DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements.
@ 2015-01-20 18:40 Konstantin Ananyev
  2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 01/18] fix fix compilation issues with RTE_LIBRTE_ACL_STANDALONE=y Konstantin Ananyev
                   ` (20 more replies)
  0 siblings, 21 replies; 27+ messages in thread
From: Konstantin Ananyev @ 2015-01-20 18:40 UTC (permalink / raw)
  To: dev

v3 changes:
Applied review comments from Thomas:
- fix spelling errors reported by codespell.
- split last patch into two:
    first to remove unused macros,
    second to add some comments about ACL internal layout.

v2 changes:
- When build with the compilers that don't support AVX2 instructions,
make rte_acl_classify_avx2() do nothing and return an error.
- Remove unneeded 'ifdef __AVX2__' in acl_run_avx2.*.
- Reorder order of patches in the set, to keep RTE_LIBRTE_ACL_STANDALONE=y
always buildable.

This patch series contain several fixes and enhancements for ACL library.
See complete list below.
Two main changes that are externally visible:
- Introduce new classify method:  RTE_ACL_CLASSIFY_AVX2.
It uses AVX2 instructions and 256 bit wide data types
to perform internal trie traversal.
That helps to increase classify() throughput.
This method is selected as default one on CPUs that supports AVX2.
- Introduce new field in the build config structure: max_size.
It specifies maximum size that internal RT structure for given context
can reach.
The purpose of that is to allow user to decide about space/performance trade-off
(faster classify() vs less space for RT internal structures)
for each given set of rules.

Konstantin Ananyev (18):
  fix fix compilation issues with RTE_LIBRTE_ACL_STANDALONE=y
  app/test: few small fixes fot test_acl.c
  librte_acl: make data_indexes long enough to survive idle transitions.
  librte_acl: remove build phase heuristsic with negative performance
    effect.
  librte_acl: fix a bug at build phase that can cause matches beeing
    overwirtten.
  librte_acl: introduce DFA nodes compression (group64) for identical
    entries.
  librte_acl: build/gen phase - simplify the way match nodes are
    allocated.
  librte_acl: make scalar RT code to be more similar to vector one.
  librte_acl: a bit of RT code deduplication.
  EAL: introduce rte_ymm and relatives in rte_common_vect.h.
  librte_acl: add AVX2 as new rte_acl_classify() method
  test-acl: add ability to manually select RT method.
  librte_acl: Remove search_sse_2 and relatives.
  libter_acl: move lo/hi dwords shuffle out from calc_addr
  libte_acl: make calc_addr a define to deduplicate the code.
  libte_acl: introduce max_size into rte_acl_config.
  libte_acl: remove unused macros.
  libte_acl: add some comments about ACL internal layout.

 app/test-acl/main.c                             | 126 +++--
 app/test/test_acl.c                             |   8 +-
 examples/l3fwd-acl/main.c                       |   3 +-
 examples/l3fwd/main.c                           |   2 +-
 lib/librte_acl/Makefile                         |  18 +
 lib/librte_acl/acl.h                            |  58 ++-
 lib/librte_acl/acl_bld.c                        | 392 +++++++---------
 lib/librte_acl/acl_gen.c                        | 268 +++++++----
 lib/librte_acl/acl_run.h                        |   7 +-
 lib/librte_acl/acl_run_avx2.c                   |  54 +++
 lib/librte_acl/acl_run_avx2.h                   | 284 ++++++++++++
 lib/librte_acl/acl_run_scalar.c                 |  65 ++-
 lib/librte_acl/acl_run_sse.c                    | 585 +-----------------------
 lib/librte_acl/acl_run_sse.h                    | 357 +++++++++++++++
 lib/librte_acl/acl_vect.h                       | 132 +++---
 lib/librte_acl/rte_acl.c                        |  47 +-
 lib/librte_acl/rte_acl.h                        |   4 +
 lib/librte_acl/rte_acl_osdep_alone.h            |  47 +-
 lib/librte_eal/common/include/rte_common_vect.h |  39 +-
 lib/librte_lpm/rte_lpm.h                        |   2 +-
 20 files changed, 1444 insertions(+), 1054 deletions(-)
 create mode 100644 lib/librte_acl/acl_run_avx2.c
 create mode 100644 lib/librte_acl/acl_run_avx2.h
 create mode 100644 lib/librte_acl/acl_run_sse.h

-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 01/18] fix fix compilation issues with RTE_LIBRTE_ACL_STANDALONE=y
  2015-01-20 18:40 [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Konstantin Ananyev
@ 2015-01-20 18:40 ` Konstantin Ananyev
  2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 02/18] app/test: few small fixes fot test_acl.c Konstantin Ananyev
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Konstantin Ananyev @ 2015-01-20 18:40 UTC (permalink / raw)
  To: dev

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/rte_acl_osdep_alone.h | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/lib/librte_acl/rte_acl_osdep_alone.h b/lib/librte_acl/rte_acl_osdep_alone.h
index a84b6f9..2a99860 100644
--- a/lib/librte_acl/rte_acl_osdep_alone.h
+++ b/lib/librte_acl/rte_acl_osdep_alone.h
@@ -214,6 +214,13 @@ rte_rdtsc(void)
 /*
  * rte_tailq related.
  */
+
+struct rte_tailq_entry {
+	TAILQ_ENTRY(rte_tailq_entry) next; /**< Pointer entries for a tailq list
+ */
+	void *data; /**< Pointer to the data referenced by this tailq entry */
+};
+
 static inline void *
 rte_dummy_tailq(void)
 {
@@ -248,6 +255,7 @@ rte_zmalloc_socket(__rte_unused const char *type, size_t size, unsigned align,
 	void *ptr;
 	int rc;
 
+	align = (align != 0) ? align : RTE_CACHE_LINE_SIZE;
 	rc = posix_memalign(&ptr, align, size);
 	if (rc != 0) {
 		rte_errno = rc;
@@ -258,6 +266,8 @@ rte_zmalloc_socket(__rte_unused const char *type, size_t size, unsigned align,
 	return ptr;
 }
 
+#define	rte_zmalloc(type, sz, align)	rte_zmalloc_socket(type, sz, align, 0)
+
 /*
  * rte_debug related
  */
@@ -271,6 +281,8 @@ rte_zmalloc_socket(__rte_unused const char *type, size_t size, unsigned align,
 	exit(err);                           \
 } while (0)
 
+#define	rte_cpu_get_flag_enabled(x)	(0)
+
 #ifdef __cplusplus
 }
 #endif
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 02/18] app/test: few small fixes fot test_acl.c
  2015-01-20 18:40 [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Konstantin Ananyev
  2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 01/18] fix fix compilation issues with RTE_LIBRTE_ACL_STANDALONE=y Konstantin Ananyev
@ 2015-01-20 18:40 ` Konstantin Ananyev
  2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 03/18] librte_acl: make data_indexes long enough to survive idle transitions Konstantin Ananyev
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Konstantin Ananyev @ 2015-01-20 18:40 UTC (permalink / raw)
  To: dev

Make sure that test_acl would not ignore error conditions.
Run classify() with all possible values.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test/test_acl.c | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/app/test/test_acl.c b/app/test/test_acl.c
index 356d620..7119ad3 100644
--- a/app/test/test_acl.c
+++ b/app/test/test_acl.c
@@ -111,7 +111,7 @@ test_classify_run(struct rte_acl_ctx *acx)
 	 * these will run quite a few times, it's necessary to test code paths
 	 * from num=0 to num>8
 	 */
-	for (count = 0; count < RTE_DIM(acl_test_data); count++) {
+	for (count = 0; count <= RTE_DIM(acl_test_data); count++) {
 		ret = rte_acl_classify(acx, data, results,
 				count, RTE_ACL_MAX_CATEGORIES);
 		if (ret != 0) {
@@ -128,6 +128,7 @@ test_classify_run(struct rte_acl_ctx *acx)
 					"(expected %"PRIu32" got %"PRIu32")!\n",
 					__LINE__, i, acl_test_data[i].allow,
 					result);
+				ret = -EINVAL;
 				goto err;
 			}
 		}
@@ -140,6 +141,7 @@ test_classify_run(struct rte_acl_ctx *acx)
 					"(expected %"PRIu32" got %"PRIu32")!\n",
 					__LINE__, i, acl_test_data[i].deny,
 					result);
+				ret = -EINVAL;
 				goto err;
 			}
 		}
@@ -150,7 +152,7 @@ test_classify_run(struct rte_acl_ctx *acx)
 			RTE_DIM(acl_test_data), RTE_ACL_MAX_CATEGORIES,
 			RTE_ACL_CLASSIFY_SCALAR);
 	if (ret != 0) {
-		printf("Line %i: SSE classify failed!\n", __LINE__);
+		printf("Line %i: scalar classify failed!\n", __LINE__);
 		goto err;
 	}
 
@@ -162,6 +164,7 @@ test_classify_run(struct rte_acl_ctx *acx)
 					"(expected %"PRIu32" got %"PRIu32")!\n",
 					__LINE__, i, acl_test_data[i].allow,
 					result);
+			ret = -EINVAL;
 			goto err;
 		}
 	}
@@ -174,6 +177,7 @@ test_classify_run(struct rte_acl_ctx *acx)
 					"(expected %"PRIu32" got %"PRIu32")!\n",
 					__LINE__, i, acl_test_data[i].deny,
 					result);
+			ret = -EINVAL;
 			goto err;
 		}
 	}
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 03/18] librte_acl: make data_indexes long enough to survive idle transitions.
  2015-01-20 18:40 [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Konstantin Ananyev
  2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 01/18] fix fix compilation issues with RTE_LIBRTE_ACL_STANDALONE=y Konstantin Ananyev
  2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 02/18] app/test: few small fixes fot test_acl.c Konstantin Ananyev
@ 2015-01-20 18:40 ` Konstantin Ananyev
  2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 04/18] librte_acl: remove build phase heuristsic with negative performance effect Konstantin Ananyev
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Konstantin Ananyev @ 2015-01-20 18:40 UTC (permalink / raw)
  To: dev

Make data_indexes long enough to survive idle transitions.
That allows to simplify match processing code.
Also fix incorrect size calculations for data indexes.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl_bld.c | 5 +++--
 lib/librte_acl/acl_run.h | 4 ----
 2 files changed, 3 insertions(+), 6 deletions(-)

diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
index d6e0c45..c5a674a 100644
--- a/lib/librte_acl/acl_bld.c
+++ b/lib/librte_acl/acl_bld.c
@@ -1948,7 +1948,7 @@ acl_set_data_indexes(struct rte_acl_ctx *ctx)
 		memcpy(ctx->data_indexes + ofs, ctx->trie[i].data_index,
 			n * sizeof(ctx->data_indexes[0]));
 		ctx->trie[i].data_index = ctx->data_indexes + ofs;
-		ofs += n;
+		ofs += RTE_ACL_MAX_FIELDS;
 	}
 }
 
@@ -1988,7 +1988,8 @@ rte_acl_build(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg)
 		/* allocate and fill run-time  structures. */
 		rc = rte_acl_gen(ctx, bcx.tries, bcx.bld_tries,
 				bcx.num_tries, bcx.cfg.num_categories,
-				RTE_ACL_IPV4VLAN_NUM * RTE_DIM(bcx.tries),
+				RTE_ACL_MAX_FIELDS * RTE_DIM(bcx.tries) *
+				sizeof(ctx->data_indexes[0]),
 				bcx.num_build_rules);
 		if (rc == 0) {
 
diff --git a/lib/librte_acl/acl_run.h b/lib/librte_acl/acl_run.h
index c191053..4c843c1 100644
--- a/lib/librte_acl/acl_run.h
+++ b/lib/librte_acl/acl_run.h
@@ -256,10 +256,6 @@ acl_match_check(uint64_t transition, int slot,
 
 		/* Fill the slot with the next trie or idle trie */
 		transition = acl_start_next_trie(flows, parms, slot, ctx);
-
-	} else if (transition == ctx->idle) {
-		/* reset indirection table for idle slots */
-		parms[slot].data_index = idle;
 	}
 
 	return transition;
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 04/18] librte_acl: remove build phase heuristsic with negative performance effect.
  2015-01-20 18:40 [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Konstantin Ananyev
                   ` (2 preceding siblings ...)
  2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 03/18] librte_acl: make data_indexes long enough to survive idle transitions Konstantin Ananyev
@ 2015-01-20 18:40 ` Konstantin Ananyev
  2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 05/18] librte_acl: fix a bug at build phase that can cause matches beeing overwirtten Konstantin Ananyev
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Konstantin Ananyev @ 2015-01-20 18:40 UTC (permalink / raw)
  To: dev

Current rule-wildness based heuristsics can cause unnecessary splits of
the ruleset.
That might have negative performance effect:
more tries to traverse, bigger RT tables.
After removing it, on some test-cases with big rulesets (~10K)
observed ~50% speedup.
No difference for smaller rulesets.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl_bld.c | 277 +++++++++++++++++------------------------------
 1 file changed, 97 insertions(+), 180 deletions(-)

diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
index c5a674a..8bf4a54 100644
--- a/lib/librte_acl/acl_bld.c
+++ b/lib/librte_acl/acl_bld.c
@@ -1539,11 +1539,9 @@ acl_calc_wildness(struct rte_acl_build_rule *head,
 	return 0;
 }
 
-static int
-acl_rule_stats(struct rte_acl_build_rule *head, struct rte_acl_config *config,
-	uint32_t *wild_limit)
+static void
+acl_rule_stats(struct rte_acl_build_rule *head, struct rte_acl_config *config)
 {
-	int min;
 	struct rte_acl_build_rule *rule;
 	uint32_t n, m, fields_deactivated = 0;
 	uint32_t start = 0, deactivate = 0;
@@ -1604,129 +1602,58 @@ acl_rule_stats(struct rte_acl_build_rule *head, struct rte_acl_config *config,
 
 		for (k = 0; k < config->num_fields; k++) {
 			if (tally[k][TALLY_DEACTIVATED] == 0) {
-				memcpy(&tally[l][0], &tally[k][0],
+				memmove(&tally[l][0], &tally[k][0],
 					TALLY_NUM * sizeof(tally[0][0]));
-				memcpy(&config->defs[l++],
+				memmove(&config->defs[l++],
 					&config->defs[k],
 					sizeof(struct rte_acl_field_def));
 			}
 		}
 		config->num_fields = l;
 	}
-
-	min = RTE_ACL_SINGLE_TRIE_SIZE;
-	if (config->num_fields == 2)
-		min *= 4;
-	else if (config->num_fields == 3)
-		min *= 3;
-	else if (config->num_fields == 4)
-		min *= 2;
-
-	if (tally[0][TALLY_0] < min)
-		return 0;
-	for (n = 0; n < config->num_fields; n++)
-		wild_limit[n] = 0;
-
-	/*
-	 * If trailing fields are 100% wild, group those together.
-	 * This allows the search length of the trie to be shortened.
-	 */
-	for (n = 1; n < config->num_fields; n++) {
-
-		double rule_percentage = (double)tally[n][TALLY_DEPTH] /
-			tally[n][0];
-
-		if (rule_percentage > RULE_PERCENTAGE) {
-			/* if it crosses an input boundary then round up */
-			while (config->defs[n - 1].input_index ==
-					config->defs[n].input_index)
-				n++;
-
-			/* set the limit for selecting rules */
-			while (n < config->num_fields)
-				wild_limit[n++] = 100;
-
-			if (wild_limit[n - 1] == 100)
-				return 1;
-		}
-	}
-
-	/* look for the most wild that's 40% or more of the rules */
-	for (n = 1; n < config->num_fields; n++) {
-		for (m = TALLY_100; m > 0; m--) {
-
-			double rule_percentage = (double)tally[n][m] /
-				tally[n][0];
-
-			if (tally[n][TALLY_DEACTIVATED] == 0 &&
-					tally[n][TALLY_0] >
-					RTE_ACL_SINGLE_TRIE_SIZE &&
-					rule_percentage > NODE_PERCENTAGE &&
-					rule_percentage < 0.80) {
-				wild_limit[n] = wild_limits[m];
-				return 1;
-			}
-		}
-	}
-	return 0;
 }
 
 static int
-order(struct rte_acl_build_rule **insert, struct rte_acl_build_rule *rule)
+rule_cmp_wildness(struct rte_acl_build_rule *r1, struct rte_acl_build_rule *r2)
 {
 	uint32_t n;
-	struct rte_acl_build_rule *left = *insert;
-
-	if (left == NULL)
-		return 0;
 
-	for (n = 1; n < left->config->num_fields; n++) {
-		int field_index = left->config->defs[n].field_index;
+	for (n = 1; n < r1->config->num_fields; n++) {
+		int field_index = r1->config->defs[n].field_index;
 
-		if (left->wildness[field_index] != rule->wildness[field_index])
-			return (left->wildness[field_index] >=
-				rule->wildness[field_index]);
+		if (r1->wildness[field_index] != r2->wildness[field_index])
+			return (r1->wildness[field_index] -
+				r2->wildness[field_index]);
 	}
 	return 0;
 }
 
 static struct rte_acl_build_rule *
-ordered_insert_rule(struct rte_acl_build_rule *head,
-	struct rte_acl_build_rule *rule)
-{
-	struct rte_acl_build_rule **insert;
-
-	if (rule == NULL)
-		return head;
-
-	rule->next = head;
-	if (head == NULL)
-		return rule;
-
-	insert = &head;
-	while (order(insert, rule))
-		insert = &(*insert)->next;
-
-	rule->next = *insert;
-	*insert = rule;
-	return head;
-}
-
-static struct rte_acl_build_rule *
 sort_rules(struct rte_acl_build_rule *head)
 {
-	struct rte_acl_build_rule *rule, *reordered_head = NULL;
-	struct rte_acl_build_rule *last_rule = NULL;
-
-	for (rule = head; rule != NULL; rule = rule->next) {
-		reordered_head = ordered_insert_rule(reordered_head, last_rule);
-		last_rule = rule;
+	struct rte_acl_build_rule *new_head;
+	struct rte_acl_build_rule *l, *r, **p;
+
+	new_head = NULL;
+	while (head != NULL) {
+		r = head;
+		head = r->next;
+		r->next = NULL;
+		if (new_head == NULL) {
+			new_head = r;
+		} else {
+			for (p = &new_head;
+					(l = *p) != NULL &&
+					rule_cmp_wildness(l, r) >= 0;
+					p = &l->next)
+				;
+
+			r->next = *p;
+			*p = r;
+		}
 	}
 
-	if (last_rule != reordered_head)
-		reordered_head = ordered_insert_rule(reordered_head, last_rule);
-
-	return reordered_head;
+	return new_head;
 }
 
 static uint32_t
@@ -1748,21 +1675,44 @@ acl_build_index(const struct rte_acl_config *config, uint32_t *data_index)
 	return m;
 }
 
+static struct rte_acl_build_rule *
+build_one_trie(struct acl_build_context *context,
+	struct rte_acl_build_rule *rule_sets[RTE_ACL_MAX_TRIES],
+	uint32_t n)
+{
+	struct rte_acl_build_rule *last;
+	struct rte_acl_config *config;
+
+	config = rule_sets[n]->config;
+
+	acl_rule_stats(rule_sets[n], config);
+	rule_sets[n] = sort_rules(rule_sets[n]);
+
+	context->tries[n].type = RTE_ACL_FULL_TRIE;
+	context->tries[n].count = 0;
+
+	context->tries[n].num_data_indexes = acl_build_index(config,
+		context->data_indexes[n]);
+	context->tries[n].data_index = context->data_indexes[n];
+
+	context->bld_tries[n].trie = build_trie(context, rule_sets[n],
+		&last, &context->tries[n].count);
+
+	return last;
+}
+
 static int
 acl_build_tries(struct acl_build_context *context,
 	struct rte_acl_build_rule *head)
 {
 	int32_t rc;
-	uint32_t n, m, num_tries;
+	uint32_t n, num_tries;
 	struct rte_acl_config *config;
-	struct rte_acl_build_rule *last, *rule;
-	uint32_t wild_limit[RTE_ACL_MAX_LEVELS];
+	struct rte_acl_build_rule *last;
 	struct rte_acl_build_rule *rule_sets[RTE_ACL_MAX_TRIES];
 
 	config = head->config;
-	rule = head;
 	rule_sets[0] = head;
-	num_tries = 1;
 
 	/* initialize tries */
 	for (n = 0; n < RTE_DIM(context->tries); n++) {
@@ -1779,91 +1729,55 @@ acl_build_tries(struct acl_build_context *context,
 	if (rc != 0)
 		return rc;
 
-	n = acl_rule_stats(head, config, &wild_limit[0]);
-
-	/* put all rules that fit the wildness criteria into a seperate trie */
-	while (n > 0 && num_tries < RTE_ACL_MAX_TRIES) {
+	for (n = 0;; n = num_tries) {
 
-		struct rte_acl_config *new_config;
-		struct rte_acl_build_rule **prev = &rule_sets[num_tries - 1];
-		struct rte_acl_build_rule *next = head->next;
+		num_tries = n + 1;
 
-		new_config = acl_build_alloc(context, 1, sizeof(*new_config));
-		if (new_config == NULL) {
-			RTE_LOG(ERR, ACL,
-				"Failed to get space for new config\n");
+		last = build_one_trie(context, rule_sets, n);
+		if (context->bld_tries[n].trie == NULL) {
+			RTE_LOG(ERR, ACL, "Build of %u-th trie failed\n", n);
 			return -ENOMEM;
 		}
 
-		memcpy(new_config, config, sizeof(*new_config));
-		config = new_config;
-		rule_sets[num_tries] = NULL;
-
-		for (rule = head; rule != NULL; rule = next) {
+		/* Build of the last trie completed. */
+		if (last == NULL)
+			break;
 
-			int move = 1;
+		if (num_tries == RTE_DIM(context->tries)) {
+			RTE_LOG(ERR, ACL,
+				"Exceeded max number of tries: %u\n",
+				num_tries);
+			return -ENOMEM;
+		}
 
-			next = rule->next;
-			for (m = 0; m < config->num_fields; m++) {
-				int x = config->defs[m].field_index;
-				if (rule->wildness[x] < wild_limit[m]) {
-					move = 0;
-					break;
-				}
-			}
+		/* Trie is getting too big, split remaining rule set. */
+		rule_sets[num_tries] = last->next;
+		last->next = NULL;
+		acl_free_node(context, context->bld_tries[n].trie);
 
-			if (move) {
-				rule->config = new_config;
-				rule->next = rule_sets[num_tries];
-				rule_sets[num_tries] = rule;
-				*prev = next;
-			} else
-				prev = &rule->next;
+		/* Create a new copy of config for remaining rules. */
+		config = acl_build_alloc(context, 1, sizeof(*config));
+		if (config == NULL) {
+			RTE_LOG(ERR, ACL,
+				"New config allocation for %u-th "
+				"trie failed\n", num_tries);
+			return -ENOMEM;
 		}
 
-		head = rule_sets[num_tries];
-		n = acl_rule_stats(rule_sets[num_tries], config,
-			&wild_limit[0]);
-		num_tries++;
-	}
-
-	if (n > 0)
-		RTE_LOG(DEBUG, ACL,
-			"Number of tries(%d) exceeded.\n", RTE_ACL_MAX_TRIES);
+		memcpy(config, rule_sets[n]->config, sizeof(*config));
 
-	for (n = 0; n < num_tries; n++) {
+		/* Make remaining rules use new config. */
+		for (head = rule_sets[num_tries]; head != NULL;
+				head = head->next)
+			head->config = config;
 
-		rule_sets[n] = sort_rules(rule_sets[n]);
-		context->tries[n].type = RTE_ACL_FULL_TRIE;
-		context->tries[n].count = 0;
-		context->tries[n].num_data_indexes =
-			acl_build_index(rule_sets[n]->config,
-			context->data_indexes[n]);
-		context->tries[n].data_index = context->data_indexes[n];
-
-		context->bld_tries[n].trie =
-				build_trie(context, rule_sets[n],
-				&last, &context->tries[n].count);
-		if (context->bld_tries[n].trie == NULL) {
+		/* Rebuild the trie for the reduced rule-set. */
+		last = build_one_trie(context, rule_sets, n);
+		if (context->bld_tries[n].trie == NULL || last != NULL) {
 			RTE_LOG(ERR, ACL, "Build of %u-th trie failed\n", n);
 			return -ENOMEM;
 		}
 
-		if (last != NULL) {
-			rule_sets[num_tries++] = last->next;
-			last->next = NULL;
-			acl_free_node(context, context->bld_tries[n].trie);
-			context->tries[n].count = 0;
-
-			context->bld_tries[n].trie =
-					build_trie(context, rule_sets[n],
-					&last, &context->tries[n].count);
-			if (context->bld_tries[n].trie == NULL) {
-				RTE_LOG(ERR, ACL,
-					"Build of %u-th trie failed\n", n);
-					return -ENOMEM;
-			}
-		}
 	}
 
 	context->num_tries = num_tries;
@@ -1876,15 +1790,18 @@ acl_build_log(const struct acl_build_context *ctx)
 	uint32_t n;
 
 	RTE_LOG(DEBUG, ACL, "Build phase for ACL \"%s\":\n"
+		"nodes created: %u\n"
 		"memory consumed: %zu\n",
 		ctx->acx->name,
+		ctx->num_nodes,
 		ctx->pool.alloc);
 
 	for (n = 0; n < RTE_DIM(ctx->tries); n++) {
 		if (ctx->tries[n].count != 0)
 			RTE_LOG(DEBUG, ACL,
-				"trie %u: number of rules: %u\n",
-				n, ctx->tries[n].count);
+				"trie %u: number of rules: %u, indexes: %u\n",
+				n, ctx->tries[n].count,
+				ctx->tries[n].num_data_indexes);
 	}
 }
 
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 05/18] librte_acl: fix a bug at build phase that can cause matches beeing overwirtten.
  2015-01-20 18:40 [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Konstantin Ananyev
                   ` (3 preceding siblings ...)
  2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 04/18] librte_acl: remove build phase heuristsic with negative performance effect Konstantin Ananyev
@ 2015-01-20 18:40 ` Konstantin Ananyev
  2015-01-25 17:34   ` Neil Horman
  2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 06/18] librte_acl: introduce DFA nodes compression (group64) for identical entries Konstantin Ananyev
                   ` (15 subsequent siblings)
  20 siblings, 1 reply; 27+ messages in thread
From: Konstantin Ananyev @ 2015-01-20 18:40 UTC (permalink / raw)
  To: dev

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl_bld.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
index 8bf4a54..22f7934 100644
--- a/lib/librte_acl/acl_bld.c
+++ b/lib/librte_acl/acl_bld.c
@@ -1907,7 +1907,7 @@ rte_acl_build(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg)
 				bcx.num_tries, bcx.cfg.num_categories,
 				RTE_ACL_MAX_FIELDS * RTE_DIM(bcx.tries) *
 				sizeof(ctx->data_indexes[0]),
-				bcx.num_build_rules);
+				bcx.num_build_rules + 1);
 		if (rc == 0) {
 
 			/* set data indexes. */
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 06/18] librte_acl: introduce DFA nodes compression (group64) for identical entries.
  2015-01-20 18:40 [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Konstantin Ananyev
                   ` (4 preceding siblings ...)
  2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 05/18] librte_acl: fix a bug at build phase that can cause matches beeing overwirtten Konstantin Ananyev
@ 2015-01-20 18:40 ` Konstantin Ananyev
  2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 07/18] librte_acl: build/gen phase - simplify the way match nodes are allocated Konstantin Ananyev
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Konstantin Ananyev @ 2015-01-20 18:40 UTC (permalink / raw)
  To: dev

Introduced division of whole 256 child transition enties
into 4 sub-groups (64 kids per group).
So 2 groups within the same node with identical children,
can use one set of transition entries.
That allows to compact some DFA nodes and get space savings in the RT table,
without any negative performance impact.
>From what I've seen an average space savings: ~20%.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl.h            |  12 ++-
 lib/librte_acl/acl_gen.c        | 195 ++++++++++++++++++++++++++++------------
 lib/librte_acl/acl_run_scalar.c |  38 ++++----
 lib/librte_acl/acl_run_sse.c    |  99 ++++++--------------
 4 files changed, 196 insertions(+), 148 deletions(-)

diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 102fa51..3f6ac79 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -47,6 +47,11 @@ extern"C" {
 #define RTE_ACL_DFA_MAX		UINT8_MAX
 #define RTE_ACL_DFA_SIZE	(UINT8_MAX + 1)
 
+#define	RTE_ACL_DFA_GR64_SIZE	64
+#define	RTE_ACL_DFA_GR64_NUM	(RTE_ACL_DFA_SIZE / RTE_ACL_DFA_GR64_SIZE)
+#define	RTE_ACL_DFA_GR64_BIT	\
+	(CHAR_BIT * sizeof(uint32_t) / RTE_ACL_DFA_GR64_NUM)
+
 typedef int bits_t;
 
 #define	RTE_ACL_BIT_SET_SIZE	((UINT8_MAX + 1) / (sizeof(bits_t) * CHAR_BIT))
@@ -100,8 +105,11 @@ struct rte_acl_node {
 	/* number of ranges (transitions w/ consecutive bits) */
 	int32_t                 id;
 	struct rte_acl_match_results *mrt; /* only valid when match_flag != 0 */
-	char                         transitions[RTE_ACL_QUAD_SIZE];
-	/* boundaries for ranged node */
+	union {
+		char            transitions[RTE_ACL_QUAD_SIZE];
+		/* boundaries for ranged node */
+		uint8_t         dfa_gr64[RTE_ACL_DFA_GR64_NUM];
+	};
 	struct rte_acl_node     *next;
 	/* free list link or pointer to duplicate node during merge */
 	struct rte_acl_node     *prev;
diff --git a/lib/librte_acl/acl_gen.c b/lib/librte_acl/acl_gen.c
index b1f766b..c9b7839 100644
--- a/lib/librte_acl/acl_gen.c
+++ b/lib/librte_acl/acl_gen.c
@@ -43,13 +43,14 @@
 } while (0)
 
 struct acl_node_counters {
-	int                match;
-	int                match_used;
-	int                single;
-	int                quad;
-	int                quad_vectors;
-	int                dfa;
-	int                smallest_match;
+	int32_t match;
+	int32_t match_used;
+	int32_t single;
+	int32_t quad;
+	int32_t quad_vectors;
+	int32_t dfa;
+	int32_t dfa_gr64;
+	int32_t smallest_match;
 };
 
 struct rte_acl_indices {
@@ -61,24 +62,118 @@ struct rte_acl_indices {
 
 static void
 acl_gen_log_stats(const struct rte_acl_ctx *ctx,
-	const struct acl_node_counters *counts)
+	const struct acl_node_counters *counts,
+	const struct rte_acl_indices *indices)
 {
 	RTE_LOG(DEBUG, ACL, "Gen phase for ACL \"%s\":\n"
 		"runtime memory footprint on socket %d:\n"
 		"single nodes/bytes used: %d/%zu\n"
-		"quad nodes/bytes used: %d/%zu\n"
-		"DFA nodes/bytes used: %d/%zu\n"
+		"quad nodes/vectors/bytes used: %d/%d/%zu\n"
+		"DFA nodes/group64/bytes used: %d/%d/%zu\n"
 		"match nodes/bytes used: %d/%zu\n"
 		"total: %zu bytes\n",
 		ctx->name, ctx->socket_id,
 		counts->single, counts->single * sizeof(uint64_t),
-		counts->quad, counts->quad_vectors * sizeof(uint64_t),
-		counts->dfa, counts->dfa * RTE_ACL_DFA_SIZE * sizeof(uint64_t),
+		counts->quad, counts->quad_vectors,
+		(indices->quad_index - indices->dfa_index) * sizeof(uint64_t),
+		counts->dfa, counts->dfa_gr64,
+		indices->dfa_index * sizeof(uint64_t),
 		counts->match,
 		counts->match * sizeof(struct rte_acl_match_results),
 		ctx->mem_sz);
 }
 
+static uint64_t
+acl_dfa_gen_idx(const struct rte_acl_node *node, uint32_t index)
+{
+	uint64_t idx;
+	uint32_t i;
+
+	idx = 0;
+	for (i = 0; i != RTE_DIM(node->dfa_gr64); i++) {
+		RTE_ACL_VERIFY(node->dfa_gr64[i] < RTE_ACL_DFA_GR64_NUM);
+		RTE_ACL_VERIFY(node->dfa_gr64[i] < node->fanout);
+		idx |= (i - node->dfa_gr64[i]) <<
+			(6 + RTE_ACL_DFA_GR64_BIT * i);
+	}
+
+	return idx << (CHAR_BIT * sizeof(index)) | index | node->node_type;
+}
+
+static void
+acl_dfa_fill_gr64(const struct rte_acl_node *node,
+	const uint64_t src[RTE_ACL_DFA_SIZE], uint64_t dst[RTE_ACL_DFA_SIZE])
+{
+	uint32_t i;
+
+	for (i = 0; i != RTE_DIM(node->dfa_gr64); i++) {
+		memcpy(dst + node->dfa_gr64[i] * RTE_ACL_DFA_GR64_SIZE,
+			src + i * RTE_ACL_DFA_GR64_SIZE,
+			RTE_ACL_DFA_GR64_SIZE * sizeof(dst[0]));
+	}
+}
+
+static uint32_t
+acl_dfa_count_gr64(const uint64_t array_ptr[RTE_ACL_DFA_SIZE],
+	uint8_t gr64[RTE_ACL_DFA_GR64_NUM])
+{
+	uint32_t i, j, k;
+
+	k = 0;
+	for (i = 0; i != RTE_ACL_DFA_GR64_NUM; i++) {
+		gr64[i] = i;
+		for (j = 0; j != i; j++) {
+			if (memcmp(array_ptr + i * RTE_ACL_DFA_GR64_SIZE,
+					array_ptr + j * RTE_ACL_DFA_GR64_SIZE,
+					RTE_ACL_DFA_GR64_SIZE *
+					sizeof(array_ptr[0])) == 0)
+				break;
+		}
+		gr64[i] = (j != i) ? gr64[j] : k++;
+	}
+
+	return k;
+}
+
+static uint32_t
+acl_node_fill_dfa(const struct rte_acl_node *node,
+	uint64_t dfa[RTE_ACL_DFA_SIZE], uint64_t no_match, int32_t resolved)
+{
+	uint32_t n, x;
+	uint32_t ranges, last_bit;
+	struct rte_acl_node *child;
+	struct rte_acl_bitset *bits;
+
+	ranges = 0;
+	last_bit = 0;
+
+	for (n = 0; n < RTE_ACL_DFA_SIZE; n++)
+		dfa[n] = no_match;
+
+	for (x = 0; x < node->num_ptrs; x++) {
+
+		child = node->ptrs[x].ptr;
+		if (child == NULL)
+			continue;
+
+		bits = &node->ptrs[x].values;
+		for (n = 0; n < RTE_ACL_DFA_SIZE; n++) {
+
+			if (bits->bits[n / (sizeof(bits_t) * CHAR_BIT)] &
+				(1 << (n % (sizeof(bits_t) * CHAR_BIT)))) {
+
+				dfa[n] = resolved ? child->node_index : x;
+				ranges += (last_bit == 0);
+				last_bit = 1;
+			} else {
+				last_bit = 0;
+			}
+		}
+	}
+
+	return ranges;
+}
+
 /*
 *  Counts the number of groups of sequential bits that are
 *  either 0 or 1, as specified by the zero_one parameter. This is used to
@@ -150,10 +245,11 @@ acl_count_fanout(struct rte_acl_node *node)
  */
 static int
 acl_count_trie_types(struct acl_node_counters *counts,
-	struct rte_acl_node *node, int match, int force_dfa)
+	struct rte_acl_node *node, uint64_t no_match, int match, int force_dfa)
 {
 	uint32_t n;
 	int num_ptrs;
+	uint64_t dfa[RTE_ACL_DFA_SIZE];
 
 	/* skip if this node has been counted */
 	if (node->node_type != (uint32_t)RTE_ACL_NODE_UNDEFINED)
@@ -186,6 +282,16 @@ acl_count_trie_types(struct acl_node_counters *counts,
 	} else {
 		counts->dfa++;
 		node->node_type = RTE_ACL_NODE_DFA;
+		if (force_dfa != 0) {
+			/* always expand to a max number of nodes. */
+			for (n = 0; n != RTE_DIM(node->dfa_gr64); n++)
+				node->dfa_gr64[n] = n;
+			node->fanout = n;
+		} else {
+			acl_node_fill_dfa(node, dfa, no_match, 0);
+			node->fanout = acl_dfa_count_gr64(dfa, node->dfa_gr64);
+		}
+		counts->dfa_gr64 += node->fanout;
 	}
 
 	/*
@@ -194,7 +300,7 @@ acl_count_trie_types(struct acl_node_counters *counts,
 	for (n = 0; n < node->num_ptrs; n++) {
 		if (node->ptrs[n].ptr != NULL)
 			match = acl_count_trie_types(counts, node->ptrs[n].ptr,
-				match, 0);
+				no_match, match, 0);
 	}
 
 	return match;
@@ -204,38 +310,11 @@ static void
 acl_add_ptrs(struct rte_acl_node *node, uint64_t *node_array, uint64_t no_match,
 	int resolved)
 {
-	uint32_t n, x;
-	int m, ranges, last_bit;
-	struct rte_acl_node *child;
-	struct rte_acl_bitset *bits;
+	uint32_t x;
+	int32_t m;
 	uint64_t *node_a, index, dfa[RTE_ACL_DFA_SIZE];
 
-	ranges = 0;
-	last_bit = 0;
-
-	for (n = 0; n < RTE_DIM(dfa); n++)
-		dfa[n] = no_match;
-
-	for (x = 0; x < node->num_ptrs; x++) {
-
-		child = node->ptrs[x].ptr;
-		if (child == NULL)
-			continue;
-
-		bits = &node->ptrs[x].values;
-		for (n = 0; n < RTE_DIM(dfa); n++) {
-
-			if (bits->bits[n / (sizeof(bits_t) * CHAR_BIT)] &
-				(1 << (n % (sizeof(bits_t) * CHAR_BIT)))) {
-
-				dfa[n] = resolved ? child->node_index : x;
-				ranges += (last_bit == 0);
-				last_bit = 1;
-			} else {
-				last_bit = 0;
-			}
-		}
-	}
+	acl_node_fill_dfa(node, dfa, no_match, resolved);
 
 	/*
 	 * Rather than going from 0 to 256, the range count and
@@ -272,8 +351,7 @@ acl_add_ptrs(struct rte_acl_node *node, uint64_t *node_array, uint64_t no_match,
 		RTE_ACL_VERIFY(m <= RTE_ACL_QUAD_SIZE);
 
 	} else if (node->node_type == RTE_ACL_NODE_DFA && resolved) {
-		for (n = 0; n < RTE_DIM(dfa); n++)
-			node_array[n] = dfa[n];
+		acl_dfa_fill_gr64(node, dfa, node_array);
 	}
 }
 
@@ -286,7 +364,7 @@ static void
 acl_gen_node(struct rte_acl_node *node, uint64_t *node_array,
 	uint64_t no_match, struct rte_acl_indices *index, int num_categories)
 {
-	uint32_t n, *qtrp;
+	uint32_t n, sz, *qtrp;
 	uint64_t *array_ptr;
 	struct rte_acl_match_results *match;
 
@@ -297,10 +375,11 @@ acl_gen_node(struct rte_acl_node *node, uint64_t *node_array,
 
 	switch (node->node_type) {
 	case RTE_ACL_NODE_DFA:
-		node->node_index = index->dfa_index | node->node_type;
 		array_ptr = &node_array[index->dfa_index];
-		index->dfa_index += RTE_ACL_DFA_SIZE;
-		for (n = 0; n < RTE_ACL_DFA_SIZE; n++)
+		node->node_index = acl_dfa_gen_idx(node, index->dfa_index);
+		sz = node->fanout * RTE_ACL_DFA_GR64_SIZE;
+		index->dfa_index += sz;
+		for (n = 0; n < sz; n++)
 			array_ptr[n] = no_match;
 		break;
 	case RTE_ACL_NODE_SINGLE:
@@ -312,7 +391,7 @@ acl_gen_node(struct rte_acl_node *node, uint64_t *node_array,
 		break;
 	case RTE_ACL_NODE_QRANGE:
 		array_ptr = &node_array[index->quad_index];
-		acl_add_ptrs(node, array_ptr, no_match,  0);
+		acl_add_ptrs(node, array_ptr, no_match, 0);
 		qtrp = (uint32_t *)node->transitions;
 		node->node_index = qtrp[0];
 		node->node_index <<= sizeof(index->quad_index) * CHAR_BIT;
@@ -368,7 +447,7 @@ static int
 acl_calc_counts_indices(struct acl_node_counters *counts,
 	struct rte_acl_indices *indices, struct rte_acl_trie *trie,
 	struct rte_acl_bld_trie *node_bld_trie, uint32_t num_tries,
-	int match_num)
+	int match_num, uint64_t no_match)
 {
 	uint32_t n;
 
@@ -379,13 +458,13 @@ acl_calc_counts_indices(struct acl_node_counters *counts,
 	for (n = 0; n < num_tries; n++) {
 		counts->smallest_match = INT32_MAX;
 		match_num = acl_count_trie_types(counts, node_bld_trie[n].trie,
-			match_num, 1);
+			no_match, match_num, 1);
 		trie[n].smallest = counts->smallest_match;
 	}
 
 	indices->dfa_index = RTE_ACL_DFA_SIZE + 1;
 	indices->quad_index = indices->dfa_index +
-		counts->dfa * RTE_ACL_DFA_SIZE;
+		counts->dfa_gr64 * RTE_ACL_DFA_GR64_SIZE;
 	indices->single_index = indices->quad_index + counts->quad_vectors;
 	indices->match_index = indices->single_index + counts->single + 1;
 	indices->match_index = RTE_ALIGN(indices->match_index,
@@ -410,9 +489,11 @@ rte_acl_gen(struct rte_acl_ctx *ctx, struct rte_acl_trie *trie,
 	struct acl_node_counters counts;
 	struct rte_acl_indices indices;
 
+	no_match = RTE_ACL_NODE_MATCH;
+
 	/* Fill counts and indices arrays from the nodes. */
 	match_num = acl_calc_counts_indices(&counts, &indices, trie,
-		node_bld_trie, num_tries, match_num);
+		node_bld_trie, num_tries, match_num, no_match);
 
 	/* Allocate runtime memory (align to cache boundary) */
 	total_size = RTE_ALIGN(data_index_sz, RTE_CACHE_LINE_SIZE) +
@@ -440,11 +521,11 @@ rte_acl_gen(struct rte_acl_ctx *ctx, struct rte_acl_trie *trie,
 	 */
 
 	node_array[RTE_ACL_DFA_SIZE] = RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE;
-	no_match = RTE_ACL_NODE_MATCH;
 
 	for (n = 0; n < RTE_ACL_DFA_SIZE; n++)
 		node_array[n] = no_match;
 
+	/* NOMATCH result at index 0 */
 	match = ((struct rte_acl_match_results *)(node_array + match_index));
 	memset(match, 0, sizeof(*match));
 
@@ -470,6 +551,6 @@ rte_acl_gen(struct rte_acl_ctx *ctx, struct rte_acl_trie *trie,
 	ctx->trans_table = node_array;
 	memcpy(ctx->trie, trie, sizeof(ctx->trie));
 
-	acl_gen_log_stats(ctx, &counts);
+	acl_gen_log_stats(ctx, &counts, &indices);
 	return 0;
 }
diff --git a/lib/librte_acl/acl_run_scalar.c b/lib/librte_acl/acl_run_scalar.c
index 43c8fc3..40691ce 100644
--- a/lib/librte_acl/acl_run_scalar.c
+++ b/lib/librte_acl/acl_run_scalar.c
@@ -94,15 +94,6 @@ resolve_priority_scalar(uint64_t transition, int n,
 	}
 }
 
-/*
- * When processing the transition, rather than using if/else
- * construct, the offset is calculated for DFA and QRANGE and
- * then conditionally added to the address based on node type.
- * This is done to avoid branch mis-predictions. Since the
- * offset is rather simple calculation it is more efficient
- * to do the calculation and do a condition move rather than
- * a conditional branch to determine which calculation to do.
- */
 static inline uint32_t
 scan_forward(uint32_t input, uint32_t max)
 {
@@ -117,18 +108,27 @@ scalar_transition(const uint64_t *trans_table, uint64_t transition,
 
 	/* break transition into component parts */
 	ranges = transition >> (sizeof(index) * CHAR_BIT);
-
-	/* calc address for a QRANGE node */
-	c = input * SCALAR_QRANGE_MULT;
-	a = ranges | SCALAR_QRANGE_MIN;
 	index = transition & ~RTE_ACL_NODE_INDEX;
-	a -= (c & SCALAR_QRANGE_MASK);
-	b = c & SCALAR_QRANGE_MIN;
 	addr = transition ^ index;
-	a &= SCALAR_QRANGE_MIN;
-	a ^= (ranges ^ b) & (a ^ b);
-	x = scan_forward(a, 32) >> 3;
-	addr += (index == RTE_ACL_NODE_DFA) ? input : x;
+
+	if (index != RTE_ACL_NODE_DFA) {
+		/* calc address for a QRANGE/SINGLE node */
+		c = (uint32_t)input * SCALAR_QRANGE_MULT;
+		a = ranges | SCALAR_QRANGE_MIN;
+		a -= (c & SCALAR_QRANGE_MASK);
+		b = c & SCALAR_QRANGE_MIN;
+		a &= SCALAR_QRANGE_MIN;
+		a ^= (ranges ^ b) & (a ^ b);
+		x = scan_forward(a, 32) >> 3;
+	} else {
+		/* calc address for a DFA node */
+		x = ranges >> (input /
+			RTE_ACL_DFA_GR64_SIZE * RTE_ACL_DFA_GR64_BIT);
+		x &= UINT8_MAX;
+		x = input - x;
+	}
+
+	addr += x;
 
 	/* pickup next transition */
 	transition = *(trans_table + addr);
diff --git a/lib/librte_acl/acl_run_sse.c b/lib/librte_acl/acl_run_sse.c
index 69a9d77..576c92b 100644
--- a/lib/librte_acl/acl_run_sse.c
+++ b/lib/librte_acl/acl_run_sse.c
@@ -40,24 +40,6 @@ enum {
 	SHUFFLE32_SWAP64 = 0x4e,
 };
 
-static const rte_xmm_t mm_type_quad_range = {
-	.u32 = {
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-	},
-};
-
-static const rte_xmm_t mm_type_quad_range64 = {
-	.u32 = {
-		RTE_ACL_NODE_QRANGE,
-		RTE_ACL_NODE_QRANGE,
-		0,
-		0,
-	},
-};
-
 static const rte_xmm_t mm_shuffle_input = {
 	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
 };
@@ -70,14 +52,6 @@ static const rte_xmm_t mm_ones_16 = {
 	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
 };
 
-static const rte_xmm_t mm_bytes = {
-	.u32 = {UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX},
-};
-
-static const rte_xmm_t mm_bytes64 = {
-	.u32 = {UINT8_MAX, UINT8_MAX, 0, 0},
-};
-
 static const rte_xmm_t mm_match_mask = {
 	.u32 = {
 		RTE_ACL_NODE_MATCH,
@@ -236,10 +210,14 @@ acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
  */
 static inline xmm_t
 acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
-	xmm_t *indices1, xmm_t *indices2)
+	xmm_t ones_16, xmm_t indices1, xmm_t indices2)
 {
-	xmm_t addr, node_types, temp;
+	xmm_t addr, node_types, range, temp;
+	xmm_t dfa_msk, dfa_ofs, quad_ofs;
+	xmm_t in, r, t;
+
+	const xmm_t range_base = _mm_set_epi32(0xffffff0c, 0xffffff08,
+		0xffffff04, 0xffffff00);
 
 	/*
 	 * Note that no transition is done for a match
@@ -248,10 +226,13 @@ acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
 	 */
 
 	/* Shuffle low 32 into temp and high 32 into indices2 */
-	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indices1, (__m128)*indices2,
-		0x88);
-	*indices2 = (xmm_t)MM_SHUFFLEPS((__m128)*indices1,
-		(__m128)*indices2, 0xdd);
+	temp = (xmm_t)MM_SHUFFLEPS((__m128)indices1, (__m128)indices2, 0x88);
+	range = (xmm_t)MM_SHUFFLEPS((__m128)indices1, (__m128)indices2, 0xdd);
+
+	t = MM_XOR(index_mask, index_mask);
+
+	/* shuffle input byte to all 4 positions of 32 bit value */
+	in = MM_SHUFFLE8(next_input, shuffle_input);
 
 	/* Calc node type and node addr */
 	node_types = MM_ANDNOT(index_mask, temp);
@@ -262,17 +243,15 @@ acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
 	 */
 
 	/* mask for DFA type (0) nodes */
-	temp = MM_CMPEQ32(node_types, MM_XOR(node_types, node_types));
+	dfa_msk = MM_CMPEQ32(node_types, t);
 
-	/* add input byte to DFA position */
-	temp = MM_AND(temp, bytes);
-	temp = MM_AND(temp, next_input);
-	addr = MM_ADD32(addr, temp);
+	r = _mm_srli_epi32(in, 30);
+	r = _mm_add_epi8(r, range_base);
 
-	/*
-	 * Calc addr for Range nodes -> range_index + range(input)
-	 */
-	node_types = MM_CMPEQ32(node_types, type_quad_range);
+	t = _mm_srli_epi32(in, 24);
+	r = _mm_shuffle_epi8(range, r);
+
+	dfa_ofs = _mm_sub_epi32(t, r);
 
 	/*
 	 * Calculate number of range boundaries that are less than the
@@ -282,11 +261,8 @@ acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
 	 * input byte.
 	 */
 
-	/* shuffle input byte to all 4 positions of 32 bit value */
-	temp = MM_SHUFFLE8(next_input, shuffle_input);
-
 	/* check ranges */
-	temp = MM_CMPGT8(temp, *indices2);
+	temp = MM_CMPGT8(in, range);
 
 	/* convert -1 to 1 (bytes greater than input byte */
 	temp = MM_SIGN8(temp, temp);
@@ -295,10 +271,10 @@ acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
 	temp = MM_MADD8(temp, temp);
 
 	/* horizontal add pairs of words into dwords */
-	temp = MM_MADD16(temp, ones_16);
+	quad_ofs = MM_MADD16(temp, ones_16);
 
 	/* mask to range type nodes */
-	temp = MM_AND(temp, node_types);
+	temp = _mm_blendv_epi8(quad_ofs, dfa_ofs, dfa_msk);
 
 	/* add index into node position */
 	return MM_ADD32(addr, temp);
@@ -309,8 +285,8 @@ acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
  */
 static inline xmm_t
 transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
-	const uint64_t *trans, xmm_t *indices1, xmm_t *indices2)
+	xmm_t ones_16, const uint64_t *trans,
+	xmm_t *indices1, xmm_t *indices2)
 {
 	xmm_t addr;
 	uint64_t trans0, trans2;
@@ -318,7 +294,7 @@ transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
 	 /* Calculate the address (array index) for all 4 transitions. */
 
 	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
-		bytes, type_quad_range, indices1, indices2);
+		*indices1, *indices2);
 
 	 /* Gather 64 bit transitions and pack back into 2 registers. */
 
@@ -408,42 +384,34 @@ search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
 
 		input0 = transition4(mm_index_mask.m, input0,
 			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
 			flows.trans, &indices1, &indices2);
 
 		input1 = transition4(mm_index_mask.m, input1,
 			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
 			flows.trans, &indices3, &indices4);
 
 		input0 = transition4(mm_index_mask.m, input0,
 			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
 			flows.trans, &indices1, &indices2);
 
 		input1 = transition4(mm_index_mask.m, input1,
 			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
 			flows.trans, &indices3, &indices4);
 
 		input0 = transition4(mm_index_mask.m, input0,
 			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
 			flows.trans, &indices1, &indices2);
 
 		input1 = transition4(mm_index_mask.m, input1,
 			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
 			flows.trans, &indices3, &indices4);
 
 		input0 = transition4(mm_index_mask.m, input0,
 			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
 			flows.trans, &indices1, &indices2);
 
 		input1 = transition4(mm_index_mask.m, input1,
 			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
 			flows.trans, &indices3, &indices4);
 
 		 /* Check for any matches. */
@@ -496,22 +464,18 @@ search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
 		/* Process the 4 bytes of input on each stream. */
 		input = transition4(mm_index_mask.m, input,
 			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
 			flows.trans, &indices1, &indices2);
 
 		 input = transition4(mm_index_mask.m, input,
 			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
 			flows.trans, &indices1, &indices2);
 
 		 input = transition4(mm_index_mask.m, input,
 			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
 			flows.trans, &indices1, &indices2);
 
 		 input = transition4(mm_index_mask.m, input,
 			mm_shuffle_input.m, mm_ones_16.m,
-			mm_bytes.m, mm_type_quad_range.m,
 			flows.trans, &indices1, &indices2);
 
 		/* Check for any matches. */
@@ -524,8 +488,7 @@ search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
 
 static inline xmm_t
 transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t bytes, xmm_t type_quad_range,
-	const uint64_t *trans, xmm_t *indices1)
+	xmm_t ones_16, const uint64_t *trans, xmm_t *indices1)
 {
 	uint64_t t;
 	xmm_t addr, indices2;
@@ -533,7 +496,7 @@ transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
 	indices2 = MM_XOR(ones_16, ones_16);
 
 	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
-		bytes, type_quad_range, indices1, &indices2);
+		*indices1, indices2);
 
 	/* Gather 64 bit transitions and pack 2 per register. */
 
@@ -583,22 +546,18 @@ search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 
 		input = transition2(mm_index_mask64.m, input,
 			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
 			flows.trans, &indices);
 
 		input = transition2(mm_index_mask64.m, input,
 			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
 			flows.trans, &indices);
 
 		input = transition2(mm_index_mask64.m, input,
 			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
 			flows.trans, &indices);
 
 		input = transition2(mm_index_mask64.m, input,
 			mm_shuffle_input64.m, mm_ones_16.m,
-			mm_bytes64.m, mm_type_quad_range64.m,
 			flows.trans, &indices);
 
 		/* Check for any matches. */
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 07/18] librte_acl: build/gen phase - simplify the way match nodes are allocated.
  2015-01-20 18:40 [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Konstantin Ananyev
                   ` (5 preceding siblings ...)
  2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 06/18] librte_acl: introduce DFA nodes compression (group64) for identical entries Konstantin Ananyev
@ 2015-01-20 18:40 ` Konstantin Ananyev
  2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 08/18] librte_acl: make scalar RT code to be more similar to vector one Konstantin Ananyev
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Konstantin Ananyev @ 2015-01-20 18:40 UTC (permalink / raw)
  To: dev

Right now we allocate indexes for all types of nodes, except MATCH,
at 'gen final RT table' stage.
For MATCH type nodes we are doing it at building temporary tree stage.
This is totally unnecessary and makes code more complex and error prone.
Rework the code and make MATCH indexes being allocated at the same stage
as all others.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl.h     |  3 +--
 lib/librte_acl/acl_bld.c |  4 +--
 lib/librte_acl/acl_gen.c | 69 ++++++++++++++++++++++--------------------------
 3 files changed, 34 insertions(+), 42 deletions(-)

diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 3f6ac79..96bb318 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -146,7 +146,6 @@ enum {
 struct rte_acl_trie {
 	uint32_t        type;
 	uint32_t        count;
-	int32_t         smallest;  /* smallest rule in this trie */
 	uint32_t        root_index;
 	const uint32_t *data_index;
 	uint32_t        num_data_indexes;
@@ -181,7 +180,7 @@ struct rte_acl_ctx {
 
 int rte_acl_gen(struct rte_acl_ctx *ctx, struct rte_acl_trie *trie,
 	struct rte_acl_bld_trie *node_bld_trie, uint32_t num_tries,
-	uint32_t num_categories, uint32_t data_index_sz, int match_num);
+	uint32_t num_categories, uint32_t data_index_sz);
 
 typedef int (*rte_acl_classify_t)
 (const struct rte_acl_ctx *, const uint8_t **, uint32_t *, uint32_t, uint32_t);
diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
index 22f7934..1fd59ee 100644
--- a/lib/librte_acl/acl_bld.c
+++ b/lib/librte_acl/acl_bld.c
@@ -1719,7 +1719,6 @@ acl_build_tries(struct acl_build_context *context,
 		context->tries[n].type = RTE_ACL_UNUSED_TRIE;
 		context->bld_tries[n].trie = NULL;
 		context->tries[n].count = 0;
-		context->tries[n].smallest = INT32_MAX;
 	}
 
 	context->tries[0].type = RTE_ACL_FULL_TRIE;
@@ -1906,8 +1905,7 @@ rte_acl_build(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg)
 		rc = rte_acl_gen(ctx, bcx.tries, bcx.bld_tries,
 				bcx.num_tries, bcx.cfg.num_categories,
 				RTE_ACL_MAX_FIELDS * RTE_DIM(bcx.tries) *
-				sizeof(ctx->data_indexes[0]),
-				bcx.num_build_rules + 1);
+				sizeof(ctx->data_indexes[0]));
 		if (rc == 0) {
 
 			/* set data indexes. */
diff --git a/lib/librte_acl/acl_gen.c b/lib/librte_acl/acl_gen.c
index c9b7839..d3def66 100644
--- a/lib/librte_acl/acl_gen.c
+++ b/lib/librte_acl/acl_gen.c
@@ -50,14 +50,14 @@ struct acl_node_counters {
 	int32_t quad_vectors;
 	int32_t dfa;
 	int32_t dfa_gr64;
-	int32_t smallest_match;
 };
 
 struct rte_acl_indices {
-	int                dfa_index;
-	int                quad_index;
-	int                single_index;
-	int                match_index;
+	int32_t dfa_index;
+	int32_t quad_index;
+	int32_t single_index;
+	int32_t match_index;
+	int32_t match_start;
 };
 
 static void
@@ -243,9 +243,9 @@ acl_count_fanout(struct rte_acl_node *node)
 /*
  * Determine the type of nodes and count each type
  */
-static int
+static void
 acl_count_trie_types(struct acl_node_counters *counts,
-	struct rte_acl_node *node, uint64_t no_match, int match, int force_dfa)
+	struct rte_acl_node *node, uint64_t no_match, int force_dfa)
 {
 	uint32_t n;
 	int num_ptrs;
@@ -253,16 +253,12 @@ acl_count_trie_types(struct acl_node_counters *counts,
 
 	/* skip if this node has been counted */
 	if (node->node_type != (uint32_t)RTE_ACL_NODE_UNDEFINED)
-		return match;
+		return;
 
 	if (node->match_flag != 0 || node->num_ptrs == 0) {
 		counts->match++;
-		if (node->match_flag == -1)
-			node->match_flag = match++;
 		node->node_type = RTE_ACL_NODE_MATCH;
-		if (counts->smallest_match > node->match_flag)
-			counts->smallest_match = node->match_flag;
-		return match;
+		return;
 	}
 
 	num_ptrs = acl_count_fanout(node);
@@ -299,11 +295,9 @@ acl_count_trie_types(struct acl_node_counters *counts,
 	 */
 	for (n = 0; n < node->num_ptrs; n++) {
 		if (node->ptrs[n].ptr != NULL)
-			match = acl_count_trie_types(counts, node->ptrs[n].ptr,
-				no_match, match, 0);
+			acl_count_trie_types(counts, node->ptrs[n].ptr,
+				no_match, 0);
 	}
-
-	return match;
 }
 
 static void
@@ -400,9 +394,13 @@ acl_gen_node(struct rte_acl_node *node, uint64_t *node_array,
 		break;
 	case RTE_ACL_NODE_MATCH:
 		match = ((struct rte_acl_match_results *)
-			(node_array + index->match_index));
-		memcpy(match + node->match_flag, node->mrt, sizeof(*node->mrt));
-		node->node_index = node->match_flag | node->node_type;
+			(node_array + index->match_start));
+		for (n = 0; n != RTE_DIM(match->results); n++)
+			RTE_ACL_VERIFY(match->results[0] == 0);
+		memcpy(match + index->match_index, node->mrt,
+			sizeof(*node->mrt));
+		node->node_index = index->match_index | node->node_type;
+		index->match_index += 1;
 		break;
 	case RTE_ACL_NODE_UNDEFINED:
 		RTE_ACL_VERIFY(node->node_type !=
@@ -443,11 +441,11 @@ acl_gen_node(struct rte_acl_node *node, uint64_t *node_array,
 	}
 }
 
-static int
+static void
 acl_calc_counts_indices(struct acl_node_counters *counts,
-	struct rte_acl_indices *indices, struct rte_acl_trie *trie,
+	struct rte_acl_indices *indices,
 	struct rte_acl_bld_trie *node_bld_trie, uint32_t num_tries,
-	int match_num, uint64_t no_match)
+	uint64_t no_match)
 {
 	uint32_t n;
 
@@ -456,21 +454,18 @@ acl_calc_counts_indices(struct acl_node_counters *counts,
 
 	/* Get stats on nodes */
 	for (n = 0; n < num_tries; n++) {
-		counts->smallest_match = INT32_MAX;
-		match_num = acl_count_trie_types(counts, node_bld_trie[n].trie,
-			no_match, match_num, 1);
-		trie[n].smallest = counts->smallest_match;
+		acl_count_trie_types(counts, node_bld_trie[n].trie,
+			no_match, 1);
 	}
 
 	indices->dfa_index = RTE_ACL_DFA_SIZE + 1;
 	indices->quad_index = indices->dfa_index +
 		counts->dfa_gr64 * RTE_ACL_DFA_GR64_SIZE;
 	indices->single_index = indices->quad_index + counts->quad_vectors;
-	indices->match_index = indices->single_index + counts->single + 1;
-	indices->match_index = RTE_ALIGN(indices->match_index,
+	indices->match_start = indices->single_index + counts->single + 1;
+	indices->match_start = RTE_ALIGN(indices->match_start,
 		(XMM_SIZE / sizeof(uint64_t)));
-
-	return match_num;
+	indices->match_index = 1;
 }
 
 /*
@@ -479,7 +474,7 @@ acl_calc_counts_indices(struct acl_node_counters *counts,
 int
 rte_acl_gen(struct rte_acl_ctx *ctx, struct rte_acl_trie *trie,
 	struct rte_acl_bld_trie *node_bld_trie, uint32_t num_tries,
-	uint32_t num_categories, uint32_t data_index_sz, int match_num)
+	uint32_t num_categories, uint32_t data_index_sz)
 {
 	void *mem;
 	size_t total_size;
@@ -492,13 +487,13 @@ rte_acl_gen(struct rte_acl_ctx *ctx, struct rte_acl_trie *trie,
 	no_match = RTE_ACL_NODE_MATCH;
 
 	/* Fill counts and indices arrays from the nodes. */
-	match_num = acl_calc_counts_indices(&counts, &indices, trie,
-		node_bld_trie, num_tries, match_num, no_match);
+	acl_calc_counts_indices(&counts, &indices,
+		node_bld_trie, num_tries, no_match);
 
 	/* Allocate runtime memory (align to cache boundary) */
 	total_size = RTE_ALIGN(data_index_sz, RTE_CACHE_LINE_SIZE) +
-		indices.match_index * sizeof(uint64_t) +
-		(match_num + 2) * sizeof(struct rte_acl_match_results) +
+		indices.match_start * sizeof(uint64_t) +
+		(counts.match + 1) * sizeof(struct rte_acl_match_results) +
 		XMM_SIZE;
 
 	mem = rte_zmalloc_socket(ctx->name, total_size, RTE_CACHE_LINE_SIZE,
@@ -511,7 +506,7 @@ rte_acl_gen(struct rte_acl_ctx *ctx, struct rte_acl_trie *trie,
 	}
 
 	/* Fill the runtime structure */
-	match_index = indices.match_index;
+	match_index = indices.match_start;
 	node_array = (uint64_t *)((uintptr_t)mem +
 		RTE_ALIGN(data_index_sz, RTE_CACHE_LINE_SIZE));
 
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 08/18] librte_acl: make scalar RT code to be more similar to vector one.
  2015-01-20 18:40 [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Konstantin Ananyev
                   ` (6 preceding siblings ...)
  2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 07/18] librte_acl: build/gen phase - simplify the way match nodes are allocated Konstantin Ananyev
@ 2015-01-20 18:40 ` Konstantin Ananyev
  2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 09/18] librte_acl: a bit of RT code deduplication Konstantin Ananyev
                   ` (12 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Konstantin Ananyev @ 2015-01-20 18:40 UTC (permalink / raw)
  To: dev

Make classify_scalar to behave in the same way as it's vector counterpart:
move match check out of the inner loop, etc.
That makes scalar and vector code look more identical.
Plus it improves scalar code performance.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl_run_scalar.c | 23 +++++++++++++----------
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/lib/librte_acl/acl_run_scalar.c b/lib/librte_acl/acl_run_scalar.c
index 40691ce..9935125 100644
--- a/lib/librte_acl/acl_run_scalar.c
+++ b/lib/librte_acl/acl_run_scalar.c
@@ -162,31 +162,34 @@ rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	transition0 = index_array[0];
 	transition1 = index_array[1];
 
+	while ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
+		transition0 = acl_match_check(transition0,
+			0, ctx, parms, &flows, resolve_priority_scalar);
+		transition1 = acl_match_check(transition1,
+			1, ctx, parms, &flows, resolve_priority_scalar);
+	}
+
 	while (flows.started > 0) {
 
 		input0 = GET_NEXT_4BYTES(parms, 0);
 		input1 = GET_NEXT_4BYTES(parms, 1);
 
 		for (n = 0; n < 4; n++) {
-			if (likely((transition0 & RTE_ACL_NODE_MATCH) == 0))
-				transition0 = scalar_transition(flows.trans,
-					transition0, (uint8_t)input0);
 
+			transition0 = scalar_transition(flows.trans,
+				transition0, (uint8_t)input0);
 			input0 >>= CHAR_BIT;
 
-			if (likely((transition1 & RTE_ACL_NODE_MATCH) == 0))
-				transition1 = scalar_transition(flows.trans,
-					transition1, (uint8_t)input1);
-
+			transition1 = scalar_transition(flows.trans,
+				transition1, (uint8_t)input1);
 			input1 >>= CHAR_BIT;
-
 		}
-		if ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
+
+		while ((transition0 | transition1) & RTE_ACL_NODE_MATCH) {
 			transition0 = acl_match_check(transition0,
 				0, ctx, parms, &flows, resolve_priority_scalar);
 			transition1 = acl_match_check(transition1,
 				1, ctx, parms, &flows, resolve_priority_scalar);
-
 		}
 	}
 	return 0;
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 09/18] librte_acl: a bit of RT code deduplication.
  2015-01-20 18:40 [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Konstantin Ananyev
                   ` (7 preceding siblings ...)
  2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 08/18] librte_acl: make scalar RT code to be more similar to vector one Konstantin Ananyev
@ 2015-01-20 18:40 ` Konstantin Ananyev
  2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 10/18] EAL: introduce rte_ymm and relatives in rte_common_vect.h Konstantin Ananyev
                   ` (11 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Konstantin Ananyev @ 2015-01-20 18:40 UTC (permalink / raw)
  To: dev

Move common check for input parameters up into rte_acl_classify_alg().

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl_run_scalar.c |  4 ----
 lib/librte_acl/acl_run_sse.c    |  4 ----
 lib/librte_acl/rte_acl.c        | 19 ++++++++++++-------
 3 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/lib/librte_acl/acl_run_scalar.c b/lib/librte_acl/acl_run_scalar.c
index 9935125..5be216c 100644
--- a/lib/librte_acl/acl_run_scalar.c
+++ b/lib/librte_acl/acl_run_scalar.c
@@ -147,10 +147,6 @@ rte_acl_classify_scalar(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	struct completion cmplt[MAX_SEARCHES_SCALAR];
 	struct parms parms[MAX_SEARCHES_SCALAR];
 
-	if (categories != 1 &&
-		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
-		return -EINVAL;
-
 	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results, num,
 		categories, ctx->trans_table);
 
diff --git a/lib/librte_acl/acl_run_sse.c b/lib/librte_acl/acl_run_sse.c
index 576c92b..09e32be 100644
--- a/lib/librte_acl/acl_run_sse.c
+++ b/lib/librte_acl/acl_run_sse.c
@@ -572,10 +572,6 @@ int
 rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories)
 {
-	if (categories != 1 &&
-		((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
-		return -EINVAL;
-
 	if (likely(num >= MAX_SEARCHES_SSE8))
 		return search_sse_8(ctx, data, results, num, categories);
 	else if (num >= MAX_SEARCHES_SSE4)
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 547e6da..a16c4a4 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -76,20 +76,25 @@ rte_acl_init(void)
 }
 
 int
-rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t num, uint32_t categories)
-{
-	return classify_fns[ctx->alg](ctx, data, results, num, categories);
-}
-
-int
 rte_acl_classify_alg(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories,
 	enum rte_acl_classify_alg alg)
 {
+	if (categories != 1 &&
+			((RTE_ACL_RESULTS_MULTIPLIER - 1) & categories) != 0)
+		return -EINVAL;
+
 	return classify_fns[alg](ctx, data, results, num, categories);
 }
 
+int
+rte_acl_classify(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	return rte_acl_classify_alg(ctx, data, results, num, categories,
+		ctx->alg);
+}
+
 struct rte_acl_ctx *
 rte_acl_find_existing(const char *name)
 {
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 10/18] EAL: introduce rte_ymm and relatives in rte_common_vect.h.
  2015-01-20 18:40 [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Konstantin Ananyev
                   ` (8 preceding siblings ...)
  2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 09/18] librte_acl: a bit of RT code deduplication Konstantin Ananyev
@ 2015-01-20 18:40 ` Konstantin Ananyev
  2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 11/18] librte_acl: add AVX2 as new rte_acl_classify() method Konstantin Ananyev
                   ` (10 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Konstantin Ananyev @ 2015-01-20 18:40 UTC (permalink / raw)
  To: dev

New data type to manipulate 256 bit AVX values.
Rename field in the rte_xmm to keep common naming across SSE/AVX fields.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 examples/l3fwd/main.c                           |  2 +-
 lib/librte_acl/acl_run_sse.c                    | 88 ++++++++++++-------------
 lib/librte_acl/rte_acl_osdep_alone.h            | 35 +++++++++-
 lib/librte_eal/common/include/rte_common_vect.h | 27 +++++++-
 lib/librte_lpm/rte_lpm.h                        |  2 +-
 5 files changed, 104 insertions(+), 50 deletions(-)

diff --git a/examples/l3fwd/main.c b/examples/l3fwd/main.c
index 918f2cb..6f7d7d4 100644
--- a/examples/l3fwd/main.c
+++ b/examples/l3fwd/main.c
@@ -1170,7 +1170,7 @@ processx4_step2(const struct lcore_conf *qconf, __m128i dip, uint32_t flag,
 	if (likely(flag != 0)) {
 		rte_lpm_lookupx4(qconf->ipv4_lookup_struct, dip, dprt, portid);
 	} else {
-		dst.m = dip;
+		dst.x = dip;
 		dprt[0] = get_dst_port(qconf, pkt[0], dst.u32[0], portid);
 		dprt[1] = get_dst_port(qconf, pkt[1], dst.u32[1], portid);
 		dprt[2] = get_dst_port(qconf, pkt[2], dst.u32[2], portid);
diff --git a/lib/librte_acl/acl_run_sse.c b/lib/librte_acl/acl_run_sse.c
index 09e32be..4605b58 100644
--- a/lib/librte_acl/acl_run_sse.c
+++ b/lib/librte_acl/acl_run_sse.c
@@ -359,16 +359,16 @@ search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
 
 	 /* Check for any matches. */
 	acl_match_check_x4(0, ctx, parms, &flows,
-		&indices1, &indices2, mm_match_mask.m);
+		&indices1, &indices2, mm_match_mask.x);
 	acl_match_check_x4(4, ctx, parms, &flows,
-		&indices3, &indices4, mm_match_mask.m);
+		&indices3, &indices4, mm_match_mask.x);
 
 	while (flows.started > 0) {
 
 		/* Gather 4 bytes of input data for each stream. */
-		input0 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0),
+		input0 = MM_INSERT32(mm_ones_16.x, GET_NEXT_4BYTES(parms, 0),
 			0);
-		input1 = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 4),
+		input1 = MM_INSERT32(mm_ones_16.x, GET_NEXT_4BYTES(parms, 4),
 			0);
 
 		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
@@ -382,43 +382,43 @@ search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
 
 		 /* Process the 4 bytes of input on each stream. */
 
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
+		input0 = transition4(mm_index_mask.x, input0,
+			mm_shuffle_input.x, mm_ones_16.x,
 			flows.trans, &indices1, &indices2);
 
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
+		input1 = transition4(mm_index_mask.x, input1,
+			mm_shuffle_input.x, mm_ones_16.x,
 			flows.trans, &indices3, &indices4);
 
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
+		input0 = transition4(mm_index_mask.x, input0,
+			mm_shuffle_input.x, mm_ones_16.x,
 			flows.trans, &indices1, &indices2);
 
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
+		input1 = transition4(mm_index_mask.x, input1,
+			mm_shuffle_input.x, mm_ones_16.x,
 			flows.trans, &indices3, &indices4);
 
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
+		input0 = transition4(mm_index_mask.x, input0,
+			mm_shuffle_input.x, mm_ones_16.x,
 			flows.trans, &indices1, &indices2);
 
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
+		input1 = transition4(mm_index_mask.x, input1,
+			mm_shuffle_input.x, mm_ones_16.x,
 			flows.trans, &indices3, &indices4);
 
-		input0 = transition4(mm_index_mask.m, input0,
-			mm_shuffle_input.m, mm_ones_16.m,
+		input0 = transition4(mm_index_mask.x, input0,
+			mm_shuffle_input.x, mm_ones_16.x,
 			flows.trans, &indices1, &indices2);
 
-		input1 = transition4(mm_index_mask.m, input1,
-			mm_shuffle_input.m, mm_ones_16.m,
+		input1 = transition4(mm_index_mask.x, input1,
+			mm_shuffle_input.x, mm_ones_16.x,
 			flows.trans, &indices3, &indices4);
 
 		 /* Check for any matches. */
 		acl_match_check_x4(0, ctx, parms, &flows,
-			&indices1, &indices2, mm_match_mask.m);
+			&indices1, &indices2, mm_match_mask.x);
 		acl_match_check_x4(4, ctx, parms, &flows,
-			&indices3, &indices4, mm_match_mask.m);
+			&indices3, &indices4, mm_match_mask.x);
 	}
 
 	return 0;
@@ -451,36 +451,36 @@ search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
 
 	/* Check for any matches. */
 	acl_match_check_x4(0, ctx, parms, &flows,
-		&indices1, &indices2, mm_match_mask.m);
+		&indices1, &indices2, mm_match_mask.x);
 
 	while (flows.started > 0) {
 
 		/* Gather 4 bytes of input data for each stream. */
-		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
+		input = MM_INSERT32(mm_ones_16.x, GET_NEXT_4BYTES(parms, 0), 0);
 		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
 		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
 		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
 
 		/* Process the 4 bytes of input on each stream. */
-		input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
+		input = transition4(mm_index_mask.x, input,
+			mm_shuffle_input.x, mm_ones_16.x,
 			flows.trans, &indices1, &indices2);
 
-		 input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
+		 input = transition4(mm_index_mask.x, input,
+			mm_shuffle_input.x, mm_ones_16.x,
 			flows.trans, &indices1, &indices2);
 
-		 input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
+		 input = transition4(mm_index_mask.x, input,
+			mm_shuffle_input.x, mm_ones_16.x,
 			flows.trans, &indices1, &indices2);
 
-		 input = transition4(mm_index_mask.m, input,
-			mm_shuffle_input.m, mm_ones_16.m,
+		 input = transition4(mm_index_mask.x, input,
+			mm_shuffle_input.x, mm_ones_16.x,
 			flows.trans, &indices1, &indices2);
 
 		/* Check for any matches. */
 		acl_match_check_x4(0, ctx, parms, &flows,
-			&indices1, &indices2, mm_match_mask.m);
+			&indices1, &indices2, mm_match_mask.x);
 	}
 
 	return 0;
@@ -534,35 +534,35 @@ search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	indices = MM_LOADU((xmm_t *) &index_array[0]);
 
 	/* Check for any matches. */
-	acl_match_check_x2(0, ctx, parms, &flows, &indices, mm_match_mask64.m);
+	acl_match_check_x2(0, ctx, parms, &flows, &indices, mm_match_mask64.x);
 
 	while (flows.started > 0) {
 
 		/* Gather 4 bytes of input data for each stream. */
-		input = MM_INSERT32(mm_ones_16.m, GET_NEXT_4BYTES(parms, 0), 0);
+		input = MM_INSERT32(mm_ones_16.x, GET_NEXT_4BYTES(parms, 0), 0);
 		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
 
 		/* Process the 4 bytes of input on each stream. */
 
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
+		input = transition2(mm_index_mask64.x, input,
+			mm_shuffle_input64.x, mm_ones_16.x,
 			flows.trans, &indices);
 
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
+		input = transition2(mm_index_mask64.x, input,
+			mm_shuffle_input64.x, mm_ones_16.x,
 			flows.trans, &indices);
 
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
+		input = transition2(mm_index_mask64.x, input,
+			mm_shuffle_input64.x, mm_ones_16.x,
 			flows.trans, &indices);
 
-		input = transition2(mm_index_mask64.m, input,
-			mm_shuffle_input64.m, mm_ones_16.m,
+		input = transition2(mm_index_mask64.x, input,
+			mm_shuffle_input64.x, mm_ones_16.x,
 			flows.trans, &indices);
 
 		/* Check for any matches. */
 		acl_match_check_x2(0, ctx, parms, &flows, &indices,
-			mm_match_mask64.m);
+			mm_match_mask64.x);
 	}
 
 	return 0;
diff --git a/lib/librte_acl/rte_acl_osdep_alone.h b/lib/librte_acl/rte_acl_osdep_alone.h
index 2a99860..58c4f6a 100644
--- a/lib/librte_acl/rte_acl_osdep_alone.h
+++ b/lib/librte_acl/rte_acl_osdep_alone.h
@@ -57,6 +57,10 @@
 #include <smmintrin.h>
 #endif
 
+#if defined(__AVX__)
+#include <immintrin.h>
+#endif
+
 #else
 
 #include <x86intrin.h>
@@ -128,8 +132,8 @@ typedef __m128i xmm_t;
 #define	XMM_SIZE	(sizeof(xmm_t))
 #define	XMM_MASK	(XMM_SIZE - 1)
 
-typedef union rte_mmsse {
-	xmm_t    m;
+typedef union rte_xmm {
+	xmm_t    x;
 	uint8_t  u8[XMM_SIZE / sizeof(uint8_t)];
 	uint16_t u16[XMM_SIZE / sizeof(uint16_t)];
 	uint32_t u32[XMM_SIZE / sizeof(uint32_t)];
@@ -137,6 +141,33 @@ typedef union rte_mmsse {
 	double   pd[XMM_SIZE / sizeof(double)];
 } rte_xmm_t;
 
+#ifdef __AVX__
+
+typedef __m256i ymm_t;
+
+#define	YMM_SIZE	(sizeof(ymm_t))
+#define	YMM_MASK	(YMM_SIZE - 1)
+
+typedef union rte_ymm {
+	ymm_t    y;
+	xmm_t    x[YMM_SIZE / sizeof(xmm_t)];
+	uint8_t  u8[YMM_SIZE / sizeof(uint8_t)];
+	uint16_t u16[YMM_SIZE / sizeof(uint16_t)];
+	uint32_t u32[YMM_SIZE / sizeof(uint32_t)];
+	uint64_t u64[YMM_SIZE / sizeof(uint64_t)];
+	double   pd[YMM_SIZE / sizeof(double)];
+} rte_ymm_t;
+
+#endif /* __AVX__ */
+
+#ifdef RTE_ARCH_I686
+#define _mm_cvtsi128_si64(a) ({ \
+	rte_xmm_t m;            \
+	m.x = (a);              \
+	(m.u64[0]);             \
+})
+#endif
+
 /*
  * rte_cycles related.
  */
diff --git a/lib/librte_eal/common/include/rte_common_vect.h b/lib/librte_eal/common/include/rte_common_vect.h
index 95bf4b1..617470b 100644
--- a/lib/librte_eal/common/include/rte_common_vect.h
+++ b/lib/librte_eal/common/include/rte_common_vect.h
@@ -54,6 +54,10 @@
 #include <smmintrin.h>
 #endif
 
+#if defined(__AVX__)
+#include <immintrin.h>
+#endif
+
 #else
 
 #include <x86intrin.h>
@@ -70,7 +74,7 @@ typedef __m128i xmm_t;
 #define	XMM_MASK	(XMM_SIZE - 1)
 
 typedef union rte_xmm {
-	xmm_t    m;
+	xmm_t    x;
 	uint8_t  u8[XMM_SIZE / sizeof(uint8_t)];
 	uint16_t u16[XMM_SIZE / sizeof(uint16_t)];
 	uint32_t u32[XMM_SIZE / sizeof(uint32_t)];
@@ -78,10 +82,29 @@ typedef union rte_xmm {
 	double   pd[XMM_SIZE / sizeof(double)];
 } rte_xmm_t;
 
+#ifdef __AVX__
+
+typedef __m256i ymm_t;
+
+#define	YMM_SIZE	(sizeof(ymm_t))
+#define	YMM_MASK	(YMM_SIZE - 1)
+
+typedef union rte_ymm {
+	ymm_t    y;
+	xmm_t    x[YMM_SIZE / sizeof(xmm_t)];
+	uint8_t  u8[YMM_SIZE / sizeof(uint8_t)];
+	uint16_t u16[YMM_SIZE / sizeof(uint16_t)];
+	uint32_t u32[YMM_SIZE / sizeof(uint32_t)];
+	uint64_t u64[YMM_SIZE / sizeof(uint64_t)];
+	double   pd[YMM_SIZE / sizeof(double)];
+} rte_ymm_t;
+
+#endif /* __AVX__ */
+
 #ifdef RTE_ARCH_I686
 #define _mm_cvtsi128_si64(a) ({ \
 	rte_xmm_t m;            \
-	m.m = (a);              \
+	m.x = (a);              \
 	(m.u64[0]);             \
 })
 #endif
diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h
index 62d7736..586300b 100644
--- a/lib/librte_lpm/rte_lpm.h
+++ b/lib/librte_lpm/rte_lpm.h
@@ -420,7 +420,7 @@ rte_lpm_lookupx4(const struct rte_lpm *lpm, __m128i ip, uint16_t hop[4],
 	tbl[3] = *(const uint16_t *)&lpm->tbl24[idx >> 32];
 
 	/* get 4 indexes for tbl8[]. */
-	i8.m = _mm_and_si128(ip, mask8);
+	i8.x = _mm_and_si128(ip, mask8);
 
 	pt = (uint64_t)tbl[0] |
 		(uint64_t)tbl[1] << 16 |
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 11/18] librte_acl: add AVX2 as new rte_acl_classify() method
  2015-01-20 18:40 [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Konstantin Ananyev
                   ` (9 preceding siblings ...)
  2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 10/18] EAL: introduce rte_ymm and relatives in rte_common_vect.h Konstantin Ananyev
@ 2015-01-20 18:41 ` Konstantin Ananyev
  2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 12/18] test-acl: add ability to manually select RT method Konstantin Ananyev
                   ` (9 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Konstantin Ananyev @ 2015-01-20 18:41 UTC (permalink / raw)
  To: dev

v2 changes:
When build with the compilers that don't support AVX2 instructions,
make rte_acl_classify_avx2() do nothing and return an error.
Remove unneeded 'ifdef __AVX2__' in acl_run_avx2.*.

Introduce new classify() method that uses AVX2 instructions.
>From my measurements:
On HSW boards when processing >= 16 packets per call,
AVX2 method outperforms it's SSE counterpart by 10-25%,
(depending on the ruleset).
At runtime, if librte_acl was build with the compiler that supports AVX2,
this method is selected as default one on HW that supports AVX2.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/Makefile       |  18 ++
 lib/librte_acl/acl.h          |   4 +
 lib/librte_acl/acl_run.h      |   2 +-
 lib/librte_acl/acl_run_avx2.c |  54 +++++
 lib/librte_acl/acl_run_avx2.h | 301 +++++++++++++++++++++++
 lib/librte_acl/acl_run_sse.c  | 537 +-----------------------------------------
 lib/librte_acl/acl_run_sse.h  | 533 +++++++++++++++++++++++++++++++++++++++++
 lib/librte_acl/rte_acl.c      |  27 +++
 lib/librte_acl/rte_acl.h      |   2 +
 9 files changed, 941 insertions(+), 537 deletions(-)
 create mode 100644 lib/librte_acl/acl_run_avx2.c
 create mode 100644 lib/librte_acl/acl_run_avx2.h
 create mode 100644 lib/librte_acl/acl_run_sse.h

diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
index 65e566d..6b74dc9 100644
--- a/lib/librte_acl/Makefile
+++ b/lib/librte_acl/Makefile
@@ -48,6 +48,24 @@ SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_sse.c
 
 CFLAGS_acl_run_sse.o += -msse4.1
 
+#
+# If the compiler supports AVX2 instructions,
+# then add support for AVX2 classify method.
+#
+
+CC_AVX2_SUPPORT=$(shell $(CC) -march=core-avx2 -dM -E - </dev/null 2>&1 | \
+grep -q AVX2 && echo 1)
+
+ifeq ($(CC_AVX2_SUPPORT), 1)
+	SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_avx2.c
+	CFLAGS_rte_acl.o += -DCC_AVX2_SUPPORT
+	ifeq ($(CC), icc)
+	CFLAGS_acl_run_avx2.o += -march=core-avx2
+	else
+	CFLAGS_acl_run_avx2.o += -mavx2
+	endif
+endif
+
 # install this header file
 SYMLINK-$(CONFIG_RTE_LIBRTE_ACL)-include := rte_acl_osdep.h
 SYMLINK-$(CONFIG_RTE_LIBRTE_ACL)-include += rte_acl.h
diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 96bb318..d33d7ad 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -196,6 +196,10 @@ int
 rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories);
 
+int
+rte_acl_classify_avx2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories);
+
 #ifdef __cplusplus
 }
 #endif /* __cplusplus */
diff --git a/lib/librte_acl/acl_run.h b/lib/librte_acl/acl_run.h
index 4c843c1..850bc81 100644
--- a/lib/librte_acl/acl_run.h
+++ b/lib/librte_acl/acl_run.h
@@ -35,9 +35,9 @@
 #define	_ACL_RUN_H_
 
 #include <rte_acl.h>
-#include "acl_vect.h"
 #include "acl.h"
 
+#define MAX_SEARCHES_AVX16	16
 #define MAX_SEARCHES_SSE8	8
 #define MAX_SEARCHES_SSE4	4
 #define MAX_SEARCHES_SSE2	2
diff --git a/lib/librte_acl/acl_run_avx2.c b/lib/librte_acl/acl_run_avx2.c
new file mode 100644
index 0000000..0a42f72
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx2.c
@@ -0,0 +1,54 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+
+#include "acl_run_avx2.h"
+
+/*
+ * Note, that to be able to use AVX2 classify method,
+ * both compiler and target cpu have to support AVX2 instructions.
+ */
+int
+rte_acl_classify_avx2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	if (likely(num >= MAX_SEARCHES_AVX16))
+		return search_avx2x16(ctx, data, results, num, categories);
+	else if (num >= MAX_SEARCHES_SSE8)
+		return search_sse_8(ctx, data, results, num, categories);
+	else if (num >= MAX_SEARCHES_SSE4)
+		return search_sse_4(ctx, data, results, num, categories);
+	else
+		return search_sse_2(ctx, data, results, num,
+			categories);
+}
diff --git a/lib/librte_acl/acl_run_avx2.h b/lib/librte_acl/acl_run_avx2.h
new file mode 100644
index 0000000..1688c50
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx2.h
@@ -0,0 +1,301 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "acl_run_sse.h"
+
+static const rte_ymm_t ymm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const rte_ymm_t ymm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const rte_ymm_t ymm_shuffle_input = {
+	.u32 = {
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+	},
+};
+
+static const rte_ymm_t ymm_ones_16 = {
+	.u16 = {
+		1, 1, 1, 1, 1, 1, 1, 1,
+		1, 1, 1, 1, 1, 1, 1, 1,
+	},
+};
+
+static inline __attribute__((always_inline)) ymm_t
+calc_addr_avx2(ymm_t index_mask, ymm_t next_input, ymm_t shuffle_input,
+	ymm_t ones_16, ymm_t tr_lo, ymm_t tr_hi)
+{
+	ymm_t in, node_type, r, t;
+	ymm_t dfa_msk, dfa_ofs, quad_ofs;
+	ymm_t addr;
+
+	const ymm_t range_base = _mm256_set_epi32(
+		0xffffff0c, 0xffffff08, 0xffffff04, 0xffffff00,
+		0xffffff0c, 0xffffff08, 0xffffff04, 0xffffff00);
+
+	t = _mm256_xor_si256(index_mask, index_mask);
+	in = _mm256_shuffle_epi8(next_input, shuffle_input);
+
+	/* Calc node type and node addr */
+	node_type = _mm256_andnot_si256(index_mask, tr_lo);
+	addr = _mm256_and_si256(index_mask, tr_lo);
+
+	/* DFA calculations. */
+
+	dfa_msk = _mm256_cmpeq_epi32(node_type, t);
+
+	r = _mm256_srli_epi32(in, 30);
+	r = _mm256_add_epi8(r, range_base);
+
+	t = _mm256_srli_epi32(in, 24);
+	r = _mm256_shuffle_epi8(tr_hi, r);
+
+	dfa_ofs = _mm256_sub_epi32(t, r);
+
+	/* QUAD/SINGLE caluclations. */
+
+	t = _mm256_cmpgt_epi8(in, tr_hi);
+	t = _mm256_sign_epi8(t, t);
+	t = _mm256_maddubs_epi16(t, t);
+	quad_ofs = _mm256_madd_epi16(t, ones_16);
+
+	/* blend DFA and QUAD/SINGLE. */
+	t = _mm256_blendv_epi8(quad_ofs, dfa_ofs, dfa_msk);
+
+	addr = _mm256_add_epi32(addr, t);
+	return addr;
+}
+
+static inline __attribute__((always_inline)) ymm_t
+transition8(ymm_t next_input, const uint64_t *trans, ymm_t *tr_lo, ymm_t *tr_hi)
+{
+	const int32_t *tr;
+	ymm_t addr;
+
+	tr = (const int32_t *)(uintptr_t)trans;
+
+	addr = calc_addr_avx2(ymm_index_mask.y, next_input, ymm_shuffle_input.y,
+		ymm_ones_16.y, *tr_lo, *tr_hi);
+
+	/* load lower 32 bits of 8 transactions at once. */
+	*tr_lo = _mm256_i32gather_epi32(tr, addr, sizeof(trans[0]));
+
+	next_input = _mm256_srli_epi32(next_input, CHAR_BIT);
+
+	/* load high 32 bits of 8 transactions at once. */
+	*tr_hi = _mm256_i32gather_epi32(tr + 1, addr, sizeof(trans[0]));
+
+	return next_input;
+}
+
+static inline void
+acl_process_matches_avx2x8(const struct rte_acl_ctx *ctx,
+	struct parms *parms, struct acl_flow_data *flows, uint32_t slot,
+	ymm_t matches, ymm_t *tr_lo, ymm_t *tr_hi)
+{
+	ymm_t t0, t1;
+	ymm_t lo, hi;
+	xmm_t l0, l1;
+	uint32_t i;
+	uint64_t tr[MAX_SEARCHES_SSE8];
+
+	l1 = _mm256_extracti128_si256(*tr_lo, 1);
+	l0 = _mm256_castsi256_si128(*tr_lo);
+
+	for (i = 0; i != RTE_DIM(tr) / 2; i++) {
+		tr[i] = (uint32_t)_mm_cvtsi128_si32(l0);
+		tr[i + 4] = (uint32_t)_mm_cvtsi128_si32(l1);
+
+		l0 = _mm_srli_si128(l0, sizeof(uint32_t));
+		l1 = _mm_srli_si128(l1, sizeof(uint32_t));
+
+		tr[i] = acl_match_check(tr[i], slot + i,
+			ctx, parms, flows, resolve_priority_sse);
+		tr[i + 4] = acl_match_check(tr[i + 4], slot + i + 4,
+			ctx, parms, flows, resolve_priority_sse);
+	}
+
+	t0 = _mm256_set_epi64x(tr[5], tr[4], tr[1], tr[0]);
+	t1 = _mm256_set_epi64x(tr[7], tr[6], tr[3], tr[2]);
+
+	lo = (ymm_t)_mm256_shuffle_ps((__m256)t0, (__m256)t1, 0x88);
+	hi = (ymm_t)_mm256_shuffle_ps((__m256)t0, (__m256)t1, 0xdd);
+
+	*tr_lo = _mm256_blendv_epi8(*tr_lo, lo, matches);
+	*tr_hi = _mm256_blendv_epi8(*tr_hi, hi, matches);
+}
+
+static inline void
+acl_match_check_avx2x8(const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows, uint32_t slot,
+	ymm_t *tr_lo, ymm_t *tr_hi, ymm_t match_mask)
+{
+	uint32_t msk;
+	ymm_t matches, temp;
+
+	/* test for match node */
+	temp = _mm256_and_si256(match_mask, *tr_lo);
+	matches = _mm256_cmpeq_epi32(temp, match_mask);
+	msk = _mm256_movemask_epi8(matches);
+
+	while (msk != 0) {
+
+		acl_process_matches_avx2x8(ctx, parms, flows, slot,
+			matches, tr_lo, tr_hi);
+		temp = _mm256_and_si256(match_mask, *tr_lo);
+		matches = _mm256_cmpeq_epi32(temp, match_mask);
+		msk = _mm256_movemask_epi8(matches);
+	}
+}
+
+static inline int
+search_avx2x16(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	uint32_t n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_AVX16];
+	struct completion cmplt[MAX_SEARCHES_AVX16];
+	struct parms parms[MAX_SEARCHES_AVX16];
+	ymm_t input[2], tr_lo[2], tr_hi[2];
+	ymm_t t0, t1;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < RTE_DIM(cmplt); n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	t0 = _mm256_set_epi64x(index_array[5], index_array[4],
+		index_array[1], index_array[0]);
+	t1 = _mm256_set_epi64x(index_array[7], index_array[6],
+		index_array[3], index_array[2]);
+
+	tr_lo[0] = (ymm_t)_mm256_shuffle_ps((__m256)t0, (__m256)t1, 0x88);
+	tr_hi[0] = (ymm_t)_mm256_shuffle_ps((__m256)t0, (__m256)t1, 0xdd);
+
+	t0 = _mm256_set_epi64x(index_array[13], index_array[12],
+		index_array[9], index_array[8]);
+	t1 = _mm256_set_epi64x(index_array[15], index_array[14],
+		index_array[11], index_array[10]);
+
+	tr_lo[1] = (ymm_t)_mm256_shuffle_ps((__m256)t0, (__m256)t1, 0x88);
+	tr_hi[1] = (ymm_t)_mm256_shuffle_ps((__m256)t0, (__m256)t1, 0xdd);
+
+	 /* Check for any matches. */
+	acl_match_check_avx2x8(ctx, parms, &flows, 0, &tr_lo[0], &tr_hi[0],
+		ymm_match_mask.y);
+	acl_match_check_avx2x8(ctx, parms, &flows, 8, &tr_lo[1], &tr_hi[1],
+		ymm_match_mask.y);
+
+	while (flows.started > 0) {
+
+		uint32_t in[MAX_SEARCHES_SSE8];
+
+		/* Gather 4 bytes of input data for first 8 flows. */
+		in[0] = GET_NEXT_4BYTES(parms, 0);
+		in[4] = GET_NEXT_4BYTES(parms, 4);
+		in[1] = GET_NEXT_4BYTES(parms, 1);
+		in[5] = GET_NEXT_4BYTES(parms, 5);
+		in[2] = GET_NEXT_4BYTES(parms, 2);
+		in[6] = GET_NEXT_4BYTES(parms, 6);
+		in[3] = GET_NEXT_4BYTES(parms, 3);
+		in[7] = GET_NEXT_4BYTES(parms, 7);
+		input[0] = _mm256_set_epi32(in[7], in[6], in[5], in[4],
+			in[3], in[2], in[1], in[0]);
+
+		/* Gather 4 bytes of input data for last 8 flows. */
+		in[0] = GET_NEXT_4BYTES(parms, 8);
+		in[4] = GET_NEXT_4BYTES(parms, 12);
+		in[1] = GET_NEXT_4BYTES(parms, 9);
+		in[5] = GET_NEXT_4BYTES(parms, 13);
+		in[2] = GET_NEXT_4BYTES(parms, 10);
+		in[6] = GET_NEXT_4BYTES(parms, 14);
+		in[3] = GET_NEXT_4BYTES(parms, 11);
+		in[7] = GET_NEXT_4BYTES(parms, 15);
+		input[1] = _mm256_set_epi32(in[7], in[6], in[5], in[4],
+			in[3], in[2], in[1], in[0]);
+
+		input[0] = transition8(input[0], flows.trans,
+			&tr_lo[0], &tr_hi[0]);
+		input[1] = transition8(input[1], flows.trans,
+			&tr_lo[1], &tr_hi[1]);
+
+		input[0] = transition8(input[0], flows.trans,
+			&tr_lo[0], &tr_hi[0]);
+		input[1] = transition8(input[1], flows.trans,
+			&tr_lo[1], &tr_hi[1]);
+
+		input[0] = transition8(input[0], flows.trans,
+			&tr_lo[0], &tr_hi[0]);
+		input[1] = transition8(input[1], flows.trans,
+			&tr_lo[1], &tr_hi[1]);
+
+		input[0] = transition8(input[0], flows.trans,
+			&tr_lo[0], &tr_hi[0]);
+		input[1] = transition8(input[1], flows.trans,
+			&tr_lo[1], &tr_hi[1]);
+
+		 /* Check for any matches. */
+		acl_match_check_avx2x8(ctx, parms, &flows, 0,
+			&tr_lo[0], &tr_hi[0], ymm_match_mask.y);
+		acl_match_check_avx2x8(ctx, parms, &flows, 8,
+			&tr_lo[1], &tr_hi[1], ymm_match_mask.y);
+	}
+
+	return 0;
+}
diff --git a/lib/librte_acl/acl_run_sse.c b/lib/librte_acl/acl_run_sse.c
index 4605b58..77b32b3 100644
--- a/lib/librte_acl/acl_run_sse.c
+++ b/lib/librte_acl/acl_run_sse.c
@@ -31,542 +31,7 @@
  *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
  */
 
-#include "acl_run.h"
-
-enum {
-	SHUFFLE32_SLOT1 = 0xe5,
-	SHUFFLE32_SLOT2 = 0xe6,
-	SHUFFLE32_SLOT3 = 0xe7,
-	SHUFFLE32_SWAP64 = 0x4e,
-};
-
-static const rte_xmm_t mm_shuffle_input = {
-	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
-};
-
-static const rte_xmm_t mm_shuffle_input64 = {
-	.u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
-};
-
-static const rte_xmm_t mm_ones_16 = {
-	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
-};
-
-static const rte_xmm_t mm_match_mask = {
-	.u32 = {
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-		RTE_ACL_NODE_MATCH,
-	},
-};
-
-static const rte_xmm_t mm_match_mask64 = {
-	.u32 = {
-		RTE_ACL_NODE_MATCH,
-		0,
-		RTE_ACL_NODE_MATCH,
-		0,
-	},
-};
-
-static const rte_xmm_t mm_index_mask = {
-	.u32 = {
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-	},
-};
-
-static const rte_xmm_t mm_index_mask64 = {
-	.u32 = {
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		0,
-		0,
-	},
-};
-
-
-/*
- * Resolve priority for multiple results (sse version).
- * This consists comparing the priority of the current traversal with the
- * running set of results for the packet.
- * For each result, keep a running array of the result (rule number) and
- * its priority for each category.
- */
-static inline void
-resolve_priority_sse(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
-	struct parms *parms, const struct rte_acl_match_results *p,
-	uint32_t categories)
-{
-	uint32_t x;
-	xmm_t results, priority, results1, priority1, selector;
-	xmm_t *saved_results, *saved_priority;
-
-	for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
-
-		saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
-		saved_priority =
-			(xmm_t *)(&parms[n].cmplt->priority[x]);
-
-		/* get results and priorities for completed trie */
-		results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
-		priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
-
-		/* if this is not the first completed trie */
-		if (parms[n].cmplt->count != ctx->num_tries) {
-
-			/* get running best results and their priorities */
-			results1 = MM_LOADU(saved_results);
-			priority1 = MM_LOADU(saved_priority);
-
-			/* select results that are highest priority */
-			selector = MM_CMPGT32(priority1, priority);
-			results = MM_BLENDV8(results, results1, selector);
-			priority = MM_BLENDV8(priority, priority1, selector);
-		}
-
-		/* save running best results and their priorities */
-		MM_STOREU(saved_results, results);
-		MM_STOREU(saved_priority, priority);
-	}
-}
-
-/*
- * Extract transitions from an XMM register and check for any matches
- */
-static void
-acl_process_matches(xmm_t *indices, int slot, const struct rte_acl_ctx *ctx,
-	struct parms *parms, struct acl_flow_data *flows)
-{
-	uint64_t transition1, transition2;
-
-	/* extract transition from low 64 bits. */
-	transition1 = MM_CVT64(*indices);
-
-	/* extract transition from high 64 bits. */
-	*indices = MM_SHUFFLE32(*indices, SHUFFLE32_SWAP64);
-	transition2 = MM_CVT64(*indices);
-
-	transition1 = acl_match_check(transition1, slot, ctx,
-		parms, flows, resolve_priority_sse);
-	transition2 = acl_match_check(transition2, slot + 1, ctx,
-		parms, flows, resolve_priority_sse);
-
-	/* update indices with new transitions. */
-	*indices = MM_SET64(transition2, transition1);
-}
-
-/*
- * Check for a match in 2 transitions (contained in SSE register)
- */
-static inline void
-acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
-	struct acl_flow_data *flows, xmm_t *indices, xmm_t match_mask)
-{
-	xmm_t temp;
-
-	temp = MM_AND(match_mask, *indices);
-	while (!MM_TESTZ(temp, temp)) {
-		acl_process_matches(indices, slot, ctx, parms, flows);
-		temp = MM_AND(match_mask, *indices);
-	}
-}
-
-/*
- * Check for any match in 4 transitions (contained in 2 SSE registers)
- */
-static inline void
-acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
-	struct acl_flow_data *flows, xmm_t *indices1, xmm_t *indices2,
-	xmm_t match_mask)
-{
-	xmm_t temp;
-
-	/* put low 32 bits of each transition into one register */
-	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indices1, (__m128)*indices2,
-		0x88);
-	/* test for match node */
-	temp = MM_AND(match_mask, temp);
-
-	while (!MM_TESTZ(temp, temp)) {
-		acl_process_matches(indices1, slot, ctx, parms, flows);
-		acl_process_matches(indices2, slot + 2, ctx, parms, flows);
-
-		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indices1,
-					(__m128)*indices2,
-					0x88);
-		temp = MM_AND(match_mask, temp);
-	}
-}
-
-/*
- * Calculate the address of the next transition for
- * all types of nodes. Note that only DFA nodes and range
- * nodes actually transition to another node. Match
- * nodes don't move.
- */
-static inline xmm_t
-acl_calc_addr(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t indices1, xmm_t indices2)
-{
-	xmm_t addr, node_types, range, temp;
-	xmm_t dfa_msk, dfa_ofs, quad_ofs;
-	xmm_t in, r, t;
-
-	const xmm_t range_base = _mm_set_epi32(0xffffff0c, 0xffffff08,
-		0xffffff04, 0xffffff00);
-
-	/*
-	 * Note that no transition is done for a match
-	 * node and therefore a stream freezes when
-	 * it reaches a match.
-	 */
-
-	/* Shuffle low 32 into temp and high 32 into indices2 */
-	temp = (xmm_t)MM_SHUFFLEPS((__m128)indices1, (__m128)indices2, 0x88);
-	range = (xmm_t)MM_SHUFFLEPS((__m128)indices1, (__m128)indices2, 0xdd);
-
-	t = MM_XOR(index_mask, index_mask);
-
-	/* shuffle input byte to all 4 positions of 32 bit value */
-	in = MM_SHUFFLE8(next_input, shuffle_input);
-
-	/* Calc node type and node addr */
-	node_types = MM_ANDNOT(index_mask, temp);
-	addr = MM_AND(index_mask, temp);
-
-	/*
-	 * Calc addr for DFAs - addr = dfa_index + input_byte
-	 */
-
-	/* mask for DFA type (0) nodes */
-	dfa_msk = MM_CMPEQ32(node_types, t);
-
-	r = _mm_srli_epi32(in, 30);
-	r = _mm_add_epi8(r, range_base);
-
-	t = _mm_srli_epi32(in, 24);
-	r = _mm_shuffle_epi8(range, r);
-
-	dfa_ofs = _mm_sub_epi32(t, r);
-
-	/*
-	 * Calculate number of range boundaries that are less than the
-	 * input value. Range boundaries for each node are in signed 8 bit,
-	 * ordered from -128 to 127 in the indices2 register.
-	 * This is effectively a popcnt of bytes that are greater than the
-	 * input byte.
-	 */
-
-	/* check ranges */
-	temp = MM_CMPGT8(in, range);
-
-	/* convert -1 to 1 (bytes greater than input byte */
-	temp = MM_SIGN8(temp, temp);
-
-	/* horizontal add pairs of bytes into words */
-	temp = MM_MADD8(temp, temp);
-
-	/* horizontal add pairs of words into dwords */
-	quad_ofs = MM_MADD16(temp, ones_16);
-
-	/* mask to range type nodes */
-	temp = _mm_blendv_epi8(quad_ofs, dfa_ofs, dfa_msk);
-
-	/* add index into node position */
-	return MM_ADD32(addr, temp);
-}
-
-/*
- * Process 4 transitions (in 2 SIMD registers) in parallel
- */
-static inline xmm_t
-transition4(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, const uint64_t *trans,
-	xmm_t *indices1, xmm_t *indices2)
-{
-	xmm_t addr;
-	uint64_t trans0, trans2;
-
-	 /* Calculate the address (array index) for all 4 transitions. */
-
-	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
-		*indices1, *indices2);
-
-	 /* Gather 64 bit transitions and pack back into 2 registers. */
-
-	trans0 = trans[MM_CVT32(addr)];
-
-	/* get slot 2 */
-
-	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
-	trans2 = trans[MM_CVT32(addr)];
-
-	/* get slot 1 */
-
-	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
-	*indices1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
-
-	/* get slot 3 */
-
-	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
-	*indices2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
-
-	return MM_SRL32(next_input, 8);
-}
-
-/*
- * Execute trie traversal with 8 traversals in parallel
- */
-static inline int
-search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE8];
-	struct completion cmplt[MAX_SEARCHES_SSE8];
-	struct parms parms[MAX_SEARCHES_SSE8];
-	xmm_t input0, input1;
-	xmm_t indices1, indices2, indices3, indices4;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	/*
-	 * indices1 contains index_array[0,1]
-	 * indices2 contains index_array[2,3]
-	 * indices3 contains index_array[4,5]
-	 * indices4 contains index_array[6,7]
-	 */
-
-	indices1 = MM_LOADU((xmm_t *) &index_array[0]);
-	indices2 = MM_LOADU((xmm_t *) &index_array[2]);
-
-	indices3 = MM_LOADU((xmm_t *) &index_array[4]);
-	indices4 = MM_LOADU((xmm_t *) &index_array[6]);
-
-	 /* Check for any matches. */
-	acl_match_check_x4(0, ctx, parms, &flows,
-		&indices1, &indices2, mm_match_mask.x);
-	acl_match_check_x4(4, ctx, parms, &flows,
-		&indices3, &indices4, mm_match_mask.x);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input0 = MM_INSERT32(mm_ones_16.x, GET_NEXT_4BYTES(parms, 0),
-			0);
-		input1 = MM_INSERT32(mm_ones_16.x, GET_NEXT_4BYTES(parms, 4),
-			0);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
-
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
-
-		 /* Process the 4 bytes of input on each stream. */
-
-		input0 = transition4(mm_index_mask.x, input0,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices1, &indices2);
-
-		input1 = transition4(mm_index_mask.x, input1,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices3, &indices4);
-
-		input0 = transition4(mm_index_mask.x, input0,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices1, &indices2);
-
-		input1 = transition4(mm_index_mask.x, input1,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices3, &indices4);
-
-		input0 = transition4(mm_index_mask.x, input0,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices1, &indices2);
-
-		input1 = transition4(mm_index_mask.x, input1,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices3, &indices4);
-
-		input0 = transition4(mm_index_mask.x, input0,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices1, &indices2);
-
-		input1 = transition4(mm_index_mask.x, input1,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices3, &indices4);
-
-		 /* Check for any matches. */
-		acl_match_check_x4(0, ctx, parms, &flows,
-			&indices1, &indices2, mm_match_mask.x);
-		acl_match_check_x4(4, ctx, parms, &flows,
-			&indices3, &indices4, mm_match_mask.x);
-	}
-
-	return 0;
-}
-
-/*
- * Execute trie traversal with 4 traversals in parallel
- */
-static inline int
-search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	 uint32_t *results, int total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE4];
-	struct completion cmplt[MAX_SEARCHES_SSE4];
-	struct parms parms[MAX_SEARCHES_SSE4];
-	xmm_t input, indices1, indices2;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	indices1 = MM_LOADU((xmm_t *) &index_array[0]);
-	indices2 = MM_LOADU((xmm_t *) &index_array[2]);
-
-	/* Check for any matches. */
-	acl_match_check_x4(0, ctx, parms, &flows,
-		&indices1, &indices2, mm_match_mask.x);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input = MM_INSERT32(mm_ones_16.x, GET_NEXT_4BYTES(parms, 0), 0);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
-
-		/* Process the 4 bytes of input on each stream. */
-		input = transition4(mm_index_mask.x, input,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices1, &indices2);
-
-		 input = transition4(mm_index_mask.x, input,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices1, &indices2);
-
-		 input = transition4(mm_index_mask.x, input,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices1, &indices2);
-
-		 input = transition4(mm_index_mask.x, input,
-			mm_shuffle_input.x, mm_ones_16.x,
-			flows.trans, &indices1, &indices2);
-
-		/* Check for any matches. */
-		acl_match_check_x4(0, ctx, parms, &flows,
-			&indices1, &indices2, mm_match_mask.x);
-	}
-
-	return 0;
-}
-
-static inline xmm_t
-transition2(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, const uint64_t *trans, xmm_t *indices1)
-{
-	uint64_t t;
-	xmm_t addr, indices2;
-
-	indices2 = MM_XOR(ones_16, ones_16);
-
-	addr = acl_calc_addr(index_mask, next_input, shuffle_input, ones_16,
-		*indices1, indices2);
-
-	/* Gather 64 bit transitions and pack 2 per register. */
-
-	t = trans[MM_CVT32(addr)];
-
-	/* get slot 1 */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
-	*indices1 = MM_SET64(trans[MM_CVT32(addr)], t);
-
-	return MM_SRL32(next_input, 8);
-}
-
-/*
- * Execute trie traversal with 2 traversals in parallel.
- */
-static inline int
-search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE2];
-	struct completion cmplt[MAX_SEARCHES_SSE2];
-	struct parms parms[MAX_SEARCHES_SSE2];
-	xmm_t input, indices;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	indices = MM_LOADU((xmm_t *) &index_array[0]);
-
-	/* Check for any matches. */
-	acl_match_check_x2(0, ctx, parms, &flows, &indices, mm_match_mask64.x);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input = MM_INSERT32(mm_ones_16.x, GET_NEXT_4BYTES(parms, 0), 0);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
-
-		/* Process the 4 bytes of input on each stream. */
-
-		input = transition2(mm_index_mask64.x, input,
-			mm_shuffle_input64.x, mm_ones_16.x,
-			flows.trans, &indices);
-
-		input = transition2(mm_index_mask64.x, input,
-			mm_shuffle_input64.x, mm_ones_16.x,
-			flows.trans, &indices);
-
-		input = transition2(mm_index_mask64.x, input,
-			mm_shuffle_input64.x, mm_ones_16.x,
-			flows.trans, &indices);
-
-		input = transition2(mm_index_mask64.x, input,
-			mm_shuffle_input64.x, mm_ones_16.x,
-			flows.trans, &indices);
-
-		/* Check for any matches. */
-		acl_match_check_x2(0, ctx, parms, &flows, &indices,
-			mm_match_mask64.x);
-	}
-
-	return 0;
-}
+#include "acl_run_sse.h"
 
 int
 rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
diff --git a/lib/librte_acl/acl_run_sse.h b/lib/librte_acl/acl_run_sse.h
new file mode 100644
index 0000000..e33e16b
--- /dev/null
+++ b/lib/librte_acl/acl_run_sse.h
@@ -0,0 +1,533 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include "acl_run.h"
+#include "acl_vect.h"
+
+enum {
+	SHUFFLE32_SLOT1 = 0xe5,
+	SHUFFLE32_SLOT2 = 0xe6,
+	SHUFFLE32_SLOT3 = 0xe7,
+	SHUFFLE32_SWAP64 = 0x4e,
+};
+
+static const rte_xmm_t xmm_shuffle_input = {
+	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
+};
+
+static const rte_xmm_t xmm_shuffle_input64 = {
+	.u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
+};
+
+static const rte_xmm_t xmm_ones_16 = {
+	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
+};
+
+static const rte_xmm_t xmm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const rte_xmm_t xmm_match_mask64 = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		0,
+		RTE_ACL_NODE_MATCH,
+		0,
+	},
+};
+
+static const rte_xmm_t xmm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const rte_xmm_t xmm_index_mask64 = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		0,
+		0,
+	},
+};
+
+
+/*
+ * Resolve priority for multiple results (sse version).
+ * This consists comparing the priority of the current traversal with the
+ * running set of results for the packet.
+ * For each result, keep a running array of the result (rule number) and
+ * its priority for each category.
+ */
+static inline void
+resolve_priority_sse(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
+	struct parms *parms, const struct rte_acl_match_results *p,
+	uint32_t categories)
+{
+	uint32_t x;
+	xmm_t results, priority, results1, priority1, selector;
+	xmm_t *saved_results, *saved_priority;
+
+	for (x = 0; x < categories; x += RTE_ACL_RESULTS_MULTIPLIER) {
+
+		saved_results = (xmm_t *)(&parms[n].cmplt->results[x]);
+		saved_priority =
+			(xmm_t *)(&parms[n].cmplt->priority[x]);
+
+		/* get results and priorities for completed trie */
+		results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
+		priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
+
+		/* if this is not the first completed trie */
+		if (parms[n].cmplt->count != ctx->num_tries) {
+
+			/* get running best results and their priorities */
+			results1 = MM_LOADU(saved_results);
+			priority1 = MM_LOADU(saved_priority);
+
+			/* select results that are highest priority */
+			selector = MM_CMPGT32(priority1, priority);
+			results = MM_BLENDV8(results, results1, selector);
+			priority = MM_BLENDV8(priority, priority1, selector);
+		}
+
+		/* save running best results and their priorities */
+		MM_STOREU(saved_results, results);
+		MM_STOREU(saved_priority, priority);
+	}
+}
+
+/*
+ * Extract transitions from an XMM register and check for any matches
+ */
+static void
+acl_process_matches(xmm_t *indices, int slot, const struct rte_acl_ctx *ctx,
+	struct parms *parms, struct acl_flow_data *flows)
+{
+	uint64_t transition1, transition2;
+
+	/* extract transition from low 64 bits. */
+	transition1 = MM_CVT64(*indices);
+
+	/* extract transition from high 64 bits. */
+	*indices = MM_SHUFFLE32(*indices, SHUFFLE32_SWAP64);
+	transition2 = MM_CVT64(*indices);
+
+	transition1 = acl_match_check(transition1, slot, ctx,
+		parms, flows, resolve_priority_sse);
+	transition2 = acl_match_check(transition2, slot + 1, ctx,
+		parms, flows, resolve_priority_sse);
+
+	/* update indices with new transitions. */
+	*indices = MM_SET64(transition2, transition1);
+}
+
+/*
+ * Check for a match in 2 transitions (contained in SSE register)
+ */
+static inline __attribute__((always_inline)) void
+acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows, xmm_t *indices, xmm_t match_mask)
+{
+	xmm_t temp;
+
+	temp = MM_AND(match_mask, *indices);
+	while (!MM_TESTZ(temp, temp)) {
+		acl_process_matches(indices, slot, ctx, parms, flows);
+		temp = MM_AND(match_mask, *indices);
+	}
+}
+
+/*
+ * Check for any match in 4 transitions (contained in 2 SSE registers)
+ */
+static inline __attribute__((always_inline)) void
+acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
+	struct acl_flow_data *flows, xmm_t *indices1, xmm_t *indices2,
+	xmm_t match_mask)
+{
+	xmm_t temp;
+
+	/* put low 32 bits of each transition into one register */
+	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indices1, (__m128)*indices2,
+		0x88);
+	/* test for match node */
+	temp = MM_AND(match_mask, temp);
+
+	while (!MM_TESTZ(temp, temp)) {
+		acl_process_matches(indices1, slot, ctx, parms, flows);
+		acl_process_matches(indices2, slot + 2, ctx, parms, flows);
+
+		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indices1,
+					(__m128)*indices2,
+					0x88);
+		temp = MM_AND(match_mask, temp);
+	}
+}
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes don't move.
+ */
+static inline __attribute__((always_inline)) xmm_t
+calc_addr_sse(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
+	xmm_t ones_16, xmm_t indices1, xmm_t indices2)
+{
+	xmm_t addr, node_types, range, temp;
+	xmm_t dfa_msk, dfa_ofs, quad_ofs;
+	xmm_t in, r, t;
+
+	const xmm_t range_base = _mm_set_epi32(0xffffff0c, 0xffffff08,
+		0xffffff04, 0xffffff00);
+
+	/*
+	 * Note that no transition is done for a match
+	 * node and therefore a stream freezes when
+	 * it reaches a match.
+	 */
+
+	/* Shuffle low 32 into temp and high 32 into indices2 */
+	temp = (xmm_t)MM_SHUFFLEPS((__m128)indices1, (__m128)indices2, 0x88);
+	range = (xmm_t)MM_SHUFFLEPS((__m128)indices1, (__m128)indices2, 0xdd);
+
+	t = MM_XOR(index_mask, index_mask);
+
+	/* shuffle input byte to all 4 positions of 32 bit value */
+	in = MM_SHUFFLE8(next_input, shuffle_input);
+
+	/* Calc node type and node addr */
+	node_types = MM_ANDNOT(index_mask, temp);
+	addr = MM_AND(index_mask, temp);
+
+	/*
+	 * Calc addr for DFAs - addr = dfa_index + input_byte
+	 */
+
+	/* mask for DFA type (0) nodes */
+	dfa_msk = MM_CMPEQ32(node_types, t);
+
+	r = _mm_srli_epi32(in, 30);
+	r = _mm_add_epi8(r, range_base);
+
+	t = _mm_srli_epi32(in, 24);
+	r = _mm_shuffle_epi8(range, r);
+
+	dfa_ofs = _mm_sub_epi32(t, r);
+
+	/*
+	 * Calculate number of range boundaries that are less than the
+	 * input value. Range boundaries for each node are in signed 8 bit,
+	 * ordered from -128 to 127 in the indices2 register.
+	 * This is effectively a popcnt of bytes that are greater than the
+	 * input byte.
+	 */
+
+	/* check ranges */
+	temp = MM_CMPGT8(in, range);
+
+	/* convert -1 to 1 (bytes greater than input byte */
+	temp = MM_SIGN8(temp, temp);
+
+	/* horizontal add pairs of bytes into words */
+	temp = MM_MADD8(temp, temp);
+
+	/* horizontal add pairs of words into dwords */
+	quad_ofs = MM_MADD16(temp, ones_16);
+
+	/* mask to range type nodes */
+	temp = _mm_blendv_epi8(quad_ofs, dfa_ofs, dfa_msk);
+
+	/* add index into node position */
+	return MM_ADD32(addr, temp);
+}
+
+/*
+ * Process 4 transitions (in 2 SIMD registers) in parallel
+ */
+static inline __attribute__((always_inline)) xmm_t
+transition4(xmm_t next_input, const uint64_t *trans,
+	xmm_t *indices1, xmm_t *indices2)
+{
+	xmm_t addr;
+	uint64_t trans0, trans2;
+
+	 /* Calculate the address (array index) for all 4 transitions. */
+
+	addr = calc_addr_sse(xmm_index_mask.x, next_input, xmm_shuffle_input.x,
+		xmm_ones_16.x, *indices1, *indices2);
+
+	 /* Gather 64 bit transitions and pack back into 2 registers. */
+
+	trans0 = trans[MM_CVT32(addr)];
+
+	/* get slot 2 */
+
+	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
+	trans2 = trans[MM_CVT32(addr)];
+
+	/* get slot 1 */
+
+	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
+	*indices1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
+
+	/* get slot 3 */
+
+	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
+	*indices2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
+
+	return MM_SRL32(next_input, CHAR_BIT);
+}
+
+/*
+ * Execute trie traversal with 8 traversals in parallel
+ */
+static inline int
+search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE8];
+	struct completion cmplt[MAX_SEARCHES_SSE8];
+	struct parms parms[MAX_SEARCHES_SSE8];
+	xmm_t input0, input1;
+	xmm_t indices1, indices2, indices3, indices4;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE8; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	/*
+	 * indices1 contains index_array[0,1]
+	 * indices2 contains index_array[2,3]
+	 * indices3 contains index_array[4,5]
+	 * indices4 contains index_array[6,7]
+	 */
+
+	indices1 = MM_LOADU((xmm_t *) &index_array[0]);
+	indices2 = MM_LOADU((xmm_t *) &index_array[2]);
+
+	indices3 = MM_LOADU((xmm_t *) &index_array[4]);
+	indices4 = MM_LOADU((xmm_t *) &index_array[6]);
+
+	 /* Check for any matches. */
+	acl_match_check_x4(0, ctx, parms, &flows,
+		&indices1, &indices2, xmm_match_mask.x);
+	acl_match_check_x4(4, ctx, parms, &flows,
+		&indices3, &indices4, xmm_match_mask.x);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input0 = _mm_cvtsi32_si128(GET_NEXT_4BYTES(parms, 0));
+		input1 = _mm_cvtsi32_si128(GET_NEXT_4BYTES(parms, 4));
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
+
+		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
+		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
+
+		 /* Process the 4 bytes of input on each stream. */
+
+		input0 = transition4(input0, flows.trans,
+			&indices1, &indices2);
+		input1 = transition4(input1, flows.trans,
+			&indices3, &indices4);
+
+		input0 = transition4(input0, flows.trans,
+			&indices1, &indices2);
+		input1 = transition4(input1, flows.trans,
+			&indices3, &indices4);
+
+		input0 = transition4(input0, flows.trans,
+			&indices1, &indices2);
+		input1 = transition4(input1, flows.trans,
+			&indices3, &indices4);
+
+		input0 = transition4(input0, flows.trans,
+			&indices1, &indices2);
+		input1 = transition4(input1, flows.trans,
+			&indices3, &indices4);
+
+		 /* Check for any matches. */
+		acl_match_check_x4(0, ctx, parms, &flows,
+			&indices1, &indices2, xmm_match_mask.x);
+		acl_match_check_x4(4, ctx, parms, &flows,
+			&indices3, &indices4, xmm_match_mask.x);
+	}
+
+	return 0;
+}
+
+/*
+ * Execute trie traversal with 4 traversals in parallel
+ */
+static inline int
+search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	 uint32_t *results, int total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE4];
+	struct completion cmplt[MAX_SEARCHES_SSE4];
+	struct parms parms[MAX_SEARCHES_SSE4];
+	xmm_t input, indices1, indices2;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE4; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	indices1 = MM_LOADU((xmm_t *) &index_array[0]);
+	indices2 = MM_LOADU((xmm_t *) &index_array[2]);
+
+	/* Check for any matches. */
+	acl_match_check_x4(0, ctx, parms, &flows,
+		&indices1, &indices2, xmm_match_mask.x);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input = _mm_cvtsi32_si128(GET_NEXT_4BYTES(parms, 0));
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
+
+		/* Process the 4 bytes of input on each stream. */
+		input = transition4(input, flows.trans, &indices1, &indices2);
+		input = transition4(input, flows.trans, &indices1, &indices2);
+		input = transition4(input, flows.trans, &indices1, &indices2);
+		input = transition4(input, flows.trans, &indices1, &indices2);
+
+		/* Check for any matches. */
+		acl_match_check_x4(0, ctx, parms, &flows,
+			&indices1, &indices2, xmm_match_mask.x);
+	}
+
+	return 0;
+}
+
+static inline __attribute__((always_inline)) xmm_t
+transition2(xmm_t next_input, const uint64_t *trans, xmm_t *indices1)
+{
+	uint64_t t;
+	xmm_t addr, indices2;
+
+	indices2 = _mm_setzero_si128();
+
+	addr = calc_addr_sse(xmm_index_mask.x, next_input, xmm_shuffle_input.x,
+		xmm_ones_16.x, *indices1, indices2);
+
+	/* Gather 64 bit transitions and pack 2 per register. */
+
+	t = trans[MM_CVT32(addr)];
+
+	/* get slot 1 */
+	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
+	*indices1 = MM_SET64(trans[MM_CVT32(addr)], t);
+
+	return MM_SRL32(next_input, CHAR_BIT);
+}
+
+/*
+ * Execute trie traversal with 2 traversals in parallel.
+ */
+static inline int
+search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	int n;
+	struct acl_flow_data flows;
+	uint64_t index_array[MAX_SEARCHES_SSE2];
+	struct completion cmplt[MAX_SEARCHES_SSE2];
+	struct parms parms[MAX_SEARCHES_SSE2];
+	xmm_t input, indices;
+
+	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
+		total_packets, categories, ctx->trans_table);
+
+	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
+		cmplt[n].count = 0;
+		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
+	}
+
+	indices = MM_LOADU((xmm_t *) &index_array[0]);
+
+	/* Check for any matches. */
+	acl_match_check_x2(0, ctx, parms, &flows, &indices,
+		xmm_match_mask64.x);
+
+	while (flows.started > 0) {
+
+		/* Gather 4 bytes of input data for each stream. */
+		input = _mm_cvtsi32_si128(GET_NEXT_4BYTES(parms, 0));
+		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
+
+		/* Process the 4 bytes of input on each stream. */
+
+		input = transition2(input, flows.trans, &indices);
+		input = transition2(input, flows.trans, &indices);
+		input = transition2(input, flows.trans, &indices);
+		input = transition2(input, flows.trans, &indices);
+
+		/* Check for any matches. */
+		acl_match_check_x2(0, ctx, parms, &flows, &indices,
+			xmm_match_mask64.x);
+	}
+
+	return 0;
+}
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index a16c4a4..a9cd349 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -38,10 +38,25 @@
 
 TAILQ_HEAD(rte_acl_list, rte_tailq_entry);
 
+/*
+ * If the compiler doesn't support AVX2 instructions,
+ * then the dummy one would be used instead for AVX2 classify method.
+ */
+int __attribute__ ((weak))
+rte_acl_classify_avx2(__rte_unused const struct rte_acl_ctx *ctx,
+	__rte_unused const uint8_t **data,
+	__rte_unused uint32_t *results,
+	__rte_unused uint32_t num,
+	__rte_unused uint32_t categories)
+{
+	return -ENOTSUP;
+}
+
 static const rte_acl_classify_t classify_fns[] = {
 	[RTE_ACL_CLASSIFY_DEFAULT] = rte_acl_classify_scalar,
 	[RTE_ACL_CLASSIFY_SCALAR] = rte_acl_classify_scalar,
 	[RTE_ACL_CLASSIFY_SSE] = rte_acl_classify_sse,
+	[RTE_ACL_CLASSIFY_AVX2] = rte_acl_classify_avx2,
 };
 
 /* by default, use always available scalar code path. */
@@ -64,12 +79,24 @@ rte_acl_set_ctx_classify(struct rte_acl_ctx *ctx, enum rte_acl_classify_alg alg)
 	return 0;
 }
 
+/*
+ * Select highest available classify method as default one.
+ * Note that CLASSIFY_AVX2 should be set as a default only
+ * if both conditions are met:
+ * at build time compiler supports AVX2 and target cpu supports AVX2.
+ */
 static void __attribute__((constructor))
 rte_acl_init(void)
 {
 	enum rte_acl_classify_alg alg = RTE_ACL_CLASSIFY_DEFAULT;
 
+#ifdef CC_AVX2_SUPPORT
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
+		alg = RTE_ACL_CLASSIFY_AVX2;
+	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
+#else
 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
+#endif
 		alg = RTE_ACL_CLASSIFY_SSE;
 
 	rte_acl_set_default_classify(alg);
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index 0d913ee..652a234 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -265,6 +265,8 @@ enum rte_acl_classify_alg {
 	RTE_ACL_CLASSIFY_DEFAULT = 0,
 	RTE_ACL_CLASSIFY_SCALAR = 1,  /**< generic implementation. */
 	RTE_ACL_CLASSIFY_SSE = 2,     /**< requires SSE4.1 support. */
+	RTE_ACL_CLASSIFY_AVX2 = 3,    /**< requires AVX2 support. */
+	RTE_ACL_CLASSIFY_NUM          /* should always be the last one. */
 };
 
 /**
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 12/18] test-acl: add ability to manually select RT method.
  2015-01-20 18:40 [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Konstantin Ananyev
                   ` (10 preceding siblings ...)
  2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 11/18] librte_acl: add AVX2 as new rte_acl_classify() method Konstantin Ananyev
@ 2015-01-20 18:41 ` Konstantin Ananyev
  2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 13/18] librte_acl: Remove search_sse_2 and relatives Konstantin Ananyev
                   ` (8 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Konstantin Ananyev @ 2015-01-20 18:41 UTC (permalink / raw)
  To: dev

In test-acl replace command-line option "--scalar" with new one:
"--alg=scalar|sse|avx2".
Allows user manually select preferred classify() method.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test-acl/main.c | 93 ++++++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 75 insertions(+), 18 deletions(-)

diff --git a/app/test-acl/main.c b/app/test-acl/main.c
index b3d4294..52f43c6 100644
--- a/app/test-acl/main.c
+++ b/app/test-acl/main.c
@@ -82,7 +82,7 @@
 #define	OPT_RULE_NUM		"rulenum"
 #define	OPT_TRACE_NUM		"tracenum"
 #define	OPT_TRACE_STEP		"tracestep"
-#define	OPT_SEARCH_SCALAR	"scalar"
+#define	OPT_SEARCH_ALG		"alg"
 #define	OPT_BLD_CATEGORIES	"bldcat"
 #define	OPT_RUN_CATEGORIES	"runcat"
 #define	OPT_ITER_NUM		"iter"
@@ -102,6 +102,26 @@ enum {
 	DUMP_MAX
 };
 
+struct acl_alg {
+	const char *name;
+	enum rte_acl_classify_alg alg;
+};
+
+static const struct acl_alg acl_alg[] = {
+	{
+		.name = "scalar",
+		.alg = RTE_ACL_CLASSIFY_SCALAR,
+	},
+	{
+		.name = "sse",
+		.alg = RTE_ACL_CLASSIFY_SSE,
+	},
+	{
+		.name = "avx2",
+		.alg = RTE_ACL_CLASSIFY_AVX2,
+	},
+};
+
 static struct {
 	const char         *prgname;
 	const char         *rule_file;
@@ -114,11 +134,11 @@ static struct {
 	uint32_t            trace_sz;
 	uint32_t            iter_num;
 	uint32_t            verbose;
-	uint32_t            scalar;
+	uint32_t            ipv6;
+	struct acl_alg      alg;
 	uint32_t            used_traces;
 	void               *traces;
 	struct rte_acl_ctx *acx;
-	uint32_t			ipv6;
 } config = {
 	.bld_categories = 3,
 	.run_categories = 1,
@@ -127,6 +147,10 @@ static struct {
 	.trace_step = TRACE_STEP_DEF,
 	.iter_num = 1,
 	.verbose = DUMP_MAX,
+	.alg = {
+		.name = "default",
+		.alg = RTE_ACL_CLASSIFY_DEFAULT,
+	},
 	.ipv6 = 0
 };
 
@@ -774,13 +798,12 @@ acx_init(void)
 	if (config.acx == NULL)
 		rte_exit(rte_errno, "failed to create ACL context\n");
 
-	/* set default classify method to scalar for this context. */
-	if (config.scalar) {
-		ret = rte_acl_set_ctx_classify(config.acx,
-			RTE_ACL_CLASSIFY_SCALAR);
+	/* set default classify method for this context. */
+	if (config.alg.alg != RTE_ACL_CLASSIFY_DEFAULT) {
+		ret = rte_acl_set_ctx_classify(config.acx, config.alg.alg);
 		if (ret != 0)
-			rte_exit(ret, "failed to setup classify method "
-				"for ACL context\n");
+			rte_exit(ret, "failed to setup %s method "
+				"for ACL context\n", config.alg.name);
 	}
 
 	/* add ACL rules. */
@@ -809,7 +832,7 @@ acx_init(void)
 }
 
 static uint32_t
-search_ip5tuples_once(uint32_t categories, uint32_t step, int scalar)
+search_ip5tuples_once(uint32_t categories, uint32_t step, const char *alg)
 {
 	int ret;
 	uint32_t i, j, k, n, r;
@@ -847,7 +870,7 @@ search_ip5tuples_once(uint32_t categories, uint32_t step, int scalar)
 
 	dump_verbose(DUMP_SEARCH, stdout,
 		"%s(%u, %u, %s) returns %u\n", __func__,
-		categories, step, scalar != 0 ? "scalar" : "sse", i);
+		categories, step, alg, i);
 	return i;
 }
 
@@ -863,7 +886,7 @@ search_ip5tuples(__attribute__((unused)) void *arg)
 
 	for (i = 0; i != config.iter_num; i++) {
 		pkt += search_ip5tuples_once(config.run_categories,
-			config.trace_step, config.scalar);
+			config.trace_step, config.alg.name);
 	}
 
 	tm = rte_rdtsc() - start;
@@ -891,8 +914,40 @@ get_uint32_opt(const char *opt, const char *name, uint32_t min, uint32_t max)
 }
 
 static void
+get_alg_opt(const char *opt, const char *name)
+{
+	uint32_t i;
+
+	for (i = 0; i != RTE_DIM(acl_alg); i++) {
+		if (strcmp(opt, acl_alg[i].name) == 0) {
+			config.alg = acl_alg[i];
+			return;
+		}
+	}
+
+	rte_exit(-EINVAL, "invalid value: \"%s\" for option: %s\n",
+		opt, name);
+}
+
+static void
 print_usage(const char *prgname)
 {
+	uint32_t i, n, rc;
+	char buf[PATH_MAX];
+
+	n = 0;
+	buf[0] = 0;
+
+	for (i = 0; i < RTE_DIM(acl_alg) - 1; i++) {
+		rc = snprintf(buf + n, sizeof(buf) - n, "%s|",
+			acl_alg[i].name);
+		if (rc > sizeof(buf) - n)
+			break;
+		n += rc;
+	}
+
+	snprintf(buf + n, sizeof(buf) - n, "%s", acl_alg[i].name);
+
 	fprintf(stdout,
 		PRINT_USAGE_START
 		"--" OPT_RULE_FILE "=<rules set file>\n"
@@ -911,10 +966,11 @@ print_usage(const char *prgname)
 			"but not greater then %u]\n"
 		"[--" OPT_ITER_NUM "=<number of iterations to perform>]\n"
 		"[--" OPT_VERBOSE "=<verbose level>]\n"
-		"[--" OPT_SEARCH_SCALAR "=<use scalar version>]\n"
+		"[--" OPT_SEARCH_ALG "=%s]\n"
 		"[--" OPT_IPV6 "=<IPv6 rules and trace files>]\n",
 		prgname, RTE_ACL_RESULTS_MULTIPLIER,
-		(uint32_t)RTE_ACL_MAX_CATEGORIES);
+		(uint32_t)RTE_ACL_MAX_CATEGORIES,
+		buf);
 }
 
 static void
@@ -930,7 +986,8 @@ dump_config(FILE *f)
 	fprintf(f, "%s:%u\n", OPT_RUN_CATEGORIES, config.run_categories);
 	fprintf(f, "%s:%u\n", OPT_ITER_NUM, config.iter_num);
 	fprintf(f, "%s:%u\n", OPT_VERBOSE, config.verbose);
-	fprintf(f, "%s:%u\n", OPT_SEARCH_SCALAR, config.scalar);
+	fprintf(f, "%s:%u(%s)\n", OPT_SEARCH_ALG, config.alg.alg,
+		config.alg.name);
 	fprintf(f, "%s:%u\n", OPT_IPV6, config.ipv6);
 }
 
@@ -958,7 +1015,7 @@ get_input_opts(int argc, char **argv)
 		{OPT_RUN_CATEGORIES, 1, 0, 0},
 		{OPT_ITER_NUM, 1, 0, 0},
 		{OPT_VERBOSE, 1, 0, 0},
-		{OPT_SEARCH_SCALAR, 0, 0, 0},
+		{OPT_SEARCH_ALG, 1, 0, 0},
 		{OPT_IPV6, 0, 0, 0},
 		{NULL, 0, 0, 0}
 	};
@@ -1002,8 +1059,8 @@ get_input_opts(int argc, char **argv)
 			config.verbose = get_uint32_opt(optarg,
 				lgopts[opt_idx].name, DUMP_NONE, DUMP_MAX);
 		} else if (strcmp(lgopts[opt_idx].name,
-				OPT_SEARCH_SCALAR) == 0) {
-			config.scalar = 1;
+				OPT_SEARCH_ALG) == 0) {
+			get_alg_opt(optarg, lgopts[opt_idx].name);
 		} else if (strcmp(lgopts[opt_idx].name, OPT_IPV6) == 0) {
 			config.ipv6 = 1;
 		}
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 13/18] librte_acl: Remove search_sse_2 and relatives.
  2015-01-20 18:40 [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Konstantin Ananyev
                   ` (11 preceding siblings ...)
  2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 12/18] test-acl: add ability to manually select RT method Konstantin Ananyev
@ 2015-01-20 18:41 ` Konstantin Ananyev
  2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 14/18] libter_acl: move lo/hi dwords shuffle out from calc_addr Konstantin Ananyev
                   ` (7 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Konstantin Ananyev @ 2015-01-20 18:41 UTC (permalink / raw)
  To: dev

Previous improvements made scalar method the fastest one
for tiny bunch of packets (< 4).
That allows us to remove specific vector code-path for small number of packets
(search_sse_2)
and always use scalar method for such cases.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl_run_avx2.c |   2 +-
 lib/librte_acl/acl_run_sse.c  |   3 +-
 lib/librte_acl/acl_run_sse.h  | 110 ------------------------------------------
 3 files changed, 3 insertions(+), 112 deletions(-)

diff --git a/lib/librte_acl/acl_run_avx2.c b/lib/librte_acl/acl_run_avx2.c
index 0a42f72..79ebbd6 100644
--- a/lib/librte_acl/acl_run_avx2.c
+++ b/lib/librte_acl/acl_run_avx2.c
@@ -49,6 +49,6 @@ rte_acl_classify_avx2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	else if (num >= MAX_SEARCHES_SSE4)
 		return search_sse_4(ctx, data, results, num, categories);
 	else
-		return search_sse_2(ctx, data, results, num,
+		return rte_acl_classify_scalar(ctx, data, results, num,
 			categories);
 }
diff --git a/lib/librte_acl/acl_run_sse.c b/lib/librte_acl/acl_run_sse.c
index 77b32b3..a5a7d36 100644
--- a/lib/librte_acl/acl_run_sse.c
+++ b/lib/librte_acl/acl_run_sse.c
@@ -42,5 +42,6 @@ rte_acl_classify_sse(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	else if (num >= MAX_SEARCHES_SSE4)
 		return search_sse_4(ctx, data, results, num, categories);
 	else
-		return search_sse_2(ctx, data, results, num, categories);
+		return rte_acl_classify_scalar(ctx, data, results, num,
+			categories);
 }
diff --git a/lib/librte_acl/acl_run_sse.h b/lib/librte_acl/acl_run_sse.h
index e33e16b..1b7870e 100644
--- a/lib/librte_acl/acl_run_sse.h
+++ b/lib/librte_acl/acl_run_sse.h
@@ -45,10 +45,6 @@ static const rte_xmm_t xmm_shuffle_input = {
 	.u32 = {0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c},
 };
 
-static const rte_xmm_t xmm_shuffle_input64 = {
-	.u32 = {0x00000000, 0x04040404, 0x80808080, 0x80808080},
-};
-
 static const rte_xmm_t xmm_ones_16 = {
 	.u16 = {1, 1, 1, 1, 1, 1, 1, 1},
 };
@@ -62,15 +58,6 @@ static const rte_xmm_t xmm_match_mask = {
 	},
 };
 
-static const rte_xmm_t xmm_match_mask64 = {
-	.u32 = {
-		RTE_ACL_NODE_MATCH,
-		0,
-		RTE_ACL_NODE_MATCH,
-		0,
-	},
-};
-
 static const rte_xmm_t xmm_index_mask = {
 	.u32 = {
 		RTE_ACL_NODE_INDEX,
@@ -80,16 +67,6 @@ static const rte_xmm_t xmm_index_mask = {
 	},
 };
 
-static const rte_xmm_t xmm_index_mask64 = {
-	.u32 = {
-		RTE_ACL_NODE_INDEX,
-		RTE_ACL_NODE_INDEX,
-		0,
-		0,
-	},
-};
-
-
 /*
  * Resolve priority for multiple results (sse version).
  * This consists comparing the priority of the current traversal with the
@@ -161,22 +138,6 @@ acl_process_matches(xmm_t *indices, int slot, const struct rte_acl_ctx *ctx,
 }
 
 /*
- * Check for a match in 2 transitions (contained in SSE register)
- */
-static inline __attribute__((always_inline)) void
-acl_match_check_x2(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
-	struct acl_flow_data *flows, xmm_t *indices, xmm_t match_mask)
-{
-	xmm_t temp;
-
-	temp = MM_AND(match_mask, *indices);
-	while (!MM_TESTZ(temp, temp)) {
-		acl_process_matches(indices, slot, ctx, parms, flows);
-		temp = MM_AND(match_mask, *indices);
-	}
-}
-
-/*
  * Check for any match in 4 transitions (contained in 2 SSE registers)
  */
 static inline __attribute__((always_inline)) void
@@ -460,74 +421,3 @@ search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
 
 	return 0;
 }
-
-static inline __attribute__((always_inline)) xmm_t
-transition2(xmm_t next_input, const uint64_t *trans, xmm_t *indices1)
-{
-	uint64_t t;
-	xmm_t addr, indices2;
-
-	indices2 = _mm_setzero_si128();
-
-	addr = calc_addr_sse(xmm_index_mask.x, next_input, xmm_shuffle_input.x,
-		xmm_ones_16.x, *indices1, indices2);
-
-	/* Gather 64 bit transitions and pack 2 per register. */
-
-	t = trans[MM_CVT32(addr)];
-
-	/* get slot 1 */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
-	*indices1 = MM_SET64(trans[MM_CVT32(addr)], t);
-
-	return MM_SRL32(next_input, CHAR_BIT);
-}
-
-/*
- * Execute trie traversal with 2 traversals in parallel.
- */
-static inline int
-search_sse_2(const struct rte_acl_ctx *ctx, const uint8_t **data,
-	uint32_t *results, uint32_t total_packets, uint32_t categories)
-{
-	int n;
-	struct acl_flow_data flows;
-	uint64_t index_array[MAX_SEARCHES_SSE2];
-	struct completion cmplt[MAX_SEARCHES_SSE2];
-	struct parms parms[MAX_SEARCHES_SSE2];
-	xmm_t input, indices;
-
-	acl_set_flow(&flows, cmplt, RTE_DIM(cmplt), data, results,
-		total_packets, categories, ctx->trans_table);
-
-	for (n = 0; n < MAX_SEARCHES_SSE2; n++) {
-		cmplt[n].count = 0;
-		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
-	}
-
-	indices = MM_LOADU((xmm_t *) &index_array[0]);
-
-	/* Check for any matches. */
-	acl_match_check_x2(0, ctx, parms, &flows, &indices,
-		xmm_match_mask64.x);
-
-	while (flows.started > 0) {
-
-		/* Gather 4 bytes of input data for each stream. */
-		input = _mm_cvtsi32_si128(GET_NEXT_4BYTES(parms, 0));
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
-
-		/* Process the 4 bytes of input on each stream. */
-
-		input = transition2(input, flows.trans, &indices);
-		input = transition2(input, flows.trans, &indices);
-		input = transition2(input, flows.trans, &indices);
-		input = transition2(input, flows.trans, &indices);
-
-		/* Check for any matches. */
-		acl_match_check_x2(0, ctx, parms, &flows, &indices,
-			xmm_match_mask64.x);
-	}
-
-	return 0;
-}
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 14/18] libter_acl: move lo/hi dwords shuffle out from calc_addr
  2015-01-20 18:40 [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Konstantin Ananyev
                   ` (12 preceding siblings ...)
  2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 13/18] librte_acl: Remove search_sse_2 and relatives Konstantin Ananyev
@ 2015-01-20 18:41 ` Konstantin Ananyev
  2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 15/18] libte_acl: make calc_addr a define to deduplicate the code Konstantin Ananyev
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Konstantin Ananyev @ 2015-01-20 18:41 UTC (permalink / raw)
  To: dev

Reorganise SSE code-path a bit by moving lo/hi dwords shuffle
out from calc_addr().
That allows to make calc_addr() for SSE and AVX2 practically identical
and opens opportunity for further code deduplication.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl_run_sse.h | 38 ++++++++++++++++++++------------------
 1 file changed, 20 insertions(+), 18 deletions(-)

diff --git a/lib/librte_acl/acl_run_sse.h b/lib/librte_acl/acl_run_sse.h
index 1b7870e..4a174e9 100644
--- a/lib/librte_acl/acl_run_sse.h
+++ b/lib/librte_acl/acl_run_sse.h
@@ -172,9 +172,9 @@ acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
  */
 static inline __attribute__((always_inline)) xmm_t
 calc_addr_sse(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t indices1, xmm_t indices2)
+	xmm_t ones_16, xmm_t tr_lo, xmm_t tr_hi)
 {
-	xmm_t addr, node_types, range, temp;
+	xmm_t addr, node_types;
 	xmm_t dfa_msk, dfa_ofs, quad_ofs;
 	xmm_t in, r, t;
 
@@ -187,18 +187,14 @@ calc_addr_sse(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
 	 * it reaches a match.
 	 */
 
-	/* Shuffle low 32 into temp and high 32 into indices2 */
-	temp = (xmm_t)MM_SHUFFLEPS((__m128)indices1, (__m128)indices2, 0x88);
-	range = (xmm_t)MM_SHUFFLEPS((__m128)indices1, (__m128)indices2, 0xdd);
-
 	t = MM_XOR(index_mask, index_mask);
 
 	/* shuffle input byte to all 4 positions of 32 bit value */
 	in = MM_SHUFFLE8(next_input, shuffle_input);
 
 	/* Calc node type and node addr */
-	node_types = MM_ANDNOT(index_mask, temp);
-	addr = MM_AND(index_mask, temp);
+	node_types = MM_ANDNOT(index_mask, tr_lo);
+	addr = MM_AND(index_mask, tr_lo);
 
 	/*
 	 * Calc addr for DFAs - addr = dfa_index + input_byte
@@ -211,7 +207,7 @@ calc_addr_sse(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
 	r = _mm_add_epi8(r, range_base);
 
 	t = _mm_srli_epi32(in, 24);
-	r = _mm_shuffle_epi8(range, r);
+	r = _mm_shuffle_epi8(tr_hi, r);
 
 	dfa_ofs = _mm_sub_epi32(t, r);
 
@@ -224,22 +220,22 @@ calc_addr_sse(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
 	 */
 
 	/* check ranges */
-	temp = MM_CMPGT8(in, range);
+	t = MM_CMPGT8(in, tr_hi);
 
 	/* convert -1 to 1 (bytes greater than input byte */
-	temp = MM_SIGN8(temp, temp);
+	t = MM_SIGN8(t, t);
 
 	/* horizontal add pairs of bytes into words */
-	temp = MM_MADD8(temp, temp);
+	t = MM_MADD8(t, t);
 
 	/* horizontal add pairs of words into dwords */
-	quad_ofs = MM_MADD16(temp, ones_16);
+	quad_ofs = MM_MADD16(t, ones_16);
 
-	/* mask to range type nodes */
-	temp = _mm_blendv_epi8(quad_ofs, dfa_ofs, dfa_msk);
+	/* blend DFA and QUAD/SINGLE. */
+	t = _mm_blendv_epi8(quad_ofs, dfa_ofs, dfa_msk);
 
 	/* add index into node position */
-	return MM_ADD32(addr, temp);
+	return MM_ADD32(addr, t);
 }
 
 /*
@@ -249,13 +245,19 @@ static inline __attribute__((always_inline)) xmm_t
 transition4(xmm_t next_input, const uint64_t *trans,
 	xmm_t *indices1, xmm_t *indices2)
 {
-	xmm_t addr;
+	xmm_t addr, tr_lo, tr_hi;
 	uint64_t trans0, trans2;
 
+	/* Shuffle low 32 into tr_lo and high 32 into tr_hi */
+	tr_lo = (xmm_t)_mm_shuffle_ps((__m128)*indices1, (__m128)*indices2,
+		0x88);
+	tr_hi = (xmm_t)_mm_shuffle_ps((__m128)*indices1, (__m128)*indices2,
+		0xdd);
+
 	 /* Calculate the address (array index) for all 4 transitions. */
 
 	addr = calc_addr_sse(xmm_index_mask.x, next_input, xmm_shuffle_input.x,
-		xmm_ones_16.x, *indices1, *indices2);
+		xmm_ones_16.x, tr_lo, tr_hi);
 
 	 /* Gather 64 bit transitions and pack back into 2 registers. */
 
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 15/18] libte_acl: make calc_addr a define to deduplicate the code.
  2015-01-20 18:40 [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Konstantin Ananyev
                   ` (13 preceding siblings ...)
  2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 14/18] libter_acl: move lo/hi dwords shuffle out from calc_addr Konstantin Ananyev
@ 2015-01-20 18:41 ` Konstantin Ananyev
  2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 16/18] libte_acl: introduce max_size into rte_acl_config Konstantin Ananyev
                   ` (5 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Konstantin Ananyev @ 2015-01-20 18:41 UTC (permalink / raw)
  To: dev

Vector code reorganisation/deduplication:
To avoid maintaining two nearly identical implementations of calc_addr()
(one for SSE, another for AVX2), replace it with  a new macro that suits
both SSE and AVX2 code-paths.
Also remove no needed any more MM_* macros.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl_run_avx2.h                   |  87 +++++-------
 lib/librte_acl/acl_run_sse.h                    | 178 ++++++++----------------
 lib/librte_acl/acl_vect.h                       | 132 ++++++++----------
 lib/librte_eal/common/include/rte_common_vect.h |  12 ++
 4 files changed, 160 insertions(+), 249 deletions(-)

diff --git a/lib/librte_acl/acl_run_avx2.h b/lib/librte_acl/acl_run_avx2.h
index 1688c50..b01a46a 100644
--- a/lib/librte_acl/acl_run_avx2.h
+++ b/lib/librte_acl/acl_run_avx2.h
@@ -73,51 +73,19 @@ static const rte_ymm_t ymm_ones_16 = {
 	},
 };
 
-static inline __attribute__((always_inline)) ymm_t
-calc_addr_avx2(ymm_t index_mask, ymm_t next_input, ymm_t shuffle_input,
-	ymm_t ones_16, ymm_t tr_lo, ymm_t tr_hi)
-{
-	ymm_t in, node_type, r, t;
-	ymm_t dfa_msk, dfa_ofs, quad_ofs;
-	ymm_t addr;
-
-	const ymm_t range_base = _mm256_set_epi32(
-		0xffffff0c, 0xffffff08, 0xffffff04, 0xffffff00,
-		0xffffff0c, 0xffffff08, 0xffffff04, 0xffffff00);
-
-	t = _mm256_xor_si256(index_mask, index_mask);
-	in = _mm256_shuffle_epi8(next_input, shuffle_input);
-
-	/* Calc node type and node addr */
-	node_type = _mm256_andnot_si256(index_mask, tr_lo);
-	addr = _mm256_and_si256(index_mask, tr_lo);
-
-	/* DFA calculations. */
-
-	dfa_msk = _mm256_cmpeq_epi32(node_type, t);
-
-	r = _mm256_srli_epi32(in, 30);
-	r = _mm256_add_epi8(r, range_base);
-
-	t = _mm256_srli_epi32(in, 24);
-	r = _mm256_shuffle_epi8(tr_hi, r);
-
-	dfa_ofs = _mm256_sub_epi32(t, r);
-
-	/* QUAD/SINGLE caluclations. */
-
-	t = _mm256_cmpgt_epi8(in, tr_hi);
-	t = _mm256_sign_epi8(t, t);
-	t = _mm256_maddubs_epi16(t, t);
-	quad_ofs = _mm256_madd_epi16(t, ones_16);
-
-	/* blend DFA and QUAD/SINGLE. */
-	t = _mm256_blendv_epi8(quad_ofs, dfa_ofs, dfa_msk);
-
-	addr = _mm256_add_epi32(addr, t);
-	return addr;
-}
+static const rte_ymm_t ymm_range_base = {
+	.u32 = {
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+	},
+};
 
+/*
+ * Process 8 transitions in parallel.
+ * tr_lo contains low 32 bits for 8 transition.
+ * tr_hi contains high 32 bits for 8 transition.
+ * next_input contains up to 4 input bytes for 8 flows.
+ */
 static inline __attribute__((always_inline)) ymm_t
 transition8(ymm_t next_input, const uint64_t *trans, ymm_t *tr_lo, ymm_t *tr_hi)
 {
@@ -126,8 +94,10 @@ transition8(ymm_t next_input, const uint64_t *trans, ymm_t *tr_lo, ymm_t *tr_hi)
 
 	tr = (const int32_t *)(uintptr_t)trans;
 
-	addr = calc_addr_avx2(ymm_index_mask.y, next_input, ymm_shuffle_input.y,
-		ymm_ones_16.y, *tr_lo, *tr_hi);
+	/* Calculate the address (array index) for all 8 transitions. */
+	ACL_TR_CALC_ADDR(mm256, 256, addr, ymm_index_mask.y, next_input,
+		ymm_shuffle_input.y, ymm_ones_16.y, ymm_range_base.y,
+		*tr_lo, *tr_hi);
 
 	/* load lower 32 bits of 8 transactions at once. */
 	*tr_lo = _mm256_i32gather_epi32(tr, addr, sizeof(trans[0]));
@@ -140,6 +110,11 @@ transition8(ymm_t next_input, const uint64_t *trans, ymm_t *tr_lo, ymm_t *tr_hi)
 	return next_input;
 }
 
+/*
+ * Process matches for  8 flows.
+ * tr_lo contains low 32 bits for 8 transition.
+ * tr_hi contains high 32 bits for 8 transition.
+ */
 static inline void
 acl_process_matches_avx2x8(const struct rte_acl_ctx *ctx,
 	struct parms *parms, struct acl_flow_data *flows, uint32_t slot,
@@ -155,6 +130,11 @@ acl_process_matches_avx2x8(const struct rte_acl_ctx *ctx,
 	l0 = _mm256_castsi256_si128(*tr_lo);
 
 	for (i = 0; i != RTE_DIM(tr) / 2; i++) {
+
+		/*
+		 * Extract low 32bits of each transition.
+		 * That's enough to process the match.
+		 */
 		tr[i] = (uint32_t)_mm_cvtsi128_si32(l0);
 		tr[i + 4] = (uint32_t)_mm_cvtsi128_si32(l1);
 
@@ -167,12 +147,14 @@ acl_process_matches_avx2x8(const struct rte_acl_ctx *ctx,
 			ctx, parms, flows, resolve_priority_sse);
 	}
 
+	/* Collect new transitions into 2 YMM registers. */
 	t0 = _mm256_set_epi64x(tr[5], tr[4], tr[1], tr[0]);
 	t1 = _mm256_set_epi64x(tr[7], tr[6], tr[3], tr[2]);
 
-	lo = (ymm_t)_mm256_shuffle_ps((__m256)t0, (__m256)t1, 0x88);
-	hi = (ymm_t)_mm256_shuffle_ps((__m256)t0, (__m256)t1, 0xdd);
+	/* For each transition: put low 32 into tr_lo and high 32 into tr_hi */
+	ACL_TR_HILO(mm256, __m256, t0, t1, lo, hi);
 
+	/* Keep transitions wth NOMATCH intact. */
 	*tr_lo = _mm256_blendv_epi8(*tr_lo, lo, matches);
 	*tr_hi = _mm256_blendv_epi8(*tr_hi, hi, matches);
 }
@@ -200,6 +182,9 @@ acl_match_check_avx2x8(const struct rte_acl_ctx *ctx, struct parms *parms,
 	}
 }
 
+/*
+ * Execute trie traversal for up to 16 flows in parallel.
+ */
 static inline int
 search_avx2x16(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t total_packets, uint32_t categories)
@@ -225,16 +210,14 @@ search_avx2x16(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	t1 = _mm256_set_epi64x(index_array[7], index_array[6],
 		index_array[3], index_array[2]);
 
-	tr_lo[0] = (ymm_t)_mm256_shuffle_ps((__m256)t0, (__m256)t1, 0x88);
-	tr_hi[0] = (ymm_t)_mm256_shuffle_ps((__m256)t0, (__m256)t1, 0xdd);
+	ACL_TR_HILO(mm256, __m256, t0, t1, tr_lo[0], tr_hi[0]);
 
 	t0 = _mm256_set_epi64x(index_array[13], index_array[12],
 		index_array[9], index_array[8]);
 	t1 = _mm256_set_epi64x(index_array[15], index_array[14],
 		index_array[11], index_array[10]);
 
-	tr_lo[1] = (ymm_t)_mm256_shuffle_ps((__m256)t0, (__m256)t1, 0x88);
-	tr_hi[1] = (ymm_t)_mm256_shuffle_ps((__m256)t0, (__m256)t1, 0xdd);
+	ACL_TR_HILO(mm256, __m256, t0, t1, tr_lo[1], tr_hi[1]);
 
 	 /* Check for any matches. */
 	acl_match_check_avx2x8(ctx, parms, &flows, 0, &tr_lo[0], &tr_hi[0],
diff --git a/lib/librte_acl/acl_run_sse.h b/lib/librte_acl/acl_run_sse.h
index 4a174e9..ad40a67 100644
--- a/lib/librte_acl/acl_run_sse.h
+++ b/lib/librte_acl/acl_run_sse.h
@@ -67,6 +67,12 @@ static const rte_xmm_t xmm_index_mask = {
 	},
 };
 
+static const rte_xmm_t xmm_range_base = {
+	.u32 = {
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+	},
+};
+
 /*
  * Resolve priority for multiple results (sse version).
  * This consists comparing the priority of the current traversal with the
@@ -90,25 +96,28 @@ resolve_priority_sse(uint64_t transition, int n, const struct rte_acl_ctx *ctx,
 			(xmm_t *)(&parms[n].cmplt->priority[x]);
 
 		/* get results and priorities for completed trie */
-		results = MM_LOADU((const xmm_t *)&p[transition].results[x]);
-		priority = MM_LOADU((const xmm_t *)&p[transition].priority[x]);
+		results = _mm_loadu_si128(
+			(const xmm_t *)&p[transition].results[x]);
+		priority = _mm_loadu_si128(
+			(const xmm_t *)&p[transition].priority[x]);
 
 		/* if this is not the first completed trie */
 		if (parms[n].cmplt->count != ctx->num_tries) {
 
 			/* get running best results and their priorities */
-			results1 = MM_LOADU(saved_results);
-			priority1 = MM_LOADU(saved_priority);
+			results1 = _mm_loadu_si128(saved_results);
+			priority1 = _mm_loadu_si128(saved_priority);
 
 			/* select results that are highest priority */
-			selector = MM_CMPGT32(priority1, priority);
-			results = MM_BLENDV8(results, results1, selector);
-			priority = MM_BLENDV8(priority, priority1, selector);
+			selector = _mm_cmpgt_epi32(priority1, priority);
+			results = _mm_blendv_epi8(results, results1, selector);
+			priority = _mm_blendv_epi8(priority, priority1,
+				selector);
 		}
 
 		/* save running best results and their priorities */
-		MM_STOREU(saved_results, results);
-		MM_STOREU(saved_priority, priority);
+		_mm_storeu_si128(saved_results, results);
+		_mm_storeu_si128(saved_priority, priority);
 	}
 }
 
@@ -122,11 +131,11 @@ acl_process_matches(xmm_t *indices, int slot, const struct rte_acl_ctx *ctx,
 	uint64_t transition1, transition2;
 
 	/* extract transition from low 64 bits. */
-	transition1 = MM_CVT64(*indices);
+	transition1 = _mm_cvtsi128_si64(*indices);
 
 	/* extract transition from high 64 bits. */
-	*indices = MM_SHUFFLE32(*indices, SHUFFLE32_SWAP64);
-	transition2 = MM_CVT64(*indices);
+	*indices = _mm_shuffle_epi32(*indices, SHUFFLE32_SWAP64);
+	transition2 = _mm_cvtsi128_si64(*indices);
 
 	transition1 = acl_match_check(transition1, slot, ctx,
 		parms, flows, resolve_priority_sse);
@@ -134,7 +143,7 @@ acl_process_matches(xmm_t *indices, int slot, const struct rte_acl_ctx *ctx,
 		parms, flows, resolve_priority_sse);
 
 	/* update indices with new transitions. */
-	*indices = MM_SET64(transition2, transition1);
+	*indices = _mm_set_epi64x(transition2, transition1);
 }
 
 /*
@@ -148,98 +157,24 @@ acl_match_check_x4(int slot, const struct rte_acl_ctx *ctx, struct parms *parms,
 	xmm_t temp;
 
 	/* put low 32 bits of each transition into one register */
-	temp = (xmm_t)MM_SHUFFLEPS((__m128)*indices1, (__m128)*indices2,
+	temp = (xmm_t)_mm_shuffle_ps((__m128)*indices1, (__m128)*indices2,
 		0x88);
 	/* test for match node */
-	temp = MM_AND(match_mask, temp);
+	temp = _mm_and_si128(match_mask, temp);
 
-	while (!MM_TESTZ(temp, temp)) {
+	while (!_mm_testz_si128(temp, temp)) {
 		acl_process_matches(indices1, slot, ctx, parms, flows);
 		acl_process_matches(indices2, slot + 2, ctx, parms, flows);
 
-		temp = (xmm_t)MM_SHUFFLEPS((__m128)*indices1,
+		temp = (xmm_t)_mm_shuffle_ps((__m128)*indices1,
 					(__m128)*indices2,
 					0x88);
-		temp = MM_AND(match_mask, temp);
+		temp = _mm_and_si128(match_mask, temp);
 	}
 }
 
 /*
- * Calculate the address of the next transition for
- * all types of nodes. Note that only DFA nodes and range
- * nodes actually transition to another node. Match
- * nodes don't move.
- */
-static inline __attribute__((always_inline)) xmm_t
-calc_addr_sse(xmm_t index_mask, xmm_t next_input, xmm_t shuffle_input,
-	xmm_t ones_16, xmm_t tr_lo, xmm_t tr_hi)
-{
-	xmm_t addr, node_types;
-	xmm_t dfa_msk, dfa_ofs, quad_ofs;
-	xmm_t in, r, t;
-
-	const xmm_t range_base = _mm_set_epi32(0xffffff0c, 0xffffff08,
-		0xffffff04, 0xffffff00);
-
-	/*
-	 * Note that no transition is done for a match
-	 * node and therefore a stream freezes when
-	 * it reaches a match.
-	 */
-
-	t = MM_XOR(index_mask, index_mask);
-
-	/* shuffle input byte to all 4 positions of 32 bit value */
-	in = MM_SHUFFLE8(next_input, shuffle_input);
-
-	/* Calc node type and node addr */
-	node_types = MM_ANDNOT(index_mask, tr_lo);
-	addr = MM_AND(index_mask, tr_lo);
-
-	/*
-	 * Calc addr for DFAs - addr = dfa_index + input_byte
-	 */
-
-	/* mask for DFA type (0) nodes */
-	dfa_msk = MM_CMPEQ32(node_types, t);
-
-	r = _mm_srli_epi32(in, 30);
-	r = _mm_add_epi8(r, range_base);
-
-	t = _mm_srli_epi32(in, 24);
-	r = _mm_shuffle_epi8(tr_hi, r);
-
-	dfa_ofs = _mm_sub_epi32(t, r);
-
-	/*
-	 * Calculate number of range boundaries that are less than the
-	 * input value. Range boundaries for each node are in signed 8 bit,
-	 * ordered from -128 to 127 in the indices2 register.
-	 * This is effectively a popcnt of bytes that are greater than the
-	 * input byte.
-	 */
-
-	/* check ranges */
-	t = MM_CMPGT8(in, tr_hi);
-
-	/* convert -1 to 1 (bytes greater than input byte */
-	t = MM_SIGN8(t, t);
-
-	/* horizontal add pairs of bytes into words */
-	t = MM_MADD8(t, t);
-
-	/* horizontal add pairs of words into dwords */
-	quad_ofs = MM_MADD16(t, ones_16);
-
-	/* blend DFA and QUAD/SINGLE. */
-	t = _mm_blendv_epi8(quad_ofs, dfa_ofs, dfa_msk);
-
-	/* add index into node position */
-	return MM_ADD32(addr, t);
-}
-
-/*
- * Process 4 transitions (in 2 SIMD registers) in parallel
+ * Process 4 transitions (in 2 XMM registers) in parallel
  */
 static inline __attribute__((always_inline)) xmm_t
 transition4(xmm_t next_input, const uint64_t *trans,
@@ -249,39 +184,36 @@ transition4(xmm_t next_input, const uint64_t *trans,
 	uint64_t trans0, trans2;
 
 	/* Shuffle low 32 into tr_lo and high 32 into tr_hi */
-	tr_lo = (xmm_t)_mm_shuffle_ps((__m128)*indices1, (__m128)*indices2,
-		0x88);
-	tr_hi = (xmm_t)_mm_shuffle_ps((__m128)*indices1, (__m128)*indices2,
-		0xdd);
+	ACL_TR_HILO(mm, __m128, *indices1, *indices2, tr_lo, tr_hi);
 
 	 /* Calculate the address (array index) for all 4 transitions. */
-
-	addr = calc_addr_sse(xmm_index_mask.x, next_input, xmm_shuffle_input.x,
-		xmm_ones_16.x, tr_lo, tr_hi);
+	ACL_TR_CALC_ADDR(mm, 128, addr, xmm_index_mask.x, next_input,
+		xmm_shuffle_input.x, xmm_ones_16.x, xmm_range_base.x,
+		tr_lo, tr_hi);
 
 	 /* Gather 64 bit transitions and pack back into 2 registers. */
 
-	trans0 = trans[MM_CVT32(addr)];
+	trans0 = trans[_mm_cvtsi128_si32(addr)];
 
 	/* get slot 2 */
 
 	/* {x0, x1, x2, x3} -> {x2, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT2);
-	trans2 = trans[MM_CVT32(addr)];
+	addr = _mm_shuffle_epi32(addr, SHUFFLE32_SLOT2);
+	trans2 = trans[_mm_cvtsi128_si32(addr)];
 
 	/* get slot 1 */
 
 	/* {x2, x1, x2, x3} -> {x1, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT1);
-	*indices1 = MM_SET64(trans[MM_CVT32(addr)], trans0);
+	addr = _mm_shuffle_epi32(addr, SHUFFLE32_SLOT1);
+	*indices1 = _mm_set_epi64x(trans[_mm_cvtsi128_si32(addr)], trans0);
 
 	/* get slot 3 */
 
 	/* {x1, x1, x2, x3} -> {x3, x1, x2, x3} */
-	addr = MM_SHUFFLE32(addr, SHUFFLE32_SLOT3);
-	*indices2 = MM_SET64(trans[MM_CVT32(addr)], trans2);
+	addr = _mm_shuffle_epi32(addr, SHUFFLE32_SLOT3);
+	*indices2 = _mm_set_epi64x(trans[_mm_cvtsi128_si32(addr)], trans2);
 
-	return MM_SRL32(next_input, CHAR_BIT);
+	return _mm_srli_epi32(next_input, CHAR_BIT);
 }
 
 /*
@@ -314,11 +246,11 @@ search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	 * indices4 contains index_array[6,7]
 	 */
 
-	indices1 = MM_LOADU((xmm_t *) &index_array[0]);
-	indices2 = MM_LOADU((xmm_t *) &index_array[2]);
+	indices1 = _mm_loadu_si128((xmm_t *) &index_array[0]);
+	indices2 = _mm_loadu_si128((xmm_t *) &index_array[2]);
 
-	indices3 = MM_LOADU((xmm_t *) &index_array[4]);
-	indices4 = MM_LOADU((xmm_t *) &index_array[6]);
+	indices3 = _mm_loadu_si128((xmm_t *) &index_array[4]);
+	indices4 = _mm_loadu_si128((xmm_t *) &index_array[6]);
 
 	 /* Check for any matches. */
 	acl_match_check_x4(0, ctx, parms, &flows,
@@ -332,14 +264,14 @@ search_sse_8(const struct rte_acl_ctx *ctx, const uint8_t **data,
 		input0 = _mm_cvtsi32_si128(GET_NEXT_4BYTES(parms, 0));
 		input1 = _mm_cvtsi32_si128(GET_NEXT_4BYTES(parms, 4));
 
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 1), 1);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 5), 1);
+		input0 = _mm_insert_epi32(input0, GET_NEXT_4BYTES(parms, 1), 1);
+		input1 = _mm_insert_epi32(input1, GET_NEXT_4BYTES(parms, 5), 1);
 
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 2), 2);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 6), 2);
+		input0 = _mm_insert_epi32(input0, GET_NEXT_4BYTES(parms, 2), 2);
+		input1 = _mm_insert_epi32(input1, GET_NEXT_4BYTES(parms, 6), 2);
 
-		input0 = MM_INSERT32(input0, GET_NEXT_4BYTES(parms, 3), 3);
-		input1 = MM_INSERT32(input1, GET_NEXT_4BYTES(parms, 7), 3);
+		input0 = _mm_insert_epi32(input0, GET_NEXT_4BYTES(parms, 3), 3);
+		input1 = _mm_insert_epi32(input1, GET_NEXT_4BYTES(parms, 7), 3);
 
 		 /* Process the 4 bytes of input on each stream. */
 
@@ -395,8 +327,8 @@ search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
 		index_array[n] = acl_start_next_trie(&flows, parms, n, ctx);
 	}
 
-	indices1 = MM_LOADU((xmm_t *) &index_array[0]);
-	indices2 = MM_LOADU((xmm_t *) &index_array[2]);
+	indices1 = _mm_loadu_si128((xmm_t *) &index_array[0]);
+	indices2 = _mm_loadu_si128((xmm_t *) &index_array[2]);
 
 	/* Check for any matches. */
 	acl_match_check_x4(0, ctx, parms, &flows,
@@ -406,9 +338,9 @@ search_sse_4(const struct rte_acl_ctx *ctx, const uint8_t **data,
 
 		/* Gather 4 bytes of input data for each stream. */
 		input = _mm_cvtsi32_si128(GET_NEXT_4BYTES(parms, 0));
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 1), 1);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 2), 2);
-		input = MM_INSERT32(input, GET_NEXT_4BYTES(parms, 3), 3);
+		input = _mm_insert_epi32(input, GET_NEXT_4BYTES(parms, 1), 1);
+		input = _mm_insert_epi32(input, GET_NEXT_4BYTES(parms, 2), 2);
+		input = _mm_insert_epi32(input, GET_NEXT_4BYTES(parms, 3), 3);
 
 		/* Process the 4 bytes of input on each stream. */
 		input = transition4(input, flows.trans, &indices1, &indices2);
diff --git a/lib/librte_acl/acl_vect.h b/lib/librte_acl/acl_vect.h
index d813600..6cc1999 100644
--- a/lib/librte_acl/acl_vect.h
+++ b/lib/librte_acl/acl_vect.h
@@ -44,86 +44,70 @@
 extern "C" {
 #endif
 
-#define	MM_ADD16(a, b)		_mm_add_epi16(a, b)
-#define	MM_ADD32(a, b)		_mm_add_epi32(a, b)
-#define	MM_ALIGNR8(a, b, c)	_mm_alignr_epi8(a, b, c)
-#define	MM_AND(a, b)		_mm_and_si128(a, b)
-#define MM_ANDNOT(a, b)		_mm_andnot_si128(a, b)
-#define MM_BLENDV8(a, b, c)	_mm_blendv_epi8(a, b, c)
-#define MM_CMPEQ16(a, b)	_mm_cmpeq_epi16(a, b)
-#define MM_CMPEQ32(a, b)	_mm_cmpeq_epi32(a, b)
-#define	MM_CMPEQ8(a, b)		_mm_cmpeq_epi8(a, b)
-#define MM_CMPGT32(a, b)	_mm_cmpgt_epi32(a, b)
-#define MM_CMPGT8(a, b)		_mm_cmpgt_epi8(a, b)
-#define MM_CVT(a)		_mm_cvtsi32_si128(a)
-#define	MM_CVT32(a)		_mm_cvtsi128_si32(a)
-#define MM_CVTU32(a)		_mm_cvtsi32_si128(a)
-#define	MM_INSERT16(a, c, b)	_mm_insert_epi16(a, c, b)
-#define	MM_INSERT32(a, c, b)	_mm_insert_epi32(a, c, b)
-#define	MM_LOAD(a)		_mm_load_si128(a)
-#define	MM_LOADH_PI(a, b)	_mm_loadh_pi(a, b)
-#define	MM_LOADU(a)		_mm_loadu_si128(a)
-#define	MM_MADD16(a, b)		_mm_madd_epi16(a, b)
-#define	MM_MADD8(a, b)		_mm_maddubs_epi16(a, b)
-#define	MM_MOVEMASK8(a)		_mm_movemask_epi8(a)
-#define MM_OR(a, b)		_mm_or_si128(a, b)
-#define	MM_SET1_16(a)		_mm_set1_epi16(a)
-#define	MM_SET1_32(a)		_mm_set1_epi32(a)
-#define	MM_SET1_64(a)		_mm_set1_epi64(a)
-#define	MM_SET1_8(a)		_mm_set1_epi8(a)
-#define	MM_SET32(a, b, c, d)	_mm_set_epi32(a, b, c, d)
-#define	MM_SHUFFLE32(a, b)	_mm_shuffle_epi32(a, b)
-#define	MM_SHUFFLE8(a, b)	_mm_shuffle_epi8(a, b)
-#define	MM_SHUFFLEPS(a, b, c)	_mm_shuffle_ps(a, b, c)
-#define	MM_SIGN8(a, b)		_mm_sign_epi8(a, b)
-#define	MM_SLL64(a, b)		_mm_sll_epi64(a, b)
-#define	MM_SRL128(a, b)		_mm_srli_si128(a, b)
-#define MM_SRL16(a, b)		_mm_srli_epi16(a, b)
-#define	MM_SRL32(a, b)		_mm_srli_epi32(a, b)
-#define	MM_STORE(a, b)		_mm_store_si128(a, b)
-#define	MM_STOREU(a, b)		_mm_storeu_si128(a, b)
-#define	MM_TESTZ(a, b)		_mm_testz_si128(a, b)
-#define	MM_XOR(a, b)		_mm_xor_si128(a, b)
-
-#define	MM_SET16(a, b, c, d, e, f, g, h)	\
-	_mm_set_epi16(a, b, c, d, e, f, g, h)
-
-#define	MM_SET8(c0, c1, c2, c3, c4, c5, c6, c7,	\
-		c8, c9, cA, cB, cC, cD, cE, cF)	\
-	_mm_set_epi8(c0, c1, c2, c3, c4, c5, c6, c7,	\
-		c8, c9, cA, cB, cC, cD, cE, cF)
-
-#ifdef RTE_ARCH_X86_64
-
-#define	MM_CVT64(a)		_mm_cvtsi128_si64(a)
-
-#else
-
-#define	MM_CVT64(a)	({ \
-	rte_xmm_t m;       \
-	m.m = (a);         \
-	(m.u64[0]);        \
-})
-
-#endif /*RTE_ARCH_X86_64 */
 
 /*
- * Prior to version 12.1 icc doesn't support _mm_set_epi64x.
+ * Takes 2 SIMD registers containing N transitions eachi (tr0, tr1).
+ * Shuffles it into different representation:
+ * lo - contains low 32 bits of given N transitions.
+ * hi - contains high 32 bits of given N transitions.
  */
-#if (defined(__ICC) && __ICC < 1210)
+#define	ACL_TR_HILO(P, TC, tr0, tr1, lo, hi)                        do { \
+	lo = (typeof(lo))_##P##_shuffle_ps((TC)(tr0), (TC)(tr1), 0x88);  \
+	hi = (typeof(hi))_##P##_shuffle_ps((TC)(tr0), (TC)(tr1), 0xdd);  \
+} while (0)
 
-#define	MM_SET64(a, b)	({ \
-	rte_xmm_t m;       \
-	m.u64[0] = b;      \
-	m.u64[1] = a;      \
-	(m.m);             \
-})
 
-#else
-
-#define	MM_SET64(a, b)		_mm_set_epi64x(a, b)
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes not supposed to be encountered here.
+ * For quad range nodes:
+ * Calculate number of range boundaries that are less than the
+ * input value. Range boundaries for each node are in signed 8 bit,
+ * ordered from -128 to 127.
+ * This is effectively a popcnt of bytes that are greater than the
+ * input byte.
+ * Single nodes are processed in the same ways as quad range nodes.
+*/
+#define ACL_TR_CALC_ADDR(P, S,					\
+	addr, index_mask, next_input, shuffle_input,		\
+	ones_16, range_base, tr_lo, tr_hi)               do {	\
+								\
+	typeof(addr) in, node_type, r, t;			\
+	typeof(addr) dfa_msk, dfa_ofs, quad_ofs;		\
+								\
+	t = _##P##_xor_si##S(index_mask, index_mask);		\
+	in = _##P##_shuffle_epi8(next_input, shuffle_input);	\
+								\
+	/* Calc node type and node addr */			\
+	node_type = _##P##_andnot_si##S(index_mask, tr_lo);	\
+	addr = _##P##_and_si##S(index_mask, tr_lo);		\
+								\
+	/* mask for DFA type(0) nodes */			\
+	dfa_msk = _##P##_cmpeq_epi32(node_type, t);		\
+								\
+	/* DFA calculations. */					\
+	r = _##P##_srli_epi32(in, 30);				\
+	r = _##P##_add_epi8(r, range_base);			\
+	t = _##P##_srli_epi32(in, 24);				\
+	r = _##P##_shuffle_epi8(tr_hi, r);			\
+								\
+	dfa_ofs = _##P##_sub_epi32(t, r);			\
+								\
+	/* QUAD/SINGLE caluclations. */				\
+	t = _##P##_cmpgt_epi8(in, tr_hi);			\
+	t = _##P##_sign_epi8(t, t);				\
+	t = _##P##_maddubs_epi16(t, t);				\
+	quad_ofs = _##P##_madd_epi16(t, ones_16);		\
+								\
+	/* blend DFA and QUAD/SINGLE. */			\
+	t = _##P##_blendv_epi8(quad_ofs, dfa_ofs, dfa_msk);	\
+								\
+	/* calculate address for next transitions. */		\
+	addr = _##P##_add_epi32(addr, t);			\
+} while (0)
 
-#endif /* (defined(__ICC) && __ICC < 1210) */
 
 #ifdef __cplusplus
 }
diff --git a/lib/librte_eal/common/include/rte_common_vect.h b/lib/librte_eal/common/include/rte_common_vect.h
index 617470b..54ec70f 100644
--- a/lib/librte_eal/common/include/rte_common_vect.h
+++ b/lib/librte_eal/common/include/rte_common_vect.h
@@ -109,6 +109,18 @@ typedef union rte_ymm {
 })
 #endif
 
+/*
+ * Prior to version 12.1 icc doesn't support _mm_set_epi64x.
+ */
+#if (defined(__ICC) && __ICC < 1210)
+#define _mm_set_epi64x(a, b)  ({ \
+	rte_xmm_t m;             \
+	m.u64[0] = b;            \
+	m.u64[1] = a;            \
+	(m.x);                   \
+})
+#endif /* (defined(__ICC) && __ICC < 1210) */
+
 #ifdef __cplusplus
 }
 #endif
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 16/18] libte_acl: introduce max_size into rte_acl_config.
  2015-01-20 18:40 [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Konstantin Ananyev
                   ` (14 preceding siblings ...)
  2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 15/18] libte_acl: make calc_addr a define to deduplicate the code Konstantin Ananyev
@ 2015-01-20 18:41 ` Konstantin Ananyev
  2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 17/18] libte_acl: remove unused macros Konstantin Ananyev
                   ` (4 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Konstantin Ananyev @ 2015-01-20 18:41 UTC (permalink / raw)
  To: dev

If at build phase we don't make any trie splitting,
then temporary build structures and resulting RT structure might be
much bigger than current.
>From other side - having just one trie instead of multiple can speedup
search quite significantly.
>From my measurements on rule-sets with ~10K rules:
RT table up to 8 times bigger, classify() up to 80% faster
than current implementation.
To make it possible for the user to decide about performance/space trade-off -
new parameter for build config structure (max_size) is introduced.
Setting it to the value greater than zero, instructs  rte_acl_build() to:
- make sure that size of RT table wouldn't exceed given value.
- attempt to minimise number of tries in the table.
Setting it to zero maintains current behaviour.
That introduces a minor change in the public API, but I think the possible
performance gain is too big to ignore it.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test-acl/main.c       |  33 ++++++++----
 examples/l3fwd-acl/main.c |   3 +-
 lib/librte_acl/acl.h      |   2 +-
 lib/librte_acl/acl_bld.c  | 134 +++++++++++++++++++++++++++++-----------------
 lib/librte_acl/acl_gen.c  |  22 +++++---
 lib/librte_acl/rte_acl.c  |   1 +
 lib/librte_acl/rte_acl.h  |   2 +
 7 files changed, 131 insertions(+), 66 deletions(-)

diff --git a/app/test-acl/main.c b/app/test-acl/main.c
index 52f43c6..5e8db04 100644
--- a/app/test-acl/main.c
+++ b/app/test-acl/main.c
@@ -85,6 +85,7 @@
 #define	OPT_SEARCH_ALG		"alg"
 #define	OPT_BLD_CATEGORIES	"bldcat"
 #define	OPT_RUN_CATEGORIES	"runcat"
+#define	OPT_MAX_SIZE		"maxsize"
 #define	OPT_ITER_NUM		"iter"
 #define	OPT_VERBOSE		"verbose"
 #define	OPT_IPV6		"ipv6"
@@ -126,6 +127,7 @@ static struct {
 	const char         *prgname;
 	const char         *rule_file;
 	const char         *trace_file;
+	size_t              max_size;
 	uint32_t            bld_categories;
 	uint32_t            run_categories;
 	uint32_t            nb_rules;
@@ -780,6 +782,8 @@ acx_init(void)
 	FILE *f;
 	struct rte_acl_config cfg;
 
+	memset(&cfg, 0, sizeof(cfg));
+
 	/* setup ACL build config. */
 	if (config.ipv6) {
 		cfg.num_fields = RTE_DIM(ipv6_defs);
@@ -789,6 +793,7 @@ acx_init(void)
 		memcpy(&cfg.defs, ipv4_defs, sizeof(ipv4_defs));
 	}
 	cfg.num_categories = config.bld_categories;
+	cfg.max_size = config.max_size;
 
 	/* setup ACL creation parameters. */
 	prm.rule_size = RTE_ACL_RULE_SZ(cfg.num_fields);
@@ -899,8 +904,8 @@ search_ip5tuples(__attribute__((unused)) void *arg)
 	return 0;
 }
 
-static uint32_t
-get_uint32_opt(const char *opt, const char *name, uint32_t min, uint32_t max)
+static unsigned long
+get_ulong_opt(const char *opt, const char *name, size_t min, size_t max)
 {
 	unsigned long val;
 	char *end;
@@ -964,6 +969,9 @@ print_usage(const char *prgname)
 			"=<number of categories to run with> "
 			"should be either 1 or multiple of %zu, "
 			"but not greater then %u]\n"
+		"[--" OPT_MAX_SIZE
+			"=<size limit (in bytes) for runtime ACL strucutures> "
+			"leave 0 for default behaviour]\n"
 		"[--" OPT_ITER_NUM "=<number of iterations to perform>]\n"
 		"[--" OPT_VERBOSE "=<verbose level>]\n"
 		"[--" OPT_SEARCH_ALG "=%s]\n"
@@ -984,6 +992,7 @@ dump_config(FILE *f)
 	fprintf(f, "%s:%u\n", OPT_TRACE_STEP, config.trace_step);
 	fprintf(f, "%s:%u\n", OPT_BLD_CATEGORIES, config.bld_categories);
 	fprintf(f, "%s:%u\n", OPT_RUN_CATEGORIES, config.run_categories);
+	fprintf(f, "%s:%zu\n", OPT_MAX_SIZE, config.max_size);
 	fprintf(f, "%s:%u\n", OPT_ITER_NUM, config.iter_num);
 	fprintf(f, "%s:%u\n", OPT_VERBOSE, config.verbose);
 	fprintf(f, "%s:%u(%s)\n", OPT_SEARCH_ALG, config.alg.alg,
@@ -1010,6 +1019,7 @@ get_input_opts(int argc, char **argv)
 		{OPT_TRACE_FILE, 1, 0, 0},
 		{OPT_TRACE_NUM, 1, 0, 0},
 		{OPT_RULE_NUM, 1, 0, 0},
+		{OPT_MAX_SIZE, 1, 0, 0},
 		{OPT_TRACE_STEP, 1, 0, 0},
 		{OPT_BLD_CATEGORIES, 1, 0, 0},
 		{OPT_RUN_CATEGORIES, 1, 0, 0},
@@ -1034,29 +1044,32 @@ get_input_opts(int argc, char **argv)
 		} else if (strcmp(lgopts[opt_idx].name, OPT_TRACE_FILE) == 0) {
 			config.trace_file = optarg;
 		} else if (strcmp(lgopts[opt_idx].name, OPT_RULE_NUM) == 0) {
-			config.nb_rules = get_uint32_opt(optarg,
+			config.nb_rules = get_ulong_opt(optarg,
 				lgopts[opt_idx].name, 1, RTE_ACL_MAX_INDEX + 1);
+		} else if (strcmp(lgopts[opt_idx].name, OPT_MAX_SIZE) == 0) {
+			config.max_size = get_ulong_opt(optarg,
+				lgopts[opt_idx].name, 0, SIZE_MAX);
 		} else if (strcmp(lgopts[opt_idx].name, OPT_TRACE_NUM) == 0) {
-			config.nb_traces = get_uint32_opt(optarg,
+			config.nb_traces = get_ulong_opt(optarg,
 				lgopts[opt_idx].name, 1, UINT32_MAX);
 		} else if (strcmp(lgopts[opt_idx].name, OPT_TRACE_STEP) == 0) {
-			config.trace_step = get_uint32_opt(optarg,
+			config.trace_step = get_ulong_opt(optarg,
 				lgopts[opt_idx].name, 1, TRACE_STEP_MAX);
 		} else if (strcmp(lgopts[opt_idx].name,
 				OPT_BLD_CATEGORIES) == 0) {
-			config.bld_categories = get_uint32_opt(optarg,
+			config.bld_categories = get_ulong_opt(optarg,
 				lgopts[opt_idx].name, 1,
 				RTE_ACL_MAX_CATEGORIES);
 		} else if (strcmp(lgopts[opt_idx].name,
 				OPT_RUN_CATEGORIES) == 0) {
-			config.run_categories = get_uint32_opt(optarg,
+			config.run_categories = get_ulong_opt(optarg,
 				lgopts[opt_idx].name, 1,
 				RTE_ACL_MAX_CATEGORIES);
 		} else if (strcmp(lgopts[opt_idx].name, OPT_ITER_NUM) == 0) {
-			config.iter_num = get_uint32_opt(optarg,
-				lgopts[opt_idx].name, 1, UINT16_MAX);
+			config.iter_num = get_ulong_opt(optarg,
+				lgopts[opt_idx].name, 1, INT32_MAX);
 		} else if (strcmp(lgopts[opt_idx].name, OPT_VERBOSE) == 0) {
-			config.verbose = get_uint32_opt(optarg,
+			config.verbose = get_ulong_opt(optarg,
 				lgopts[opt_idx].name, DUMP_NONE, DUMP_MAX);
 		} else if (strcmp(lgopts[opt_idx].name,
 				OPT_SEARCH_ALG) == 0) {
diff --git a/examples/l3fwd-acl/main.c b/examples/l3fwd-acl/main.c
index 022ccab..f1f7601 100644
--- a/examples/l3fwd-acl/main.c
+++ b/examples/l3fwd-acl/main.c
@@ -1178,8 +1178,9 @@ setup_acl(struct rte_acl_rule *route_base,
 			rte_exit(EXIT_FAILURE, "add rules failed\n");
 
 	/* Perform builds */
-	acl_build_param.num_categories = DEFAULT_MAX_CATEGORIES;
+	memset(&acl_build_param, 0, sizeof(acl_build_param));
 
+	acl_build_param.num_categories = DEFAULT_MAX_CATEGORIES;
 	acl_build_param.num_fields = dim;
 	memcpy(&acl_build_param.defs, ipv6 ? ipv6_defs : ipv4_defs,
 		ipv6 ? sizeof(ipv6_defs) : sizeof(ipv4_defs));
diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index d33d7ad..61b849a 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -180,7 +180,7 @@ struct rte_acl_ctx {
 
 int rte_acl_gen(struct rte_acl_ctx *ctx, struct rte_acl_trie *trie,
 	struct rte_acl_bld_trie *node_bld_trie, uint32_t num_tries,
-	uint32_t num_categories, uint32_t data_index_sz);
+	uint32_t num_categories, uint32_t data_index_sz, size_t max_size);
 
 typedef int (*rte_acl_classify_t)
 (const struct rte_acl_ctx *, const uint8_t **, uint32_t *, uint32_t, uint32_t);
diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
index 1fd59ee..1fe79fb 100644
--- a/lib/librte_acl/acl_bld.c
+++ b/lib/librte_acl/acl_bld.c
@@ -41,10 +41,9 @@
 /* number of pointers per alloc */
 #define ACL_PTR_ALLOC	32
 
-/* variable for dividing rule sets */
-#define NODE_MAX	2500
-#define NODE_PERCENTAGE	(0.40)
-#define RULE_PERCENTAGE	(0.40)
+/* macros for dividing rule sets heuristics */
+#define NODE_MAX	0x4000
+#define NODE_MIN	0x800
 
 /* TALLY are statistics per field */
 enum {
@@ -97,6 +96,7 @@ struct acl_build_context {
 	const struct rte_acl_ctx *acx;
 	struct rte_acl_build_rule *build_rules;
 	struct rte_acl_config     cfg;
+	int32_t                   node_max;
 	uint32_t                  node;
 	uint32_t                  num_nodes;
 	uint32_t                  category_mask;
@@ -1447,7 +1447,7 @@ build_trie(struct acl_build_context *context, struct rte_acl_build_rule *head,
 			return NULL;
 
 		node_count = context->num_nodes - node_count;
-		if (node_count > NODE_MAX) {
+		if (node_count > context->node_max) {
 			*last = prev;
 			return trie;
 		}
@@ -1628,6 +1628,9 @@ rule_cmp_wildness(struct rte_acl_build_rule *r1, struct rte_acl_build_rule *r2)
 	return 0;
 }
 
+/*
+ * Sort list of rules based on the rules wildness.
+ */
 static struct rte_acl_build_rule *
 sort_rules(struct rte_acl_build_rule *head)
 {
@@ -1636,21 +1639,22 @@ sort_rules(struct rte_acl_build_rule *head)
 
 	new_head = NULL;
 	while (head != NULL) {
+
+		/* remove element from the head of the old list. */
 		r = head;
 		head = r->next;
 		r->next = NULL;
-		if (new_head == NULL) {
-			new_head = r;
-		} else {
-			for (p = &new_head;
-					(l = *p) != NULL &&
-					rule_cmp_wildness(l, r) >= 0;
-					p = &l->next)
-				;
-
-			r->next = *p;
-			*p = r;
-		}
+
+		/* walk through new sorted list to find a proper place. */
+		for (p = &new_head;
+				(l = *p) != NULL &&
+				rule_cmp_wildness(l, r) >= 0;
+				p = &l->next)
+			;
+
+		/* insert element into the new sorted list. */
+		r->next = *p;
+		*p = r;
 	}
 
 	return new_head;
@@ -1789,9 +1793,11 @@ acl_build_log(const struct acl_build_context *ctx)
 	uint32_t n;
 
 	RTE_LOG(DEBUG, ACL, "Build phase for ACL \"%s\":\n"
+		"node limit for tree split: %u\n"
 		"nodes created: %u\n"
 		"memory consumed: %zu\n",
 		ctx->acx->name,
+		ctx->node_max,
 		ctx->num_nodes,
 		ctx->pool.alloc);
 
@@ -1868,11 +1874,48 @@ acl_set_data_indexes(struct rte_acl_ctx *ctx)
 	}
 }
 
+/*
+ * Internal routine, performs 'build' phase of trie generation:
+ * - setups build context.
+ * - analizes given set of rules.
+ * - builds internal tree(s).
+ */
+static int
+acl_bld(struct acl_build_context *bcx, struct rte_acl_ctx *ctx,
+	const struct rte_acl_config *cfg, uint32_t node_max)
+{
+	int32_t rc;
+
+	/* setup build context. */
+	memset(bcx, 0, sizeof(*bcx));
+	bcx->acx = ctx;
+	bcx->pool.alignment = ACL_POOL_ALIGN;
+	bcx->pool.min_alloc = ACL_POOL_ALLOC_MIN;
+	bcx->cfg = *cfg;
+	bcx->category_mask = LEN2MASK(bcx->cfg.num_categories);
+	bcx->node_max = node_max;
+
+	/* Create a build rules copy. */
+	rc = acl_build_rules(bcx);
+	if (rc != 0)
+		return rc;
+
+	/* No rules to build for that context+config */
+	if (bcx->build_rules == NULL) {
+		rc = -EINVAL;
+	} else {
+		/* build internal trie representation. */
+		rc = acl_build_tries(bcx, bcx->build_rules);
+	}
+	return rc;
+}
 
 int
 rte_acl_build(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg)
 {
-	int rc;
+	int32_t rc;
+	uint32_t n;
+	size_t max_size;
 	struct acl_build_context bcx;
 
 	if (ctx == NULL || cfg == NULL || cfg->num_categories == 0 ||
@@ -1881,44 +1924,39 @@ rte_acl_build(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg)
 
 	acl_build_reset(ctx);
 
-	memset(&bcx, 0, sizeof(bcx));
-	bcx.acx = ctx;
-	bcx.pool.alignment = ACL_POOL_ALIGN;
-	bcx.pool.min_alloc = ACL_POOL_ALLOC_MIN;
-	bcx.cfg = *cfg;
-	bcx.category_mask = LEN2MASK(bcx.cfg.num_categories);
-
-
-	/* Create a build rules copy. */
-	rc = acl_build_rules(&bcx);
-	if (rc != 0)
-		return rc;
+	if (cfg->max_size == 0) {
+		n = NODE_MIN;
+		max_size = SIZE_MAX;
+	} else {
+		n = NODE_MAX;
+		max_size = cfg->max_size;
+	}
 
-	/* No rules to build for that context+config */
-	if (bcx.build_rules == NULL) {
-		rc = -EINVAL;
+	for (rc = -ERANGE; n >= NODE_MIN && rc == -ERANGE; n /= 2) {
 
-	/* build internal trie representation. */
-	} else if ((rc = acl_build_tries(&bcx, bcx.build_rules)) == 0) {
+		/* perform build phase. */
+		rc = acl_bld(&bcx, ctx, cfg, n);
 
-		/* allocate and fill run-time  structures. */
-		rc = rte_acl_gen(ctx, bcx.tries, bcx.bld_tries,
+		if (rc == 0) {
+			/* allocate and fill run-time  structures. */
+			rc = rte_acl_gen(ctx, bcx.tries, bcx.bld_tries,
 				bcx.num_tries, bcx.cfg.num_categories,
 				RTE_ACL_MAX_FIELDS * RTE_DIM(bcx.tries) *
-				sizeof(ctx->data_indexes[0]));
-		if (rc == 0) {
+				sizeof(ctx->data_indexes[0]), max_size);
+			if (rc == 0) {
+				/* set data indexes. */
+				acl_set_data_indexes(ctx);
 
-			/* set data indexes. */
-			acl_set_data_indexes(ctx);
-
-			/* copy in build config. */
-			ctx->config = *cfg;
+				/* copy in build config. */
+				ctx->config = *cfg;
+			}
 		}
-	}
 
-	acl_build_log(&bcx);
+		acl_build_log(&bcx);
+
+		/* cleanup after build. */
+		tb_free_pool(&bcx.pool);
+	}
 
-	/* cleanup after build. */
-	tb_free_pool(&bcx.pool);
 	return rc;
 }
diff --git a/lib/librte_acl/acl_gen.c b/lib/librte_acl/acl_gen.c
index d3def66..ea557ab 100644
--- a/lib/librte_acl/acl_gen.c
+++ b/lib/librte_acl/acl_gen.c
@@ -32,7 +32,6 @@
  */
 
 #include <rte_acl.h>
-#include "acl_vect.h"
 #include "acl.h"
 
 #define	QRANGE_MIN	((uint8_t)INT8_MIN)
@@ -63,7 +62,8 @@ struct rte_acl_indices {
 static void
 acl_gen_log_stats(const struct rte_acl_ctx *ctx,
 	const struct acl_node_counters *counts,
-	const struct rte_acl_indices *indices)
+	const struct rte_acl_indices *indices,
+	size_t max_size)
 {
 	RTE_LOG(DEBUG, ACL, "Gen phase for ACL \"%s\":\n"
 		"runtime memory footprint on socket %d:\n"
@@ -71,7 +71,8 @@ acl_gen_log_stats(const struct rte_acl_ctx *ctx,
 		"quad nodes/vectors/bytes used: %d/%d/%zu\n"
 		"DFA nodes/group64/bytes used: %d/%d/%zu\n"
 		"match nodes/bytes used: %d/%zu\n"
-		"total: %zu bytes\n",
+		"total: %zu bytes\n"
+		"max limit: %zu bytes\n",
 		ctx->name, ctx->socket_id,
 		counts->single, counts->single * sizeof(uint64_t),
 		counts->quad, counts->quad_vectors,
@@ -80,7 +81,8 @@ acl_gen_log_stats(const struct rte_acl_ctx *ctx,
 		indices->dfa_index * sizeof(uint64_t),
 		counts->match,
 		counts->match * sizeof(struct rte_acl_match_results),
-		ctx->mem_sz);
+		ctx->mem_sz,
+		max_size);
 }
 
 static uint64_t
@@ -474,7 +476,7 @@ acl_calc_counts_indices(struct acl_node_counters *counts,
 int
 rte_acl_gen(struct rte_acl_ctx *ctx, struct rte_acl_trie *trie,
 	struct rte_acl_bld_trie *node_bld_trie, uint32_t num_tries,
-	uint32_t num_categories, uint32_t data_index_sz)
+	uint32_t num_categories, uint32_t data_index_sz, size_t max_size)
 {
 	void *mem;
 	size_t total_size;
@@ -496,6 +498,14 @@ rte_acl_gen(struct rte_acl_ctx *ctx, struct rte_acl_trie *trie,
 		(counts.match + 1) * sizeof(struct rte_acl_match_results) +
 		XMM_SIZE;
 
+	if (total_size > max_size) {
+		RTE_LOG(DEBUG, ACL,
+			"Gen phase for ACL ctx \"%s\" exceeds max_size limit, "
+			"bytes required: %zu, allowed: %zu\n",
+			ctx->name, total_size, max_size);
+		return -ERANGE;
+	}
+
 	mem = rte_zmalloc_socket(ctx->name, total_size, RTE_CACHE_LINE_SIZE,
 			ctx->socket_id);
 	if (mem == NULL) {
@@ -546,6 +556,6 @@ rte_acl_gen(struct rte_acl_ctx *ctx, struct rte_acl_trie *trie,
 	ctx->trans_table = node_array;
 	memcpy(ctx->trie, trie, sizeof(ctx->trie));
 
-	acl_gen_log_stats(ctx, &counts, &indices);
+	acl_gen_log_stats(ctx, &counts, &indices, max_size);
 	return 0;
 }
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index a9cd349..7d10301 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -543,6 +543,7 @@ rte_acl_ipv4vlan_build(struct rte_acl_ctx *ctx,
 	if (ctx == NULL || layout == NULL)
 		return -EINVAL;
 
+	memset(&cfg, 0, sizeof(cfg));
 	acl_ipv4vlan_config(&cfg, layout, num_categories);
 	return rte_acl_build(ctx, &cfg);
 }
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index 652a234..30aea03 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -94,6 +94,8 @@ struct rte_acl_config {
 	uint32_t num_fields;     /**< Number of field definitions. */
 	struct rte_acl_field_def defs[RTE_ACL_MAX_FIELDS];
 	/**< array of field definitions. */
+	size_t max_size;
+	/**< max memory limit for internal run-time structures. */
 };
 
 /**
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 17/18] libte_acl: remove unused macros.
  2015-01-20 18:40 [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Konstantin Ananyev
                   ` (15 preceding siblings ...)
  2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 16/18] libte_acl: introduce max_size into rte_acl_config Konstantin Ananyev
@ 2015-01-20 18:41 ` Konstantin Ananyev
  2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 18/18] libte_acl: add some comments about ACL internal layout Konstantin Ananyev
                   ` (3 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Konstantin Ananyev @ 2015-01-20 18:41 UTC (permalink / raw)
  To: dev

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl.h     | 1 -
 lib/librte_acl/acl_run.h | 1 -
 2 files changed, 2 deletions(-)

diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 61b849a..217bab3 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -62,7 +62,6 @@ struct rte_acl_bitset {
 
 #define	RTE_ACL_NODE_DFA	(0 << RTE_ACL_TYPE_SHIFT)
 #define	RTE_ACL_NODE_SINGLE	(1U << RTE_ACL_TYPE_SHIFT)
-#define	RTE_ACL_NODE_QEXACT	(2U << RTE_ACL_TYPE_SHIFT)
 #define	RTE_ACL_NODE_QRANGE	(3U << RTE_ACL_TYPE_SHIFT)
 #define	RTE_ACL_NODE_MATCH	(4U << RTE_ACL_TYPE_SHIFT)
 #define	RTE_ACL_NODE_TYPE	(7U << RTE_ACL_TYPE_SHIFT)
diff --git a/lib/librte_acl/acl_run.h b/lib/librte_acl/acl_run.h
index 850bc81..b2fc42c 100644
--- a/lib/librte_acl/acl_run.h
+++ b/lib/librte_acl/acl_run.h
@@ -40,7 +40,6 @@
 #define MAX_SEARCHES_AVX16	16
 #define MAX_SEARCHES_SSE8	8
 #define MAX_SEARCHES_SSE4	4
-#define MAX_SEARCHES_SSE2	2
 #define MAX_SEARCHES_SCALAR	2
 
 #define GET_NEXT_4BYTES(prm, idx)	\
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [dpdk-dev] [PATCH v3 18/18] libte_acl: add some comments about ACL internal layout.
  2015-01-20 18:40 [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Konstantin Ananyev
                   ` (16 preceding siblings ...)
  2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 17/18] libte_acl: remove unused macros Konstantin Ananyev
@ 2015-01-20 18:41 ` Konstantin Ananyev
  2015-01-22 18:54 ` [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Neil Horman
                   ` (2 subsequent siblings)
  20 siblings, 0 replies; 27+ messages in thread
From: Konstantin Ananyev @ 2015-01-20 18:41 UTC (permalink / raw)
  To: dev

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl.h | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 217bab3..4dadab5 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -68,6 +68,44 @@ struct rte_acl_bitset {
 #define	RTE_ACL_NODE_UNDEFINED	UINT32_MAX
 
 /*
+ * ACL RT structure is a set of multibit tries (with stride == 8)
+ * represented by an array of transitions. The next position is calculated
+ * based on the current position and the input byte.
+ * Each transition is 64 bit value with the following format:
+ * | node_type_specific : 32 | node_type : 3 | node_addr : 29 |
+ * For all node types except RTE_ACL_NODE_MATCH, node_addr is an index
+ * to the start of the node in the transtions array.
+ * Few different node types are used:
+ * RTE_ACL_NODE_MATCH:
+ * node_addr value is and index into an array that contains the return value
+ * and its priority for each category.
+ * Upper 32 bits of the transition value are not used for that node type.
+ * RTE_ACL_NODE_QRANGE:
+ * that node consist of up to 5 transitions.
+ * Upper 32 bits are interpreted as 4 signed character values which
+ * are ordered from smallest(INT8_MIN) to largest (INT8_MAX).
+ * These values define 5 ranges:
+ * INT8_MIN <= range[0]  <= ((int8_t *)&transition)[4]
+ * ((int8_t *)&transition)[4] < range[1] <= ((int8_t *)&transition)[5]
+ * ((int8_t *)&transition)[5] < range[2] <= ((int8_t *)&transition)[6]
+ * ((int8_t *)&transition)[6] < range[3] <= ((int8_t *)&transition)[7]
+ * ((int8_t *)&transition)[7] < range[4] <= INT8_MAX
+ * So for input byte value within range[i] i-th transition within that node
+ * will be used.
+ * RTE_ACL_NODE_SINGLE:
+ * always transitions to the same node regardless of the input value.
+ * RTE_ACL_NODE_DFA:
+ * that node consits of up to 256 transitions.
+ * In attempt to conserve space all transitions are divided into 4 consecutive
+ * groups, by 64 transitions per group:
+ * group64[i] contains transitions[i * 64, .. i * 64 + 63].
+ * Upper 32 bits are interpreted as 4 unsigned character values one per group,
+ * which contain index to the start of the given group within the node.
+ * So to calculate transition index within the node for given input byte value:
+ * input_byte - ((uint8_t *)&transition)[4 + input_byte / 64].
+ */
+
+/*
  * Structure of a node is a set of ptrs and each ptr has a bit map
  * of values associated with this transition.
  */
-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements.
  2015-01-20 18:40 [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Konstantin Ananyev
                   ` (17 preceding siblings ...)
  2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 18/18] libte_acl: add some comments about ACL internal layout Konstantin Ananyev
@ 2015-01-22 18:54 ` Neil Horman
  2015-01-22 22:10   ` Ananyev, Konstantin
  2015-01-27 14:03 ` Neil Horman
  2015-01-30  3:12 ` Fu, JingguoX
  20 siblings, 1 reply; 27+ messages in thread
From: Neil Horman @ 2015-01-22 18:54 UTC (permalink / raw)
  To: Konstantin Ananyev; +Cc: dev

On Tue, Jan 20, 2015 at 06:40:49PM +0000, Konstantin Ananyev wrote:
> v3 changes:
> Applied review comments from Thomas:
> - fix spelling errors reported by codespell.
> - split last patch into two:
>     first to remove unused macros,
>     second to add some comments about ACL internal layout.
> 
> v2 changes:
> - When build with the compilers that don't support AVX2 instructions,
> make rte_acl_classify_avx2() do nothing and return an error.
> - Remove unneeded 'ifdef __AVX2__' in acl_run_avx2.*.
> - Reorder order of patches in the set, to keep RTE_LIBRTE_ACL_STANDALONE=y
> always buildable.
> 
> This patch series contain several fixes and enhancements for ACL library.
> See complete list below.
> Two main changes that are externally visible:
> - Introduce new classify method:  RTE_ACL_CLASSIFY_AVX2.
> It uses AVX2 instructions and 256 bit wide data types
> to perform internal trie traversal.
> That helps to increase classify() throughput.
> This method is selected as default one on CPUs that supports AVX2.
> - Introduce new field in the build config structure: max_size.
> It specifies maximum size that internal RT structure for given context
> can reach.
> The purpose of that is to allow user to decide about space/performance trade-off
> (faster classify() vs less space for RT internal structures)
> for each given set of rules.
> 
> Konstantin Ananyev (18):
>   fix fix compilation issues with RTE_LIBRTE_ACL_STANDALONE=y
>   app/test: few small fixes fot test_acl.c
>   librte_acl: make data_indexes long enough to survive idle transitions.
>   librte_acl: remove build phase heuristsic with negative performance
>     effect.
>   librte_acl: fix a bug at build phase that can cause matches beeing
>     overwirtten.
>   librte_acl: introduce DFA nodes compression (group64) for identical
>     entries.
>   librte_acl: build/gen phase - simplify the way match nodes are
>     allocated.
>   librte_acl: make scalar RT code to be more similar to vector one.
>   librte_acl: a bit of RT code deduplication.
>   EAL: introduce rte_ymm and relatives in rte_common_vect.h.
>   librte_acl: add AVX2 as new rte_acl_classify() method
>   test-acl: add ability to manually select RT method.
>   librte_acl: Remove search_sse_2 and relatives.
>   libter_acl: move lo/hi dwords shuffle out from calc_addr
>   libte_acl: make calc_addr a define to deduplicate the code.
>   libte_acl: introduce max_size into rte_acl_config.
>   libte_acl: remove unused macros.
>   libte_acl: add some comments about ACL internal layout.
> 
>  app/test-acl/main.c                             | 126 +++--
>  app/test/test_acl.c                             |   8 +-
>  examples/l3fwd-acl/main.c                       |   3 +-
>  examples/l3fwd/main.c                           |   2 +-
>  lib/librte_acl/Makefile                         |  18 +
>  lib/librte_acl/acl.h                            |  58 ++-
>  lib/librte_acl/acl_bld.c                        | 392 +++++++---------
>  lib/librte_acl/acl_gen.c                        | 268 +++++++----
>  lib/librte_acl/acl_run.h                        |   7 +-
>  lib/librte_acl/acl_run_avx2.c                   |  54 +++
>  lib/librte_acl/acl_run_avx2.h                   | 284 ++++++++++++
>  lib/librte_acl/acl_run_scalar.c                 |  65 ++-
>  lib/librte_acl/acl_run_sse.c                    | 585 +-----------------------
>  lib/librte_acl/acl_run_sse.h                    | 357 +++++++++++++++
>  lib/librte_acl/acl_vect.h                       | 132 +++---
>  lib/librte_acl/rte_acl.c                        |  47 +-
>  lib/librte_acl/rte_acl.h                        |   4 +
>  lib/librte_acl/rte_acl_osdep_alone.h            |  47 +-
>  lib/librte_eal/common/include/rte_common_vect.h |  39 +-
>  lib/librte_lpm/rte_lpm.h                        |   2 +-
>  20 files changed, 1444 insertions(+), 1054 deletions(-)
>  create mode 100644 lib/librte_acl/acl_run_avx2.c
>  create mode 100644 lib/librte_acl/acl_run_avx2.h
>  create mode 100644 lib/librte_acl/acl_run_sse.h
> 
> -- 
> 1.8.5.3
> 
> 
I'm sorry I've not looked at this yet Konstantin, I'm trying to get to it soon
Neil

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements.
  2015-01-22 18:54 ` [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Neil Horman
@ 2015-01-22 22:10   ` Ananyev, Konstantin
  0 siblings, 0 replies; 27+ messages in thread
From: Ananyev, Konstantin @ 2015-01-22 22:10 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev



> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Thursday, January 22, 2015 6:55 PM
> To: Ananyev, Konstantin
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements.
> 
> On Tue, Jan 20, 2015 at 06:40:49PM +0000, Konstantin Ananyev wrote:
> > v3 changes:
> > Applied review comments from Thomas:
> > - fix spelling errors reported by codespell.
> > - split last patch into two:
> >     first to remove unused macros,
> >     second to add some comments about ACL internal layout.
> >
> > v2 changes:
> > - When build with the compilers that don't support AVX2 instructions,
> > make rte_acl_classify_avx2() do nothing and return an error.
> > - Remove unneeded 'ifdef __AVX2__' in acl_run_avx2.*.
> > - Reorder order of patches in the set, to keep RTE_LIBRTE_ACL_STANDALONE=y
> > always buildable.
> >
> > This patch series contain several fixes and enhancements for ACL library.
> > See complete list below.
> > Two main changes that are externally visible:
> > - Introduce new classify method:  RTE_ACL_CLASSIFY_AVX2.
> > It uses AVX2 instructions and 256 bit wide data types
> > to perform internal trie traversal.
> > That helps to increase classify() throughput.
> > This method is selected as default one on CPUs that supports AVX2.
> > - Introduce new field in the build config structure: max_size.
> > It specifies maximum size that internal RT structure for given context
> > can reach.
> > The purpose of that is to allow user to decide about space/performance trade-off
> > (faster classify() vs less space for RT internal structures)
> > for each given set of rules.
> >
> > Konstantin Ananyev (18):
> >   fix fix compilation issues with RTE_LIBRTE_ACL_STANDALONE=y
> >   app/test: few small fixes fot test_acl.c
> >   librte_acl: make data_indexes long enough to survive idle transitions.
> >   librte_acl: remove build phase heuristsic with negative performance
> >     effect.
> >   librte_acl: fix a bug at build phase that can cause matches beeing
> >     overwirtten.
> >   librte_acl: introduce DFA nodes compression (group64) for identical
> >     entries.
> >   librte_acl: build/gen phase - simplify the way match nodes are
> >     allocated.
> >   librte_acl: make scalar RT code to be more similar to vector one.
> >   librte_acl: a bit of RT code deduplication.
> >   EAL: introduce rte_ymm and relatives in rte_common_vect.h.
> >   librte_acl: add AVX2 as new rte_acl_classify() method
> >   test-acl: add ability to manually select RT method.
> >   librte_acl: Remove search_sse_2 and relatives.
> >   libter_acl: move lo/hi dwords shuffle out from calc_addr
> >   libte_acl: make calc_addr a define to deduplicate the code.
> >   libte_acl: introduce max_size into rte_acl_config.
> >   libte_acl: remove unused macros.
> >   libte_acl: add some comments about ACL internal layout.
> >
> >  app/test-acl/main.c                             | 126 +++--
> >  app/test/test_acl.c                             |   8 +-
> >  examples/l3fwd-acl/main.c                       |   3 +-
> >  examples/l3fwd/main.c                           |   2 +-
> >  lib/librte_acl/Makefile                         |  18 +
> >  lib/librte_acl/acl.h                            |  58 ++-
> >  lib/librte_acl/acl_bld.c                        | 392 +++++++---------
> >  lib/librte_acl/acl_gen.c                        | 268 +++++++----
> >  lib/librte_acl/acl_run.h                        |   7 +-
> >  lib/librte_acl/acl_run_avx2.c                   |  54 +++
> >  lib/librte_acl/acl_run_avx2.h                   | 284 ++++++++++++
> >  lib/librte_acl/acl_run_scalar.c                 |  65 ++-
> >  lib/librte_acl/acl_run_sse.c                    | 585 +-----------------------
> >  lib/librte_acl/acl_run_sse.h                    | 357 +++++++++++++++
> >  lib/librte_acl/acl_vect.h                       | 132 +++---
> >  lib/librte_acl/rte_acl.c                        |  47 +-
> >  lib/librte_acl/rte_acl.h                        |   4 +
> >  lib/librte_acl/rte_acl_osdep_alone.h            |  47 +-
> >  lib/librte_eal/common/include/rte_common_vect.h |  39 +-
> >  lib/librte_lpm/rte_lpm.h                        |   2 +-
> >  20 files changed, 1444 insertions(+), 1054 deletions(-)
> >  create mode 100644 lib/librte_acl/acl_run_avx2.c
> >  create mode 100644 lib/librte_acl/acl_run_avx2.h
> >  create mode 100644 lib/librte_acl/acl_run_sse.h
> >
> > --
> > 1.8.5.3
> >
> >
> I'm sorry I've not looked at this yet Konstantin, I'm trying to get to it soon
> Neil

No worries, and thanks for your reviews :)
Konstantin

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH v3 05/18] librte_acl: fix a bug at build phase that can cause matches beeing overwirtten.
  2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 05/18] librte_acl: fix a bug at build phase that can cause matches beeing overwirtten Konstantin Ananyev
@ 2015-01-25 17:34   ` Neil Horman
  2015-01-25 22:40     ` Ananyev, Konstantin
  0 siblings, 1 reply; 27+ messages in thread
From: Neil Horman @ 2015-01-25 17:34 UTC (permalink / raw)
  To: Konstantin Ananyev; +Cc: dev

On Tue, Jan 20, 2015 at 06:40:54PM +0000, Konstantin Ananyev wrote:
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
>  lib/librte_acl/acl_bld.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
> index 8bf4a54..22f7934 100644
> --- a/lib/librte_acl/acl_bld.c
> +++ b/lib/librte_acl/acl_bld.c
> @@ -1907,7 +1907,7 @@ rte_acl_build(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg)
>  				bcx.num_tries, bcx.cfg.num_categories,
>  				RTE_ACL_MAX_FIELDS * RTE_DIM(bcx.tries) *
>  				sizeof(ctx->data_indexes[0]),
> -				bcx.num_build_rules);
> +				bcx.num_build_rules + 1);
>  		if (rc == 0) {
>  
>  			/* set data indexes. */
> -- 
> 1.8.5.3
> 
> 
Shouldn't you add to num_build_rules inside rte_acl_gen?  That way other future
users of the function don't have to remember to do so.
Neil

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH v3 05/18] librte_acl: fix a bug at build phase that can cause matches beeing overwirtten.
  2015-01-25 17:34   ` Neil Horman
@ 2015-01-25 22:40     ` Ananyev, Konstantin
  2015-01-26 12:08       ` Neil Horman
  0 siblings, 1 reply; 27+ messages in thread
From: Ananyev, Konstantin @ 2015-01-25 22:40 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

Hi Neil,

> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Sunday, January 25, 2015 5:35 PM
> To: Ananyev, Konstantin
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH v3 05/18] librte_acl: fix a bug at build phase that can cause matches beeing overwirtten.
> 
> On Tue, Jan 20, 2015 at 06:40:54PM +0000, Konstantin Ananyev wrote:
> > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > ---
> >  lib/librte_acl/acl_bld.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
> > index 8bf4a54..22f7934 100644
> > --- a/lib/librte_acl/acl_bld.c
> > +++ b/lib/librte_acl/acl_bld.c
> > @@ -1907,7 +1907,7 @@ rte_acl_build(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg)
> >  				bcx.num_tries, bcx.cfg.num_categories,
> >  				RTE_ACL_MAX_FIELDS * RTE_DIM(bcx.tries) *
> >  				sizeof(ctx->data_indexes[0]),
> > -				bcx.num_build_rules);
> > +				bcx.num_build_rules + 1);
> >  		if (rc == 0) {
> >
> >  			/* set data indexes. */
> > --
> > 1.8.5.3
> >
> >
> Shouldn't you add to num_build_rules inside rte_acl_gen?  That way other future
> users of the function don't have to remember to do so.

In that patch, I just fix the bug to stop generate invalid tries for some corener cases.
In the later patch in that set, I did something similar to what you are suggesting here -
make rte_acl_gen() to allocate indexes for all match nodes too (as it already doing for all other nodes).
See  [PATCH v3 07/18] librte_acl: build/gen phase - simplify the way match nodes are allocated.

Konstantin

> Neil

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH v3 05/18] librte_acl: fix a bug at build phase that can cause matches beeing overwirtten.
  2015-01-25 22:40     ` Ananyev, Konstantin
@ 2015-01-26 12:08       ` Neil Horman
  0 siblings, 0 replies; 27+ messages in thread
From: Neil Horman @ 2015-01-26 12:08 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev

On Sun, Jan 25, 2015 at 10:40:23PM +0000, Ananyev, Konstantin wrote:
> Hi Neil,
> 
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Sunday, January 25, 2015 5:35 PM
> > To: Ananyev, Konstantin
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH v3 05/18] librte_acl: fix a bug at build phase that can cause matches beeing overwirtten.
> > 
> > On Tue, Jan 20, 2015 at 06:40:54PM +0000, Konstantin Ananyev wrote:
> > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > > ---
> > >  lib/librte_acl/acl_bld.c | 2 +-
> > >  1 file changed, 1 insertion(+), 1 deletion(-)
> > >
> > > diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
> > > index 8bf4a54..22f7934 100644
> > > --- a/lib/librte_acl/acl_bld.c
> > > +++ b/lib/librte_acl/acl_bld.c
> > > @@ -1907,7 +1907,7 @@ rte_acl_build(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg)
> > >  				bcx.num_tries, bcx.cfg.num_categories,
> > >  				RTE_ACL_MAX_FIELDS * RTE_DIM(bcx.tries) *
> > >  				sizeof(ctx->data_indexes[0]),
> > > -				bcx.num_build_rules);
> > > +				bcx.num_build_rules + 1);
> > >  		if (rc == 0) {
> > >
> > >  			/* set data indexes. */
> > > --
> > > 1.8.5.3
> > >
> > >
> > Shouldn't you add to num_build_rules inside rte_acl_gen?  That way other future
> > users of the function don't have to remember to do so.
> 
> In that patch, I just fix the bug to stop generate invalid tries for some corener cases.
> In the later patch in that set, I did something similar to what you are suggesting here -
> make rte_acl_gen() to allocate indexes for all match nodes too (as it already doing for all other nodes).
> See  [PATCH v3 07/18] librte_acl: build/gen phase - simplify the way match nodes are allocated.
> 
Ok, thank you
Neil

> Konstantin
> 
> > Neil
> 
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements.
  2015-01-20 18:40 [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Konstantin Ananyev
                   ` (18 preceding siblings ...)
  2015-01-22 18:54 ` [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Neil Horman
@ 2015-01-27 14:03 ` Neil Horman
  2015-01-28 16:14   ` Thomas Monjalon
  2015-01-30  3:12 ` Fu, JingguoX
  20 siblings, 1 reply; 27+ messages in thread
From: Neil Horman @ 2015-01-27 14:03 UTC (permalink / raw)
  To: Konstantin Ananyev; +Cc: dev

On Tue, Jan 20, 2015 at 06:40:49PM +0000, Konstantin Ananyev wrote:
> v3 changes:
> Applied review comments from Thomas:
> - fix spelling errors reported by codespell.
> - split last patch into two:
>     first to remove unused macros,
>     second to add some comments about ACL internal layout.
> 
> v2 changes:
> - When build with the compilers that don't support AVX2 instructions,
> make rte_acl_classify_avx2() do nothing and return an error.
> - Remove unneeded 'ifdef __AVX2__' in acl_run_avx2.*.
> - Reorder order of patches in the set, to keep RTE_LIBRTE_ACL_STANDALONE=y
> always buildable.
> 
> This patch series contain several fixes and enhancements for ACL library.
> See complete list below.
> Two main changes that are externally visible:
> - Introduce new classify method:  RTE_ACL_CLASSIFY_AVX2.
> It uses AVX2 instructions and 256 bit wide data types
> to perform internal trie traversal.
> That helps to increase classify() throughput.
> This method is selected as default one on CPUs that supports AVX2.
> - Introduce new field in the build config structure: max_size.
> It specifies maximum size that internal RT structure for given context
> can reach.
> The purpose of that is to allow user to decide about space/performance trade-off
> (faster classify() vs less space for RT internal structures)
> for each given set of rules.
> 
> Konstantin Ananyev (18):
>   fix fix compilation issues with RTE_LIBRTE_ACL_STANDALONE=y
>   app/test: few small fixes fot test_acl.c
>   librte_acl: make data_indexes long enough to survive idle transitions.
>   librte_acl: remove build phase heuristsic with negative performance
>     effect.
>   librte_acl: fix a bug at build phase that can cause matches beeing
>     overwirtten.
>   librte_acl: introduce DFA nodes compression (group64) for identical
>     entries.
>   librte_acl: build/gen phase - simplify the way match nodes are
>     allocated.
>   librte_acl: make scalar RT code to be more similar to vector one.
>   librte_acl: a bit of RT code deduplication.
>   EAL: introduce rte_ymm and relatives in rte_common_vect.h.
>   librte_acl: add AVX2 as new rte_acl_classify() method
>   test-acl: add ability to manually select RT method.
>   librte_acl: Remove search_sse_2 and relatives.
>   libter_acl: move lo/hi dwords shuffle out from calc_addr
>   libte_acl: make calc_addr a define to deduplicate the code.
>   libte_acl: introduce max_size into rte_acl_config.
>   libte_acl: remove unused macros.
>   libte_acl: add some comments about ACL internal layout.
> 
>  app/test-acl/main.c                             | 126 +++--
>  app/test/test_acl.c                             |   8 +-
>  examples/l3fwd-acl/main.c                       |   3 +-
>  examples/l3fwd/main.c                           |   2 +-
>  lib/librte_acl/Makefile                         |  18 +
>  lib/librte_acl/acl.h                            |  58 ++-
>  lib/librte_acl/acl_bld.c                        | 392 +++++++---------
>  lib/librte_acl/acl_gen.c                        | 268 +++++++----
>  lib/librte_acl/acl_run.h                        |   7 +-
>  lib/librte_acl/acl_run_avx2.c                   |  54 +++
>  lib/librte_acl/acl_run_avx2.h                   | 284 ++++++++++++
>  lib/librte_acl/acl_run_scalar.c                 |  65 ++-
>  lib/librte_acl/acl_run_sse.c                    | 585 +-----------------------
>  lib/librte_acl/acl_run_sse.h                    | 357 +++++++++++++++
>  lib/librte_acl/acl_vect.h                       | 132 +++---
>  lib/librte_acl/rte_acl.c                        |  47 +-
>  lib/librte_acl/rte_acl.h                        |   4 +
>  lib/librte_acl/rte_acl_osdep_alone.h            |  47 +-
>  lib/librte_eal/common/include/rte_common_vect.h |  39 +-
>  lib/librte_lpm/rte_lpm.h                        |   2 +-
>  20 files changed, 1444 insertions(+), 1054 deletions(-)
>  create mode 100644 lib/librte_acl/acl_run_avx2.c
>  create mode 100644 lib/librte_acl/acl_run_avx2.h
>  create mode 100644 lib/librte_acl/acl_run_sse.h
> 
> -- 
> 1.8.5.3
> 
> 
For the series
Acked-by: Neil Horman <nhorman@tuxdriver.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements.
  2015-01-27 14:03 ` Neil Horman
@ 2015-01-28 16:14   ` Thomas Monjalon
  0 siblings, 0 replies; 27+ messages in thread
From: Thomas Monjalon @ 2015-01-28 16:14 UTC (permalink / raw)
  To: Konstantin Ananyev; +Cc: dev

> > v3 changes:
> > Applied review comments from Thomas:
> > - fix spelling errors reported by codespell.
> > - split last patch into two:
> >     first to remove unused macros,
> >     second to add some comments about ACL internal layout.
> > 
> > v2 changes:
> > - When build with the compilers that don't support AVX2 instructions,
> > make rte_acl_classify_avx2() do nothing and return an error.
> > - Remove unneeded 'ifdef __AVX2__' in acl_run_avx2.*.
> > - Reorder order of patches in the set, to keep RTE_LIBRTE_ACL_STANDALONE=y
> > always buildable.
> > 
> > This patch series contain several fixes and enhancements for ACL library.
> > See complete list below.
> > Two main changes that are externally visible:
> > - Introduce new classify method:  RTE_ACL_CLASSIFY_AVX2.
> > It uses AVX2 instructions and 256 bit wide data types
> > to perform internal trie traversal.
> > That helps to increase classify() throughput.
> > This method is selected as default one on CPUs that supports AVX2.
> > - Introduce new field in the build config structure: max_size.
> > It specifies maximum size that internal RT structure for given context
> > can reach.
> > The purpose of that is to allow user to decide about space/performance trade-off
> > (faster classify() vs less space for RT internal structures)
> > for each given set of rules.
> > 
> > Konstantin Ananyev (18):
> >   fix fix compilation issues with RTE_LIBRTE_ACL_STANDALONE=y
> >   app/test: few small fixes fot test_acl.c
> >   librte_acl: make data_indexes long enough to survive idle transitions.
> >   librte_acl: remove build phase heuristsic with negative performance
> >     effect.
> >   librte_acl: fix a bug at build phase that can cause matches beeing
> >     overwirtten.
> >   librte_acl: introduce DFA nodes compression (group64) for identical
> >     entries.
> >   librte_acl: build/gen phase - simplify the way match nodes are
> >     allocated.
> >   librte_acl: make scalar RT code to be more similar to vector one.
> >   librte_acl: a bit of RT code deduplication.
> >   EAL: introduce rte_ymm and relatives in rte_common_vect.h.
> >   librte_acl: add AVX2 as new rte_acl_classify() method
> >   test-acl: add ability to manually select RT method.
> >   librte_acl: Remove search_sse_2 and relatives.
> >   libter_acl: move lo/hi dwords shuffle out from calc_addr
> >   libte_acl: make calc_addr a define to deduplicate the code.
> >   libte_acl: introduce max_size into rte_acl_config.
> >   libte_acl: remove unused macros.
> >   libte_acl: add some comments about ACL internal layout.
> > 
> For the series
> Acked-by: Neil Horman <nhorman@tuxdriver.com>

Applied

Thanks for the big work
-- 
Thomas

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements.
  2015-01-20 18:40 [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Konstantin Ananyev
                   ` (19 preceding siblings ...)
  2015-01-27 14:03 ` Neil Horman
@ 2015-01-30  3:12 ` Fu, JingguoX
  20 siblings, 0 replies; 27+ messages in thread
From: Fu, JingguoX @ 2015-01-30  3:12 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev

Tested-by: Jingguo Fu <jingguox.fu@intel.com>

- Tested Commit: 17f520d2cff8d69962824f28810f36e949a7184d
- OS: Ubuntu14.04 3.13.0-24-generic
- GCC: gcc version 4.8.2
- CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
- NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ [8086:10fb] (rev 01)
- Default x86_64-native-linuxapp-gcc configuration
- Total 5 cases, 5 passed, 0 failed


- Case: l3fwdACL_ACL_rule
  Description:
        l3fwd Access Control match ACL rule test
  Command / instruction:
        Add ACL rules:
            echo '' > /root/rule_ipv4.db
            echo 'R0.0.0.0/0 0.0.0.0/0 0 : 65535 0 : 65535 0x00/0x00 1' >> /root/rule_ipv4.db
            echo '' > /root/rule_ipv6.db
            echo 'R0:0:0:0:0:0:0:0/0 0:0:0:0:0:0:0:0/0 0 : 65535 0 : 65535 0x00/0x00 1' >> /root/rule_ipv6.db
            echo '' > /root/rule_ipv4.db
            echo @200.10.0.1/32 0.0.0.0/0 0 : 65535 0 : 65535 0x00/0x00  >> /root/rule_ipv4.db
            echo R0.0.0.0/0 0.0.0.0/0 0 : 65535 0 : 65535 0x00/0x00 1 >> /root/rule_ipv4.db
        Start l3fwd-ACL with rule_ipv4 and rule_ipv6 config
            # ./examples/l3fwd-ACL/build/l3fwd-ACL -c 0x3c1e03c1e -n 4 -- -p 0x3 --config="(0,0,2),(1,0,3)" --rule_ipv4="/root/rule_ipv4.db" --rule_ipv6="/root/rule_ipv6.db"
        Send packets by Scapy according to ACL rule
  Expected result:
        Application can filter packets by ACL rules
  Test Result: PASSED

- Case: l3fwdACL_exact_route
  Description:
        l3fwd Access Control match Exact route rule test
  Command / instruction:
        Add ACL rules:
            echo '' > /root/rule_ipv4.db
            echo 'R0.0.0.0/0 0.0.0.0/0 0 : 65535 0 : 65535 0x00/0x00 1' >> /root/rule_ipv4.db
            echo '' > /root/rule_ipv6.db
            echo 'R0:0:0:0:0:0:0:0/0 0:0:0:0:0:0:0:0/0 0 : 65535 0 : 65535 0x00/0x00 1' >> /root/rule_ipv6.db
            echo '' > /root/rule_ipv4.db
            echo R200.10.0.1/32 100.10.0.1/32 11 : 11 101 : 101 0x06/0xff 0 >> /root/rule_ipv4.db
            echo R200.20.0.1/32 100.20.0.1/32 12 : 12 102 : 102 0x06/0xff 1 >> /root/rule_ipv4.db
        Start l3fwd-ACL with rule_ipv4 and rule_ipv6 config
            # ./examples/l3fwd-ACL/build/l3fwd-ACL -c 0x3c1e03c1e -n 4 -- -p 0x3 --config="(0,0,2),(1,0,3)" --rule_ipv4="/root/rule_ipv4.db" --rule_ipv6="/root/rule_ipv6.db"
        Send packets by Scapy according to route rule
  Expected result:
        ACL rule can filter packets
  Test Result: PASSED
  
- Case: l3fwdACL_invalid
  Description:
        l3fwd Access Control handle Invalid rule test
  Command / instruction:
        Add ACL rules:
            echo '' > /root/rule_ipv4.db
            echo 'R0.0.0.0/0 0.0.0.0/0 0 : 65535 0 : 65535 0x00/0x00 1' >> /root/rule_ipv4.db
            echo '' > /root/rule_ipv6.db
            echo 'R0:0:0:0:0:0:0:0/0 0:0:0:0:0:0:0:0/0 0 : 65535 0 : 65535 0x00/0x00 1' >> /root/rule_ipv6.db
            echo '' > /root/rule_ipv4.db
            echo R0.0.0.0/0 0.0.0.0/0 12 : 11 0 : 65535 0x00/0x00 0 >> /root/rule_ipv4.db
            echo R0.0.0.0/0 0.0.0.0/0 0 : 65535 0 : 65535 0x00/0x00 1 >> /root/rule_ipv4.db
        Start l3fwd-ACL with rule_ipv4 and rule_ipv6 config
            # ./examples/l3fwd-ACL/build/l3fwd-ACL -c 0x3c1e03c1e -n 4 -- -p 0x3 --config="(0,0,2),(1,0,3)" --rule_ipv4="/root/rule_ipv4.db" --rule_ipv6="/root/rule_ipv6.db"
        Send packets by Scapy according to invalid rule
  Expected result:
        ACL rule can filter packets
  Test Result: PASSED

- Case: l3fwdACL_lpm_route
  Description:
        l3fwd Access Control match Lpm route rule test
  Command / instruction:
        Add ACL rules:
            echo '' > /root/rule_ipv4.db
            echo 'R0.0.0.0/0 0.0.0.0/0 0 : 65535 0 : 65535 0x00/0x00 1' >> /root/rule_ipv4.db
            echo '' > /root/rule_ipv6.db
            echo 'R0:0:0:0:0:0:0:0/0 0:0:0:0:0:0:0:0/0 0 : 65535 0 : 65535 0x00/0x00 1' >> /root/rule_ipv6.db
            echo '' > /root/rule_ipv4.db
            echo R0.0.0.0/0 1.1.1.0/24 0 : 65535 0 : 65535 0x00/0x00 0 >> /root/rule_ipv4.db
            echo R0.0.0.0/0 2.1.1.0/24 0 : 65535 0 : 65535 0x00/0x00 1 >> /root/rule_ipv4.db
        Start l3fwd-ACL with rule_ipv4 and rule_ipv6 config
            # ./examples/l3fwd-ACL/build/l3fwd-ACL -c 0x3c1e03c1e -n 4 -- -p 0x3 --config="(0,0,2),(1,0,3)" --rule_ipv4="/root/rule_ipv4.db" --rule_ipv6="/root/rule_ipv6.db"
        Send packets by Scapy according to lpm route rule
  Expected result:
        ACL rule can filter packets
  Test Result: PASSED

- Case: l3fwdACL_scalar
  Description:
        l3fwd Access Control match with Scalar function test
  Command / instruction:
        Add ACL rules:
            echo '' > /root/rule_ipv4.db
            echo 'R0.0.0.0/0 0.0.0.0/0 0 : 65535 0 : 65535 0x00/0x00 1' >> /root/rule_ipv4.db
            echo '' > /root/rule_ipv6.db
            echo 'R0:0:0:0:0:0:0:0/0 0:0:0:0:0:0:0:0/0 0 : 65535 0 : 65535 0x00/0x00 1' >> /root/rule_ipv6.db
            echo '' > /root/rule_ipv4.db
            echo @200.10.0.1/32 100.10.0.1/32 11 : 11 101 : 101 0x06/0xff  >> /root/rule_ipv4.db
            echo R0.0.0.0/0 0.0.0.0/0 0 : 65535 0 : 65535 0x00/0x00 1 >> /root/rule_ipv4.db
        Start l3fwd-ACL with rule_ipv4 and rule_ipv6 config
            # ./examples/l3fwd-ACL/build/l3fwd-ACL -c 0x3c1e03c1e -n 4 -- -p 0x3 --config="(0,0,2),(1,0,3)" --rule_ipv4="/root/rule_ipv4.db" --rule_ipv6="/root/rule_ipv6.db"
        Send packets by Scapy according to ACL rule
  Expected result:
        ACL rule can filter packets
  Test Result: PASSED


-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Konstantin Ananyev
Sent: Wednesday, January 21, 2015 02:41
To: dev@dpdk.org
Subject: [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements.

v3 changes:
Applied review comments from Thomas:
- fix spelling errors reported by codespell.
- split last patch into two:
    first to remove unused macros,
    second to add some comments about ACL internal layout.

v2 changes:
- When build with the compilers that don't support AVX2 instructions,
make rte_ACL_classify_avx2() do nothing and return an error.
- Remove unneeded 'ifdef __AVX2__' in ACL_run_avx2.*.
- Reorder order of patches in the set, to keep RTE_LIBRTE_ACL_STANDALONE=y
always buildable.

This patch series contain several fixes and enhancements for ACL library.
See complete list below.
Two main changes that are externally visible:
- Introduce new classify method:  RTE_ACL_CLASSIFY_AVX2.
It uses AVX2 instructions and 256 bit wide data types
to perform internal trie traversal.
That helps to increase classify() throughput.
This method is selected as default one on CPUs that supports AVX2.
- Introduce new field in the build config structure: max_size.
It specifies maximum size that internal RT structure for given context
can reach.
The purpose of that is to allow user to decide about space/performance trade-off
(faster classify() vs less space for RT internal structures)
for each given set of rules.

Konstantin Ananyev (18):
  fix fix compilation issues with RTE_LIBRTE_ACL_STANDALONE=y
  app/test: few small fixes fot test_ACL.c
  librte_ACL: make data_indexes long enough to survive idle transitions.
  librte_ACL: remove build phase heuristsic with negative performance
    effect.
  librte_ACL: fix a bug at build phase that can cause matches beeing
    overwirtten.
  librte_ACL: introduce DFA nodes compression (group64) for identical
    entries.
  librte_ACL: build/gen phase - simplify the way match nodes are
    allocated.
  librte_ACL: make scalar RT code to be more similar to vector one.
  librte_ACL: a bit of RT code deduplication.
  EAL: introduce rte_ymm and relatives in rte_common_vect.h.
  librte_ACL: add AVX2 as new rte_ACL_classify() method
  test-ACL: add ability to manually select RT method.
  librte_ACL: Remove search_sse_2 and relatives.
  libter_ACL: move lo/hi dwords shuffle out from calc_addr
  libte_ACL: make calc_addr a define to deduplicate the code.
  libte_ACL: introduce max_size into rte_ACL_config.
  libte_ACL: remove unused macros.
  libte_ACL: add some comments about ACL internal layout.

 app/test-ACL/main.c                             | 126 +++--
 app/test/test_ACL.c                             |   8 +-
 examples/l3fwd-ACL/main.c                       |   3 +-
 examples/l3fwd/main.c                           |   2 +-
 lib/librte_ACL/Makefile                         |  18 +
 lib/librte_ACL/ACL.h                            |  58 ++-
 lib/librte_ACL/ACL_bld.c                        | 392 +++++++---------
 lib/librte_ACL/ACL_gen.c                        | 268 +++++++----
 lib/librte_ACL/ACL_run.h                        |   7 +-
 lib/librte_ACL/ACL_run_avx2.c                   |  54 +++
 lib/librte_ACL/ACL_run_avx2.h                   | 284 ++++++++++++
 lib/librte_ACL/ACL_run_scalar.c                 |  65 ++-
 lib/librte_ACL/ACL_run_sse.c                    | 585 +-----------------------
 lib/librte_ACL/ACL_run_sse.h                    | 357 +++++++++++++++
 lib/librte_ACL/ACL_vect.h                       | 132 +++---
 lib/librte_ACL/rte_ACL.c                        |  47 +-
 lib/librte_ACL/rte_ACL.h                        |   4 +
 lib/librte_ACL/rte_ACL_osdep_alone.h            |  47 +-
 lib/librte_eal/common/include/rte_common_vect.h |  39 +-
 lib/librte_lpm/rte_lpm.h                        |   2 +-
 20 files changed, 1444 insertions(+), 1054 deletions(-)
 create mode 100644 lib/librte_ACL/ACL_run_avx2.c
 create mode 100644 lib/librte_ACL/ACL_run_avx2.h
 create mode 100644 lib/librte_ACL/ACL_run_sse.h

-- 
1.8.5.3

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2015-01-30  3:12 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-20 18:40 [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Konstantin Ananyev
2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 01/18] fix fix compilation issues with RTE_LIBRTE_ACL_STANDALONE=y Konstantin Ananyev
2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 02/18] app/test: few small fixes fot test_acl.c Konstantin Ananyev
2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 03/18] librte_acl: make data_indexes long enough to survive idle transitions Konstantin Ananyev
2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 04/18] librte_acl: remove build phase heuristsic with negative performance effect Konstantin Ananyev
2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 05/18] librte_acl: fix a bug at build phase that can cause matches beeing overwirtten Konstantin Ananyev
2015-01-25 17:34   ` Neil Horman
2015-01-25 22:40     ` Ananyev, Konstantin
2015-01-26 12:08       ` Neil Horman
2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 06/18] librte_acl: introduce DFA nodes compression (group64) for identical entries Konstantin Ananyev
2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 07/18] librte_acl: build/gen phase - simplify the way match nodes are allocated Konstantin Ananyev
2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 08/18] librte_acl: make scalar RT code to be more similar to vector one Konstantin Ananyev
2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 09/18] librte_acl: a bit of RT code deduplication Konstantin Ananyev
2015-01-20 18:40 ` [dpdk-dev] [PATCH v3 10/18] EAL: introduce rte_ymm and relatives in rte_common_vect.h Konstantin Ananyev
2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 11/18] librte_acl: add AVX2 as new rte_acl_classify() method Konstantin Ananyev
2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 12/18] test-acl: add ability to manually select RT method Konstantin Ananyev
2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 13/18] librte_acl: Remove search_sse_2 and relatives Konstantin Ananyev
2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 14/18] libter_acl: move lo/hi dwords shuffle out from calc_addr Konstantin Ananyev
2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 15/18] libte_acl: make calc_addr a define to deduplicate the code Konstantin Ananyev
2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 16/18] libte_acl: introduce max_size into rte_acl_config Konstantin Ananyev
2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 17/18] libte_acl: remove unused macros Konstantin Ananyev
2015-01-20 18:41 ` [dpdk-dev] [PATCH v3 18/18] libte_acl: add some comments about ACL internal layout Konstantin Ananyev
2015-01-22 18:54 ` [dpdk-dev] [PATCH v3 00/18] ACL: New AVX2 classify method and several other enhancements Neil Horman
2015-01-22 22:10   ` Ananyev, Konstantin
2015-01-27 14:03 ` Neil Horman
2015-01-28 16:14   ` Thomas Monjalon
2015-01-30  3:12 ` Fu, JingguoX

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).