DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH 20.11 0/7] acl: introduce AVX512 classify method
@ 2020-08-07 16:28 Konstantin Ananyev
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 1/7] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
                   ` (7 more replies)
  0 siblings, 8 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-08-07 16:28 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

These patch series introduce support of AVX512 specific classify
implementation for ACL library.
Inside it contains two code-paths –
one uses mostly 256 bit instruction/registers and can
process up to 16 flows in parallel.
second uses 512 bit instruction/registers over majority of
places and can process up to 32 flows in parallel.
These internal code-path selection is done internally based
on input burst size and is totally opaque to the user.
On my SKX box test-acl shows ~20-65% improvement
(depending on rule-set and input burst size) 
when switching from AVX2 to AVX512 classify algorithms.

Note that this change introduce a formal ABI incompatibility
with previous versions of ACL library.

TODO list:
- Deduplicate 8/16 code paths
- Update default algorithm selection
- Update docs

These patch series depends on:
https://patches.dpdk.org/patch/70429/
to be applied first.  

Konstantin Ananyev (7):
  acl: fix x86 build when compiler doesn't support AVX2
  app/acl: few small improvements
  acl: remove of unused enum value
  acl: add infrastructure to support AVX512 classify
  app/acl: add AVX512 classify support
  acl: introduce AVX512 classify implementation
  acl: enhance AVX512 classify implementation

 app/test-acl/main.c                |  19 +-
 config/x86/meson.build             |   3 +-
 lib/librte_acl/Makefile            |  26 ++
 lib/librte_acl/acl.h               |   4 +
 lib/librte_acl/acl_run_avx512.c    | 140 +++++++
 lib/librte_acl/acl_run_avx512x16.h | 635 +++++++++++++++++++++++++++++
 lib/librte_acl/acl_run_avx512x8.h  | 614 ++++++++++++++++++++++++++++
 lib/librte_acl/meson.build         |  39 ++
 lib/librte_acl/rte_acl.c           |  19 +-
 lib/librte_acl/rte_acl.h           |   2 +-
 10 files changed, 1493 insertions(+), 8 deletions(-)
 create mode 100644 lib/librte_acl/acl_run_avx512.c
 create mode 100644 lib/librte_acl/acl_run_avx512x16.h
 create mode 100644 lib/librte_acl/acl_run_avx512x8.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH 20.11 1/7] acl: fix x86 build when compiler doesn't support AVX2
  2020-08-07 16:28 [dpdk-dev] [PATCH 20.11 0/7] acl: introduce AVX512 classify method Konstantin Ananyev
@ 2020-08-07 16:28 ` Konstantin Ananyev
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 2/7] app/acl: few small improvements Konstantin Ananyev
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-08-07 16:28 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev, stable

Right now we define dummy version of rte_acl_classify_avx2()
when both X86 and AVX2 are not detected, though it should be
for non-AVX2 case only.

Fixes: e53ce4e41379 ("acl: remove use of weak functions")
Cc: stable@dpdk.org

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/rte_acl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 777ec4d34..715b02359 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -16,7 +16,6 @@ static struct rte_tailq_elem rte_acl_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_acl_tailq)
 
-#ifndef RTE_ARCH_X86
 #ifndef CC_AVX2_SUPPORT
 /*
  * If the compiler doesn't support AVX2 instructions,
@@ -33,6 +32,7 @@ rte_acl_classify_avx2(__rte_unused const struct rte_acl_ctx *ctx,
 }
 #endif
 
+#ifndef RTE_ARCH_X86
 int
 rte_acl_classify_sse(__rte_unused const struct rte_acl_ctx *ctx,
 	__rte_unused const uint8_t **data,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH 20.11 2/7] app/acl: few small improvements
  2020-08-07 16:28 [dpdk-dev] [PATCH 20.11 0/7] acl: introduce AVX512 classify method Konstantin Ananyev
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 1/7] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
@ 2020-08-07 16:28 ` Konstantin Ananyev
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 3/7] acl: remove of unused enum value Konstantin Ananyev
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-08-07 16:28 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

- enhance output to print extra stats
- use rte_rdtsc_precise() for cycle measurements

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test-acl/main.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/app/test-acl/main.c b/app/test-acl/main.c
index 0a5dfb621..d9b65517c 100644
--- a/app/test-acl/main.c
+++ b/app/test-acl/main.c
@@ -862,9 +862,10 @@ search_ip5tuples(__rte_unused void *arg)
 {
 	uint64_t pkt, start, tm;
 	uint32_t i, lcore;
+	long double st;
 
 	lcore = rte_lcore_id();
-	start = rte_rdtsc();
+	start = rte_rdtsc_precise();
 	pkt = 0;
 
 	for (i = 0; i != config.iter_num; i++) {
@@ -872,12 +873,16 @@ search_ip5tuples(__rte_unused void *arg)
 			config.trace_step, config.alg.name);
 	}
 
-	tm = rte_rdtsc() - start;
+	tm = rte_rdtsc_precise() - start;
+
+	st = (long double)tm / rte_get_timer_hz();
 	dump_verbose(DUMP_NONE, stdout,
 		"%s  @lcore %u: %" PRIu32 " iterations, %" PRIu64 " pkts, %"
-		PRIu32 " categories, %" PRIu64 " cycles, %#Lf cycles/pkt\n",
-		__func__, lcore, i, pkt, config.run_categories,
-		tm, (pkt == 0) ? 0 : (long double)tm / pkt);
+		PRIu32 " categories, %" PRIu64 " cycles (%.2Lf sec), "
+		"%.2Lf cycles/pkt, %.2Lf pkt/sec\n",
+		__func__, lcore, i, pkt,
+		config.run_categories, tm, st,
+		(pkt == 0) ? 0 : (long double)tm / pkt, pkt / st);
 
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH 20.11 3/7] acl: remove of unused enum value
  2020-08-07 16:28 [dpdk-dev] [PATCH 20.11 0/7] acl: introduce AVX512 classify method Konstantin Ananyev
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 1/7] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 2/7] app/acl: few small improvements Konstantin Ananyev
@ 2020-08-07 16:28 ` Konstantin Ananyev
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 4/7] acl: add infrastructure to support AVX512 classify Konstantin Ananyev
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-08-07 16:28 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Removal of unused enum value (RTE_ACL_CLASSIFY_NUM).
This enum value is not used inside DPDK, while it prevents
to add new classify algorithms without causing an ABI breakage.

Note that this change introduce a formal ABI incompatibility
with previous versions of ACL library.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/rte_acl.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index aa22e70c6..b814423a6 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -241,7 +241,6 @@ enum rte_acl_classify_alg {
 	RTE_ACL_CLASSIFY_AVX2 = 3,    /**< requires AVX2 support. */
 	RTE_ACL_CLASSIFY_NEON = 4,    /**< requires NEON support. */
 	RTE_ACL_CLASSIFY_ALTIVEC = 5,    /**< requires ALTIVEC support. */
-	RTE_ACL_CLASSIFY_NUM          /* should always be the last one. */
 };
 
 /**
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH 20.11 4/7] acl: add infrastructure to support AVX512 classify
  2020-08-07 16:28 [dpdk-dev] [PATCH 20.11 0/7] acl: introduce AVX512 classify method Konstantin Ananyev
                   ` (2 preceding siblings ...)
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 3/7] acl: remove of unused enum value Konstantin Ananyev
@ 2020-08-07 16:28 ` Konstantin Ananyev
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 5/7] app/acl: add AVX512 classify support Konstantin Ananyev
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-08-07 16:28 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Add necessary changes to support new AVX512 specific ACL classify
algorithm:
 - changes in meson.build and Makefile to check that build tools
   (compiler, assembler, etc.) do properly support AVX512.
 - dummy rte_acl_classify_avx512() for targets where AVX512
   implementation couldn't be properly supported.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 config/x86/meson.build          |  3 ++-
 lib/librte_acl/Makefile         | 26 ++++++++++++++++++++++
 lib/librte_acl/acl.h            |  4 ++++
 lib/librte_acl/acl_run_avx512.c | 17 ++++++++++++++
 lib/librte_acl/meson.build      | 39 +++++++++++++++++++++++++++++++++
 lib/librte_acl/rte_acl.c        | 17 ++++++++++++++
 lib/librte_acl/rte_acl.h        |  1 +
 7 files changed, 106 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_acl/acl_run_avx512.c

diff --git a/config/x86/meson.build b/config/x86/meson.build
index 6ec020ef6..c5626e914 100644
--- a/config/x86/meson.build
+++ b/config/x86/meson.build
@@ -23,7 +23,8 @@ foreach f:base_flags
 endforeach
 
 optional_flags = ['AES', 'PCLMUL',
-		'AVX', 'AVX2', 'AVX512F',
+		'AVX', 'AVX2',
+		'AVX512F', 'AVX512VL', 'AVX512CD', 'AVX512BW',
 		'RDRND', 'RDSEED']
 foreach f:optional_flags
 	if cc.get_define('__@0@__'.format(f), args: machine_args) == '1'
diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
index f4332b044..8bd469c2b 100644
--- a/lib/librte_acl/Makefile
+++ b/lib/librte_acl/Makefile
@@ -58,6 +58,32 @@ ifeq ($(CC_AVX2_SUPPORT), 1)
 	CFLAGS_rte_acl.o += -DCC_AVX2_SUPPORT
 endif
 
+# compile AVX512 version if:
+# we are building 64-bit binary AND binutils can generate proper code
+ifeq ($(CONFIG_RTE_ARCH_X86_64),y)
+
+	BINUTIL_OK=$(shell AS=as; \
+		$(RTE_SDK)/buildtools/binutils-avx512-check.sh && \
+		echo 1)
+	ifeq ($(BINUTIL_OK), 1)
+
+		# If the compiler supports AVX512 instructions,
+		# then add support for AVX512 classify method.
+
+		CC_AVX512_FLAGS=$(shell $(CC) \
+		-mavx512f -mavx512vl -mavx512cd -mavx512bw \
+		-dM -E - </dev/null 2>&1 | grep AVX512 | wc -l)
+		ifeq ($(CC_AVX512_FLAGS), 4)
+			CFLAGS_acl_run_avx512.o += -mavx512f
+			CFLAGS_acl_run_avx512.o += -mavx512vl
+			CFLAGS_acl_run_avx512.o += -mavx512cd
+			CFLAGS_acl_run_avx512.o += -mavx512bw
+			SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_avx512.c
+			CFLAGS_rte_acl.o += -DCC_AVX512_SUPPORT
+		endif
+	endif
+endif
+
 # install this header file
 SYMLINK-$(CONFIG_RTE_LIBRTE_ACL)-include := rte_acl_osdep.h
 SYMLINK-$(CONFIG_RTE_LIBRTE_ACL)-include += rte_acl.h
diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 39d45a0c2..2022cf253 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -201,6 +201,10 @@ int
 rte_acl_classify_avx2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories);
 
+int
+rte_acl_classify_avx512(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories);
+
 int
 rte_acl_classify_neon(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories);
diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
new file mode 100644
index 000000000..67274989d
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#include "acl_run_sse.h"
+
+int
+rte_acl_classify_avx512(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	if (num >= MAX_SEARCHES_SSE8)
+		return search_sse_8(ctx, data, results, num, categories);
+	if (num >= MAX_SEARCHES_SSE4)
+		return search_sse_4(ctx, data, results, num, categories);
+
+	return rte_acl_classify_scalar(ctx, data, results, num, categories);
+}
diff --git a/lib/librte_acl/meson.build b/lib/librte_acl/meson.build
index d1e2c184c..b2fd61cad 100644
--- a/lib/librte_acl/meson.build
+++ b/lib/librte_acl/meson.build
@@ -27,6 +27,45 @@ if dpdk_conf.has('RTE_ARCH_X86')
 		cflags += '-DCC_AVX2_SUPPORT'
 	endif
 
+	# compile AVX512 version if:
+	# we are building 64-bit binary AND binutils can generate proper code
+
+	if dpdk_conf.has('RTE_ARCH_X86_64') and binutils_ok.returncode() == 0
+
+		# compile AVX512 version if either:
+		# a. we have AVX512 supported in minimum instruction set
+		#    baseline
+		# b. it's not minimum instruction set, but supported by
+		#    compiler
+		#
+		# in former case, just add avx512 C file to files list
+		# in latter case, compile c file to static lib, using correct
+		# compiler flags, and then have the .o file from static lib
+		# linked into main lib.
+
+		if dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512F') and \
+			dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512VL') and \
+			dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512CD') and \
+			dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512BW')
+
+			sources += files('acl_run_avx512.c')
+			cflags += '-DCC_AVX512_SUPPORT'
+
+		elif cc.has_multi_arguments('-mavx512f', '-mavx512vl',
+					'-mavx512cd', '-mavx512bw')
+
+			avx512_tmplib = static_library('avx512_tmp',
+				'acl_run_avx512.c',
+				dependencies: static_rte_eal,
+				c_args: cflags +
+					['-mavx512f', '-mavx512vl',
+					 '-mavx512cd', '-mavx512bw'])
+			objs += avx512_tmplib.extract_objects(
+					'acl_run_avx512.c')
+			cflags += '-DCC_AVX512_SUPPORT'
+		endif
+	endif
+
 elif dpdk_conf.has('RTE_ARCH_ARM') or dpdk_conf.has('RTE_ARCH_ARM64')
 	cflags += '-flax-vector-conversions'
 	sources += files('acl_run_neon.c')
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 715b02359..71b4afb08 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -16,6 +16,22 @@ static struct rte_tailq_elem rte_acl_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_acl_tailq)
 
+#ifndef CC_AVX512_SUPPORT
+/*
+ * If the compiler doesn't support AVX512 instructions,
+ * then the dummy one would be used instead for AVX512 classify method.
+ */
+int
+rte_acl_classify_avx512(__rte_unused const struct rte_acl_ctx *ctx,
+	__rte_unused const uint8_t **data,
+	__rte_unused uint32_t *results,
+	__rte_unused uint32_t num,
+	__rte_unused uint32_t categories)
+{
+	return -ENOTSUP;
+}
+#endif
+
 #ifndef CC_AVX2_SUPPORT
 /*
  * If the compiler doesn't support AVX2 instructions,
@@ -77,6 +93,7 @@ static const rte_acl_classify_t classify_fns[] = {
 	[RTE_ACL_CLASSIFY_AVX2] = rte_acl_classify_avx2,
 	[RTE_ACL_CLASSIFY_NEON] = rte_acl_classify_neon,
 	[RTE_ACL_CLASSIFY_ALTIVEC] = rte_acl_classify_altivec,
+	[RTE_ACL_CLASSIFY_AVX512] = rte_acl_classify_avx512,
 };
 
 /* by default, use always available scalar code path. */
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index b814423a6..6f39042fc 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -241,6 +241,7 @@ enum rte_acl_classify_alg {
 	RTE_ACL_CLASSIFY_AVX2 = 3,    /**< requires AVX2 support. */
 	RTE_ACL_CLASSIFY_NEON = 4,    /**< requires NEON support. */
 	RTE_ACL_CLASSIFY_ALTIVEC = 5,    /**< requires ALTIVEC support. */
+	RTE_ACL_CLASSIFY_AVX512 = 6,    /**< requires AVX512 support. */
 };
 
 /**
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH 20.11 5/7] app/acl: add AVX512 classify support
  2020-08-07 16:28 [dpdk-dev] [PATCH 20.11 0/7] acl: introduce AVX512 classify method Konstantin Ananyev
                   ` (3 preceding siblings ...)
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 4/7] acl: add infrastructure to support AVX512 classify Konstantin Ananyev
@ 2020-08-07 16:28 ` Konstantin Ananyev
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 6/7] acl: introduce AVX512 classify implementation Konstantin Ananyev
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-08-07 16:28 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Add ability to use AVX512 classify method.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test-acl/main.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/app/test-acl/main.c b/app/test-acl/main.c
index d9b65517c..19b714335 100644
--- a/app/test-acl/main.c
+++ b/app/test-acl/main.c
@@ -81,6 +81,10 @@ static const struct acl_alg acl_alg[] = {
 		.name = "altivec",
 		.alg = RTE_ACL_CLASSIFY_ALTIVEC,
 	},
+	{
+		.name = "avx512",
+		.alg = RTE_ACL_CLASSIFY_AVX512,
+	},
 };
 
 static struct {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH 20.11 6/7] acl: introduce AVX512 classify implementation
  2020-08-07 16:28 [dpdk-dev] [PATCH 20.11 0/7] acl: introduce AVX512 classify method Konstantin Ananyev
                   ` (4 preceding siblings ...)
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 5/7] app/acl: add AVX512 classify support Konstantin Ananyev
@ 2020-08-07 16:28 ` Konstantin Ananyev
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 7/7] acl: enhance " Konstantin Ananyev
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
  7 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-08-07 16:28 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Add search_avx512x8x2() which uses mostly 256-bit width
registers/instructions and is able to process up to 16 flows in
parallel.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl_run_avx512.c   | 120 ++++++
 lib/librte_acl/acl_run_avx512x8.h | 614 ++++++++++++++++++++++++++++++
 2 files changed, 734 insertions(+)
 create mode 100644 lib/librte_acl/acl_run_avx512x8.h

diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
index 67274989d..8ee996679 100644
--- a/lib/librte_acl/acl_run_avx512.c
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -4,10 +4,130 @@
 
 #include "acl_run_sse.h"
 
+/*sizeof(uint32_t) << match_log == sizeof(struct rte_acl_match_results)*/
+static const uint32_t match_log = 5;
+
+struct acl_flow_avx512 {
+	uint32_t num_packets;       /* number of packets processed */
+	uint32_t total_packets;     /* max number of packets to process */
+	uint32_t root_index;        /* current root index */
+	const uint64_t *trans;      /* transition table */
+	const uint32_t *data_index; /* input data indexes */
+	const uint8_t **idata;      /* input data */
+	uint32_t *matches;          /* match indexes */
+};
+
+static inline void
+acl_set_flow_avx512(struct acl_flow_avx512 *flow, const struct rte_acl_ctx *ctx,
+	uint32_t trie, const uint8_t *data[], uint32_t *matches,
+	uint32_t total_packets)
+{
+	flow->num_packets = 0;
+	flow->total_packets = total_packets;
+	flow->root_index = ctx->trie[trie].root_index;
+	flow->trans = ctx->trans_table;
+	flow->data_index = ctx->trie[trie].data_index;
+	flow->idata = data;
+	flow->matches = matches;
+}
+
+/*
+ * Resolve matches for multiple categories (LE 8, use 128b instuctions/regs)
+ */
+static inline void
+resolve_mcle8_avx512x1(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_cat, uint32_t nb_trie)
+{
+	const int32_t *pri;
+	const uint32_t *pm, *res;
+	uint32_t i, j, k, mi, mn;
+	__mmask8 msk;
+	xmm_t cp, cr, np, nr;
+
+	res = pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != nb_pkt; k++, result += nb_cat) {
+
+		mi = match[k] << match_log;
+
+		for (j = 0; j != nb_cat; j += RTE_ACL_RESULTS_MULTIPLIER) {
+
+			cr = _mm_loadu_si128((const xmm_t *)(res + mi + j));
+			cp = _mm_loadu_si128((const xmm_t *)(pri + mi + j));
+
+			for (i = 1, pm = match + nb_pkt; i != nb_trie;
+				i++, pm += nb_pkt) {
+
+				mn = j + (pm[k] << match_log);
+
+				nr = _mm_loadu_si128((const xmm_t *)(res + mn));
+				np = _mm_loadu_si128((const xmm_t *)(pri + mn));
+
+				msk = _mm_cmpgt_epi32_mask(cp, np);
+				cr = _mm_mask_mov_epi32(nr, msk, cr);
+				cp = _mm_mask_mov_epi32(np, msk, cp);
+			}
+
+			_mm_storeu_si128((xmm_t *)(result + j), cr);
+		}
+	}
+}
+
+/*
+ * Resolve matches for multiple categories (GT 8, use 512b instuctions/regs)
+ */
+static inline void
+resolve_mcgt8_avx512x1(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_cat, uint32_t nb_trie)
+{
+	const int32_t *pri;
+	const uint32_t *pm, *res;
+	uint32_t i, k, mi;
+	__mmask16 cm, sm;
+	__m512i cp, cr, np, nr;
+
+	const uint32_t match_log = 5;
+
+	res = pr->results;
+	pri = pr->priority;
+
+	cm = (1 << nb_cat) - 1;
+
+	for (k = 0; k != nb_pkt; k++, result += nb_cat) {
+
+		mi = match[k] << match_log;
+
+		cr = _mm512_maskz_loadu_epi32(cm, res + mi);
+		cp = _mm512_maskz_loadu_epi32(cm, pri + mi);
+
+		for (i = 1, pm = match + nb_pkt; i != nb_trie;
+				i++, pm += nb_pkt) {
+
+			mi = pm[k] << match_log;
+
+			nr = _mm512_maskz_loadu_epi32(cm, res + mi);
+			np = _mm512_maskz_loadu_epi32(cm, pri + mi);
+
+			sm = _mm512_cmpgt_epi32_mask(cp, np);
+			cr = _mm512_mask_mov_epi32(nr, sm, cr);
+			cp = _mm512_mask_mov_epi32(np, sm, cp);
+		}
+
+		_mm512_mask_storeu_epi32(result, cm, cr);
+	}
+}
+
+#include "acl_run_avx512x8.h"
+
 int
 rte_acl_classify_avx512(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories)
 {
+	if (num >= MAX_SEARCHES_AVX16)
+		return search_avx512x8x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE8)
 		return search_sse_8(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE4)
diff --git a/lib/librte_acl/acl_run_avx512x8.h b/lib/librte_acl/acl_run_avx512x8.h
new file mode 100644
index 000000000..63b1d872f
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512x8.h
@@ -0,0 +1,614 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#define NUM_AVX512X8X2	(2 * CHAR_BIT)
+#define MSK_AVX512X8X2	(NUM_AVX512X8X2 - 1)
+
+static const rte_ymm_t ymm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const rte_ymm_t ymm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const rte_ymm_t ymm_trlo_idle = {
+	.u32 = {
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+	},
+};
+
+static const rte_ymm_t ymm_trhi_idle = {
+	.u32 = {
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+	},
+};
+
+static const rte_ymm_t ymm_shuffle_input = {
+	.u32 = {
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+	},
+};
+
+static const rte_ymm_t ymm_four_32 = {
+	.u32 = {
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+	},
+};
+
+static const rte_ymm_t ymm_idx_add = {
+	.u32 = {
+		0, 1, 2, 3,
+		4, 5, 6, 7,
+	},
+};
+
+static const rte_ymm_t ymm_range_base = {
+	.u32 = {
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+	},
+};
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes not supposed to be encountered here.
+ * For quad range nodes:
+ * Calculate number of range boundaries that are less than the
+ * input value. Range boundaries for each node are in signed 8 bit,
+ * ordered from -128 to 127.
+ * This is effectively a popcnt of bytes that are greater than the
+ * input byte.
+ * Single nodes are processed in the same ways as quad range nodes.
+ */
+static __rte_always_inline ymm_t
+calc_addr8(ymm_t index_mask, ymm_t next_input, ymm_t shuffle_input,
+	ymm_t four_32, ymm_t range_base, ymm_t tr_lo, ymm_t tr_hi)
+{
+	ymm_t addr, in, node_type, r, t;
+	ymm_t dfa_msk, dfa_ofs, quad_ofs;
+
+	t = _mm256_xor_si256(index_mask, index_mask);
+	in = _mm256_shuffle_epi8(next_input, shuffle_input);
+
+	/* Calc node type and node addr */
+	node_type = _mm256_andnot_si256(index_mask, tr_lo);
+	addr = _mm256_and_si256(index_mask, tr_lo);
+
+	/* mask for DFA type(0) nodes */
+	dfa_msk = _mm256_cmpeq_epi32(node_type, t);
+
+	/* DFA calculations. */
+	r = _mm256_srli_epi32(in, 30);
+	r = _mm256_add_epi8(r, range_base);
+	t = _mm256_srli_epi32(in, 24);
+	r = _mm256_shuffle_epi8(tr_hi, r);
+
+	dfa_ofs = _mm256_sub_epi32(t, r);
+
+	/* QUAD/SINGLE calculations. */
+	t = _mm256_cmpgt_epi8(in, tr_hi);
+	t = _mm256_lzcnt_epi32(t);
+	t = _mm256_srli_epi32(t, 3);
+	quad_ofs = _mm256_sub_epi32(four_32, t);
+
+	/* blend DFA and QUAD/SINGLE. */
+	t = _mm256_blendv_epi8(quad_ofs, dfa_ofs, dfa_msk);
+
+	/* calculate address for next transitions. */
+	addr = _mm256_add_epi32(addr, t);
+	return addr;
+}
+
+/*
+ * Process 8 transitions in parallel.
+ * tr_lo contains low 32 bits for 8 transitions.
+ * tr_hi contains high 32 bits for 8 transitions.
+ * next_input contains up to 4 input bytes for 8 flows.
+ */
+static __rte_always_inline ymm_t
+transition8(ymm_t next_input, const uint64_t *trans, ymm_t *tr_lo, ymm_t *tr_hi)
+{
+	const int32_t *tr;
+	ymm_t addr;
+
+	tr = (const int32_t *)(uintptr_t)trans;
+
+	/* Calculate the address (array index) for all 8 transitions. */
+	addr = calc_addr8(ymm_index_mask.y, next_input, ymm_shuffle_input.y,
+		ymm_four_32.y, ymm_range_base.y, *tr_lo, *tr_hi);
+
+	/* load lower 32 bits of 8 transactions at once. */
+	*tr_lo = _mm256_i32gather_epi32(tr, addr, sizeof(trans[0]));
+
+	next_input = _mm256_srli_epi32(next_input, CHAR_BIT);
+
+	/* load high 32 bits of 8 transactions at once. */
+	*tr_hi = _mm256_i32gather_epi32(tr + 1, addr, sizeof(trans[0]));
+
+	return next_input;
+}
+
+/*
+ * Execute first transition for up to 8 flows in parallel.
+ * next_input should contain one input byte for up to 8 flows.
+ * msk - mask of active flows.
+ * tr_lo contains low 32 bits for up to 8 transitions.
+ * tr_hi contains high 32 bits for up to 8 transitions.
+ */
+static __rte_always_inline void
+first_trans8(const struct acl_flow_avx512 *flow, ymm_t next_input,
+	__mmask8 msk, ymm_t *tr_lo, ymm_t *tr_hi)
+{
+	const int32_t *tr;
+	ymm_t addr, root;
+
+	tr = (const int32_t *)(uintptr_t)flow->trans;
+
+	addr = _mm256_set1_epi32(UINT8_MAX);
+	root = _mm256_set1_epi32(flow->root_index);
+
+	addr = _mm256_and_si256(next_input, addr);
+	addr = _mm256_add_epi32(root, addr);
+
+	/* load lower 32 bits of 8 transactions at once. */
+	*tr_lo = _mm256_mmask_i32gather_epi32(*tr_lo, msk, addr, tr,
+		sizeof(flow->trans[0]));
+
+	/* load high 32 bits of 8 transactions at once. */
+	*tr_hi = _mm256_mmask_i32gather_epi32(*tr_hi, msk, addr, (tr + 1),
+		sizeof(flow->trans[0]));
+}
+
+/*
+ * Load and return next 4 input bytes for up to 8 flows in parallel.
+ * pdata - 8 pointers to flow input data
+ * mask - mask of active flows.
+ * di - data indexes for these 8 flows.
+ */
+static inline ymm_t
+get_next_4bytes_avx512x8(const struct acl_flow_avx512 *flow, __m512i pdata,
+	__mmask8 mask, ymm_t *di)
+{
+	const int32_t *div;
+	ymm_t one, zero;
+	ymm_t inp, t;
+	__m512i p;
+
+	div = (const int32_t *)flow->data_index;
+
+	one = _mm256_set1_epi32(1);
+	zero = _mm256_xor_si256(one, one);
+
+	/* load data offsets for given indexes */
+	t = _mm256_mmask_i32gather_epi32(zero, mask, *di, div, sizeof(div[0]));
+
+	/* increment data indexes */
+	*di = _mm256_mask_add_epi32(*di, mask, *di, one);
+
+	p = _mm512_cvtepu32_epi64(t);
+	p = _mm512_add_epi64(p, pdata);
+
+	/* load input bytes */
+	inp = _mm512_mask_i64gather_epi32(zero, mask, p, NULL, sizeof(uint8_t));
+	return inp;
+}
+
+/*
+ * Start up to 8 new flows.
+ * num - number of flows to start
+ * msk - mask of new flows.
+ * pdata - pointers to flow input data
+ * di - data indexes for these flows.
+ */
+static inline void
+start_flow8(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
+	__m512i *pdata, ymm_t *idx, ymm_t *di)
+{
+	uint32_t nm;
+	ymm_t ni;
+	__m512i nd;
+
+	/* load input data pointers for new flows */
+	nm = (1 << num) - 1;
+	nd = _mm512_maskz_loadu_epi64(nm, flow->idata + flow->num_packets);
+
+	/* calculate match indexes of new flows */
+	ni = _mm256_set1_epi32(flow->num_packets);
+	ni = _mm256_add_epi32(ni, ymm_idx_add.y);
+
+	/* merge new and existing flows data */
+	*pdata = _mm512_mask_expand_epi64(*pdata, msk, nd);
+	*idx = _mm256_mask_expand_epi32(*idx, msk, ni);
+	*di = _mm256_maskz_mov_epi32(msk ^ UINT8_MAX, *di);
+
+	flow->num_packets += num;
+}
+
+/*
+ * Update flow and result masks based on the number of unprocessed flows.
+ */
+static inline uint32_t
+update_flow_mask8(const struct acl_flow_avx512 *flow, __mmask8 *fmsk,
+	__mmask8 *rmsk)
+{
+	uint32_t i, j, k, m, n;
+
+	fmsk[0] ^= rmsk[0];
+	m = rmsk[0];
+
+	k = __builtin_popcount(m);
+	n = flow->total_packets - flow->num_packets;
+
+	if (n < k) {
+		/* reduce mask */
+		for (i = k - n; i != 0; i--) {
+			j = sizeof(m) * CHAR_BIT - 1 - __builtin_clz(m);
+			m ^= 1 << j;
+		}
+	} else
+		n = k;
+
+	rmsk[0] = m;
+	fmsk[0] |= rmsk[0];
+
+	return n;
+}
+
+/*
+ * Process found matches for up to 8 flows.
+ * fmsk - mask of active flows
+ * rmsk - maks of found matches
+ * pdata - pointers to flow input data
+ * di - data indexes for these flows
+ * idx - match indexed for given flows
+ * tr_lo contains low 32 bits for up to 8 transitions.
+ * tr_hi contains high 32 bits for up to 8 transitions.
+ */
+static inline uint32_t
+match_process_avx512x8(struct acl_flow_avx512 *flow, __mmask8 *fmsk,
+	__mmask8 *rmsk,	__m512i *pdata, ymm_t *di, ymm_t *idx,
+	ymm_t *tr_lo, ymm_t *tr_hi)
+{
+	uint32_t n;
+	ymm_t res;
+
+	if (rmsk[0] == 0)
+		return 0;
+
+	/* extract match indexes */
+	res = _mm256_and_si256(tr_lo[0], ymm_index_mask.y);
+
+	/* mask  matched transitions to nop */
+	tr_lo[0] = _mm256_mask_mov_epi32(tr_lo[0], rmsk[0], ymm_trlo_idle.y);
+	tr_hi[0] = _mm256_mask_mov_epi32(tr_hi[0], rmsk[0], ymm_trhi_idle.y);
+
+	/* save found match indexes */
+	_mm256_mask_i32scatter_epi32(flow->matches, rmsk[0],
+		idx[0], res, sizeof(flow->matches[0]));
+
+	/* update masks and start new flows for matches */
+	n = update_flow_mask8(flow, fmsk, rmsk);
+	start_flow8(flow, n, rmsk[0], pdata, idx, di);
+
+	return n;
+}
+
+
+static inline void
+match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, __mmask8 fm[2],
+	__m512i pdata[2], ymm_t di[2], ymm_t idx[2], ymm_t inp[2],
+	ymm_t tr_lo[2], ymm_t tr_hi[2])
+{
+	uint32_t n[2];
+	__mmask8 rm[2];
+
+	/* check for matches */
+	rm[0] = _mm256_test_epi32_mask(tr_lo[0], ymm_match_mask.y);
+	rm[1] = _mm256_test_epi32_mask(tr_lo[1], ymm_match_mask.y);
+
+	/* till unprocessed matches exist */
+	while ((rm[0] | rm[1]) != 0) {
+
+		/* process matches and start new flows */
+		n[0] = match_process_avx512x8(flow, &fm[0], &rm[0], &pdata[0],
+			&di[0], &idx[0], &tr_lo[0], &tr_hi[0]);
+		n[1] = match_process_avx512x8(flow, &fm[1], &rm[1], &pdata[1],
+			&di[1], &idx[1], &tr_lo[1], &tr_hi[1]);
+
+		/* execute first transition for new flows, if any */
+
+		if (n[0] != 0) {
+			inp[0] = get_next_4bytes_avx512x8(flow, pdata[0], rm[0],
+				&di[0]);
+			first_trans8(flow, inp[0], rm[0], &tr_lo[0], &tr_hi[0]);
+
+			rm[0] = _mm256_test_epi32_mask(tr_lo[0],
+				ymm_match_mask.y);
+		}
+
+		if (n[1] != 0) {
+			inp[1] = get_next_4bytes_avx512x8(flow, pdata[1], rm[1],
+				&di[1]);
+			first_trans8(flow, inp[1], rm[1], &tr_lo[1], &tr_hi[1]);
+
+			rm[1] = _mm256_test_epi32_mask(tr_lo[1],
+				ymm_match_mask.y);
+		}
+	}
+}
+
+/*
+ * Perform search for up to 16 flows in parallel.
+ * Use two sets of metadata, each serves 8 flows max.
+ * So in fact we perform search for 2x8 flows.
+ */
+static inline void
+search_trie_avx512x8x2(struct acl_flow_avx512 *flow)
+{
+	__mmask8 fm[2];
+	__m512i pdata[2];
+	ymm_t di[2], idx[2], inp[2], tr_lo[2], tr_hi[2];
+
+	/* first 1B load */
+	start_flow8(flow, CHAR_BIT, UINT8_MAX, &pdata[0], &idx[0], &di[0]);
+	start_flow8(flow, CHAR_BIT, UINT8_MAX, &pdata[1], &idx[1], &di[1]);
+
+	inp[0] = get_next_4bytes_avx512x8(flow, pdata[0], UINT8_MAX, &di[0]);
+	inp[1] = get_next_4bytes_avx512x8(flow, pdata[1], UINT8_MAX, &di[1]);
+
+	first_trans8(flow, inp[0], UINT8_MAX, &tr_lo[0], &tr_hi[0]);
+	first_trans8(flow, inp[1], UINT8_MAX, &tr_lo[1], &tr_hi[1]);
+
+	fm[0] = UINT8_MAX;
+	fm[1] = UINT8_MAX;
+
+	/* match check */
+	match_check_process_avx512x8x2(flow, fm, pdata, di, idx, inp,
+		tr_lo, tr_hi);
+
+	while ((fm[0] | fm[1]) != 0) {
+
+		/* load next 4B */
+
+		inp[0] = get_next_4bytes_avx512x8(flow, pdata[0], fm[0],
+			&di[0]);
+		inp[1] = get_next_4bytes_avx512x8(flow, pdata[1], fm[1],
+			&di[1]);
+
+		/* main 4B loop */
+
+		inp[0] = transition8(inp[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		inp[1] = transition8(inp[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		inp[0] = transition8(inp[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		inp[1] = transition8(inp[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		inp[0] = transition8(inp[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		inp[1] = transition8(inp[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		inp[0] = transition8(inp[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		inp[1] = transition8(inp[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		/* check for matches */
+		match_check_process_avx512x8x2(flow, fm, pdata, di, idx, inp,
+			tr_lo, tr_hi);
+	}
+}
+
+/*
+ * resolve match index to actual result/priority offset.
+ */
+static inline ymm_t
+resolve_match_idx_avx512x8(ymm_t mi)
+{
+	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
+		1 << (match_log + 2));
+	return _mm256_slli_epi32(mi, match_log);
+}
+
+
+/*
+ * Resolve multiple matches for the same flow based on priority.
+ */
+static inline ymm_t
+resolve_pri_avx512x8(const int32_t res[], const int32_t pri[],
+	const uint32_t match[], __mmask8 msk, uint32_t nb_trie,
+	uint32_t nb_skip)
+{
+	uint32_t i;
+	const uint32_t *pm;
+	__mmask8 m;
+	ymm_t cp, cr, np, nr, mch;
+
+	const ymm_t zero = _mm256_set1_epi32(0);
+
+	mch = _mm256_maskz_loadu_epi32(msk, match);
+	mch = resolve_match_idx_avx512x8(mch);
+
+	cr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0]));
+	cp = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0]));
+
+	for (i = 1, pm = match + nb_skip; i != nb_trie;
+			i++, pm += nb_skip) {
+
+		mch = _mm256_maskz_loadu_epi32(msk, pm);
+		mch = resolve_match_idx_avx512x8(mch);
+
+		nr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res,
+			sizeof(res[0]));
+		np = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri,
+			sizeof(pri[0]));
+
+		m = _mm256_cmpgt_epi32_mask(cp, np);
+		cr = _mm256_mask_mov_epi32(nr, m, cr);
+		cp = _mm256_mask_mov_epi32(np, m, cp);
+	}
+
+	return cr;
+}
+
+/*
+ * Resolve num (<= 8) matches for single category
+ */
+static inline void
+resolve_sc_avx512x8(uint32_t result[], const int32_t res[], const int32_t pri[],
+	const uint32_t match[], uint32_t nb_pkt, uint32_t nb_trie,
+	uint32_t nb_skip)
+{
+	__mmask8 msk;
+	ymm_t cr;
+
+	msk = (1 << nb_pkt) - 1;
+	cr = resolve_pri_avx512x8(res, pri, match, msk, nb_trie, nb_skip);
+	_mm256_mask_storeu_epi32(result, msk, cr);
+}
+
+/*
+ * Resolve matches for single category
+ */
+static inline void
+resolve_sc_avx512x8x2(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_trie)
+{
+	uint32_t i, j, k, n;
+	const uint32_t *pm;
+	const int32_t *res, *pri;
+	__mmask8 m[2];
+	ymm_t cp[2], cr[2], np[2], nr[2], mch[2];
+
+	res = (const int32_t *)pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != (nb_pkt & ~MSK_AVX512X8X2); k += NUM_AVX512X8X2) {
+
+		j = k + CHAR_BIT;
+
+		/* load match indexes for first trie */
+		mch[0] = _mm256_loadu_si256((const ymm_t *)(match + k));
+		mch[1] = _mm256_loadu_si256((const ymm_t *)(match + j));
+
+		mch[0] = resolve_match_idx_avx512x8(mch[0]);
+		mch[1] = resolve_match_idx_avx512x8(mch[1]);
+
+		/* load matches and their priorities for first trie */
+
+		cr[0] = _mm256_i32gather_epi32(res, mch[0], sizeof(res[0]));
+		cr[1] = _mm256_i32gather_epi32(res, mch[1], sizeof(res[0]));
+
+		cp[0] = _mm256_i32gather_epi32(pri, mch[0], sizeof(pri[0]));
+		cp[1] = _mm256_i32gather_epi32(pri, mch[1], sizeof(pri[0]));
+
+		/* select match with highest priority */
+		for (i = 1, pm = match + nb_pkt; i != nb_trie;
+				i++, pm += nb_pkt) {
+
+			mch[0] = _mm256_loadu_si256((const ymm_t *)(pm + k));
+			mch[1] = _mm256_loadu_si256((const ymm_t *)(pm + j));
+
+			mch[0] = resolve_match_idx_avx512x8(mch[0]);
+			mch[1] = resolve_match_idx_avx512x8(mch[1]);
+
+			nr[0] = _mm256_i32gather_epi32(res, mch[0],
+				sizeof(res[0]));
+			nr[1] = _mm256_i32gather_epi32(res, mch[1],
+				sizeof(res[0]));
+
+			np[0] = _mm256_i32gather_epi32(pri, mch[0],
+				sizeof(pri[0]));
+			np[1] = _mm256_i32gather_epi32(pri, mch[1],
+				sizeof(pri[0]));
+
+			m[0] = _mm256_cmpgt_epi32_mask(cp[0], np[0]);
+			m[1] = _mm256_cmpgt_epi32_mask(cp[1], np[1]);
+
+			cr[0] = _mm256_mask_mov_epi32(nr[0], m[0], cr[0]);
+			cr[1] = _mm256_mask_mov_epi32(nr[1], m[1], cr[1]);
+
+			cp[0] = _mm256_mask_mov_epi32(np[0], m[0], cp[0]);
+			cp[1] = _mm256_mask_mov_epi32(np[1], m[1], cp[1]);
+		}
+
+		_mm256_storeu_si256((ymm_t *)(result + k), cr[0]);
+		_mm256_storeu_si256((ymm_t *)(result + j), cr[1]);
+	}
+
+	n = nb_pkt - k;
+	if (n != 0) {
+		if (n > CHAR_BIT) {
+			resolve_sc_avx512x8(result + k, res, pri, match + k,
+				CHAR_BIT, nb_trie, nb_pkt);
+			k += CHAR_BIT;
+			n -= CHAR_BIT;
+		}
+		resolve_sc_avx512x8(result + k, res, pri, match + k, n,
+				nb_trie, nb_pkt);
+	}
+}
+
+static inline int
+search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	uint32_t i, *pm;
+	const struct rte_acl_match_results *pr;
+	struct acl_flow_avx512 flow;
+	uint32_t match[ctx->num_tries * total_packets];
+
+	for (i = 0, pm = match; i != ctx->num_tries; i++, pm += total_packets) {
+
+		/* setup for next trie */
+		acl_set_flow_avx512(&flow, ctx, i, data, pm, total_packets);
+
+		/* process the trie */
+		search_trie_avx512x8x2(&flow);
+	}
+
+	/* resolve matches */
+	pr = (const struct rte_acl_match_results *)
+		(ctx->trans_table + ctx->match_index);
+
+	if (categories == 1)
+		resolve_sc_avx512x8x2(results, pr, match, total_packets,
+			ctx->num_tries);
+	else if (categories <= RTE_ACL_MAX_CATEGORIES / 2)
+		resolve_mcle8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+	else
+		resolve_mcgt8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+
+	return 0;
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH 20.11 7/7] acl: enhance AVX512 classify implementation
  2020-08-07 16:28 [dpdk-dev] [PATCH 20.11 0/7] acl: introduce AVX512 classify method Konstantin Ananyev
                   ` (5 preceding siblings ...)
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 6/7] acl: introduce AVX512 classify implementation Konstantin Ananyev
@ 2020-08-07 16:28 ` Konstantin Ananyev
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
  7 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-08-07 16:28 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Add search_avx512x16x2() which uses mostly 512-bit width
registers/instructions and is able to process up to 32 flows in
parallel.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

These patch depends on:
https://patches.dpdk.org/patch/70429/
to be applied first.

 lib/librte_acl/acl_run_avx512.c    |   3 +
 lib/librte_acl/acl_run_avx512x16.h | 635 +++++++++++++++++++++++++++++
 2 files changed, 638 insertions(+)
 create mode 100644 lib/librte_acl/acl_run_avx512x16.h

diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
index 8ee996679..332e359fb 100644
--- a/lib/librte_acl/acl_run_avx512.c
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -121,11 +121,14 @@ resolve_mcgt8_avx512x1(uint32_t result[],
 }
 
 #include "acl_run_avx512x8.h"
+#include "acl_run_avx512x16.h"
 
 int
 rte_acl_classify_avx512(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories)
 {
+	if (num >= 2 * MAX_SEARCHES_AVX16)
+		return search_avx512x16x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_AVX16)
 		return search_avx512x8x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE8)
diff --git a/lib/librte_acl/acl_run_avx512x16.h b/lib/librte_acl/acl_run_avx512x16.h
new file mode 100644
index 000000000..53216bda3
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512x16.h
@@ -0,0 +1,635 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#define	MASK16_BIT	(sizeof(__mmask16) * CHAR_BIT)
+
+#define NUM_AVX512X16X2	(2 * MASK16_BIT)
+#define MSK_AVX512X16X2	(NUM_AVX512X16X2 - 1)
+
+static const __rte_x86_zmm_t zmm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_trlo_idle = {
+	.u32 = {
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_trhi_idle = {
+	.u32 = {
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_shuffle_input = {
+	.u32 = {
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_four_32 = {
+	.u32 = {
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_idx_add = {
+	.u32 = {
+		0, 1, 2, 3,
+		4, 5, 6, 7,
+		8, 9, 10, 11,
+		12, 13, 14, 15,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_range_base = {
+	.u32 = {
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+	},
+};
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes not supposed to be encountered here.
+ * For quad range nodes:
+ * Calculate number of range boundaries that are less than the
+ * input value. Range boundaries for each node are in signed 8 bit,
+ * ordered from -128 to 127.
+ * This is effectively a popcnt of bytes that are greater than the
+ * input byte.
+ * Single nodes are processed in the same ways as quad range nodes.
+ */
+static __rte_always_inline __m512i
+calc_addr16(__m512i index_mask, __m512i next_input, __m512i shuffle_input,
+	__m512i four_32, __m512i range_base, __m512i tr_lo, __m512i tr_hi)
+{
+	__mmask64 qm;
+	__mmask16 dfa_msk;
+	__m512i addr, in, node_type, r, t;
+	__m512i dfa_ofs, quad_ofs;
+
+	t = _mm512_xor_si512(index_mask, index_mask);
+	in = _mm512_shuffle_epi8(next_input, shuffle_input);
+
+	/* Calc node type and node addr */
+	node_type = _mm512_andnot_si512(index_mask, tr_lo);
+	addr = _mm512_and_si512(index_mask, tr_lo);
+
+	/* mask for DFA type(0) nodes */
+	dfa_msk = _mm512_cmpeq_epi32_mask(node_type, t);
+
+	/* DFA calculations. */
+	r = _mm512_srli_epi32(in, 30);
+	r = _mm512_add_epi8(r, range_base);
+	t = _mm512_srli_epi32(in, 24);
+	r = _mm512_shuffle_epi8(tr_hi, r);
+
+	dfa_ofs = _mm512_sub_epi32(t, r);
+
+	/* QUAD/SINGLE calculations. */
+	qm = _mm512_cmpgt_epi8_mask(in, tr_hi);
+	t = _mm512_maskz_set1_epi8(qm, (uint8_t)UINT8_MAX);
+	t = _mm512_lzcnt_epi32(t);
+	t = _mm512_srli_epi32(t, 3);
+	quad_ofs = _mm512_sub_epi32(four_32, t);
+
+	/* blend DFA and QUAD/SINGLE. */
+	t = _mm512_mask_mov_epi32(quad_ofs, dfa_msk, dfa_ofs);
+
+	/* calculate address for next transitions. */
+	addr = _mm512_add_epi32(addr, t);
+	return addr;
+}
+
+/*
+ * Process 8 transitions in parallel.
+ * tr_lo contains low 32 bits for 8 transition.
+ * tr_hi contains high 32 bits for 8 transition.
+ * next_input contains up to 4 input bytes for 8 flows.
+ */
+static __rte_always_inline __m512i
+transition16(__m512i next_input, const uint64_t *trans, __m512i *tr_lo,
+	__m512i *tr_hi)
+{
+	const int32_t *tr;
+	__m512i addr;
+
+	tr = (const int32_t *)(uintptr_t)trans;
+
+	/* Calculate the address (array index) for all 8 transitions. */
+	addr = calc_addr16(zmm_index_mask.z, next_input, zmm_shuffle_input.z,
+		zmm_four_32.z, zmm_range_base.z, *tr_lo, *tr_hi);
+
+	/* load lower 32 bits of 8 transactions at once. */
+	*tr_lo = _mm512_i32gather_epi32(addr, tr, sizeof(trans[0]));
+
+	next_input = _mm512_srli_epi32(next_input, CHAR_BIT);
+
+	/* load high 32 bits of 8 transactions at once. */
+	*tr_hi = _mm512_i32gather_epi32(addr, (tr + 1), sizeof(trans[0]));
+
+	return next_input;
+}
+
+static __rte_always_inline void
+first_trans16(const struct acl_flow_avx512 *flow, __m512i next_input,
+	__mmask16 msk, __m512i *tr_lo, __m512i *tr_hi)
+{
+	const int32_t *tr;
+	__m512i addr, root;
+
+	tr = (const int32_t *)(uintptr_t)flow->trans;
+
+	addr = _mm512_set1_epi32(UINT8_MAX);
+	root = _mm512_set1_epi32(flow->root_index);
+
+	addr = _mm512_and_si512(next_input, addr);
+	addr = _mm512_add_epi32(root, addr);
+
+	/* load lower 32 bits of 8 transactions at once. */
+	*tr_lo = _mm512_mask_i32gather_epi32(*tr_lo, msk, addr, tr,
+		sizeof(flow->trans[0]));
+
+	/* load high 32 bits of 8 transactions at once. */
+	*tr_hi = _mm512_mask_i32gather_epi32(*tr_hi, msk, addr, (tr + 1),
+		sizeof(flow->trans[0]));
+}
+
+static inline __m512i
+get_next_4bytes_avx512x16(const struct acl_flow_avx512 *flow, __m512i pdata[2],
+	uint32_t msk, __m512i *di)
+{
+	const int32_t *div;
+	__m512i one, zero, t, p[2];
+	ymm_t inp[2];
+
+	static const __rte_x86_zmm_t zmm_pminp = {
+		.u32 = {
+			0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+			0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17,
+		},
+	};
+
+	const __mmask16 pmidx_msk = 0x5555;
+
+	static const __rte_x86_zmm_t zmm_pmidx[2] = {
+		[0] = {
+			.u32 = {
+				0, 0, 1, 0, 2, 0, 3, 0,
+				4, 0, 5, 0, 6, 0, 7, 0,
+			},
+		},
+		[1] = {
+			.u32 = {
+				8, 0, 9, 0, 10, 0, 11, 0,
+				12, 0, 13, 0, 14, 0, 15, 0,
+			},
+		},
+	};
+
+	div = (const int32_t *)flow->data_index;
+
+	one = _mm512_set1_epi32(1);
+	zero = _mm512_xor_si512(one, one);
+
+	t = _mm512_mask_i32gather_epi32(zero, msk, *di, div, sizeof(div[0]));
+
+	*di = _mm512_mask_add_epi32(*di, msk, *di, one);
+
+	p[0] = _mm512_maskz_permutexvar_epi32(pmidx_msk, zmm_pmidx[0].z, t);
+	p[1] = _mm512_maskz_permutexvar_epi32(pmidx_msk, zmm_pmidx[1].z, t);
+
+	p[0] = _mm512_add_epi64(p[0], pdata[0]);
+	p[1] = _mm512_add_epi64(p[1], pdata[1]);
+
+	inp[0] = _mm512_mask_i64gather_epi32(_mm512_castsi512_si256(zero),
+		(msk & UINT8_MAX), p[0], NULL, sizeof(uint8_t));
+	inp[1] = _mm512_mask_i64gather_epi32(_mm512_castsi512_si256(zero),
+		(msk >> CHAR_BIT), p[1], NULL, sizeof(uint8_t));
+
+	return _mm512_permutex2var_epi32(_mm512_castsi256_si512(inp[0]),
+			zmm_pminp.z, _mm512_castsi256_si512(inp[1]));
+}
+
+static inline void
+start_flow16(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
+	__m512i pdata[2], __m512i *idx, __m512i *di)
+{
+	uint32_t n, nm[2];
+	__m512i ni, nd[2];
+
+	n = __builtin_popcount(msk & UINT8_MAX);
+	nm[0] = (1 << n) - 1;
+	nm[1] = (1 << (num - n)) - 1;
+
+	nd[0] = _mm512_maskz_loadu_epi64(nm[0],
+		flow->idata + flow->num_packets);
+	nd[1] = _mm512_maskz_loadu_epi64(nm[1],
+		flow->idata + flow->num_packets + n);
+
+	ni = _mm512_set1_epi32(flow->num_packets);
+	ni = _mm512_add_epi32(ni, zmm_idx_add.z);
+
+	pdata[0] = _mm512_mask_expand_epi64(pdata[0], (msk & UINT8_MAX), nd[0]);
+	pdata[1] = _mm512_mask_expand_epi64(pdata[1], (msk >> CHAR_BIT), nd[1]);
+
+	*idx = _mm512_mask_expand_epi32(*idx, msk, ni);
+	*di = _mm512_maskz_mov_epi32(msk ^ UINT16_MAX, *di);
+
+	flow->num_packets += num;
+}
+
+static inline uint32_t
+update_flow_mask16(const struct acl_flow_avx512 *flow, __mmask16 *fmsk,
+	__mmask16 *rmsk)
+{
+	uint32_t i, j, k, m, n;
+
+	fmsk[0] ^= rmsk[0];
+	m = rmsk[0];
+
+	k = __builtin_popcount(m);
+	n = flow->total_packets - flow->num_packets;
+
+	if (n < k) {
+		/* reduce mask */
+		for (i = k - n; i != 0; i--) {
+			j = sizeof(m) * CHAR_BIT - 1 - __builtin_clz(m);
+			m ^= 1 << j;
+		}
+	} else
+		n = k;
+
+	rmsk[0] = m;
+	fmsk[0] |= rmsk[0];
+
+	return n;
+}
+
+static inline uint32_t
+match_process_avx512x16(struct acl_flow_avx512 *flow, __mmask16 *fmsk,
+	__mmask16 *rmsk, __m512i pdata[2], __m512i *di, __m512i *idx,
+	__m512i *tr_lo, __m512i *tr_hi)
+{
+	uint32_t n;
+	__m512i res;
+
+	if (rmsk[0] == 0)
+		return 0;
+
+	/* extract match indexes */
+	res = _mm512_and_si512(tr_lo[0], zmm_index_mask.z);
+
+	/* mask  matched transitions to nop */
+	tr_lo[0] = _mm512_mask_mov_epi32(tr_lo[0], rmsk[0], zmm_trlo_idle.z);
+	tr_hi[0] = _mm512_mask_mov_epi32(tr_hi[0], rmsk[0], zmm_trhi_idle.z);
+
+	/* save found match indexes */
+	_mm512_mask_i32scatter_epi32(flow->matches, rmsk[0],
+		idx[0], res, sizeof(flow->matches[0]));
+
+	/* update masks and start new flows for matches */
+	n = update_flow_mask16(flow, fmsk, rmsk);
+	start_flow16(flow, n, rmsk[0], pdata, idx, di);
+
+	return n;
+}
+
+static inline void
+match_check_process_avx512x16x2(struct acl_flow_avx512 *flow, __mmask16 fm[2],
+	__m512i pdata[4], __m512i di[2], __m512i idx[2], __m512i inp[2],
+	__m512i tr_lo[2], __m512i tr_hi[2])
+{
+	uint32_t n[2];
+	__mmask16 rm[2];
+
+	/* check for matches */
+	rm[0] = _mm512_test_epi32_mask(tr_lo[0], zmm_match_mask.z);
+	rm[1] = _mm512_test_epi32_mask(tr_lo[1], zmm_match_mask.z);
+
+	while ((rm[0] | rm[1]) != 0) {
+
+		n[0] = match_process_avx512x16(flow, &fm[0], &rm[0], &pdata[0],
+			&di[0], &idx[0], &tr_lo[0], &tr_hi[0]);
+		n[1] = match_process_avx512x16(flow, &fm[1], &rm[1], &pdata[2],
+			&di[1], &idx[1], &tr_lo[1], &tr_hi[1]);
+
+		if (n[0] != 0) {
+			inp[0] = get_next_4bytes_avx512x16(flow, &pdata[0],
+				rm[0], &di[0]);
+			first_trans16(flow, inp[0], rm[0], &tr_lo[0],
+				&tr_hi[0]);
+			rm[0] = _mm512_test_epi32_mask(tr_lo[0],
+				zmm_match_mask.z);
+		}
+
+		if (n[1] != 0) {
+			inp[1] = get_next_4bytes_avx512x16(flow, &pdata[2],
+				rm[1], &di[1]);
+			first_trans16(flow, inp[1], rm[1], &tr_lo[1],
+				&tr_hi[1]);
+			rm[1] = _mm512_test_epi32_mask(tr_lo[1],
+				zmm_match_mask.z);
+		}
+	}
+}
+
+static inline void
+search_trie_avx512x16x2(struct acl_flow_avx512 *flow)
+{
+	__mmask16 fm[2];
+	__m512i di[2], idx[2], in[2], pdata[4], tr_lo[2], tr_hi[2];
+
+	/* first 1B load */
+	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[0], &idx[0], &di[0]);
+	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[2], &idx[1], &di[1]);
+
+	in[0] = get_next_4bytes_avx512x16(flow, &pdata[0], UINT16_MAX, &di[0]);
+	in[1] = get_next_4bytes_avx512x16(flow, &pdata[2], UINT16_MAX, &di[1]);
+
+	first_trans16(flow, in[0], UINT16_MAX, &tr_lo[0], &tr_hi[0]);
+	first_trans16(flow, in[1], UINT16_MAX, &tr_lo[1], &tr_hi[1]);
+
+	fm[0] = UINT16_MAX;
+	fm[1] = UINT16_MAX;
+
+	/* match check */
+	match_check_process_avx512x16x2(flow, fm, pdata, di, idx, in,
+		tr_lo, tr_hi);
+
+	while ((fm[0] | fm[1]) != 0) {
+
+		/* load next 4B */
+
+		in[0] = get_next_4bytes_avx512x16(flow, &pdata[0], fm[0],
+			&di[0]);
+		in[1] = get_next_4bytes_avx512x16(flow, &pdata[2], fm[1],
+			&di[1]);
+
+		/* main 4B loop */
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		/* check for matches */
+		match_check_process_avx512x16x2(flow, fm, pdata, di, idx, in,
+			tr_lo, tr_hi);
+	}
+}
+
+static inline __m512i
+resolve_match_idx_avx512x16(__m512i mi)
+{
+	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
+		1 << (match_log + 2));
+	return _mm512_slli_epi32(mi, match_log);
+}
+
+static inline __m512i
+resolve_pri_avx512x16(const int32_t res[], const int32_t pri[],
+	const uint32_t match[], __mmask16 msk, uint32_t nb_trie,
+	uint32_t nb_skip)
+{
+	uint32_t i;
+	const uint32_t *pm;
+	__mmask16 m;
+	__m512i cp, cr, np, nr, mch;
+
+	const __m512i zero = _mm512_set1_epi32(0);
+
+	mch = _mm512_maskz_loadu_epi32(msk, match);
+	mch = resolve_match_idx_avx512x16(mch);
+
+	cr = _mm512_mask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0]));
+	cp = _mm512_mask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0]));
+
+	for (i = 1, pm = match + nb_skip; i != nb_trie;
+			i++, pm += nb_skip) {
+
+		mch = _mm512_maskz_loadu_epi32(msk, pm);
+		mch = resolve_match_idx_avx512x16(mch);
+
+		nr = _mm512_mask_i32gather_epi32(zero, msk, mch, res,
+			sizeof(res[0]));
+		np = _mm512_mask_i32gather_epi32(zero, msk, mch, pri,
+			sizeof(pri[0]));
+
+		m = _mm512_cmpgt_epi32_mask(cp, np);
+		cr = _mm512_mask_mov_epi32(nr, m, cr);
+		cp = _mm512_mask_mov_epi32(np, m, cp);
+	}
+
+	return cr;
+}
+
+/*
+ * Resolve num (<= 16) matches for single category
+ */
+static inline void
+resolve_sc_avx512x16(uint32_t result[], const int32_t res[],
+	const int32_t pri[], const uint32_t match[], uint32_t nb_pkt,
+	uint32_t nb_trie, uint32_t nb_skip)
+{
+	__mmask16 msk;
+	__m512i cr;
+
+	msk = (1 << nb_pkt) - 1;
+	cr = resolve_pri_avx512x16(res, pri, match, msk, nb_trie, nb_skip);
+	_mm512_mask_storeu_epi32(result, msk, cr);
+}
+
+/*
+ * Resolve matches for single category
+ */
+static inline void
+resolve_sc_avx512x16x2(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_trie)
+{
+	uint32_t i, j, k, n;
+	const uint32_t *pm;
+	const int32_t *res, *pri;
+	__mmask16 m[2];
+	__m512i cp[2], cr[2], np[2], nr[2], mch[2];
+
+	res = (const int32_t *)pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != (nb_pkt & ~MSK_AVX512X16X2); k += NUM_AVX512X16X2) {
+
+		j = k + MASK16_BIT;
+
+		/* load match indexes for first trie */
+		mch[0] = _mm512_loadu_si512(match + k);
+		mch[1] = _mm512_loadu_si512(match + j);
+
+		mch[0] = resolve_match_idx_avx512x16(mch[0]);
+		mch[1] = resolve_match_idx_avx512x16(mch[1]);
+
+		/* load matches and their priorities for first trie */
+
+		cr[0] = _mm512_i32gather_epi32(mch[0], res, sizeof(res[0]));
+		cr[1] = _mm512_i32gather_epi32(mch[1], res, sizeof(res[0]));
+
+		cp[0] = _mm512_i32gather_epi32(mch[0], pri, sizeof(pri[0]));
+		cp[1] = _mm512_i32gather_epi32(mch[1], pri, sizeof(pri[0]));
+
+		/* select match with highest priority */
+		for (i = 1, pm = match + nb_pkt; i != nb_trie;
+				i++, pm += nb_pkt) {
+
+			mch[0] = _mm512_loadu_si512(pm + k);
+			mch[1] = _mm512_loadu_si512(pm + j);
+
+			mch[0] = resolve_match_idx_avx512x16(mch[0]);
+			mch[1] = resolve_match_idx_avx512x16(mch[1]);
+
+			nr[0] = _mm512_i32gather_epi32(mch[0], res,
+				sizeof(res[0]));
+			nr[1] = _mm512_i32gather_epi32(mch[1], res,
+				sizeof(res[0]));
+
+			np[0] = _mm512_i32gather_epi32(mch[0], pri,
+				sizeof(pri[0]));
+			np[1] = _mm512_i32gather_epi32(mch[1], pri,
+				sizeof(pri[0]));
+
+			m[0] = _mm512_cmpgt_epi32_mask(cp[0], np[0]);
+			m[1] = _mm512_cmpgt_epi32_mask(cp[1], np[1]);
+
+			cr[0] = _mm512_mask_mov_epi32(nr[0], m[0], cr[0]);
+			cr[1] = _mm512_mask_mov_epi32(nr[1], m[1], cr[1]);
+
+			cp[0] = _mm512_mask_mov_epi32(np[0], m[0], cp[0]);
+			cp[1] = _mm512_mask_mov_epi32(np[1], m[1], cp[1]);
+		}
+
+		_mm512_storeu_si512(result + k, cr[0]);
+		_mm512_storeu_si512(result + j, cr[1]);
+	}
+
+	n = nb_pkt - k;
+	if (n != 0) {
+		if (n > MASK16_BIT) {
+			resolve_sc_avx512x16(result + k, res, pri, match + k,
+				MASK16_BIT, nb_trie, nb_pkt);
+			k += MASK16_BIT;
+			n -= MASK16_BIT;
+		}
+		resolve_sc_avx512x16(result + k, res, pri, match + k, n,
+				nb_trie, nb_pkt);
+	}
+}
+
+static inline int
+search_avx512x16x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	uint32_t i, *pm;
+	const struct rte_acl_match_results *pr;
+	struct acl_flow_avx512 flow;
+	uint32_t match[ctx->num_tries * total_packets];
+
+	for (i = 0, pm = match; i != ctx->num_tries; i++, pm += total_packets) {
+
+		/* setup for next trie */
+		acl_set_flow_avx512(&flow, ctx, i, data, pm, total_packets);
+
+		/* process the trie */
+		search_trie_avx512x16x2(&flow);
+	}
+
+	/* resolve matches */
+	pr = (const struct rte_acl_match_results *)
+		(ctx->trans_table + ctx->match_index);
+
+	if (categories == 1)
+		resolve_sc_avx512x16x2(results, pr, match, total_packets,
+			ctx->num_tries);
+	else if (categories <= RTE_ACL_MAX_CATEGORIES / 2)
+		resolve_mcle8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+	else
+		resolve_mcgt8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+
+	return 0;
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method
  2020-08-07 16:28 [dpdk-dev] [PATCH 20.11 0/7] acl: introduce AVX512 classify method Konstantin Ananyev
                   ` (6 preceding siblings ...)
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 7/7] acl: enhance " Konstantin Ananyev
@ 2020-09-15 16:50 ` Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 01/12] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
                     ` (12 more replies)
  7 siblings, 13 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

These patch series introduce support of AVX512 specific classify
implementation for ACL library.
Inside it contains two code-paths –
one uses mostly 256 bit instruction/registers and can
process up to 16 flows in parallel.
second uses 512 bit instruction/registers over majority of
places and can process up to 32 flows in parallel.
This runtime code-path selection is done internally based
on input burst size and is totally opaque to the user.
On my SKX box test-acl shows ~20-65% improvement
(depending on rule-set and input burst size)
when switching from AVX2 to AVX512 classify algorithms.
ICX and CLX testing showed similar level of speedup: up to ~50-60%.
Current AVX512 classify implementation is only supported on x86_64.
Note that this series introduce a formal ABI incompatibility
with previous versions of ACL library.

v1 -> v2:
  Deduplicated 8/16 code paths as much as possible
  Updated default algorithm selection
    Removed library constructor to make it easier integrate with
    https://patches.dpdk.org/project/dpdk/list/?series=11831
  Updated docs

These patch series depends on:
https://patches.dpdk.org/patch/73922/mbox/
to be applied first.

Konstantin Ananyev (12):
  acl: fix x86 build when compiler doesn't support AVX2
  doc: fix mixing classify methods in ACL guide
  acl: remove of unused enum value
  acl: remove library constructor
  app/acl: few small improvements
  test/acl: expand classify test coverage
  acl: add infrastructure to support AVX512 classify
  acl: introduce AVX512 classify implementation
  acl: enhance AVX512 classify implementation
  acl: for AVX512 classify use 4B load whenever possible
  test/acl: add AVX512 classify support
  app/acl: add AVX512 classify support

 app/test-acl/main.c                           |  19 +-
 app/test/test_acl.c                           | 104 ++--
 config/x86/meson.build                        |   3 +-
 .../prog_guide/packet_classif_access_ctrl.rst |  15 +
 doc/guides/rel_notes/deprecation.rst          |   4 -
 doc/guides/rel_notes/release_20_11.rst        |   9 +
 lib/librte_acl/acl.h                          |  12 +
 lib/librte_acl/acl_bld.c                      |  34 ++
 lib/librte_acl/acl_gen.c                      |   2 +-
 lib/librte_acl/acl_run_avx512.c               | 331 +++++++++++
 lib/librte_acl/acl_run_avx512x16.h            | 526 ++++++++++++++++++
 lib/librte_acl/acl_run_avx512x8.h             | 439 +++++++++++++++
 lib/librte_acl/meson.build                    |  39 ++
 lib/librte_acl/rte_acl.c                      | 198 +++++--
 lib/librte_acl/rte_acl.h                      |   3 +-
 15 files changed, 1638 insertions(+), 100 deletions(-)
 create mode 100644 lib/librte_acl/acl_run_avx512.c
 create mode 100644 lib/librte_acl/acl_run_avx512x16.h
 create mode 100644 lib/librte_acl/acl_run_avx512x8.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v2 01/12] acl: fix x86 build when compiler doesn't support AVX2
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
@ 2020-09-15 16:50   ` Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 02/12] doc: fix mixing classify methods in ACL guide Konstantin Ananyev
                     ` (11 subsequent siblings)
  12 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev, stable

Right now we define dummy version of rte_acl_classify_avx2()
when both X86 and AVX2 are not detected, though it should be
for non-AVX2 case only.

Fixes: e53ce4e41379 ("acl: remove use of weak functions")
Cc: stable@dpdk.org

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/rte_acl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 777ec4d34..715b02359 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -16,7 +16,6 @@ static struct rte_tailq_elem rte_acl_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_acl_tailq)
 
-#ifndef RTE_ARCH_X86
 #ifndef CC_AVX2_SUPPORT
 /*
  * If the compiler doesn't support AVX2 instructions,
@@ -33,6 +32,7 @@ rte_acl_classify_avx2(__rte_unused const struct rte_acl_ctx *ctx,
 }
 #endif
 
+#ifndef RTE_ARCH_X86
 int
 rte_acl_classify_sse(__rte_unused const struct rte_acl_ctx *ctx,
 	__rte_unused const uint8_t **data,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v2 02/12] doc: fix mixing classify methods in ACL guide
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 01/12] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
@ 2020-09-15 16:50   ` Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 03/12] acl: remove of unused enum value Konstantin Ananyev
                     ` (10 subsequent siblings)
  12 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev, stable

Add brief description for missing ACL classify algorithms:
RTE_ACL_CLASSIFY_NEON and RTE_ACL_CLASSIFY_ALTIVEC.

Fixes: 34fa6c27c156 ("acl: add NEON optimization for ARMv8")
Fixes: 1d73135f9f1c ("acl: add AltiVec for ppc64")
Cc: stable@dpdk.org

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 doc/guides/prog_guide/packet_classif_access_ctrl.rst | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/doc/guides/prog_guide/packet_classif_access_ctrl.rst b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
index 0345512b9..daf03e6d7 100644
--- a/doc/guides/prog_guide/packet_classif_access_ctrl.rst
+++ b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
@@ -373,6 +373,12 @@ There are several implementations of classify algorithm:
 
 *   **RTE_ACL_CLASSIFY_AVX2**: vector implementation, can process up to 16 flows in parallel. Requires AVX2 support.
 
+*   **RTE_ACL_CLASSIFY_NEON**: vector implementation, can process up to 8 flows
+    in parallel. Requires NEON support.
+
+*   **RTE_ACL_CLASSIFY_ALTIVEC**: vector implementation, can process up to 8
+    flows in parallel. Requires ALTIVEC support.
+
 It is purely a runtime decision which method to choose, there is no build-time difference.
 All implementations operates over the same internal RT structures and use similar principles. The main difference is that vector implementations can manually exploit IA SIMD instructions and process several input data flows in parallel.
 At startup ACL library determines the highest available classify method for the given platform and sets it as default one. Though the user has an ability to override the default classifier function for a given ACL context or perform particular search using non-default classify method. In that case it is user responsibility to make sure that given platform supports selected classify implementation.
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v2 03/12] acl: remove of unused enum value
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 01/12] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 02/12] doc: fix mixing classify methods in ACL guide Konstantin Ananyev
@ 2020-09-15 16:50   ` Konstantin Ananyev
  2020-09-27  3:27     ` Ruifeng Wang
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 04/12] acl: remove library constructor Konstantin Ananyev
                     ` (9 subsequent siblings)
  12 siblings, 1 reply; 70+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Removal of unused enum value (RTE_ACL_CLASSIFY_NUM).
This enum value is not used inside DPDK, while it prevents
to add new classify algorithms without causing an ABI breakage.

Note that this change introduce a formal ABI incompatibility
with previous versions of ACL library.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 doc/guides/rel_notes/deprecation.rst   | 4 ----
 doc/guides/rel_notes/release_20_11.rst | 4 ++++
 lib/librte_acl/rte_acl.h               | 1 -
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
index 52168f775..3279a01ef 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -288,10 +288,6 @@ Deprecation Notices
   - https://patches.dpdk.org/patch/71457/
   - https://patches.dpdk.org/patch/71456/
 
-* acl: ``RTE_ACL_CLASSIFY_NUM`` enum value will be removed.
-  This enum value is not used inside DPDK, while it prevents to add new
-  classify algorithms without causing an ABI breakage.
-
 * sched: To allow more traffic classes, flexible mapping of pipe queues to
   traffic classes, and subport level configuration of pipes and queues
   changes will be made to macros, data structures and API functions defined
diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index b729bdf20..a9a1b0305 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -97,6 +97,10 @@ API Changes
   and the function ``rte_rawdev_queue_conf_get()``
   from ``void`` to ``int`` allowing the return of error codes from drivers.
 
+* acl: ``RTE_ACL_CLASSIFY_NUM`` enum value has been removed.
+  This enum value was not used inside DPDK, while it prevented to add new
+  classify algorithms without causing an ABI breakage.
+
 
 ABI Changes
 -----------
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index aa22e70c6..b814423a6 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -241,7 +241,6 @@ enum rte_acl_classify_alg {
 	RTE_ACL_CLASSIFY_AVX2 = 3,    /**< requires AVX2 support. */
 	RTE_ACL_CLASSIFY_NEON = 4,    /**< requires NEON support. */
 	RTE_ACL_CLASSIFY_ALTIVEC = 5,    /**< requires ALTIVEC support. */
-	RTE_ACL_CLASSIFY_NUM          /* should always be the last one. */
 };
 
 /**
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v2 04/12] acl: remove library constructor
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
                     ` (2 preceding siblings ...)
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 03/12] acl: remove of unused enum value Konstantin Ananyev
@ 2020-09-15 16:50   ` Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 05/12] app/acl: few small improvements Konstantin Ananyev
                     ` (8 subsequent siblings)
  12 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Right now ACL library determines best possible (default) classify method
on a given platform with specilal constructor function rte_acl_init().
This patch makes the following changes:
 - Move selection of default classify method into a separate private
   function and call it for each ACL context creation (rte_acl_create()).
 - Remove library constructor function
 - Make rte_acl_set_ctx_classify() to check that requested algorithm
   is supported on given platform.

The purpose of these changes to improve and simplify algorithm selection
process and prepare ACL library to be integrated with:
add max SIMD bitwidth to EAL
(https://patches.dpdk.org/project/dpdk/list/?series=11831)
patch-set

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/rte_acl.c | 166 ++++++++++++++++++++++++++++++---------
 lib/librte_acl/rte_acl.h |   1 +
 2 files changed, 132 insertions(+), 35 deletions(-)

diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 715b02359..fbcf45fdc 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -79,57 +79,153 @@ static const rte_acl_classify_t classify_fns[] = {
 	[RTE_ACL_CLASSIFY_ALTIVEC] = rte_acl_classify_altivec,
 };
 
-/* by default, use always available scalar code path. */
-static enum rte_acl_classify_alg rte_acl_default_classify =
-	RTE_ACL_CLASSIFY_SCALAR;
+/*
+ * Helper function for acl_check_alg.
+ * Check support for ARM specific classify methods.
+ */
+static int
+acl_check_alg_arm(enum rte_acl_classify_alg alg)
+{
+	if (alg == RTE_ACL_CLASSIFY_NEON) {
+#if defined(RTE_ARCH_ARM64)
+		return 0;
+#elif defined(RTE_ARCH_ARM)
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON))
+			return 0;
+		return -ENOTSUP;
+#else
+		return -ENOTSUP;
+#endif
+	}
+
+	return -EINVAL;
+}
 
-static void
-rte_acl_set_default_classify(enum rte_acl_classify_alg alg)
+/*
+ * Helper function for acl_check_alg.
+ * Check support for PPC specific classify methods.
+ */
+static int
+acl_check_alg_ppc(enum rte_acl_classify_alg alg)
 {
-	rte_acl_default_classify = alg;
+	if (alg == RTE_ACL_CLASSIFY_ALTIVEC) {
+#if defined(RTE_ARCH_PPC_64)
+		return 0;
+#else
+		return -ENOTSUP;
+#endif
+	}
+
+	return -EINVAL;
 }
 
-extern int
-rte_acl_set_ctx_classify(struct rte_acl_ctx *ctx, enum rte_acl_classify_alg alg)
+/*
+ * Helper function for acl_check_alg.
+ * Check support for x86 specific classify methods.
+ */
+static int
+acl_check_alg_x86(enum rte_acl_classify_alg alg)
 {
-	if (ctx == NULL || (uint32_t)alg >= RTE_DIM(classify_fns))
-		return -EINVAL;
+	if (alg == RTE_ACL_CLASSIFY_AVX2) {
+#ifdef CC_AVX2_SUPPORT
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
+			return 0;
+#endif
+		return -ENOTSUP;
+	}
 
-	ctx->alg = alg;
-	return 0;
+	if (alg == RTE_ACL_CLASSIFY_SSE) {
+#ifdef RTE_ARCH_X86
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
+			return 0;
+#endif
+		return -ENOTSUP;
+	}
+
+	return -EINVAL;
 }
 
 /*
- * Select highest available classify method as default one.
- * Note that CLASSIFY_AVX2 should be set as a default only
- * if both conditions are met:
- * at build time compiler supports AVX2 and target cpu supports AVX2.
+ * Check if input alg is supported by given platform/binary.
+ * Note that both conditions should be met:
+ * - at build time compiler supports ISA used by given methos
+ *   at run time target cpu supports necessary ISA.
  */
-RTE_INIT(rte_acl_init)
+static int
+acl_check_alg(enum rte_acl_classify_alg alg)
 {
-	enum rte_acl_classify_alg alg = RTE_ACL_CLASSIFY_DEFAULT;
+	switch (alg) {
+	case RTE_ACL_CLASSIFY_NEON:
+		return acl_check_alg_arm(alg);
+	case RTE_ACL_CLASSIFY_ALTIVEC:
+		return acl_check_alg_ppc(alg);
+	case RTE_ACL_CLASSIFY_AVX2:
+	case RTE_ACL_CLASSIFY_SSE:
+		return acl_check_alg_x86(alg);
+	/* scalar method is supported on all platforms */
+	case RTE_ACL_CLASSIFY_SCALAR:
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
 
-#if defined(RTE_ARCH_ARM64)
-	alg =  RTE_ACL_CLASSIFY_NEON;
-#elif defined(RTE_ARCH_ARM)
-	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON))
-		alg =  RTE_ACL_CLASSIFY_NEON;
+/*
+ * Get preferred alg for given platform.
+ */
+static enum rte_acl_classify_alg
+acl_get_best_alg(void)
+{
+	/*
+	 * array of supported methods for each platform.
+	 * Note that order is important - from most to less preferable.
+	 */
+	static const enum rte_acl_classify_alg alg[] = {
+#if defined(RTE_ARCH_ARM)
+		RTE_ACL_CLASSIFY_NEON,
 #elif defined(RTE_ARCH_PPC_64)
-	alg = RTE_ACL_CLASSIFY_ALTIVEC;
-#else
-#ifdef CC_AVX2_SUPPORT
-	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
-		alg = RTE_ACL_CLASSIFY_AVX2;
-	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
-#else
-	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
+		RTE_ACL_CLASSIFY_ALTIVEC,
+#elif defined(RTE_ARCH_X86)
+		RTE_ACL_CLASSIFY_AVX2,
+		RTE_ACL_CLASSIFY_SSE,
 #endif
-		alg = RTE_ACL_CLASSIFY_SSE;
+		RTE_ACL_CLASSIFY_SCALAR,
+	};
 
-#endif
-	rte_acl_set_default_classify(alg);
+	uint32_t i;
+
+	/* find best possible alg */
+	for (i = 0; i != RTE_DIM(alg) && acl_check_alg(alg[i]) != 0; i++)
+		;
+
+	/* we always have to find something suitable */
+	RTE_VERIFY(i != RTE_DIM(alg));
+	return alg[i];
+}
+
+extern int
+rte_acl_set_ctx_classify(struct rte_acl_ctx *ctx, enum rte_acl_classify_alg alg)
+{
+	int32_t rc;
+
+	/* formal parameters check */
+	if (ctx == NULL || (uint32_t)alg >= RTE_DIM(classify_fns))
+		return -EINVAL;
+
+	/* user asked us to select the *best* one */
+	if (alg == RTE_ACL_CLASSIFY_DEFAULT)
+		alg = acl_get_best_alg();
+
+	/* check that given alg is supported */
+	rc = acl_check_alg(alg);
+	if (rc != 0)
+		return rc;
+
+	ctx->alg = alg;
+	return 0;
 }
 
+
 int
 rte_acl_classify_alg(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories,
@@ -262,7 +358,7 @@ rte_acl_create(const struct rte_acl_param *param)
 		ctx->max_rules = param->max_rule_num;
 		ctx->rule_sz = param->rule_size;
 		ctx->socket_id = param->socket_id;
-		ctx->alg = rte_acl_default_classify;
+		ctx->alg = acl_get_best_alg();
 		strlcpy(ctx->name, param->name, sizeof(ctx->name));
 
 		te->data = (void *) ctx;
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index b814423a6..3999f15de 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -329,6 +329,7 @@ rte_acl_classify_alg(const struct rte_acl_ctx *ctx,
  *   existing algorithm, and that it could be run on the given CPU.
  * @return
  *   - -EINVAL if the parameters are invalid.
+ *   - -ENOTSUP requested algorithm is not supported by given platform.
  *   - Zero if operation completed successfully.
  */
 extern int
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v2 05/12] app/acl: few small improvements
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
                     ` (3 preceding siblings ...)
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 04/12] acl: remove library constructor Konstantin Ananyev
@ 2020-09-15 16:50   ` Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 06/12] test/acl: expand classify test coverage Konstantin Ananyev
                     ` (7 subsequent siblings)
  12 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

- enhance output to print extra stats
- use rte_rdtsc_precise() for cycle measurements

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test-acl/main.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/app/test-acl/main.c b/app/test-acl/main.c
index 0a5dfb621..d9b65517c 100644
--- a/app/test-acl/main.c
+++ b/app/test-acl/main.c
@@ -862,9 +862,10 @@ search_ip5tuples(__rte_unused void *arg)
 {
 	uint64_t pkt, start, tm;
 	uint32_t i, lcore;
+	long double st;
 
 	lcore = rte_lcore_id();
-	start = rte_rdtsc();
+	start = rte_rdtsc_precise();
 	pkt = 0;
 
 	for (i = 0; i != config.iter_num; i++) {
@@ -872,12 +873,16 @@ search_ip5tuples(__rte_unused void *arg)
 			config.trace_step, config.alg.name);
 	}
 
-	tm = rte_rdtsc() - start;
+	tm = rte_rdtsc_precise() - start;
+
+	st = (long double)tm / rte_get_timer_hz();
 	dump_verbose(DUMP_NONE, stdout,
 		"%s  @lcore %u: %" PRIu32 " iterations, %" PRIu64 " pkts, %"
-		PRIu32 " categories, %" PRIu64 " cycles, %#Lf cycles/pkt\n",
-		__func__, lcore, i, pkt, config.run_categories,
-		tm, (pkt == 0) ? 0 : (long double)tm / pkt);
+		PRIu32 " categories, %" PRIu64 " cycles (%.2Lf sec), "
+		"%.2Lf cycles/pkt, %.2Lf pkt/sec\n",
+		__func__, lcore, i, pkt,
+		config.run_categories, tm, st,
+		(pkt == 0) ? 0 : (long double)tm / pkt, pkt / st);
 
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v2 06/12] test/acl: expand classify test coverage
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
                     ` (4 preceding siblings ...)
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 05/12] app/acl: few small improvements Konstantin Ananyev
@ 2020-09-15 16:50   ` Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 07/12] acl: add infrastructure to support AVX512 classify Konstantin Ananyev
                     ` (6 subsequent siblings)
  12 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Make classify test to run for all supported methods.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test/test_acl.c | 103 ++++++++++++++++++++++----------------------
 1 file changed, 51 insertions(+), 52 deletions(-)

diff --git a/app/test/test_acl.c b/app/test/test_acl.c
index 316bf4d06..333b34757 100644
--- a/app/test/test_acl.c
+++ b/app/test/test_acl.c
@@ -266,22 +266,20 @@ rte_acl_ipv4vlan_build(struct rte_acl_ctx *ctx,
 }
 
 /*
- * Test scalar and SSE ACL lookup.
+ * Test ACL lookup (selected alg).
  */
 static int
-test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
-	size_t dim)
+test_classify_alg(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
+	const uint8_t *data[], size_t dim, enum rte_acl_classify_alg alg)
 {
-	int ret, i;
-	uint32_t result, count;
+	int32_t ret;
+	uint32_t i, result, count;
 	uint32_t results[dim * RTE_ACL_MAX_CATEGORIES];
-	const uint8_t *data[dim];
-	/* swap all bytes in the data to network order */
-	bswap_test_data(test_data, dim, 1);
 
-	/* store pointers to test data */
-	for (i = 0; i < (int) dim; i++)
-		data[i] = (uint8_t *)&test_data[i];
+	/* set given classify alg, skip test if alg is not supported */
+	ret = rte_acl_set_ctx_classify(acx, alg);
+	if (ret == -ENOTSUP)
+		return 0;
 
 	/**
 	 * these will run quite a few times, it's necessary to test code paths
@@ -291,12 +289,13 @@ test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
 		ret = rte_acl_classify(acx, data, results,
 				count, RTE_ACL_MAX_CATEGORIES);
 		if (ret != 0) {
-			printf("Line %i: SSE classify failed!\n", __LINE__);
-			goto err;
+			printf("Line %i: classify(alg=%d) failed!\n",
+				__LINE__, alg);
+			return ret;
 		}
 
 		/* check if we allow everything we should allow */
-		for (i = 0; i < (int) count; i++) {
+		for (i = 0; i < count; i++) {
 			result =
 				results[i * RTE_ACL_MAX_CATEGORIES + ACL_ALLOW];
 			if (result != test_data[i].allow) {
@@ -304,63 +303,63 @@ test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
 					"(expected %"PRIu32" got %"PRIu32")!\n",
 					__LINE__, i, test_data[i].allow,
 					result);
-				ret = -EINVAL;
-				goto err;
+				return -EINVAL;
 			}
 		}
 
 		/* check if we deny everything we should deny */
-		for (i = 0; i < (int) count; i++) {
+		for (i = 0; i < count; i++) {
 			result = results[i * RTE_ACL_MAX_CATEGORIES + ACL_DENY];
 			if (result != test_data[i].deny) {
 				printf("Line %i: Error in deny results at %i "
 					"(expected %"PRIu32" got %"PRIu32")!\n",
 					__LINE__, i, test_data[i].deny,
 					result);
-				ret = -EINVAL;
-				goto err;
+				return -EINVAL;
 			}
 		}
 	}
 
-	/* make a quick check for scalar */
-	ret = rte_acl_classify_alg(acx, data, results,
-			dim, RTE_ACL_MAX_CATEGORIES,
-			RTE_ACL_CLASSIFY_SCALAR);
-	if (ret != 0) {
-		printf("Line %i: scalar classify failed!\n", __LINE__);
-		goto err;
-	}
+	/* restore default classify alg */
+	return rte_acl_set_ctx_classify(acx, RTE_ACL_CLASSIFY_DEFAULT);
+}
 
-	/* check if we allow everything we should allow */
-	for (i = 0; i < (int) dim; i++) {
-		result = results[i * RTE_ACL_MAX_CATEGORIES + ACL_ALLOW];
-		if (result != test_data[i].allow) {
-			printf("Line %i: Error in allow results at %i "
-					"(expected %"PRIu32" got %"PRIu32")!\n",
-					__LINE__, i, test_data[i].allow,
-					result);
-			ret = -EINVAL;
-			goto err;
-		}
-	}
+/*
+ * Test ACL lookup (all possible methods).
+ */
+static int
+test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
+	size_t dim)
+{
+	int32_t ret;
+	uint32_t i;
+	const uint8_t *data[dim];
 
-	/* check if we deny everything we should deny */
-	for (i = 0; i < (int) dim; i++) {
-		result = results[i * RTE_ACL_MAX_CATEGORIES + ACL_DENY];
-		if (result != test_data[i].deny) {
-			printf("Line %i: Error in deny results at %i "
-					"(expected %"PRIu32" got %"PRIu32")!\n",
-					__LINE__, i, test_data[i].deny,
-					result);
-			ret = -EINVAL;
-			goto err;
-		}
-	}
+	static const enum rte_acl_classify_alg alg[] = {
+		RTE_ACL_CLASSIFY_SCALAR,
+		RTE_ACL_CLASSIFY_SSE,
+		RTE_ACL_CLASSIFY_AVX2,
+		RTE_ACL_CLASSIFY_NEON,
+		RTE_ACL_CLASSIFY_ALTIVEC,
+	};
+
+	/* swap all bytes in the data to network order */
+	bswap_test_data(test_data, dim, 1);
+
+	/* store pointers to test data */
+	for (i = 0; i < dim; i++)
+		data[i] = (uint8_t *)&test_data[i];
 
 	ret = 0;
+	for (i = 0; i != RTE_DIM(alg); i++) {
+		ret = test_classify_alg(acx, test_data, data, dim, alg[i]);
+		if (ret < 0) {
+			printf("Line %i: %s() for alg=%d failed, errno=%d\n",
+				__LINE__, __func__, alg[i], -ret);
+			break;
+		}
+	}
 
-err:
 	/* swap data back to cpu order so that next time tests don't fail */
 	bswap_test_data(test_data, dim, 0);
 	return ret;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v2 07/12] acl: add infrastructure to support AVX512 classify
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
                     ` (5 preceding siblings ...)
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 06/12] test/acl: expand classify test coverage Konstantin Ananyev
@ 2020-09-15 16:50   ` Konstantin Ananyev
  2020-09-16  9:11     ` Bruce Richardson
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 08/12] acl: introduce AVX512 classify implementation Konstantin Ananyev
                     ` (5 subsequent siblings)
  12 siblings, 1 reply; 70+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Add necessary changes to support new AVX512 specific ACL classify
algorithm:
 - changes in meson.build to check that build tools
   (compiler, assembler, etc.) do properly support AVX512.
 - run-time checks to make sure target platform does support AVX512.
 - dummy rte_acl_classify_avx512() for targets where AVX512
   implementation couldn't be properly supported.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 config/x86/meson.build          |  3 ++-
 lib/librte_acl/acl.h            |  4 ++++
 lib/librte_acl/acl_run_avx512.c | 17 ++++++++++++++
 lib/librte_acl/meson.build      | 39 +++++++++++++++++++++++++++++++++
 lib/librte_acl/rte_acl.c        | 29 ++++++++++++++++++++++++
 lib/librte_acl/rte_acl.h        |  1 +
 6 files changed, 92 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_acl/acl_run_avx512.c

diff --git a/config/x86/meson.build b/config/x86/meson.build
index 6ec020ef6..c5626e914 100644
--- a/config/x86/meson.build
+++ b/config/x86/meson.build
@@ -23,7 +23,8 @@ foreach f:base_flags
 endforeach
 
 optional_flags = ['AES', 'PCLMUL',
-		'AVX', 'AVX2', 'AVX512F',
+		'AVX', 'AVX2',
+		'AVX512F', 'AVX512VL', 'AVX512CD', 'AVX512BW',
 		'RDRND', 'RDSEED']
 foreach f:optional_flags
 	if cc.get_define('__@0@__'.format(f), args: machine_args) == '1'
diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 39d45a0c2..2022cf253 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -201,6 +201,10 @@ int
 rte_acl_classify_avx2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories);
 
+int
+rte_acl_classify_avx512(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories);
+
 int
 rte_acl_classify_neon(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories);
diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
new file mode 100644
index 000000000..67274989d
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#include "acl_run_sse.h"
+
+int
+rte_acl_classify_avx512(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	if (num >= MAX_SEARCHES_SSE8)
+		return search_sse_8(ctx, data, results, num, categories);
+	if (num >= MAX_SEARCHES_SSE4)
+		return search_sse_4(ctx, data, results, num, categories);
+
+	return rte_acl_classify_scalar(ctx, data, results, num, categories);
+}
diff --git a/lib/librte_acl/meson.build b/lib/librte_acl/meson.build
index d1e2c184c..b2fd61cad 100644
--- a/lib/librte_acl/meson.build
+++ b/lib/librte_acl/meson.build
@@ -27,6 +27,45 @@ if dpdk_conf.has('RTE_ARCH_X86')
 		cflags += '-DCC_AVX2_SUPPORT'
 	endif
 
+	# compile AVX512 version if:
+	# we are building 64-bit binary AND binutils can generate proper code
+
+	if dpdk_conf.has('RTE_ARCH_X86_64') and binutils_ok.returncode() == 0
+
+		# compile AVX512 version if either:
+		# a. we have AVX512 supported in minimum instruction set
+		#    baseline
+		# b. it's not minimum instruction set, but supported by
+		#    compiler
+		#
+		# in former case, just add avx512 C file to files list
+		# in latter case, compile c file to static lib, using correct
+		# compiler flags, and then have the .o file from static lib
+		# linked into main lib.
+
+		if dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512F') and \
+			dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512VL') and \
+			dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512CD') and \
+			dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512BW')
+
+			sources += files('acl_run_avx512.c')
+			cflags += '-DCC_AVX512_SUPPORT'
+
+		elif cc.has_multi_arguments('-mavx512f', '-mavx512vl',
+					'-mavx512cd', '-mavx512bw')
+
+			avx512_tmplib = static_library('avx512_tmp',
+				'acl_run_avx512.c',
+				dependencies: static_rte_eal,
+				c_args: cflags +
+					['-mavx512f', '-mavx512vl',
+					 '-mavx512cd', '-mavx512bw'])
+			objs += avx512_tmplib.extract_objects(
+					'acl_run_avx512.c')
+			cflags += '-DCC_AVX512_SUPPORT'
+		endif
+	endif
+
 elif dpdk_conf.has('RTE_ARCH_ARM') or dpdk_conf.has('RTE_ARCH_ARM64')
 	cflags += '-flax-vector-conversions'
 	sources += files('acl_run_neon.c')
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index fbcf45fdc..fdcb7a798 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -16,6 +16,22 @@ static struct rte_tailq_elem rte_acl_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_acl_tailq)
 
+#ifndef CC_AVX512_SUPPORT
+/*
+ * If the compiler doesn't support AVX512 instructions,
+ * then the dummy one would be used instead for AVX512 classify method.
+ */
+int
+rte_acl_classify_avx512(__rte_unused const struct rte_acl_ctx *ctx,
+	__rte_unused const uint8_t **data,
+	__rte_unused uint32_t *results,
+	__rte_unused uint32_t num,
+	__rte_unused uint32_t categories)
+{
+	return -ENOTSUP;
+}
+#endif
+
 #ifndef CC_AVX2_SUPPORT
 /*
  * If the compiler doesn't support AVX2 instructions,
@@ -77,6 +93,7 @@ static const rte_acl_classify_t classify_fns[] = {
 	[RTE_ACL_CLASSIFY_AVX2] = rte_acl_classify_avx2,
 	[RTE_ACL_CLASSIFY_NEON] = rte_acl_classify_neon,
 	[RTE_ACL_CLASSIFY_ALTIVEC] = rte_acl_classify_altivec,
+	[RTE_ACL_CLASSIFY_AVX512] = rte_acl_classify_avx512,
 };
 
 /*
@@ -126,6 +143,17 @@ acl_check_alg_ppc(enum rte_acl_classify_alg alg)
 static int
 acl_check_alg_x86(enum rte_acl_classify_alg alg)
 {
+	if (alg == RTE_ACL_CLASSIFY_AVX512) {
+#ifdef CC_AVX512_SUPPORT
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) &&
+			rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512VL) &&
+			rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512CD) &&
+			rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512BW))
+			return 0;
+#endif
+		return -ENOTSUP;
+	}
+
 	if (alg == RTE_ACL_CLASSIFY_AVX2) {
 #ifdef CC_AVX2_SUPPORT
 		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
@@ -159,6 +187,7 @@ acl_check_alg(enum rte_acl_classify_alg alg)
 		return acl_check_alg_arm(alg);
 	case RTE_ACL_CLASSIFY_ALTIVEC:
 		return acl_check_alg_ppc(alg);
+	case RTE_ACL_CLASSIFY_AVX512:
 	case RTE_ACL_CLASSIFY_AVX2:
 	case RTE_ACL_CLASSIFY_SSE:
 		return acl_check_alg_x86(alg);
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index 3999f15de..d243a1c84 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -241,6 +241,7 @@ enum rte_acl_classify_alg {
 	RTE_ACL_CLASSIFY_AVX2 = 3,    /**< requires AVX2 support. */
 	RTE_ACL_CLASSIFY_NEON = 4,    /**< requires NEON support. */
 	RTE_ACL_CLASSIFY_ALTIVEC = 5,    /**< requires ALTIVEC support. */
+	RTE_ACL_CLASSIFY_AVX512 = 6,    /**< requires AVX512 support. */
 };
 
 /**
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v2 08/12] acl: introduce AVX512 classify implementation
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
                     ` (6 preceding siblings ...)
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 07/12] acl: add infrastructure to support AVX512 classify Konstantin Ananyev
@ 2020-09-15 16:50   ` Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 09/12] acl: enhance " Konstantin Ananyev
                     ` (4 subsequent siblings)
  12 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Introduce classify implementation that uses AVX512 specific ISA.
Current approach uses mix of 256i/512-bit width registers/instructions
and is able to process up to 16 flows in parallel.
Note that for now only 64-bit version of rte_acl_classify_avx512()
is available.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl.h              |   7 +
 lib/librte_acl/acl_gen.c          |   2 +-
 lib/librte_acl/acl_run_avx512.c   | 145 +++++++
 lib/librte_acl/acl_run_avx512x8.h | 620 ++++++++++++++++++++++++++++++
 4 files changed, 773 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_acl/acl_run_avx512x8.h

diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 2022cf253..3f0719f33 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -76,6 +76,13 @@ struct rte_acl_bitset {
  * input_byte - ((uint8_t *)&transition)[4 + input_byte / 64].
  */
 
+/*
+ * Each ACL RT contains an idle nomatch node:
+ * a SINGLE node at predefined position (RTE_ACL_DFA_SIZE)
+ * that points to itself.
+ */
+#define RTE_ACL_IDLE_NODE	(RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE)
+
 /*
  * Structure of a node is a set of ptrs and each ptr has a bit map
  * of values associated with this transition.
diff --git a/lib/librte_acl/acl_gen.c b/lib/librte_acl/acl_gen.c
index f1b9d12f1..e759a2ca1 100644
--- a/lib/librte_acl/acl_gen.c
+++ b/lib/librte_acl/acl_gen.c
@@ -496,7 +496,7 @@ rte_acl_gen(struct rte_acl_ctx *ctx, struct rte_acl_trie *trie,
 	 * highest index, that points to itself)
 	 */
 
-	node_array[RTE_ACL_DFA_SIZE] = RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE;
+	node_array[RTE_ACL_DFA_SIZE] = RTE_ACL_IDLE_NODE;
 
 	for (n = 0; n < RTE_ACL_DFA_SIZE; n++)
 		node_array[n] = no_match;
diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
index 67274989d..353a3c004 100644
--- a/lib/librte_acl/acl_run_avx512.c
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -4,10 +4,155 @@
 
 #include "acl_run_sse.h"
 
+/*sizeof(uint32_t) << match_log == sizeof(struct rte_acl_match_results)*/
+static const uint32_t match_log = 5;
+
+struct acl_flow_avx512 {
+	uint32_t num_packets;       /* number of packets processed */
+	uint32_t total_packets;     /* max number of packets to process */
+	uint32_t root_index;        /* current root index */
+	const uint64_t *trans;      /* transition table */
+	const uint32_t *data_index; /* input data indexes */
+	const uint8_t **idata;      /* input data */
+	uint32_t *matches;          /* match indexes */
+};
+
+static inline void
+acl_set_flow_avx512(struct acl_flow_avx512 *flow, const struct rte_acl_ctx *ctx,
+	uint32_t trie, const uint8_t *data[], uint32_t *matches,
+	uint32_t total_packets)
+{
+	flow->num_packets = 0;
+	flow->total_packets = total_packets;
+	flow->root_index = ctx->trie[trie].root_index;
+	flow->trans = ctx->trans_table;
+	flow->data_index = ctx->trie[trie].data_index;
+	flow->idata = data;
+	flow->matches = matches;
+}
+
+/*
+ * Resolve matches for multiple categories (LE 8, use 128b instuctions/regs)
+ */
+static inline void
+resolve_mcle8_avx512x1(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_cat, uint32_t nb_trie)
+{
+	const int32_t *pri;
+	const uint32_t *pm, *res;
+	uint32_t i, j, k, mi, mn;
+	__mmask8 msk;
+	xmm_t cp, cr, np, nr;
+
+	res = pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != nb_pkt; k++, result += nb_cat) {
+
+		mi = match[k] << match_log;
+
+		for (j = 0; j != nb_cat; j += RTE_ACL_RESULTS_MULTIPLIER) {
+
+			cr = _mm_loadu_si128((const xmm_t *)(res + mi + j));
+			cp = _mm_loadu_si128((const xmm_t *)(pri + mi + j));
+
+			for (i = 1, pm = match + nb_pkt; i != nb_trie;
+				i++, pm += nb_pkt) {
+
+				mn = j + (pm[k] << match_log);
+
+				nr = _mm_loadu_si128((const xmm_t *)(res + mn));
+				np = _mm_loadu_si128((const xmm_t *)(pri + mn));
+
+				msk = _mm_cmpgt_epi32_mask(cp, np);
+				cr = _mm_mask_mov_epi32(nr, msk, cr);
+				cp = _mm_mask_mov_epi32(np, msk, cp);
+			}
+
+			_mm_storeu_si128((xmm_t *)(result + j), cr);
+		}
+	}
+}
+
+/*
+ * Resolve matches for multiple categories (GT 8, use 512b instuctions/regs)
+ */
+static inline void
+resolve_mcgt8_avx512x1(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_cat, uint32_t nb_trie)
+{
+	const int32_t *pri;
+	const uint32_t *pm, *res;
+	uint32_t i, k, mi;
+	__mmask16 cm, sm;
+	__m512i cp, cr, np, nr;
+
+	const uint32_t match_log = 5;
+
+	res = pr->results;
+	pri = pr->priority;
+
+	cm = (1 << nb_cat) - 1;
+
+	for (k = 0; k != nb_pkt; k++, result += nb_cat) {
+
+		mi = match[k] << match_log;
+
+		cr = _mm512_maskz_loadu_epi32(cm, res + mi);
+		cp = _mm512_maskz_loadu_epi32(cm, pri + mi);
+
+		for (i = 1, pm = match + nb_pkt; i != nb_trie;
+				i++, pm += nb_pkt) {
+
+			mi = pm[k] << match_log;
+
+			nr = _mm512_maskz_loadu_epi32(cm, res + mi);
+			np = _mm512_maskz_loadu_epi32(cm, pri + mi);
+
+			sm = _mm512_cmpgt_epi32_mask(cp, np);
+			cr = _mm512_mask_mov_epi32(nr, sm, cr);
+			cp = _mm512_mask_mov_epi32(np, sm, cp);
+		}
+
+		_mm512_mask_storeu_epi32(result, cm, cr);
+	}
+}
+
+static inline ymm_t
+_m512_mask_gather_epi8x8(__m512i pdata, __mmask8 mask)
+{
+	__m512i t;
+	rte_ymm_t v;
+	__rte_x86_zmm_t p;
+
+	static const uint32_t zero;
+
+	t = _mm512_set1_epi64((uintptr_t)&zero);
+	p.z = _mm512_mask_mov_epi64(t, mask, pdata);
+
+	v.u32[0] = *(uint8_t *)p.u64[0];
+	v.u32[1] = *(uint8_t *)p.u64[1];
+	v.u32[2] = *(uint8_t *)p.u64[2];
+	v.u32[3] = *(uint8_t *)p.u64[3];
+	v.u32[4] = *(uint8_t *)p.u64[4];
+	v.u32[5] = *(uint8_t *)p.u64[5];
+	v.u32[6] = *(uint8_t *)p.u64[6];
+	v.u32[7] = *(uint8_t *)p.u64[7];
+
+	return v.y;
+}
+
+
+#include "acl_run_avx512x8.h"
+
 int
 rte_acl_classify_avx512(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories)
 {
+	if (num >= MAX_SEARCHES_AVX16)
+		return search_avx512x8x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE8)
 		return search_sse_8(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE4)
diff --git a/lib/librte_acl/acl_run_avx512x8.h b/lib/librte_acl/acl_run_avx512x8.h
new file mode 100644
index 000000000..66fc26b26
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512x8.h
@@ -0,0 +1,620 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#define NUM_AVX512X8X2	(2 * CHAR_BIT)
+#define MSK_AVX512X8X2	(NUM_AVX512X8X2 - 1)
+
+static const rte_ymm_t ymm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const rte_ymm_t ymm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const rte_ymm_t ymm_trlo_idle = {
+	.u32 = {
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+	},
+};
+
+static const rte_ymm_t ymm_trhi_idle = {
+	.u32 = {
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+	},
+};
+
+static const rte_ymm_t ymm_shuffle_input = {
+	.u32 = {
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+	},
+};
+
+static const rte_ymm_t ymm_four_32 = {
+	.u32 = {
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+	},
+};
+
+static const rte_ymm_t ymm_idx_add = {
+	.u32 = {
+		0, 1, 2, 3,
+		4, 5, 6, 7,
+	},
+};
+
+static const rte_ymm_t ymm_range_base = {
+	.u32 = {
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+	},
+};
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes not supposed to be encountered here.
+ * For quad range nodes:
+ * Calculate number of range boundaries that are less than the
+ * input value. Range boundaries for each node are in signed 8 bit,
+ * ordered from -128 to 127.
+ * This is effectively a popcnt of bytes that are greater than the
+ * input byte.
+ * Single nodes are processed in the same ways as quad range nodes.
+ */
+static __rte_always_inline ymm_t
+calc_addr8(ymm_t index_mask, ymm_t next_input, ymm_t shuffle_input,
+	ymm_t four_32, ymm_t range_base, ymm_t tr_lo, ymm_t tr_hi)
+{
+	ymm_t addr, in, node_type, r, t;
+	ymm_t dfa_msk, dfa_ofs, quad_ofs;
+
+	t = _mm256_xor_si256(index_mask, index_mask);
+	in = _mm256_shuffle_epi8(next_input, shuffle_input);
+
+	/* Calc node type and node addr */
+	node_type = _mm256_andnot_si256(index_mask, tr_lo);
+	addr = _mm256_and_si256(index_mask, tr_lo);
+
+	/* mask for DFA type(0) nodes */
+	dfa_msk = _mm256_cmpeq_epi32(node_type, t);
+
+	/* DFA calculations. */
+	r = _mm256_srli_epi32(in, 30);
+	r = _mm256_add_epi8(r, range_base);
+	t = _mm256_srli_epi32(in, 24);
+	r = _mm256_shuffle_epi8(tr_hi, r);
+
+	dfa_ofs = _mm256_sub_epi32(t, r);
+
+	/* QUAD/SINGLE calculations. */
+	t = _mm256_cmpgt_epi8(in, tr_hi);
+	t = _mm256_lzcnt_epi32(t);
+	t = _mm256_srli_epi32(t, 3);
+	quad_ofs = _mm256_sub_epi32(four_32, t);
+
+	/* blend DFA and QUAD/SINGLE. */
+	t = _mm256_blendv_epi8(quad_ofs, dfa_ofs, dfa_msk);
+
+	/* calculate address for next transitions. */
+	addr = _mm256_add_epi32(addr, t);
+	return addr;
+}
+
+/*
+ * Process 8 transitions in parallel.
+ * tr_lo contains low 32 bits for 8 transitions.
+ * tr_hi contains high 32 bits for 8 transitions.
+ * next_input contains up to 4 input bytes for 8 flows.
+ */
+static __rte_always_inline ymm_t
+transition8(ymm_t next_input, const uint64_t *trans, ymm_t *tr_lo, ymm_t *tr_hi)
+{
+	const int32_t *tr;
+	ymm_t addr;
+
+	tr = (const int32_t *)(uintptr_t)trans;
+
+	/* Calculate the address (array index) for all 8 transitions. */
+	addr = calc_addr8(ymm_index_mask.y, next_input, ymm_shuffle_input.y,
+		ymm_four_32.y, ymm_range_base.y, *tr_lo, *tr_hi);
+
+	/* load lower 32 bits of 8 transactions at once. */
+	*tr_lo = _mm256_i32gather_epi32(tr, addr, sizeof(trans[0]));
+
+	next_input = _mm256_srli_epi32(next_input, CHAR_BIT);
+
+	/* load high 32 bits of 8 transactions at once. */
+	*tr_hi = _mm256_i32gather_epi32(tr + 1, addr, sizeof(trans[0]));
+
+	return next_input;
+}
+
+/*
+ * Execute first transition for up to 8 flows in parallel.
+ * next_input should contain one input byte for up to 8 flows.
+ * msk - mask of active flows.
+ * tr_lo contains low 32 bits for up to 8 transitions.
+ * tr_hi contains high 32 bits for up to 8 transitions.
+ */
+static __rte_always_inline void
+first_trans8(const struct acl_flow_avx512 *flow, ymm_t next_input,
+	__mmask8 msk, ymm_t *tr_lo, ymm_t *tr_hi)
+{
+	const int32_t *tr;
+	ymm_t addr, root;
+
+	tr = (const int32_t *)(uintptr_t)flow->trans;
+
+	addr = _mm256_set1_epi32(UINT8_MAX);
+	root = _mm256_set1_epi32(flow->root_index);
+
+	addr = _mm256_and_si256(next_input, addr);
+	addr = _mm256_add_epi32(root, addr);
+
+	/* load lower 32 bits of 8 transactions at once. */
+	*tr_lo = _mm256_mmask_i32gather_epi32(*tr_lo, msk, addr, tr,
+		sizeof(flow->trans[0]));
+
+	/* load high 32 bits of 8 transactions at once. */
+	*tr_hi = _mm256_mmask_i32gather_epi32(*tr_hi, msk, addr, (tr + 1),
+		sizeof(flow->trans[0]));
+}
+
+/*
+ * Load and return next 4 input bytes for up to 8 flows in parallel.
+ * pdata - 8 pointers to flow input data
+ * mask - mask of active flows.
+ * di - data indexes for these 8 flows.
+ */
+static inline ymm_t
+get_next_bytes_avx512x8(const struct acl_flow_avx512 *flow, __m512i pdata,
+	__mmask8 mask, ymm_t *di, uint32_t bnum)
+{
+	const int32_t *div;
+	ymm_t one, zero;
+	ymm_t inp, t;
+	__m512i p;
+
+	div = (const int32_t *)flow->data_index;
+
+	one = _mm256_set1_epi32(1);
+	zero = _mm256_xor_si256(one, one);
+
+	/* load data offsets for given indexes */
+	t = _mm256_mmask_i32gather_epi32(zero, mask, *di, div, sizeof(div[0]));
+
+	/* increment data indexes */
+	*di = _mm256_mask_add_epi32(*di, mask, *di, one);
+
+	p = _mm512_cvtepu32_epi64(t);
+	p = _mm512_add_epi64(p, pdata);
+
+	/* load input byte(s), either one or four */
+	if (bnum == sizeof(uint8_t))
+		inp = _m512_mask_gather_epi8x8(p, mask);
+	else
+		inp = _mm512_mask_i64gather_epi32(zero, mask, p, NULL,
+			sizeof(uint8_t));
+	return inp;
+}
+
+/*
+ * Start up to 8 new flows.
+ * num - number of flows to start
+ * msk - mask of new flows.
+ * pdata - pointers to flow input data
+ * di - data indexes for these flows.
+ */
+static inline void
+start_flow8(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
+	__m512i *pdata, ymm_t *idx, ymm_t *di)
+{
+	uint32_t nm;
+	ymm_t ni;
+	__m512i nd;
+
+	/* load input data pointers for new flows */
+	nm = (1 << num) - 1;
+	nd = _mm512_maskz_loadu_epi64(nm, flow->idata + flow->num_packets);
+
+	/* calculate match indexes of new flows */
+	ni = _mm256_set1_epi32(flow->num_packets);
+	ni = _mm256_add_epi32(ni, ymm_idx_add.y);
+
+	/* merge new and existing flows data */
+	*pdata = _mm512_mask_expand_epi64(*pdata, msk, nd);
+	*idx = _mm256_mask_expand_epi32(*idx, msk, ni);
+	*di = _mm256_maskz_mov_epi32(msk ^ UINT8_MAX, *di);
+
+	flow->num_packets += num;
+}
+
+/*
+ * Update flow and result masks based on the number of unprocessed flows.
+ */
+static inline uint32_t
+update_flow_mask8(const struct acl_flow_avx512 *flow, __mmask8 *fmsk,
+	__mmask8 *rmsk)
+{
+	uint32_t i, j, k, m, n;
+
+	fmsk[0] ^= rmsk[0];
+	m = rmsk[0];
+
+	k = __builtin_popcount(m);
+	n = flow->total_packets - flow->num_packets;
+
+	if (n < k) {
+		/* reduce mask */
+		for (i = k - n; i != 0; i--) {
+			j = sizeof(m) * CHAR_BIT - 1 - __builtin_clz(m);
+			m ^= 1 << j;
+		}
+	} else
+		n = k;
+
+	rmsk[0] = m;
+	fmsk[0] |= rmsk[0];
+
+	return n;
+}
+
+/*
+ * Process found matches for up to 8 flows.
+ * fmsk - mask of active flows
+ * rmsk - mask of found matches
+ * pdata - pointers to flow input data
+ * di - data indexes for these flows
+ * idx - match indexed for given flows
+ * tr_lo contains low 32 bits for up to 8 transitions.
+ * tr_hi contains high 32 bits for up to 8 transitions.
+ */
+static inline uint32_t
+match_process_avx512x8(struct acl_flow_avx512 *flow, __mmask8 *fmsk,
+	__mmask8 *rmsk,	__m512i *pdata, ymm_t *di, ymm_t *idx,
+	ymm_t *tr_lo, ymm_t *tr_hi)
+{
+	uint32_t n;
+	ymm_t res;
+
+	if (rmsk[0] == 0)
+		return 0;
+
+	/* extract match indexes */
+	res = _mm256_and_si256(tr_lo[0], ymm_index_mask.y);
+
+	/* mask  matched transitions to nop */
+	tr_lo[0] = _mm256_mask_mov_epi32(tr_lo[0], rmsk[0], ymm_trlo_idle.y);
+	tr_hi[0] = _mm256_mask_mov_epi32(tr_hi[0], rmsk[0], ymm_trhi_idle.y);
+
+	/* save found match indexes */
+	_mm256_mask_i32scatter_epi32(flow->matches, rmsk[0],
+		idx[0], res, sizeof(flow->matches[0]));
+
+	/* update masks and start new flows for matches */
+	n = update_flow_mask8(flow, fmsk, rmsk);
+	start_flow8(flow, n, rmsk[0], pdata, idx, di);
+
+	return n;
+}
+
+
+static inline void
+match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, __mmask8 fm[2],
+	__m512i pdata[2], ymm_t di[2], ymm_t idx[2], ymm_t inp[2],
+	ymm_t tr_lo[2], ymm_t tr_hi[2])
+{
+	uint32_t n[2];
+	__mmask8 rm[2];
+
+	/* check for matches */
+	rm[0] = _mm256_test_epi32_mask(tr_lo[0], ymm_match_mask.y);
+	rm[1] = _mm256_test_epi32_mask(tr_lo[1], ymm_match_mask.y);
+
+	/* till unprocessed matches exist */
+	while ((rm[0] | rm[1]) != 0) {
+
+		/* process matches and start new flows */
+		n[0] = match_process_avx512x8(flow, &fm[0], &rm[0], &pdata[0],
+			&di[0], &idx[0], &tr_lo[0], &tr_hi[0]);
+		n[1] = match_process_avx512x8(flow, &fm[1], &rm[1], &pdata[1],
+			&di[1], &idx[1], &tr_lo[1], &tr_hi[1]);
+
+		/* execute first transition for new flows, if any */
+
+		if (n[0] != 0) {
+			inp[0] = get_next_bytes_avx512x8(flow, pdata[0], rm[0],
+				&di[0], sizeof(uint8_t));
+			first_trans8(flow, inp[0], rm[0], &tr_lo[0], &tr_hi[0]);
+
+			rm[0] = _mm256_test_epi32_mask(tr_lo[0],
+				ymm_match_mask.y);
+		}
+
+		if (n[1] != 0) {
+			inp[1] = get_next_bytes_avx512x8(flow, pdata[1], rm[1],
+				&di[1], sizeof(uint8_t));
+			first_trans8(flow, inp[1], rm[1], &tr_lo[1], &tr_hi[1]);
+
+			rm[1] = _mm256_test_epi32_mask(tr_lo[1],
+				ymm_match_mask.y);
+		}
+	}
+}
+
+/*
+ * Perform search for up to 16 flows in parallel.
+ * Use two sets of metadata, each serves 8 flows max.
+ * So in fact we perform search for 2x8 flows.
+ */
+static inline void
+search_trie_avx512x8x2(struct acl_flow_avx512 *flow)
+{
+	__mmask8 fm[2];
+	__m512i pdata[2];
+	ymm_t di[2], idx[2], inp[2], tr_lo[2], tr_hi[2];
+
+	/* first 1B load */
+	start_flow8(flow, CHAR_BIT, UINT8_MAX, &pdata[0], &idx[0], &di[0]);
+	start_flow8(flow, CHAR_BIT, UINT8_MAX, &pdata[1], &idx[1], &di[1]);
+
+	inp[0] = get_next_bytes_avx512x8(flow, pdata[0], UINT8_MAX, &di[0],
+		sizeof(uint8_t));
+	inp[1] = get_next_bytes_avx512x8(flow, pdata[1], UINT8_MAX, &di[1],
+		sizeof(uint8_t));
+
+	first_trans8(flow, inp[0], UINT8_MAX, &tr_lo[0], &tr_hi[0]);
+	first_trans8(flow, inp[1], UINT8_MAX, &tr_lo[1], &tr_hi[1]);
+
+	fm[0] = UINT8_MAX;
+	fm[1] = UINT8_MAX;
+
+	/* match check */
+	match_check_process_avx512x8x2(flow, fm, pdata, di, idx, inp,
+		tr_lo, tr_hi);
+
+	while ((fm[0] | fm[1]) != 0) {
+
+		/* load next 4B */
+
+		inp[0] = get_next_bytes_avx512x8(flow, pdata[0], fm[0],
+			&di[0], sizeof(uint32_t));
+		inp[1] = get_next_bytes_avx512x8(flow, pdata[1], fm[1],
+			&di[1], sizeof(uint32_t));
+
+		/* main 4B loop */
+
+		inp[0] = transition8(inp[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		inp[1] = transition8(inp[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		inp[0] = transition8(inp[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		inp[1] = transition8(inp[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		inp[0] = transition8(inp[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		inp[1] = transition8(inp[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		inp[0] = transition8(inp[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		inp[1] = transition8(inp[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		/* check for matches */
+		match_check_process_avx512x8x2(flow, fm, pdata, di, idx, inp,
+			tr_lo, tr_hi);
+	}
+}
+
+/*
+ * resolve match index to actual result/priority offset.
+ */
+static inline ymm_t
+resolve_match_idx_avx512x8(ymm_t mi)
+{
+	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
+		1 << (match_log + 2));
+	return _mm256_slli_epi32(mi, match_log);
+}
+
+
+/*
+ * Resolve multiple matches for the same flow based on priority.
+ */
+static inline ymm_t
+resolve_pri_avx512x8(const int32_t res[], const int32_t pri[],
+	const uint32_t match[], __mmask8 msk, uint32_t nb_trie,
+	uint32_t nb_skip)
+{
+	uint32_t i;
+	const uint32_t *pm;
+	__mmask8 m;
+	ymm_t cp, cr, np, nr, mch;
+
+	const ymm_t zero = _mm256_set1_epi32(0);
+
+	mch = _mm256_maskz_loadu_epi32(msk, match);
+	mch = resolve_match_idx_avx512x8(mch);
+
+	cr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0]));
+	cp = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0]));
+
+	for (i = 1, pm = match + nb_skip; i != nb_trie;
+			i++, pm += nb_skip) {
+
+		mch = _mm256_maskz_loadu_epi32(msk, pm);
+		mch = resolve_match_idx_avx512x8(mch);
+
+		nr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res,
+			sizeof(res[0]));
+		np = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri,
+			sizeof(pri[0]));
+
+		m = _mm256_cmpgt_epi32_mask(cp, np);
+		cr = _mm256_mask_mov_epi32(nr, m, cr);
+		cp = _mm256_mask_mov_epi32(np, m, cp);
+	}
+
+	return cr;
+}
+
+/*
+ * Resolve num (<= 8) matches for single category
+ */
+static inline void
+resolve_sc_avx512x8(uint32_t result[], const int32_t res[], const int32_t pri[],
+	const uint32_t match[], uint32_t nb_pkt, uint32_t nb_trie,
+	uint32_t nb_skip)
+{
+	__mmask8 msk;
+	ymm_t cr;
+
+	msk = (1 << nb_pkt) - 1;
+	cr = resolve_pri_avx512x8(res, pri, match, msk, nb_trie, nb_skip);
+	_mm256_mask_storeu_epi32(result, msk, cr);
+}
+
+/*
+ * Resolve matches for single category
+ */
+static inline void
+resolve_sc_avx512x8x2(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_trie)
+{
+	uint32_t i, j, k, n;
+	const uint32_t *pm;
+	const int32_t *res, *pri;
+	__mmask8 m[2];
+	ymm_t cp[2], cr[2], np[2], nr[2], mch[2];
+
+	res = (const int32_t *)pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != (nb_pkt & ~MSK_AVX512X8X2); k += NUM_AVX512X8X2) {
+
+		j = k + CHAR_BIT;
+
+		/* load match indexes for first trie */
+		mch[0] = _mm256_loadu_si256((const ymm_t *)(match + k));
+		mch[1] = _mm256_loadu_si256((const ymm_t *)(match + j));
+
+		mch[0] = resolve_match_idx_avx512x8(mch[0]);
+		mch[1] = resolve_match_idx_avx512x8(mch[1]);
+
+		/* load matches and their priorities for first trie */
+
+		cr[0] = _mm256_i32gather_epi32(res, mch[0], sizeof(res[0]));
+		cr[1] = _mm256_i32gather_epi32(res, mch[1], sizeof(res[0]));
+
+		cp[0] = _mm256_i32gather_epi32(pri, mch[0], sizeof(pri[0]));
+		cp[1] = _mm256_i32gather_epi32(pri, mch[1], sizeof(pri[0]));
+
+		/* select match with highest priority */
+		for (i = 1, pm = match + nb_pkt; i != nb_trie;
+				i++, pm += nb_pkt) {
+
+			mch[0] = _mm256_loadu_si256((const ymm_t *)(pm + k));
+			mch[1] = _mm256_loadu_si256((const ymm_t *)(pm + j));
+
+			mch[0] = resolve_match_idx_avx512x8(mch[0]);
+			mch[1] = resolve_match_idx_avx512x8(mch[1]);
+
+			nr[0] = _mm256_i32gather_epi32(res, mch[0],
+				sizeof(res[0]));
+			nr[1] = _mm256_i32gather_epi32(res, mch[1],
+				sizeof(res[0]));
+
+			np[0] = _mm256_i32gather_epi32(pri, mch[0],
+				sizeof(pri[0]));
+			np[1] = _mm256_i32gather_epi32(pri, mch[1],
+				sizeof(pri[0]));
+
+			m[0] = _mm256_cmpgt_epi32_mask(cp[0], np[0]);
+			m[1] = _mm256_cmpgt_epi32_mask(cp[1], np[1]);
+
+			cr[0] = _mm256_mask_mov_epi32(nr[0], m[0], cr[0]);
+			cr[1] = _mm256_mask_mov_epi32(nr[1], m[1], cr[1]);
+
+			cp[0] = _mm256_mask_mov_epi32(np[0], m[0], cp[0]);
+			cp[1] = _mm256_mask_mov_epi32(np[1], m[1], cp[1]);
+		}
+
+		_mm256_storeu_si256((ymm_t *)(result + k), cr[0]);
+		_mm256_storeu_si256((ymm_t *)(result + j), cr[1]);
+	}
+
+	n = nb_pkt - k;
+	if (n != 0) {
+		if (n > CHAR_BIT) {
+			resolve_sc_avx512x8(result + k, res, pri, match + k,
+				CHAR_BIT, nb_trie, nb_pkt);
+			k += CHAR_BIT;
+			n -= CHAR_BIT;
+		}
+		resolve_sc_avx512x8(result + k, res, pri, match + k, n,
+				nb_trie, nb_pkt);
+	}
+}
+
+static inline int
+search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	uint32_t i, *pm;
+	const struct rte_acl_match_results *pr;
+	struct acl_flow_avx512 flow;
+	uint32_t match[ctx->num_tries * total_packets];
+
+	for (i = 0, pm = match; i != ctx->num_tries; i++, pm += total_packets) {
+
+		/* setup for next trie */
+		acl_set_flow_avx512(&flow, ctx, i, data, pm, total_packets);
+
+		/* process the trie */
+		search_trie_avx512x8x2(&flow);
+	}
+
+	/* resolve matches */
+	pr = (const struct rte_acl_match_results *)
+		(ctx->trans_table + ctx->match_index);
+
+	if (categories == 1)
+		resolve_sc_avx512x8x2(results, pr, match, total_packets,
+			ctx->num_tries);
+	else if (categories <= RTE_ACL_MAX_CATEGORIES / 2)
+		resolve_mcle8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+	else
+		resolve_mcgt8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+
+	return 0;
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v2 09/12] acl: enhance AVX512 classify implementation
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
                     ` (7 preceding siblings ...)
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 08/12] acl: introduce AVX512 classify implementation Konstantin Ananyev
@ 2020-09-15 16:50   ` Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 10/12] acl: for AVX512 classify use 4B load whenever possible Konstantin Ananyev
                     ` (3 subsequent siblings)
  12 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Add search_avx512x16x2() which uses mostly 512-bit width
registers/instructions and is able to process up to 32 flows in
parallel. That allows to futher speedup rte_acl_classify_avx512()
for bursts with 32+ requests.
Run-time code-path selection is done internally based
on input burst size and is totally opaque to the user.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

These patch depends on:
https://patches.dpdk.org/patch/73922/mbox/
to be applied first.

 .../prog_guide/packet_classif_access_ctrl.rst |   9 +
 doc/guides/rel_notes/release_20_11.rst        |   5 +
 lib/librte_acl/acl_run_avx512.c               | 162 ++++++
 lib/librte_acl/acl_run_avx512x16.h            | 526 ++++++++++++++++++
 lib/librte_acl/acl_run_avx512x8.h             | 195 +------
 5 files changed, 709 insertions(+), 188 deletions(-)
 create mode 100644 lib/librte_acl/acl_run_avx512x16.h

diff --git a/doc/guides/prog_guide/packet_classif_access_ctrl.rst b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
index daf03e6d7..f6c64fbd9 100644
--- a/doc/guides/prog_guide/packet_classif_access_ctrl.rst
+++ b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
@@ -379,10 +379,19 @@ There are several implementations of classify algorithm:
 *   **RTE_ACL_CLASSIFY_ALTIVEC**: vector implementation, can process up to 8
     flows in parallel. Requires ALTIVEC support.
 
+*   **RTE_ACL_CLASSIFY_AVX512**: vector implementation, can process up to 32
+    flows in parallel. Requires AVX512 support.
+
 It is purely a runtime decision which method to choose, there is no build-time difference.
 All implementations operates over the same internal RT structures and use similar principles. The main difference is that vector implementations can manually exploit IA SIMD instructions and process several input data flows in parallel.
 At startup ACL library determines the highest available classify method for the given platform and sets it as default one. Though the user has an ability to override the default classifier function for a given ACL context or perform particular search using non-default classify method. In that case it is user responsibility to make sure that given platform supports selected classify implementation.
 
+.. note::
+
+     Right now ``RTE_ACL_CLASSIFY_AVX512`` is not selected by default
+     (due to possible frequency level change), but it can be selected at
+     runtime by apps through the use of ACL API: ``rte_acl_set_ctx_classify``.
+
 Application Programming Interface (API) Usage
 ---------------------------------------------
 
diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index a9a1b0305..acdd12ef9 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -55,6 +55,11 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Add new AVX512 specific classify algorithm for ACL library.**
+
+  Added new ``RTE_ACL_CLASSIFY_AVX512`` vector implementation,
+  which can processup to 32 flows in parallel. Requires AVX512 support.
+
 
 Removed Items
 -------------
diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
index 353a3c004..60762b7d6 100644
--- a/lib/librte_acl/acl_run_avx512.c
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -4,6 +4,11 @@
 
 #include "acl_run_sse.h"
 
+#define	MASK16_BIT	(sizeof(__mmask16) * CHAR_BIT)
+
+#define NUM_AVX512X16X2	(2 * MASK16_BIT)
+#define MSK_AVX512X16X2	(NUM_AVX512X16X2 - 1)
+
 /*sizeof(uint32_t) << match_log == sizeof(struct rte_acl_match_results)*/
 static const uint32_t match_log = 5;
 
@@ -31,6 +36,36 @@ acl_set_flow_avx512(struct acl_flow_avx512 *flow, const struct rte_acl_ctx *ctx,
 	flow->matches = matches;
 }
 
+/*
+ * Update flow and result masks based on the number of unprocessed flows.
+ */
+static inline uint32_t
+update_flow_mask(const struct acl_flow_avx512 *flow, uint32_t *fmsk,
+	uint32_t *rmsk)
+{
+	uint32_t i, j, k, m, n;
+
+	fmsk[0] ^= rmsk[0];
+	m = rmsk[0];
+
+	k = __builtin_popcount(m);
+	n = flow->total_packets - flow->num_packets;
+
+	if (n < k) {
+		/* reduce mask */
+		for (i = k - n; i != 0; i--) {
+			j = sizeof(m) * CHAR_BIT - 1 - __builtin_clz(m);
+			m ^= 1 << j;
+		}
+	} else
+		n = k;
+
+	rmsk[0] = m;
+	fmsk[0] |= rmsk[0];
+
+	return n;
+}
+
 /*
  * Resolve matches for multiple categories (LE 8, use 128b instuctions/regs)
  */
@@ -144,13 +179,140 @@ _m512_mask_gather_epi8x8(__m512i pdata, __mmask8 mask)
 	return v.y;
 }
 
+/*
+ * resolve match index to actual result/priority offset.
+ */
+static inline __m512i
+resolve_match_idx_avx512x16(__m512i mi)
+{
+	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
+		1 << (match_log + 2));
+	return _mm512_slli_epi32(mi, match_log);
+}
+
+/*
+ * Resolve multiple matches for the same flow based on priority.
+ */
+static inline __m512i
+resolve_pri_avx512x16(const int32_t res[], const int32_t pri[],
+	const uint32_t match[], __mmask16 msk, uint32_t nb_trie,
+	uint32_t nb_skip)
+{
+	uint32_t i;
+	const uint32_t *pm;
+	__mmask16 m;
+	__m512i cp, cr, np, nr, mch;
+
+	const __m512i zero = _mm512_set1_epi32(0);
+
+	/* get match indexes */
+	mch = _mm512_maskz_loadu_epi32(msk, match);
+	mch = resolve_match_idx_avx512x16(mch);
+
+	/* read result and priority values for first trie */
+	cr = _mm512_mask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0]));
+	cp = _mm512_mask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0]));
+
+	/*
+	 * read result and priority values for next tries and select one
+	 * with highest priority.
+	 */
+	for (i = 1, pm = match + nb_skip; i != nb_trie;
+			i++, pm += nb_skip) {
+
+		mch = _mm512_maskz_loadu_epi32(msk, pm);
+		mch = resolve_match_idx_avx512x16(mch);
+
+		nr = _mm512_mask_i32gather_epi32(zero, msk, mch, res,
+			sizeof(res[0]));
+		np = _mm512_mask_i32gather_epi32(zero, msk, mch, pri,
+			sizeof(pri[0]));
+
+		m = _mm512_cmpgt_epi32_mask(cp, np);
+		cr = _mm512_mask_mov_epi32(nr, m, cr);
+		cp = _mm512_mask_mov_epi32(np, m, cp);
+	}
+
+	return cr;
+}
+
+/*
+ * Resolve num (<= 16) matches for single category
+ */
+static inline void
+resolve_sc_avx512x16(uint32_t result[], const int32_t res[],
+	const int32_t pri[], const uint32_t match[], uint32_t nb_pkt,
+	uint32_t nb_trie, uint32_t nb_skip)
+{
+	__mmask16 msk;
+	__m512i cr;
+
+	msk = (1 << nb_pkt) - 1;
+	cr = resolve_pri_avx512x16(res, pri, match, msk, nb_trie, nb_skip);
+	_mm512_mask_storeu_epi32(result, msk, cr);
+}
+
+/*
+ * Resolve matches for single category
+ */
+static inline void
+resolve_sc_avx512x16x2(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_trie)
+{
+	uint32_t j, k, n;
+	const int32_t *res, *pri;
+	__m512i cr[2];
+
+	res = (const int32_t *)pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != (nb_pkt & ~MSK_AVX512X16X2); k += NUM_AVX512X16X2) {
+
+		j = k + MASK16_BIT;
+
+		cr[0] = resolve_pri_avx512x16(res, pri, match + k, UINT16_MAX,
+				nb_trie, nb_pkt);
+		cr[1] = resolve_pri_avx512x16(res, pri, match + j, UINT16_MAX,
+				nb_trie, nb_pkt);
+
+		_mm512_storeu_si512(result + k, cr[0]);
+		_mm512_storeu_si512(result + j, cr[1]);
+	}
+
+	n = nb_pkt - k;
+	if (n != 0) {
+		if (n > MASK16_BIT) {
+			resolve_sc_avx512x16(result + k, res, pri, match + k,
+				MASK16_BIT, nb_trie, nb_pkt);
+			k += MASK16_BIT;
+			n -= MASK16_BIT;
+		}
+		resolve_sc_avx512x16(result + k, res, pri, match + k, n,
+				nb_trie, nb_pkt);
+	}
+}
 
 #include "acl_run_avx512x8.h"
+#include "acl_run_avx512x16.h"
 
 int
 rte_acl_classify_avx512(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories)
 {
+	const uint32_t max_iter = MAX_SEARCHES_AVX16 * MAX_SEARCHES_AVX16;
+
+	/* split huge lookup (gt 256) into series of fixed size ones */
+	while (num > max_iter) {
+		search_avx512x16x2(ctx, data, results, max_iter, categories);
+		data += max_iter;
+		results += max_iter * categories;
+		num -= max_iter;
+	}
+
+	/* select classify method based on number of remainig requests */
+	if (num >= 2 * MAX_SEARCHES_AVX16)
+		return search_avx512x16x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_AVX16)
 		return search_avx512x8x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE8)
diff --git a/lib/librte_acl/acl_run_avx512x16.h b/lib/librte_acl/acl_run_avx512x16.h
new file mode 100644
index 000000000..45b0b4db6
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512x16.h
@@ -0,0 +1,526 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+static const __rte_x86_zmm_t zmm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_trlo_idle = {
+	.u32 = {
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_trhi_idle = {
+	.u32 = {
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_shuffle_input = {
+	.u32 = {
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_four_32 = {
+	.u32 = {
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_idx_add = {
+	.u32 = {
+		0, 1, 2, 3,
+		4, 5, 6, 7,
+		8, 9, 10, 11,
+		12, 13, 14, 15,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_range_base = {
+	.u32 = {
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+	},
+};
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes not supposed to be encountered here.
+ * For quad range nodes:
+ * Calculate number of range boundaries that are less than the
+ * input value. Range boundaries for each node are in signed 8 bit,
+ * ordered from -128 to 127.
+ * This is effectively a popcnt of bytes that are greater than the
+ * input byte.
+ * Single nodes are processed in the same ways as quad range nodes.
+ */
+static __rte_always_inline __m512i
+calc_addr16(__m512i index_mask, __m512i next_input, __m512i shuffle_input,
+	__m512i four_32, __m512i range_base, __m512i tr_lo, __m512i tr_hi)
+{
+	__mmask64 qm;
+	__mmask16 dfa_msk;
+	__m512i addr, in, node_type, r, t;
+	__m512i dfa_ofs, quad_ofs;
+
+	t = _mm512_xor_si512(index_mask, index_mask);
+	in = _mm512_shuffle_epi8(next_input, shuffle_input);
+
+	/* Calc node type and node addr */
+	node_type = _mm512_andnot_si512(index_mask, tr_lo);
+	addr = _mm512_and_si512(index_mask, tr_lo);
+
+	/* mask for DFA type(0) nodes */
+	dfa_msk = _mm512_cmpeq_epi32_mask(node_type, t);
+
+	/* DFA calculations. */
+	r = _mm512_srli_epi32(in, 30);
+	r = _mm512_add_epi8(r, range_base);
+	t = _mm512_srli_epi32(in, 24);
+	r = _mm512_shuffle_epi8(tr_hi, r);
+
+	dfa_ofs = _mm512_sub_epi32(t, r);
+
+	/* QUAD/SINGLE calculations. */
+	qm = _mm512_cmpgt_epi8_mask(in, tr_hi);
+	t = _mm512_maskz_set1_epi8(qm, (uint8_t)UINT8_MAX);
+	t = _mm512_lzcnt_epi32(t);
+	t = _mm512_srli_epi32(t, 3);
+	quad_ofs = _mm512_sub_epi32(four_32, t);
+
+	/* blend DFA and QUAD/SINGLE. */
+	t = _mm512_mask_mov_epi32(quad_ofs, dfa_msk, dfa_ofs);
+
+	/* calculate address for next transitions. */
+	addr = _mm512_add_epi32(addr, t);
+	return addr;
+}
+
+/*
+ * Process 16 transitions in parallel.
+ * tr_lo contains low 32 bits for 16 transition.
+ * tr_hi contains high 32 bits for 16 transition.
+ * next_input contains up to 4 input bytes for 16 flows.
+ */
+static __rte_always_inline __m512i
+transition16(__m512i next_input, const uint64_t *trans, __m512i *tr_lo,
+	__m512i *tr_hi)
+{
+	const int32_t *tr;
+	__m512i addr;
+
+	tr = (const int32_t *)(uintptr_t)trans;
+
+	/* Calculate the address (array index) for all 16 transitions. */
+	addr = calc_addr16(zmm_index_mask.z, next_input, zmm_shuffle_input.z,
+		zmm_four_32.z, zmm_range_base.z, *tr_lo, *tr_hi);
+
+	/* load lower 32 bits of 16 transactions at once. */
+	*tr_lo = _mm512_i32gather_epi32(addr, tr, sizeof(trans[0]));
+
+	next_input = _mm512_srli_epi32(next_input, CHAR_BIT);
+
+	/* load high 32 bits of 16 transactions at once. */
+	*tr_hi = _mm512_i32gather_epi32(addr, (tr + 1), sizeof(trans[0]));
+
+	return next_input;
+}
+
+/*
+ * Execute first transition for up to 16 flows in parallel.
+ * next_input should contain one input byte for up to 16 flows.
+ * msk - mask of active flows.
+ * tr_lo contains low 32 bits for up to 16 transitions.
+ * tr_hi contains high 32 bits for up to 16 transitions.
+ */
+static __rte_always_inline void
+first_trans16(const struct acl_flow_avx512 *flow, __m512i next_input,
+	__mmask16 msk, __m512i *tr_lo, __m512i *tr_hi)
+{
+	const int32_t *tr;
+	__m512i addr, root;
+
+	tr = (const int32_t *)(uintptr_t)flow->trans;
+
+	addr = _mm512_set1_epi32(UINT8_MAX);
+	root = _mm512_set1_epi32(flow->root_index);
+
+	addr = _mm512_and_si512(next_input, addr);
+	addr = _mm512_add_epi32(root, addr);
+
+	/* load lower 32 bits of 16 transactions at once. */
+	*tr_lo = _mm512_mask_i32gather_epi32(*tr_lo, msk, addr, tr,
+		sizeof(flow->trans[0]));
+
+	/* load high 32 bits of 16 transactions at once. */
+	*tr_hi = _mm512_mask_i32gather_epi32(*tr_hi, msk, addr, (tr + 1),
+		sizeof(flow->trans[0]));
+}
+
+/*
+ * Load and return next 4 input bytes for up to 16 flows in parallel.
+ * pdata - 8x2 pointers to flow input data
+ * mask - mask of active flows.
+ * di - data indexes for these 16 flows.
+ */
+static inline __m512i
+get_next_bytes_avx512x16(const struct acl_flow_avx512 *flow, __m512i pdata[2],
+	uint32_t msk, __m512i *di, uint32_t bnum)
+{
+	const int32_t *div;
+	__m512i one, zero, t, p[2];
+	ymm_t inp[2];
+
+	static const __rte_x86_zmm_t zmm_pminp = {
+		.u32 = {
+			0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+			0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17,
+		},
+	};
+
+	const __mmask16 pmidx_msk = 0x5555;
+
+	static const __rte_x86_zmm_t zmm_pmidx[2] = {
+		[0] = {
+			.u32 = {
+				0, 0, 1, 0, 2, 0, 3, 0,
+				4, 0, 5, 0, 6, 0, 7, 0,
+			},
+		},
+		[1] = {
+			.u32 = {
+				8, 0, 9, 0, 10, 0, 11, 0,
+				12, 0, 13, 0, 14, 0, 15, 0,
+			},
+		},
+	};
+
+	div = (const int32_t *)flow->data_index;
+
+	one = _mm512_set1_epi32(1);
+	zero = _mm512_xor_si512(one, one);
+
+	/* load data offsets for given indexes */
+	t = _mm512_mask_i32gather_epi32(zero, msk, *di, div, sizeof(div[0]));
+
+	/* increment data indexes */
+	*di = _mm512_mask_add_epi32(*di, msk, *di, one);
+
+	/*
+	 * unsigned expand 32-bit indexes to 64-bit
+	 * (for later pointer arithmetic), i.e:
+	 * for (i = 0; i != 16; i++)
+	 *   p[i/8].u64[i%8] = (uint64_t)t.u32[i];
+	 */
+	p[0] = _mm512_maskz_permutexvar_epi32(pmidx_msk, zmm_pmidx[0].z, t);
+	p[1] = _mm512_maskz_permutexvar_epi32(pmidx_msk, zmm_pmidx[1].z, t);
+
+	p[0] = _mm512_add_epi64(p[0], pdata[0]);
+	p[1] = _mm512_add_epi64(p[1], pdata[1]);
+
+	/* load input byte(s), either one or four */
+	if (bnum == sizeof(uint8_t)) {
+		inp[0] = _m512_mask_gather_epi8x8(p[0], (msk & UINT8_MAX));
+		inp[1] = _m512_mask_gather_epi8x8(p[1], (msk >> CHAR_BIT));
+	} else {
+		inp[0] = _mm512_mask_i64gather_epi32(
+				_mm512_castsi512_si256(zero), (msk & UINT8_MAX),
+				p[0], NULL, sizeof(uint8_t));
+		inp[1] = _mm512_mask_i64gather_epi32(
+				_mm512_castsi512_si256(zero), (msk >> CHAR_BIT),
+				p[1], NULL, sizeof(uint8_t));
+	}
+
+	/* squeeze input into one 512-bit register */
+	return _mm512_permutex2var_epi32(_mm512_castsi256_si512(inp[0]),
+			zmm_pminp.z, _mm512_castsi256_si512(inp[1]));
+}
+
+/*
+ * Start up to 16 new flows.
+ * num - number of flows to start
+ * msk - mask of new flows.
+ * pdata - pointers to flow input data
+ * idx - match indexed for given flows
+ * di - data indexes for these flows.
+ */
+static inline void
+start_flow16(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
+	__m512i pdata[2], __m512i *idx, __m512i *di)
+{
+	uint32_t n, nm[2];
+	__m512i ni, nd[2];
+
+	/* load input data pointers for new flows */
+	n = __builtin_popcount(msk & UINT8_MAX);
+	nm[0] = (1 << n) - 1;
+	nm[1] = (1 << (num - n)) - 1;
+
+	nd[0] = _mm512_maskz_loadu_epi64(nm[0],
+		flow->idata + flow->num_packets);
+	nd[1] = _mm512_maskz_loadu_epi64(nm[1],
+		flow->idata + flow->num_packets + n);
+
+	/* calculate match indexes of new flows */
+	ni = _mm512_set1_epi32(flow->num_packets);
+	ni = _mm512_add_epi32(ni, zmm_idx_add.z);
+
+	/* merge new and existing flows data */
+	pdata[0] = _mm512_mask_expand_epi64(pdata[0], (msk & UINT8_MAX), nd[0]);
+	pdata[1] = _mm512_mask_expand_epi64(pdata[1], (msk >> CHAR_BIT), nd[1]);
+
+	/* update match and data indexes */
+	*idx = _mm512_mask_expand_epi32(*idx, msk, ni);
+	*di = _mm512_maskz_mov_epi32(msk ^ UINT16_MAX, *di);
+
+	flow->num_packets += num;
+}
+
+/*
+ * Process found matches for up to 16 flows.
+ * fmsk - mask of active flows
+ * rmsk - mask of found matches
+ * pdata - pointers to flow input data
+ * di - data indexes for these flows
+ * idx - match indexed for given flows
+ * tr_lo contains low 32 bits for up to 8 transitions.
+ * tr_hi contains high 32 bits for up to 8 transitions.
+ */
+static inline uint32_t
+match_process_avx512x16(struct acl_flow_avx512 *flow, uint32_t *fmsk,
+	uint32_t *rmsk, __m512i pdata[2], __m512i *di, __m512i *idx,
+	__m512i *tr_lo, __m512i *tr_hi)
+{
+	uint32_t n;
+	__m512i res;
+
+	if (rmsk[0] == 0)
+		return 0;
+
+	/* extract match indexes */
+	res = _mm512_and_si512(tr_lo[0], zmm_index_mask.z);
+
+	/* mask  matched transitions to nop */
+	tr_lo[0] = _mm512_mask_mov_epi32(tr_lo[0], rmsk[0], zmm_trlo_idle.z);
+	tr_hi[0] = _mm512_mask_mov_epi32(tr_hi[0], rmsk[0], zmm_trhi_idle.z);
+
+	/* save found match indexes */
+	_mm512_mask_i32scatter_epi32(flow->matches, rmsk[0],
+		idx[0], res, sizeof(flow->matches[0]));
+
+	/* update masks and start new flows for matches */
+	n = update_flow_mask(flow, fmsk, rmsk);
+	start_flow16(flow, n, rmsk[0], pdata, idx, di);
+
+	return n;
+}
+
+/*
+ * Test for matches ut to 32 (2x16) flows at once,
+ * if matches exist - process them and start new flows.
+ */
+static inline void
+match_check_process_avx512x16x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
+	__m512i pdata[4], __m512i di[2], __m512i idx[2], __m512i inp[2],
+	__m512i tr_lo[2], __m512i tr_hi[2])
+{
+	uint32_t n[2];
+	uint32_t rm[2];
+
+	/* check for matches */
+	rm[0] = _mm512_test_epi32_mask(tr_lo[0], zmm_match_mask.z);
+	rm[1] = _mm512_test_epi32_mask(tr_lo[1], zmm_match_mask.z);
+
+	/* till unprocessed matches exist */
+	while ((rm[0] | rm[1]) != 0) {
+
+		/* process matches and start new flows */
+		n[0] = match_process_avx512x16(flow, &fm[0], &rm[0], &pdata[0],
+			&di[0], &idx[0], &tr_lo[0], &tr_hi[0]);
+		n[1] = match_process_avx512x16(flow, &fm[1], &rm[1], &pdata[2],
+			&di[1], &idx[1], &tr_lo[1], &tr_hi[1]);
+
+		/* execute first transition for new flows, if any */
+
+		if (n[0] != 0) {
+			inp[0] = get_next_bytes_avx512x16(flow, &pdata[0],
+				rm[0], &di[0], sizeof(uint8_t));
+			first_trans16(flow, inp[0], rm[0], &tr_lo[0],
+				&tr_hi[0]);
+			rm[0] = _mm512_test_epi32_mask(tr_lo[0],
+				zmm_match_mask.z);
+		}
+
+		if (n[1] != 0) {
+			inp[1] = get_next_bytes_avx512x16(flow, &pdata[2],
+				rm[1], &di[1], sizeof(uint8_t));
+			first_trans16(flow, inp[1], rm[1], &tr_lo[1],
+				&tr_hi[1]);
+			rm[1] = _mm512_test_epi32_mask(tr_lo[1],
+				zmm_match_mask.z);
+		}
+	}
+}
+
+/*
+ * Perform search for up to 32 flows in parallel.
+ * Use two sets of metadata, each serves 16 flows max.
+ * So in fact we perform search for 2x16 flows.
+ */
+static inline void
+search_trie_avx512x16x2(struct acl_flow_avx512 *flow)
+{
+	uint32_t fm[2];
+	__m512i di[2], idx[2], in[2], pdata[4], tr_lo[2], tr_hi[2];
+
+	/* first 1B load */
+	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[0], &idx[0], &di[0]);
+	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[2], &idx[1], &di[1]);
+
+	in[0] = get_next_bytes_avx512x16(flow, &pdata[0], UINT16_MAX, &di[0],
+		sizeof(uint8_t));
+	in[1] = get_next_bytes_avx512x16(flow, &pdata[2], UINT16_MAX, &di[1],
+		sizeof(uint8_t));
+
+	first_trans16(flow, in[0], UINT16_MAX, &tr_lo[0], &tr_hi[0]);
+	first_trans16(flow, in[1], UINT16_MAX, &tr_lo[1], &tr_hi[1]);
+
+	fm[0] = UINT16_MAX;
+	fm[1] = UINT16_MAX;
+
+	/* match check */
+	match_check_process_avx512x16x2(flow, fm, pdata, di, idx, in,
+		tr_lo, tr_hi);
+
+	while ((fm[0] | fm[1]) != 0) {
+
+		/* load next 4B */
+
+		in[0] = get_next_bytes_avx512x16(flow, &pdata[0], fm[0],
+			&di[0], sizeof(uint32_t));
+		in[1] = get_next_bytes_avx512x16(flow, &pdata[2], fm[1],
+			&di[1], sizeof(uint32_t));
+
+		/* main 4B loop */
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		/* check for matches */
+		match_check_process_avx512x16x2(flow, fm, pdata, di, idx, in,
+			tr_lo, tr_hi);
+	}
+}
+
+static inline int
+search_avx512x16x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	uint32_t i, *pm;
+	const struct rte_acl_match_results *pr;
+	struct acl_flow_avx512 flow;
+	uint32_t match[ctx->num_tries * total_packets];
+
+	for (i = 0, pm = match; i != ctx->num_tries; i++, pm += total_packets) {
+
+		/* setup for next trie */
+		acl_set_flow_avx512(&flow, ctx, i, data, pm, total_packets);
+
+		/* process the trie */
+		search_trie_avx512x16x2(&flow);
+	}
+
+	/* resolve matches */
+	pr = (const struct rte_acl_match_results *)
+		(ctx->trans_table + ctx->match_index);
+
+	if (categories == 1)
+		resolve_sc_avx512x16x2(results, pr, match, total_packets,
+			ctx->num_tries);
+	else if (categories <= RTE_ACL_MAX_CATEGORIES / 2)
+		resolve_mcle8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+	else
+		resolve_mcgt8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+
+	return 0;
+}
diff --git a/lib/librte_acl/acl_run_avx512x8.h b/lib/librte_acl/acl_run_avx512x8.h
index 66fc26b26..82171e8e0 100644
--- a/lib/librte_acl/acl_run_avx512x8.h
+++ b/lib/librte_acl/acl_run_avx512x8.h
@@ -260,36 +260,6 @@ start_flow8(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
 	flow->num_packets += num;
 }
 
-/*
- * Update flow and result masks based on the number of unprocessed flows.
- */
-static inline uint32_t
-update_flow_mask8(const struct acl_flow_avx512 *flow, __mmask8 *fmsk,
-	__mmask8 *rmsk)
-{
-	uint32_t i, j, k, m, n;
-
-	fmsk[0] ^= rmsk[0];
-	m = rmsk[0];
-
-	k = __builtin_popcount(m);
-	n = flow->total_packets - flow->num_packets;
-
-	if (n < k) {
-		/* reduce mask */
-		for (i = k - n; i != 0; i--) {
-			j = sizeof(m) * CHAR_BIT - 1 - __builtin_clz(m);
-			m ^= 1 << j;
-		}
-	} else
-		n = k;
-
-	rmsk[0] = m;
-	fmsk[0] |= rmsk[0];
-
-	return n;
-}
-
 /*
  * Process found matches for up to 8 flows.
  * fmsk - mask of active flows
@@ -301,8 +271,8 @@ update_flow_mask8(const struct acl_flow_avx512 *flow, __mmask8 *fmsk,
  * tr_hi contains high 32 bits for up to 8 transitions.
  */
 static inline uint32_t
-match_process_avx512x8(struct acl_flow_avx512 *flow, __mmask8 *fmsk,
-	__mmask8 *rmsk,	__m512i *pdata, ymm_t *di, ymm_t *idx,
+match_process_avx512x8(struct acl_flow_avx512 *flow, uint32_t *fmsk,
+	uint32_t *rmsk,	__m512i *pdata, ymm_t *di, ymm_t *idx,
 	ymm_t *tr_lo, ymm_t *tr_hi)
 {
 	uint32_t n;
@@ -323,7 +293,7 @@ match_process_avx512x8(struct acl_flow_avx512 *flow, __mmask8 *fmsk,
 		idx[0], res, sizeof(flow->matches[0]));
 
 	/* update masks and start new flows for matches */
-	n = update_flow_mask8(flow, fmsk, rmsk);
+	n = update_flow_mask(flow, fmsk, rmsk);
 	start_flow8(flow, n, rmsk[0], pdata, idx, di);
 
 	return n;
@@ -331,12 +301,12 @@ match_process_avx512x8(struct acl_flow_avx512 *flow, __mmask8 *fmsk,
 
 
 static inline void
-match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, __mmask8 fm[2],
+match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
 	__m512i pdata[2], ymm_t di[2], ymm_t idx[2], ymm_t inp[2],
 	ymm_t tr_lo[2], ymm_t tr_hi[2])
 {
 	uint32_t n[2];
-	__mmask8 rm[2];
+	uint32_t rm[2];
 
 	/* check for matches */
 	rm[0] = _mm256_test_epi32_mask(tr_lo[0], ymm_match_mask.y);
@@ -381,7 +351,7 @@ match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, __mmask8 fm[2],
 static inline void
 search_trie_avx512x8x2(struct acl_flow_avx512 *flow)
 {
-	__mmask8 fm[2];
+	uint32_t fm[2];
 	__m512i pdata[2];
 	ymm_t di[2], idx[2], inp[2], tr_lo[2], tr_hi[2];
 
@@ -433,157 +403,6 @@ search_trie_avx512x8x2(struct acl_flow_avx512 *flow)
 	}
 }
 
-/*
- * resolve match index to actual result/priority offset.
- */
-static inline ymm_t
-resolve_match_idx_avx512x8(ymm_t mi)
-{
-	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
-		1 << (match_log + 2));
-	return _mm256_slli_epi32(mi, match_log);
-}
-
-
-/*
- * Resolve multiple matches for the same flow based on priority.
- */
-static inline ymm_t
-resolve_pri_avx512x8(const int32_t res[], const int32_t pri[],
-	const uint32_t match[], __mmask8 msk, uint32_t nb_trie,
-	uint32_t nb_skip)
-{
-	uint32_t i;
-	const uint32_t *pm;
-	__mmask8 m;
-	ymm_t cp, cr, np, nr, mch;
-
-	const ymm_t zero = _mm256_set1_epi32(0);
-
-	mch = _mm256_maskz_loadu_epi32(msk, match);
-	mch = resolve_match_idx_avx512x8(mch);
-
-	cr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0]));
-	cp = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0]));
-
-	for (i = 1, pm = match + nb_skip; i != nb_trie;
-			i++, pm += nb_skip) {
-
-		mch = _mm256_maskz_loadu_epi32(msk, pm);
-		mch = resolve_match_idx_avx512x8(mch);
-
-		nr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res,
-			sizeof(res[0]));
-		np = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri,
-			sizeof(pri[0]));
-
-		m = _mm256_cmpgt_epi32_mask(cp, np);
-		cr = _mm256_mask_mov_epi32(nr, m, cr);
-		cp = _mm256_mask_mov_epi32(np, m, cp);
-	}
-
-	return cr;
-}
-
-/*
- * Resolve num (<= 8) matches for single category
- */
-static inline void
-resolve_sc_avx512x8(uint32_t result[], const int32_t res[], const int32_t pri[],
-	const uint32_t match[], uint32_t nb_pkt, uint32_t nb_trie,
-	uint32_t nb_skip)
-{
-	__mmask8 msk;
-	ymm_t cr;
-
-	msk = (1 << nb_pkt) - 1;
-	cr = resolve_pri_avx512x8(res, pri, match, msk, nb_trie, nb_skip);
-	_mm256_mask_storeu_epi32(result, msk, cr);
-}
-
-/*
- * Resolve matches for single category
- */
-static inline void
-resolve_sc_avx512x8x2(uint32_t result[],
-	const struct rte_acl_match_results pr[], const uint32_t match[],
-	uint32_t nb_pkt, uint32_t nb_trie)
-{
-	uint32_t i, j, k, n;
-	const uint32_t *pm;
-	const int32_t *res, *pri;
-	__mmask8 m[2];
-	ymm_t cp[2], cr[2], np[2], nr[2], mch[2];
-
-	res = (const int32_t *)pr->results;
-	pri = pr->priority;
-
-	for (k = 0; k != (nb_pkt & ~MSK_AVX512X8X2); k += NUM_AVX512X8X2) {
-
-		j = k + CHAR_BIT;
-
-		/* load match indexes for first trie */
-		mch[0] = _mm256_loadu_si256((const ymm_t *)(match + k));
-		mch[1] = _mm256_loadu_si256((const ymm_t *)(match + j));
-
-		mch[0] = resolve_match_idx_avx512x8(mch[0]);
-		mch[1] = resolve_match_idx_avx512x8(mch[1]);
-
-		/* load matches and their priorities for first trie */
-
-		cr[0] = _mm256_i32gather_epi32(res, mch[0], sizeof(res[0]));
-		cr[1] = _mm256_i32gather_epi32(res, mch[1], sizeof(res[0]));
-
-		cp[0] = _mm256_i32gather_epi32(pri, mch[0], sizeof(pri[0]));
-		cp[1] = _mm256_i32gather_epi32(pri, mch[1], sizeof(pri[0]));
-
-		/* select match with highest priority */
-		for (i = 1, pm = match + nb_pkt; i != nb_trie;
-				i++, pm += nb_pkt) {
-
-			mch[0] = _mm256_loadu_si256((const ymm_t *)(pm + k));
-			mch[1] = _mm256_loadu_si256((const ymm_t *)(pm + j));
-
-			mch[0] = resolve_match_idx_avx512x8(mch[0]);
-			mch[1] = resolve_match_idx_avx512x8(mch[1]);
-
-			nr[0] = _mm256_i32gather_epi32(res, mch[0],
-				sizeof(res[0]));
-			nr[1] = _mm256_i32gather_epi32(res, mch[1],
-				sizeof(res[0]));
-
-			np[0] = _mm256_i32gather_epi32(pri, mch[0],
-				sizeof(pri[0]));
-			np[1] = _mm256_i32gather_epi32(pri, mch[1],
-				sizeof(pri[0]));
-
-			m[0] = _mm256_cmpgt_epi32_mask(cp[0], np[0]);
-			m[1] = _mm256_cmpgt_epi32_mask(cp[1], np[1]);
-
-			cr[0] = _mm256_mask_mov_epi32(nr[0], m[0], cr[0]);
-			cr[1] = _mm256_mask_mov_epi32(nr[1], m[1], cr[1]);
-
-			cp[0] = _mm256_mask_mov_epi32(np[0], m[0], cp[0]);
-			cp[1] = _mm256_mask_mov_epi32(np[1], m[1], cp[1]);
-		}
-
-		_mm256_storeu_si256((ymm_t *)(result + k), cr[0]);
-		_mm256_storeu_si256((ymm_t *)(result + j), cr[1]);
-	}
-
-	n = nb_pkt - k;
-	if (n != 0) {
-		if (n > CHAR_BIT) {
-			resolve_sc_avx512x8(result + k, res, pri, match + k,
-				CHAR_BIT, nb_trie, nb_pkt);
-			k += CHAR_BIT;
-			n -= CHAR_BIT;
-		}
-		resolve_sc_avx512x8(result + k, res, pri, match + k, n,
-				nb_trie, nb_pkt);
-	}
-}
-
 static inline int
 search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t total_packets, uint32_t categories)
@@ -607,7 +426,7 @@ search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 		(ctx->trans_table + ctx->match_index);
 
 	if (categories == 1)
-		resolve_sc_avx512x8x2(results, pr, match, total_packets,
+		resolve_sc_avx512x16x2(results, pr, match, total_packets,
 			ctx->num_tries);
 	else if (categories <= RTE_ACL_MAX_CATEGORIES / 2)
 		resolve_mcle8_avx512x1(results, pr, match, total_packets,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v2 10/12] acl: for AVX512 classify use 4B load whenever possible
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
                     ` (8 preceding siblings ...)
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 09/12] acl: enhance " Konstantin Ananyev
@ 2020-09-15 16:50   ` Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 11/12] test/acl: add AVX512 classify support Konstantin Ananyev
                     ` (2 subsequent siblings)
  12 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

With current ACL implementation first field in the rule definition
has always to be one byte long. Though for optimising classify
implementation it might be useful to be able to use 4B reads
(as we do for rest of the fields).
So at build phase, check user provided field definitions to determine
is it safe to do 4B loads for first ACL field.
Then at run-time this information can be used to choose classify
behavior.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl.h               |  1 +
 lib/librte_acl/acl_bld.c           | 34 ++++++++++++++++++++++++++++++
 lib/librte_acl/acl_run_avx512.c    |  7 ++++++
 lib/librte_acl/acl_run_avx512x16.h |  8 +++----
 lib/librte_acl/acl_run_avx512x8.h  |  8 +++----
 lib/librte_acl/rte_acl.c           |  1 +
 6 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 3f0719f33..493dec2a2 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -169,6 +169,7 @@ struct rte_acl_ctx {
 	int32_t             socket_id;
 	/** Socket ID to allocate memory from. */
 	enum rte_acl_classify_alg alg;
+	uint32_t           first_load_sz;
 	void               *rules;
 	uint32_t            max_rules;
 	uint32_t            rule_sz;
diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
index d1f920b09..da10864cd 100644
--- a/lib/librte_acl/acl_bld.c
+++ b/lib/librte_acl/acl_bld.c
@@ -1581,6 +1581,37 @@ acl_check_bld_param(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg)
 	return 0;
 }
 
+/*
+ * With current ACL implementation first field in the rule definition
+ * has always to be one byte long. Though for optimising *classify*
+ * implementation it might be useful to be able to use 4B reads
+ * (as we do for rest of the fields).
+ * This function checks input config to determine is it safe to do 4B
+ * loads for first ACL field. For that we need to make sure that
+ * first field in our rule definition doesn't have the biggest offset,
+ * i.e. we still do have other fields located after the first one.
+ * Contrary if first field has the largest offset, then it means
+ * first field can occupy the very last byte in the input data buffer,
+ * and we have to do single byte load for it.
+ */
+static uint32_t
+get_first_load_size(const struct rte_acl_config *cfg)
+{
+	uint32_t i, max_ofs, ofs;
+
+	ofs = 0;
+	max_ofs = 0;
+
+	for (i = 0; i != cfg->num_fields; i++) {
+		if (cfg->defs[i].field_index == 0)
+			ofs = cfg->defs[i].offset;
+		else if (max_ofs < cfg->defs[i].offset)
+			max_ofs = cfg->defs[i].offset;
+	}
+
+	return (ofs < max_ofs) ? sizeof(uint32_t) : sizeof(uint8_t);
+}
+
 int
 rte_acl_build(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg)
 {
@@ -1618,6 +1649,9 @@ rte_acl_build(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg)
 				/* set data indexes. */
 				acl_set_data_indexes(ctx);
 
+				/* determine can we always do 4B load */
+				ctx->first_load_sz = get_first_load_size(cfg);
+
 				/* copy in build config. */
 				ctx->config = *cfg;
 			}
diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
index 60762b7d6..51bfa6a3b 100644
--- a/lib/librte_acl/acl_run_avx512.c
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -16,6 +16,7 @@ struct acl_flow_avx512 {
 	uint32_t num_packets;       /* number of packets processed */
 	uint32_t total_packets;     /* max number of packets to process */
 	uint32_t root_index;        /* current root index */
+	uint32_t first_load_sz;     /* first load size for new packet */
 	const uint64_t *trans;      /* transition table */
 	const uint32_t *data_index; /* input data indexes */
 	const uint8_t **idata;      /* input data */
@@ -29,6 +30,7 @@ acl_set_flow_avx512(struct acl_flow_avx512 *flow, const struct rte_acl_ctx *ctx,
 {
 	flow->num_packets = 0;
 	flow->total_packets = total_packets;
+	flow->first_load_sz = ctx->first_load_sz;
 	flow->root_index = ctx->trie[trie].root_index;
 	flow->trans = ctx->trans_table;
 	flow->data_index = ctx->trie[trie].data_index;
@@ -155,6 +157,11 @@ resolve_mcgt8_avx512x1(uint32_t result[],
 	}
 }
 
+/*
+ * unfortunately current AVX512 ISA doesn't provide ability for
+ * gather load on a byte quantity. So we have to mimic it in SW,
+ * by doing 8x1B scalar loads.
+ */
 static inline ymm_t
 _m512_mask_gather_epi8x8(__m512i pdata, __mmask8 mask)
 {
diff --git a/lib/librte_acl/acl_run_avx512x16.h b/lib/librte_acl/acl_run_avx512x16.h
index 45b0b4db6..df5f6135f 100644
--- a/lib/librte_acl/acl_run_avx512x16.h
+++ b/lib/librte_acl/acl_run_avx512x16.h
@@ -413,7 +413,7 @@ match_check_process_avx512x16x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
 
 		if (n[0] != 0) {
 			inp[0] = get_next_bytes_avx512x16(flow, &pdata[0],
-				rm[0], &di[0], sizeof(uint8_t));
+				rm[0], &di[0], flow->first_load_sz);
 			first_trans16(flow, inp[0], rm[0], &tr_lo[0],
 				&tr_hi[0]);
 			rm[0] = _mm512_test_epi32_mask(tr_lo[0],
@@ -422,7 +422,7 @@ match_check_process_avx512x16x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
 
 		if (n[1] != 0) {
 			inp[1] = get_next_bytes_avx512x16(flow, &pdata[2],
-				rm[1], &di[1], sizeof(uint8_t));
+				rm[1], &di[1], flow->first_load_sz);
 			first_trans16(flow, inp[1], rm[1], &tr_lo[1],
 				&tr_hi[1]);
 			rm[1] = _mm512_test_epi32_mask(tr_lo[1],
@@ -447,9 +447,9 @@ search_trie_avx512x16x2(struct acl_flow_avx512 *flow)
 	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[2], &idx[1], &di[1]);
 
 	in[0] = get_next_bytes_avx512x16(flow, &pdata[0], UINT16_MAX, &di[0],
-		sizeof(uint8_t));
+			flow->first_load_sz);
 	in[1] = get_next_bytes_avx512x16(flow, &pdata[2], UINT16_MAX, &di[1],
-		sizeof(uint8_t));
+			flow->first_load_sz);
 
 	first_trans16(flow, in[0], UINT16_MAX, &tr_lo[0], &tr_hi[0]);
 	first_trans16(flow, in[1], UINT16_MAX, &tr_lo[1], &tr_hi[1]);
diff --git a/lib/librte_acl/acl_run_avx512x8.h b/lib/librte_acl/acl_run_avx512x8.h
index 82171e8e0..777451973 100644
--- a/lib/librte_acl/acl_run_avx512x8.h
+++ b/lib/librte_acl/acl_run_avx512x8.h
@@ -325,7 +325,7 @@ match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
 
 		if (n[0] != 0) {
 			inp[0] = get_next_bytes_avx512x8(flow, pdata[0], rm[0],
-				&di[0], sizeof(uint8_t));
+				&di[0], flow->first_load_sz);
 			first_trans8(flow, inp[0], rm[0], &tr_lo[0], &tr_hi[0]);
 
 			rm[0] = _mm256_test_epi32_mask(tr_lo[0],
@@ -334,7 +334,7 @@ match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
 
 		if (n[1] != 0) {
 			inp[1] = get_next_bytes_avx512x8(flow, pdata[1], rm[1],
-				&di[1], sizeof(uint8_t));
+				&di[1], flow->first_load_sz);
 			first_trans8(flow, inp[1], rm[1], &tr_lo[1], &tr_hi[1]);
 
 			rm[1] = _mm256_test_epi32_mask(tr_lo[1],
@@ -360,9 +360,9 @@ search_trie_avx512x8x2(struct acl_flow_avx512 *flow)
 	start_flow8(flow, CHAR_BIT, UINT8_MAX, &pdata[1], &idx[1], &di[1]);
 
 	inp[0] = get_next_bytes_avx512x8(flow, pdata[0], UINT8_MAX, &di[0],
-		sizeof(uint8_t));
+			flow->first_load_sz);
 	inp[1] = get_next_bytes_avx512x8(flow, pdata[1], UINT8_MAX, &di[1],
-		sizeof(uint8_t));
+			flow->first_load_sz);
 
 	first_trans8(flow, inp[0], UINT8_MAX, &tr_lo[0], &tr_hi[0]);
 	first_trans8(flow, inp[1], UINT8_MAX, &tr_lo[1], &tr_hi[1]);
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index fdcb7a798..9f16d28ea 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -486,6 +486,7 @@ rte_acl_dump(const struct rte_acl_ctx *ctx)
 	printf("acl context <%s>@%p\n", ctx->name, ctx);
 	printf("  socket_id=%"PRId32"\n", ctx->socket_id);
 	printf("  alg=%"PRId32"\n", ctx->alg);
+	printf("  first_load_sz=%"PRIu32"\n", ctx->first_load_sz);
 	printf("  max_rules=%"PRIu32"\n", ctx->max_rules);
 	printf("  rule_size=%"PRIu32"\n", ctx->rule_sz);
 	printf("  num_rules=%"PRIu32"\n", ctx->num_rules);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v2 11/12] test/acl: add AVX512 classify support
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
                     ` (9 preceding siblings ...)
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 10/12] acl: for AVX512 classify use 4B load whenever possible Konstantin Ananyev
@ 2020-09-15 16:50   ` Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 12/12] app/acl: " Konstantin Ananyev
  2020-10-05 18:45   ` [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
  12 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Add AVX512 classify to the test coverage.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test/test_acl.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/app/test/test_acl.c b/app/test/test_acl.c
index 333b34757..11d69d2d5 100644
--- a/app/test/test_acl.c
+++ b/app/test/test_acl.c
@@ -278,8 +278,8 @@ test_classify_alg(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
 
 	/* set given classify alg, skip test if alg is not supported */
 	ret = rte_acl_set_ctx_classify(acx, alg);
-	if (ret == -ENOTSUP)
-		return 0;
+	if (ret != 0)
+		return (ret == -ENOTSUP) ? 0 : ret;
 
 	/**
 	 * these will run quite a few times, it's necessary to test code paths
@@ -341,6 +341,7 @@ test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
 		RTE_ACL_CLASSIFY_AVX2,
 		RTE_ACL_CLASSIFY_NEON,
 		RTE_ACL_CLASSIFY_ALTIVEC,
+		RTE_ACL_CLASSIFY_AVX512,
 	};
 
 	/* swap all bytes in the data to network order */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v2 12/12] app/acl: add AVX512 classify support
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
                     ` (10 preceding siblings ...)
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 11/12] test/acl: add AVX512 classify support Konstantin Ananyev
@ 2020-09-15 16:50   ` Konstantin Ananyev
  2020-10-05 18:45   ` [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
  12 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Add ability to use AVX512 classify method.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test-acl/main.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/app/test-acl/main.c b/app/test-acl/main.c
index d9b65517c..19b714335 100644
--- a/app/test-acl/main.c
+++ b/app/test-acl/main.c
@@ -81,6 +81,10 @@ static const struct acl_alg acl_alg[] = {
 		.name = "altivec",
 		.alg = RTE_ACL_CLASSIFY_ALTIVEC,
 	},
+	{
+		.name = "avx512",
+		.alg = RTE_ACL_CLASSIFY_AVX512,
+	},
 };
 
 static struct {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v2 07/12] acl: add infrastructure to support AVX512 classify
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 07/12] acl: add infrastructure to support AVX512 classify Konstantin Ananyev
@ 2020-09-16  9:11     ` Bruce Richardson
  2020-09-16  9:36       ` Medvedkin, Vladimir
  0 siblings, 1 reply; 70+ messages in thread
From: Bruce Richardson @ 2020-09-16  9:11 UTC (permalink / raw)
  To: Konstantin Ananyev; +Cc: dev, jerinj, ruifeng.wang, vladimir.medvedkin

On Tue, Sep 15, 2020 at 05:50:20PM +0100, Konstantin Ananyev wrote:
> Add necessary changes to support new AVX512 specific ACL classify
> algorithm:
>  - changes in meson.build to check that build tools
>    (compiler, assembler, etc.) do properly support AVX512.
>  - run-time checks to make sure target platform does support AVX512.
>  - dummy rte_acl_classify_avx512() for targets where AVX512
>    implementation couldn't be properly supported.
> 
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---

This all looks correct, though I wonder do you really need to check all
those AVX512 flags in each case? Since "F" is always present in any AVX512
implementation perhaps it can be checked, though if the other three always
need to be checked I can understand if you want to keep it there for
completeness. [Are all the other 3 used in your code?]

Acked-by: Bruce Richardson <bruce.richardson@intel.com>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v2 07/12] acl: add infrastructure to support AVX512 classify
  2020-09-16  9:11     ` Bruce Richardson
@ 2020-09-16  9:36       ` Medvedkin, Vladimir
  2020-09-16  9:49         ` Bruce Richardson
  0 siblings, 1 reply; 70+ messages in thread
From: Medvedkin, Vladimir @ 2020-09-16  9:36 UTC (permalink / raw)
  To: Bruce Richardson, Konstantin Ananyev; +Cc: dev, jerinj, ruifeng.wang

Hi Bruce,

On 16/09/2020 10:11, Bruce Richardson wrote:
> On Tue, Sep 15, 2020 at 05:50:20PM +0100, Konstantin Ananyev wrote:
>> Add necessary changes to support new AVX512 specific ACL classify
>> algorithm:
>>   - changes in meson.build to check that build tools
>>     (compiler, assembler, etc.) do properly support AVX512.
>>   - run-time checks to make sure target platform does support AVX512.
>>   - dummy rte_acl_classify_avx512() for targets where AVX512
>>     implementation couldn't be properly supported.
>>
>> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
>> ---
> 
> This all looks correct, though I wonder do you really need to check all
> those AVX512 flags in each case? Since "F" is always present in any AVX512
> implementation perhaps it can be checked, though if the other three always
> need to be checked I can understand if you want to keep it there for
> completeness. [Are all the other 3 used in your code?]
> 

As for me it is good to check all the flags supported by compiler. Some 
old (but still supported by dpdk) gcc can't compile the code in some 
circumstances. For example:

gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)   <-- pretty 
old but still supported, right?

gcc -march=native -dM -E - < /dev/null | grep "AVX512"
#define __AVX512F__ 1
#define __AVX512BW__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1

Does not support __AVX512VL__

from acl_run_avx512x8.h in first_trans8 there is 
_mm256_mmask_i32gather_epi32 which requires this flag, so compilation 
will fail.

> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> 

-- 
Regards,
Vladimir

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v2 07/12] acl: add infrastructure to support AVX512 classify
  2020-09-16  9:36       ` Medvedkin, Vladimir
@ 2020-09-16  9:49         ` Bruce Richardson
  2020-09-16 10:06           ` Ananyev, Konstantin
  0 siblings, 1 reply; 70+ messages in thread
From: Bruce Richardson @ 2020-09-16  9:49 UTC (permalink / raw)
  To: Medvedkin, Vladimir; +Cc: Konstantin Ananyev, dev, jerinj, ruifeng.wang

On Wed, Sep 16, 2020 at 10:36:32AM +0100, Medvedkin, Vladimir wrote:
> Hi Bruce,
> 
> On 16/09/2020 10:11, Bruce Richardson wrote:
> > On Tue, Sep 15, 2020 at 05:50:20PM +0100, Konstantin Ananyev wrote:
> > > Add necessary changes to support new AVX512 specific ACL classify
> > > algorithm:
> > >   - changes in meson.build to check that build tools
> > >     (compiler, assembler, etc.) do properly support AVX512.
> > >   - run-time checks to make sure target platform does support AVX512.
> > >   - dummy rte_acl_classify_avx512() for targets where AVX512
> > >     implementation couldn't be properly supported.
> > > 
> > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > > ---
> > 
> > This all looks correct, though I wonder do you really need to check all
> > those AVX512 flags in each case? Since "F" is always present in any AVX512
> > implementation perhaps it can be checked, though if the other three always
> > need to be checked I can understand if you want to keep it there for
> > completeness. [Are all the other 3 used in your code?]
> > 
> 
> As for me it is good to check all the flags supported by compiler. Some old
> (but still supported by dpdk) gcc can't compile the code in some
> circumstances. For example:
> 
> gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)   <-- pretty old
> but still supported, right?
> 
> gcc -march=native -dM -E - < /dev/null | grep "AVX512"
> #define __AVX512F__ 1
> #define __AVX512BW__ 1
> #define __AVX512CD__ 1
> #define __AVX512DQ__ 1
> 
> Does not support __AVX512VL__
> 
Interesting, seems like checking them all to be sure is the right approach
so.
My ack stands so, and ignore the comment.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v2 07/12] acl: add infrastructure to support AVX512 classify
  2020-09-16  9:49         ` Bruce Richardson
@ 2020-09-16 10:06           ` Ananyev, Konstantin
  0 siblings, 0 replies; 70+ messages in thread
From: Ananyev, Konstantin @ 2020-09-16 10:06 UTC (permalink / raw)
  To: Richardson, Bruce, Medvedkin, Vladimir; +Cc: dev, jerinj, ruifeng.wang

 
> On Wed, Sep 16, 2020 at 10:36:32AM +0100, Medvedkin, Vladimir wrote:
> > Hi Bruce,
> >
> > On 16/09/2020 10:11, Bruce Richardson wrote:
> > > On Tue, Sep 15, 2020 at 05:50:20PM +0100, Konstantin Ananyev wrote:
> > > > Add necessary changes to support new AVX512 specific ACL classify
> > > > algorithm:
> > > >   - changes in meson.build to check that build tools
> > > >     (compiler, assembler, etc.) do properly support AVX512.
> > > >   - run-time checks to make sure target platform does support AVX512.
> > > >   - dummy rte_acl_classify_avx512() for targets where AVX512
> > > >     implementation couldn't be properly supported.
> > > >
> > > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > > > ---
> > >
> > > This all looks correct, though I wonder do you really need to check all
> > > those AVX512 flags in each case? Since "F" is always present in any AVX512
> > > implementation perhaps it can be checked, though if the other three always
> > > need to be checked I can understand if you want to keep it there for
> > > completeness. [Are all the other 3 used in your code?]

Yep, ACL uses all of them.
Thanks
Konstantin

> > >
> >
> > As for me it is good to check all the flags supported by compiler. Some old
> > (but still supported by dpdk) gcc can't compile the code in some
> > circumstances. For example:
> >
> > gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)   <-- pretty old
> > but still supported, right?
> >
> > gcc -march=native -dM -E - < /dev/null | grep "AVX512"
> > #define __AVX512F__ 1
> > #define __AVX512BW__ 1
> > #define __AVX512CD__ 1
> > #define __AVX512DQ__ 1
> >
> > Does not support __AVX512VL__
> >
> Interesting, seems like checking them all to be sure is the right approach
> so.
> My ack stands so, and ignore the comment.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v2 03/12] acl: remove of unused enum value
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 03/12] acl: remove of unused enum value Konstantin Ananyev
@ 2020-09-27  3:27     ` Ruifeng Wang
  0 siblings, 0 replies; 70+ messages in thread
From: Ruifeng Wang @ 2020-09-27  3:27 UTC (permalink / raw)
  To: Konstantin Ananyev, dev; +Cc: jerinj, vladimir.medvedkin, nd

> -----Original Message-----
> From: Konstantin Ananyev <konstantin.ananyev@intel.com>
> Sent: Wednesday, September 16, 2020 12:50 AM
> To: dev@dpdk.org
> Cc: jerinj@marvell.com; Ruifeng Wang <Ruifeng.Wang@arm.com>;
> vladimir.medvedkin@intel.com; Konstantin Ananyev
> <konstantin.ananyev@intel.com>
> Subject: [PATCH v2 03/12] acl: remove of unused enum value
> 
> Removal of unused enum value (RTE_ACL_CLASSIFY_NUM).
> This enum value is not used inside DPDK, while it prevents to add new
> classify algorithms without causing an ABI breakage.
> 
> Note that this change introduce a formal ABI incompatibility with previous
> versions of ACL library.
> 
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
>  doc/guides/rel_notes/deprecation.rst   | 4 ----
>  doc/guides/rel_notes/release_20_11.rst | 4 ++++
>  lib/librte_acl/rte_acl.h               | 1 -
>  3 files changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/doc/guides/rel_notes/deprecation.rst
> b/doc/guides/rel_notes/deprecation.rst
> index 52168f775..3279a01ef 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -288,10 +288,6 @@ Deprecation Notices
>    - https://patches.dpdk.org/patch/71457/
>    - https://patches.dpdk.org/patch/71456/
> 
> -* acl: ``RTE_ACL_CLASSIFY_NUM`` enum value will be removed.
> -  This enum value is not used inside DPDK, while it prevents to add new
> -  classify algorithms without causing an ABI breakage.
> -
>  * sched: To allow more traffic classes, flexible mapping of pipe queues to
>    traffic classes, and subport level configuration of pipes and queues
>    changes will be made to macros, data structures and API functions defined
> diff --git a/doc/guides/rel_notes/release_20_11.rst
> b/doc/guides/rel_notes/release_20_11.rst
> index b729bdf20..a9a1b0305 100644
> --- a/doc/guides/rel_notes/release_20_11.rst
> +++ b/doc/guides/rel_notes/release_20_11.rst
> @@ -97,6 +97,10 @@ API Changes
>    and the function ``rte_rawdev_queue_conf_get()``
>    from ``void`` to ``int`` allowing the return of error codes from drivers.
> 
> +* acl: ``RTE_ACL_CLASSIFY_NUM`` enum value has been removed.
> +  This enum value was not used inside DPDK, while it prevented to add
> +new
> +  classify algorithms without causing an ABI breakage.
> +
> 
>  ABI Changes
>  -----------
> diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h index
> aa22e70c6..b814423a6 100644
> --- a/lib/librte_acl/rte_acl.h
> +++ b/lib/librte_acl/rte_acl.h
> @@ -241,7 +241,6 @@ enum rte_acl_classify_alg {
>  	RTE_ACL_CLASSIFY_AVX2 = 3,    /**< requires AVX2 support. */
>  	RTE_ACL_CLASSIFY_NEON = 4,    /**< requires NEON support. */
>  	RTE_ACL_CLASSIFY_ALTIVEC = 5,    /**< requires ALTIVEC support. */
> -	RTE_ACL_CLASSIFY_NUM          /* should always be the last one. */
>  };
> 
>  /**
> --
> 2.17.1

Looks good from ABI perspective.
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
                     ` (11 preceding siblings ...)
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 12/12] app/acl: " Konstantin Ananyev
@ 2020-10-05 18:45   ` Konstantin Ananyev
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 01/14] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
                       ` (15 more replies)
  12 siblings, 16 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-05 18:45 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

These patch series introduce support of AVX512 specific classify
implementation for ACL library.
It adds two new algorithms:
 - RTE_ACL_CLASSIFY_AVX512X16 - can process up to 16 flows in parallel.
   It uses 256-bit width instructions/registers only
   (to avoid frequency level change).
   On my SKX box test-acl shows ~15-30% improvement
   (depending on rule-set and input burst size)
   when switching from AVX2 to AVX512X16 classify algorithms.
 - RTE_ACL_CLASSIFY_AVX512X32 - can process up to 32 flows in parallel.
   It uses 512-bit width instructions/registers and provides higher
   performance then AVX512X16, but can cause frequency level change.
   On my SKX box test-acl shows ~50-70% improvement
   (depending on rule-set and input burst size)
   when switching from AVX2 to AVX512X32 classify algorithms.
   ICX and CLX testing showed similar level of speedup.

Current AVX512 classify implementation is only supported on x86_64.
Note that this series introduce a formal ABI incompatibility
with previous versions of ACL library.

v2 -> v3:
  Fix checkpatch warnings
  Split AVX512 algorithm into two and deduplicate common code
v1 -> v2:
  Deduplicated 8/16 code paths as much as possible
  Updated default algorithm selection
    Removed library constructor to make it easier integrate with
    https://patches.dpdk.org/project/dpdk/list/?series=11831
  Updated docs

These patch series depends on:
https://patches.dpdk.org/patch/79310/
to be applied first.

Konstantin Ananyev (14):
  acl: fix x86 build when compiler doesn't support AVX2
  doc: fix missing classify methods in ACL guide
  acl: remove of unused enum value
  acl: remove library constructor
  app/acl: few small improvements
  test/acl: expand classify test coverage
  acl: add infrastructure to support AVX512 classify
  acl: introduce 256-bit width AVX512 classify implementation
  acl: update default classify algorithm selection
  acl: introduce 512-bit width AVX512 classify implementation
  acl: for AVX512 classify use 4B load whenever possible
  acl: deduplicate AVX512 code paths
  test/acl: add AVX512 classify support
  app/acl: add AVX512 classify support

 app/test-acl/main.c                           |  23 +-
 app/test/test_acl.c                           | 105 ++--
 config/x86/meson.build                        |   3 +-
 .../prog_guide/packet_classif_access_ctrl.rst |  20 +
 doc/guides/rel_notes/deprecation.rst          |   4 -
 doc/guides/rel_notes/release_20_11.rst        |  12 +
 lib/librte_acl/acl.h                          |  16 +
 lib/librte_acl/acl_bld.c                      |  34 ++
 lib/librte_acl/acl_gen.c                      |   2 +-
 lib/librte_acl/acl_run_avx512.c               | 164 ++++++
 lib/librte_acl/acl_run_avx512_common.h        | 477 ++++++++++++++++++
 lib/librte_acl/acl_run_avx512x16.h            | 341 +++++++++++++
 lib/librte_acl/acl_run_avx512x8.h             | 253 ++++++++++
 lib/librte_acl/meson.build                    |  39 ++
 lib/librte_acl/rte_acl.c                      | 212 ++++++--
 lib/librte_acl/rte_acl.h                      |   4 +-
 16 files changed, 1609 insertions(+), 100 deletions(-)
 create mode 100644 lib/librte_acl/acl_run_avx512.c
 create mode 100644 lib/librte_acl/acl_run_avx512_common.h
 create mode 100644 lib/librte_acl/acl_run_avx512x16.h
 create mode 100644 lib/librte_acl/acl_run_avx512x8.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v3 01/14] acl: fix x86 build when compiler doesn't support AVX2
  2020-10-05 18:45   ` [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
@ 2020-10-05 18:45     ` Konstantin Ananyev
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 02/14] doc: fix missing classify methods in ACL guide Konstantin Ananyev
                       ` (14 subsequent siblings)
  15 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-05 18:45 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev, stable

Right now we define dummy version of rte_acl_classify_avx2()
when both X86 and AVX2 are not detected, though it should be
for non-AVX2 case only.

Fixes: e53ce4e41379 ("acl: remove use of weak functions")
Cc: stable@dpdk.org

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/rte_acl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 777ec4d340..715b023592 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -16,7 +16,6 @@ static struct rte_tailq_elem rte_acl_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_acl_tailq)
 
-#ifndef RTE_ARCH_X86
 #ifndef CC_AVX2_SUPPORT
 /*
  * If the compiler doesn't support AVX2 instructions,
@@ -33,6 +32,7 @@ rte_acl_classify_avx2(__rte_unused const struct rte_acl_ctx *ctx,
 }
 #endif
 
+#ifndef RTE_ARCH_X86
 int
 rte_acl_classify_sse(__rte_unused const struct rte_acl_ctx *ctx,
 	__rte_unused const uint8_t **data,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v3 02/14] doc: fix missing classify methods in ACL guide
  2020-10-05 18:45   ` [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 01/14] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
@ 2020-10-05 18:45     ` Konstantin Ananyev
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 03/14] acl: remove of unused enum value Konstantin Ananyev
                       ` (13 subsequent siblings)
  15 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-05 18:45 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev, stable

Add brief description for missing ACL classify algorithms:
RTE_ACL_CLASSIFY_NEON and RTE_ACL_CLASSIFY_ALTIVEC.

Fixes: 34fa6c27c156 ("acl: add NEON optimization for ARMv8")
Fixes: 1d73135f9f1c ("acl: add AltiVec for ppc64")
Cc: stable@dpdk.org

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 doc/guides/prog_guide/packet_classif_access_ctrl.rst | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/doc/guides/prog_guide/packet_classif_access_ctrl.rst b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
index 0345512b9e..daf03e6d7a 100644
--- a/doc/guides/prog_guide/packet_classif_access_ctrl.rst
+++ b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
@@ -373,6 +373,12 @@ There are several implementations of classify algorithm:
 
 *   **RTE_ACL_CLASSIFY_AVX2**: vector implementation, can process up to 16 flows in parallel. Requires AVX2 support.
 
+*   **RTE_ACL_CLASSIFY_NEON**: vector implementation, can process up to 8 flows
+    in parallel. Requires NEON support.
+
+*   **RTE_ACL_CLASSIFY_ALTIVEC**: vector implementation, can process up to 8
+    flows in parallel. Requires ALTIVEC support.
+
 It is purely a runtime decision which method to choose, there is no build-time difference.
 All implementations operates over the same internal RT structures and use similar principles. The main difference is that vector implementations can manually exploit IA SIMD instructions and process several input data flows in parallel.
 At startup ACL library determines the highest available classify method for the given platform and sets it as default one. Though the user has an ability to override the default classifier function for a given ACL context or perform particular search using non-default classify method. In that case it is user responsibility to make sure that given platform supports selected classify implementation.
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v3 03/14] acl: remove of unused enum value
  2020-10-05 18:45   ` [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 01/14] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 02/14] doc: fix missing classify methods in ACL guide Konstantin Ananyev
@ 2020-10-05 18:45     ` Konstantin Ananyev
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 04/14] acl: remove library constructor Konstantin Ananyev
                       ` (12 subsequent siblings)
  15 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-05 18:45 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Removal of unused enum value (RTE_ACL_CLASSIFY_NUM).
This enum value is not used inside DPDK, while it prevents
to add new classify algorithms without causing an ABI breakage.

Note that this change introduce a formal ABI incompatibility
with previous versions of ACL library.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 doc/guides/rel_notes/deprecation.rst   | 4 ----
 doc/guides/rel_notes/release_20_11.rst | 4 ++++
 lib/librte_acl/rte_acl.h               | 1 -
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
index 8080a28896..938e967c8f 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -209,10 +209,6 @@ Deprecation Notices
   - https://patches.dpdk.org/patch/71457/
   - https://patches.dpdk.org/patch/71456/
 
-* acl: ``RTE_ACL_CLASSIFY_NUM`` enum value will be removed.
-  This enum value is not used inside DPDK, while it prevents to add new
-  classify algorithms without causing an ABI breakage.
-
 * sched: To allow more traffic classes, flexible mapping of pipe queues to
   traffic classes, and subport level configuration of pipes and queues
   changes will be made to macros, data structures and API functions defined
diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index 6d8c24413d..e0de60c0c2 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -210,6 +210,10 @@ API Changes
 
 * bpf: ``RTE_BPF_XTYPE_NUM`` has been dropped from ``rte_bpf_xtype``.
 
+* acl: ``RTE_ACL_CLASSIFY_NUM`` enum value has been removed.
+  This enum value was not used inside DPDK, while it prevented to add new
+  classify algorithms without causing an ABI breakage.
+
 
 ABI Changes
 -----------
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index aa22e70c6e..b814423a63 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -241,7 +241,6 @@ enum rte_acl_classify_alg {
 	RTE_ACL_CLASSIFY_AVX2 = 3,    /**< requires AVX2 support. */
 	RTE_ACL_CLASSIFY_NEON = 4,    /**< requires NEON support. */
 	RTE_ACL_CLASSIFY_ALTIVEC = 5,    /**< requires ALTIVEC support. */
-	RTE_ACL_CLASSIFY_NUM          /* should always be the last one. */
 };
 
 /**
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v3 04/14] acl: remove library constructor
  2020-10-05 18:45   ` [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                       ` (2 preceding siblings ...)
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 03/14] acl: remove of unused enum value Konstantin Ananyev
@ 2020-10-05 18:45     ` Konstantin Ananyev
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 05/14] app/acl: few small improvements Konstantin Ananyev
                       ` (11 subsequent siblings)
  15 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-05 18:45 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Right now ACL library determines best possible (default) classify method
on a given platform with specilal constructor function rte_acl_init().
This patch makes the following changes:
 - Move selection of default classify method into a separate private
   function and call it for each ACL context creation (rte_acl_create()).
 - Remove library constructor function
 - Make rte_acl_set_ctx_classify() to check that requested algorithm
   is supported on given platform.

The purpose of these changes to improve and simplify algorithm selection
process and prepare ACL library to be integrated with:
add max SIMD bitwidth to EAL
(https://patches.dpdk.org/project/dpdk/list/?series=11831)
patch-set

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/rte_acl.c | 166 ++++++++++++++++++++++++++++++---------
 lib/librte_acl/rte_acl.h |   1 +
 2 files changed, 132 insertions(+), 35 deletions(-)

diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 715b023592..863549a38b 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -79,57 +79,153 @@ static const rte_acl_classify_t classify_fns[] = {
 	[RTE_ACL_CLASSIFY_ALTIVEC] = rte_acl_classify_altivec,
 };
 
-/* by default, use always available scalar code path. */
-static enum rte_acl_classify_alg rte_acl_default_classify =
-	RTE_ACL_CLASSIFY_SCALAR;
+/*
+ * Helper function for acl_check_alg.
+ * Check support for ARM specific classify methods.
+ */
+static int
+acl_check_alg_arm(enum rte_acl_classify_alg alg)
+{
+	if (alg == RTE_ACL_CLASSIFY_NEON) {
+#if defined(RTE_ARCH_ARM64)
+		return 0;
+#elif defined(RTE_ARCH_ARM)
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON))
+			return 0;
+		return -ENOTSUP;
+#else
+		return -ENOTSUP;
+#endif
+	}
+
+	return -EINVAL;
+}
 
-static void
-rte_acl_set_default_classify(enum rte_acl_classify_alg alg)
+/*
+ * Helper function for acl_check_alg.
+ * Check support for PPC specific classify methods.
+ */
+static int
+acl_check_alg_ppc(enum rte_acl_classify_alg alg)
 {
-	rte_acl_default_classify = alg;
+	if (alg == RTE_ACL_CLASSIFY_ALTIVEC) {
+#if defined(RTE_ARCH_PPC_64)
+		return 0;
+#else
+		return -ENOTSUP;
+#endif
+	}
+
+	return -EINVAL;
 }
 
-extern int
-rte_acl_set_ctx_classify(struct rte_acl_ctx *ctx, enum rte_acl_classify_alg alg)
+/*
+ * Helper function for acl_check_alg.
+ * Check support for x86 specific classify methods.
+ */
+static int
+acl_check_alg_x86(enum rte_acl_classify_alg alg)
 {
-	if (ctx == NULL || (uint32_t)alg >= RTE_DIM(classify_fns))
-		return -EINVAL;
+	if (alg == RTE_ACL_CLASSIFY_AVX2) {
+#ifdef CC_AVX2_SUPPORT
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
+			return 0;
+#endif
+		return -ENOTSUP;
+	}
 
-	ctx->alg = alg;
-	return 0;
+	if (alg == RTE_ACL_CLASSIFY_SSE) {
+#ifdef RTE_ARCH_X86
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
+			return 0;
+#endif
+		return -ENOTSUP;
+	}
+
+	return -EINVAL;
 }
 
 /*
- * Select highest available classify method as default one.
- * Note that CLASSIFY_AVX2 should be set as a default only
- * if both conditions are met:
- * at build time compiler supports AVX2 and target cpu supports AVX2.
+ * Check if input alg is supported by given platform/binary.
+ * Note that both conditions should be met:
+ * - at build time compiler supports ISA used by given methods
+ * - at run time target cpu supports necessary ISA.
  */
-RTE_INIT(rte_acl_init)
+static int
+acl_check_alg(enum rte_acl_classify_alg alg)
 {
-	enum rte_acl_classify_alg alg = RTE_ACL_CLASSIFY_DEFAULT;
+	switch (alg) {
+	case RTE_ACL_CLASSIFY_NEON:
+		return acl_check_alg_arm(alg);
+	case RTE_ACL_CLASSIFY_ALTIVEC:
+		return acl_check_alg_ppc(alg);
+	case RTE_ACL_CLASSIFY_AVX2:
+	case RTE_ACL_CLASSIFY_SSE:
+		return acl_check_alg_x86(alg);
+	/* scalar method is supported on all platforms */
+	case RTE_ACL_CLASSIFY_SCALAR:
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
 
-#if defined(RTE_ARCH_ARM64)
-	alg =  RTE_ACL_CLASSIFY_NEON;
-#elif defined(RTE_ARCH_ARM)
-	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON))
-		alg =  RTE_ACL_CLASSIFY_NEON;
+/*
+ * Get preferred alg for given platform.
+ */
+static enum rte_acl_classify_alg
+acl_get_best_alg(void)
+{
+	/*
+	 * array of supported methods for each platform.
+	 * Note that order is important - from most to less preferable.
+	 */
+	static const enum rte_acl_classify_alg alg[] = {
+#if defined(RTE_ARCH_ARM)
+		RTE_ACL_CLASSIFY_NEON,
 #elif defined(RTE_ARCH_PPC_64)
-	alg = RTE_ACL_CLASSIFY_ALTIVEC;
-#else
-#ifdef CC_AVX2_SUPPORT
-	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
-		alg = RTE_ACL_CLASSIFY_AVX2;
-	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
-#else
-	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
+		RTE_ACL_CLASSIFY_ALTIVEC,
+#elif defined(RTE_ARCH_X86)
+		RTE_ACL_CLASSIFY_AVX2,
+		RTE_ACL_CLASSIFY_SSE,
 #endif
-		alg = RTE_ACL_CLASSIFY_SSE;
+		RTE_ACL_CLASSIFY_SCALAR,
+	};
 
-#endif
-	rte_acl_set_default_classify(alg);
+	uint32_t i;
+
+	/* find best possible alg */
+	for (i = 0; i != RTE_DIM(alg) && acl_check_alg(alg[i]) != 0; i++)
+		;
+
+	/* we always have to find something suitable */
+	RTE_VERIFY(i != RTE_DIM(alg));
+	return alg[i];
+}
+
+extern int
+rte_acl_set_ctx_classify(struct rte_acl_ctx *ctx, enum rte_acl_classify_alg alg)
+{
+	int32_t rc;
+
+	/* formal parameters check */
+	if (ctx == NULL || (uint32_t)alg >= RTE_DIM(classify_fns))
+		return -EINVAL;
+
+	/* user asked us to select the *best* one */
+	if (alg == RTE_ACL_CLASSIFY_DEFAULT)
+		alg = acl_get_best_alg();
+
+	/* check that given alg is supported */
+	rc = acl_check_alg(alg);
+	if (rc != 0)
+		return rc;
+
+	ctx->alg = alg;
+	return 0;
 }
 
+
 int
 rte_acl_classify_alg(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories,
@@ -262,7 +358,7 @@ rte_acl_create(const struct rte_acl_param *param)
 		ctx->max_rules = param->max_rule_num;
 		ctx->rule_sz = param->rule_size;
 		ctx->socket_id = param->socket_id;
-		ctx->alg = rte_acl_default_classify;
+		ctx->alg = acl_get_best_alg();
 		strlcpy(ctx->name, param->name, sizeof(ctx->name));
 
 		te->data = (void *) ctx;
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index b814423a63..3999f15ded 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -329,6 +329,7 @@ rte_acl_classify_alg(const struct rte_acl_ctx *ctx,
  *   existing algorithm, and that it could be run on the given CPU.
  * @return
  *   - -EINVAL if the parameters are invalid.
+ *   - -ENOTSUP requested algorithm is not supported by given platform.
  *   - Zero if operation completed successfully.
  */
 extern int
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v3 05/14] app/acl: few small improvements
  2020-10-05 18:45   ` [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                       ` (3 preceding siblings ...)
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 04/14] acl: remove library constructor Konstantin Ananyev
@ 2020-10-05 18:45     ` Konstantin Ananyev
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 06/14] test/acl: expand classify test coverage Konstantin Ananyev
                       ` (10 subsequent siblings)
  15 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-05 18:45 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

- enhance output to print extra stats
- use rte_rdtsc_precise() for cycle measurements

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test-acl/main.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/app/test-acl/main.c b/app/test-acl/main.c
index 0a5dfb621d..d9b65517cb 100644
--- a/app/test-acl/main.c
+++ b/app/test-acl/main.c
@@ -862,9 +862,10 @@ search_ip5tuples(__rte_unused void *arg)
 {
 	uint64_t pkt, start, tm;
 	uint32_t i, lcore;
+	long double st;
 
 	lcore = rte_lcore_id();
-	start = rte_rdtsc();
+	start = rte_rdtsc_precise();
 	pkt = 0;
 
 	for (i = 0; i != config.iter_num; i++) {
@@ -872,12 +873,16 @@ search_ip5tuples(__rte_unused void *arg)
 			config.trace_step, config.alg.name);
 	}
 
-	tm = rte_rdtsc() - start;
+	tm = rte_rdtsc_precise() - start;
+
+	st = (long double)tm / rte_get_timer_hz();
 	dump_verbose(DUMP_NONE, stdout,
 		"%s  @lcore %u: %" PRIu32 " iterations, %" PRIu64 " pkts, %"
-		PRIu32 " categories, %" PRIu64 " cycles, %#Lf cycles/pkt\n",
-		__func__, lcore, i, pkt, config.run_categories,
-		tm, (pkt == 0) ? 0 : (long double)tm / pkt);
+		PRIu32 " categories, %" PRIu64 " cycles (%.2Lf sec), "
+		"%.2Lf cycles/pkt, %.2Lf pkt/sec\n",
+		__func__, lcore, i, pkt,
+		config.run_categories, tm, st,
+		(pkt == 0) ? 0 : (long double)tm / pkt, pkt / st);
 
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v3 06/14] test/acl: expand classify test coverage
  2020-10-05 18:45   ` [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                       ` (4 preceding siblings ...)
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 05/14] app/acl: few small improvements Konstantin Ananyev
@ 2020-10-05 18:45     ` Konstantin Ananyev
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 07/14] acl: add infrastructure to support AVX512 classify Konstantin Ananyev
                       ` (9 subsequent siblings)
  15 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-05 18:45 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Make classify test to run for all supported methods.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test/test_acl.c | 103 ++++++++++++++++++++++----------------------
 1 file changed, 51 insertions(+), 52 deletions(-)

diff --git a/app/test/test_acl.c b/app/test/test_acl.c
index 316bf4d065..333b347579 100644
--- a/app/test/test_acl.c
+++ b/app/test/test_acl.c
@@ -266,22 +266,20 @@ rte_acl_ipv4vlan_build(struct rte_acl_ctx *ctx,
 }
 
 /*
- * Test scalar and SSE ACL lookup.
+ * Test ACL lookup (selected alg).
  */
 static int
-test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
-	size_t dim)
+test_classify_alg(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
+	const uint8_t *data[], size_t dim, enum rte_acl_classify_alg alg)
 {
-	int ret, i;
-	uint32_t result, count;
+	int32_t ret;
+	uint32_t i, result, count;
 	uint32_t results[dim * RTE_ACL_MAX_CATEGORIES];
-	const uint8_t *data[dim];
-	/* swap all bytes in the data to network order */
-	bswap_test_data(test_data, dim, 1);
 
-	/* store pointers to test data */
-	for (i = 0; i < (int) dim; i++)
-		data[i] = (uint8_t *)&test_data[i];
+	/* set given classify alg, skip test if alg is not supported */
+	ret = rte_acl_set_ctx_classify(acx, alg);
+	if (ret == -ENOTSUP)
+		return 0;
 
 	/**
 	 * these will run quite a few times, it's necessary to test code paths
@@ -291,12 +289,13 @@ test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
 		ret = rte_acl_classify(acx, data, results,
 				count, RTE_ACL_MAX_CATEGORIES);
 		if (ret != 0) {
-			printf("Line %i: SSE classify failed!\n", __LINE__);
-			goto err;
+			printf("Line %i: classify(alg=%d) failed!\n",
+				__LINE__, alg);
+			return ret;
 		}
 
 		/* check if we allow everything we should allow */
-		for (i = 0; i < (int) count; i++) {
+		for (i = 0; i < count; i++) {
 			result =
 				results[i * RTE_ACL_MAX_CATEGORIES + ACL_ALLOW];
 			if (result != test_data[i].allow) {
@@ -304,63 +303,63 @@ test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
 					"(expected %"PRIu32" got %"PRIu32")!\n",
 					__LINE__, i, test_data[i].allow,
 					result);
-				ret = -EINVAL;
-				goto err;
+				return -EINVAL;
 			}
 		}
 
 		/* check if we deny everything we should deny */
-		for (i = 0; i < (int) count; i++) {
+		for (i = 0; i < count; i++) {
 			result = results[i * RTE_ACL_MAX_CATEGORIES + ACL_DENY];
 			if (result != test_data[i].deny) {
 				printf("Line %i: Error in deny results at %i "
 					"(expected %"PRIu32" got %"PRIu32")!\n",
 					__LINE__, i, test_data[i].deny,
 					result);
-				ret = -EINVAL;
-				goto err;
+				return -EINVAL;
 			}
 		}
 	}
 
-	/* make a quick check for scalar */
-	ret = rte_acl_classify_alg(acx, data, results,
-			dim, RTE_ACL_MAX_CATEGORIES,
-			RTE_ACL_CLASSIFY_SCALAR);
-	if (ret != 0) {
-		printf("Line %i: scalar classify failed!\n", __LINE__);
-		goto err;
-	}
+	/* restore default classify alg */
+	return rte_acl_set_ctx_classify(acx, RTE_ACL_CLASSIFY_DEFAULT);
+}
 
-	/* check if we allow everything we should allow */
-	for (i = 0; i < (int) dim; i++) {
-		result = results[i * RTE_ACL_MAX_CATEGORIES + ACL_ALLOW];
-		if (result != test_data[i].allow) {
-			printf("Line %i: Error in allow results at %i "
-					"(expected %"PRIu32" got %"PRIu32")!\n",
-					__LINE__, i, test_data[i].allow,
-					result);
-			ret = -EINVAL;
-			goto err;
-		}
-	}
+/*
+ * Test ACL lookup (all possible methods).
+ */
+static int
+test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
+	size_t dim)
+{
+	int32_t ret;
+	uint32_t i;
+	const uint8_t *data[dim];
 
-	/* check if we deny everything we should deny */
-	for (i = 0; i < (int) dim; i++) {
-		result = results[i * RTE_ACL_MAX_CATEGORIES + ACL_DENY];
-		if (result != test_data[i].deny) {
-			printf("Line %i: Error in deny results at %i "
-					"(expected %"PRIu32" got %"PRIu32")!\n",
-					__LINE__, i, test_data[i].deny,
-					result);
-			ret = -EINVAL;
-			goto err;
-		}
-	}
+	static const enum rte_acl_classify_alg alg[] = {
+		RTE_ACL_CLASSIFY_SCALAR,
+		RTE_ACL_CLASSIFY_SSE,
+		RTE_ACL_CLASSIFY_AVX2,
+		RTE_ACL_CLASSIFY_NEON,
+		RTE_ACL_CLASSIFY_ALTIVEC,
+	};
+
+	/* swap all bytes in the data to network order */
+	bswap_test_data(test_data, dim, 1);
+
+	/* store pointers to test data */
+	for (i = 0; i < dim; i++)
+		data[i] = (uint8_t *)&test_data[i];
 
 	ret = 0;
+	for (i = 0; i != RTE_DIM(alg); i++) {
+		ret = test_classify_alg(acx, test_data, data, dim, alg[i]);
+		if (ret < 0) {
+			printf("Line %i: %s() for alg=%d failed, errno=%d\n",
+				__LINE__, __func__, alg[i], -ret);
+			break;
+		}
+	}
 
-err:
 	/* swap data back to cpu order so that next time tests don't fail */
 	bswap_test_data(test_data, dim, 0);
 	return ret;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v3 07/14] acl: add infrastructure to support AVX512 classify
  2020-10-05 18:45   ` [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                       ` (5 preceding siblings ...)
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 06/14] test/acl: expand classify test coverage Konstantin Ananyev
@ 2020-10-05 18:45     ` Konstantin Ananyev
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 08/14] acl: introduce 256-bit width AVX512 classify implementation Konstantin Ananyev
                       ` (8 subsequent siblings)
  15 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-05 18:45 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Add necessary changes to support new AVX512 specific ACL classify
algorithm:
 - changes in meson.build to check that build tools
   (compiler, assembler, etc.) do properly support AVX512.
 - run-time checks to make sure target platform does support AVX512.
 - dummy rte_acl_classify_avx512() for targets where AVX512
   implementation couldn't be properly supported.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
---
 config/x86/meson.build          |  3 ++-
 lib/librte_acl/acl.h            |  8 +++++++
 lib/librte_acl/acl_run_avx512.c | 29 +++++++++++++++++++++++
 lib/librte_acl/meson.build      | 39 ++++++++++++++++++++++++++++++
 lib/librte_acl/rte_acl.c        | 42 +++++++++++++++++++++++++++++++++
 lib/librte_acl/rte_acl.h        |  2 ++
 6 files changed, 122 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_acl/acl_run_avx512.c

diff --git a/config/x86/meson.build b/config/x86/meson.build
index fea4d54035..724e69f4c4 100644
--- a/config/x86/meson.build
+++ b/config/x86/meson.build
@@ -22,7 +22,8 @@ foreach f:base_flags
 endforeach
 
 optional_flags = ['AES', 'PCLMUL',
-		'AVX', 'AVX2', 'AVX512F',
+		'AVX', 'AVX2',
+		'AVX512F', 'AVX512VL', 'AVX512CD', 'AVX512BW',
 		'RDRND', 'RDSEED']
 foreach f:optional_flags
 	if cc.get_define('__@0@__'.format(f), args: machine_args) == '1'
diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 39d45a0c2b..543ce55659 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -201,6 +201,14 @@ int
 rte_acl_classify_avx2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories);
 
+int
+rte_acl_classify_avx512x16(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories);
+
+int
+rte_acl_classify_avx512x32(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories);
+
 int
 rte_acl_classify_neon(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories);
diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
new file mode 100644
index 0000000000..1817f88b29
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#include "acl_run_sse.h"
+
+int
+rte_acl_classify_avx512x16(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	if (num >= MAX_SEARCHES_SSE8)
+		return search_sse_8(ctx, data, results, num, categories);
+	if (num >= MAX_SEARCHES_SSE4)
+		return search_sse_4(ctx, data, results, num, categories);
+
+	return rte_acl_classify_scalar(ctx, data, results, num, categories);
+}
+
+int
+rte_acl_classify_avx512x32(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	if (num >= MAX_SEARCHES_SSE8)
+		return search_sse_8(ctx, data, results, num, categories);
+	if (num >= MAX_SEARCHES_SSE4)
+		return search_sse_4(ctx, data, results, num, categories);
+
+	return rte_acl_classify_scalar(ctx, data, results, num, categories);
+}
diff --git a/lib/librte_acl/meson.build b/lib/librte_acl/meson.build
index b31a3f798e..30cdb8b761 100644
--- a/lib/librte_acl/meson.build
+++ b/lib/librte_acl/meson.build
@@ -27,6 +27,45 @@ if dpdk_conf.has('RTE_ARCH_X86')
 		cflags += '-DCC_AVX2_SUPPORT'
 	endif
 
+	# compile AVX512 version if:
+	# we are building 64-bit binary AND binutils can generate proper code
+
+	if dpdk_conf.has('RTE_ARCH_X86_64') and binutils_ok.returncode() == 0
+
+		# compile AVX512 version if either:
+		# a. we have AVX512 supported in minimum instruction set
+		#    baseline
+		# b. it's not minimum instruction set, but supported by
+		#    compiler
+		#
+		# in former case, just add avx512 C file to files list
+		# in latter case, compile c file to static lib, using correct
+		# compiler flags, and then have the .o file from static lib
+		# linked into main lib.
+
+		if dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512F') and \
+			dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512VL') and \
+			dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512CD') and \
+			dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512BW')
+
+			sources += files('acl_run_avx512.c')
+			cflags += '-DCC_AVX512_SUPPORT'
+
+		elif cc.has_multi_arguments('-mavx512f', '-mavx512vl',
+					'-mavx512cd', '-mavx512bw')
+
+			avx512_tmplib = static_library('avx512_tmp',
+				'acl_run_avx512.c',
+				dependencies: static_rte_eal,
+				c_args: cflags +
+					['-mavx512f', '-mavx512vl',
+					 '-mavx512cd', '-mavx512bw'])
+			objs += avx512_tmplib.extract_objects(
+					'acl_run_avx512.c')
+			cflags += '-DCC_AVX512_SUPPORT'
+		endif
+	endif
+
 elif dpdk_conf.has('RTE_ARCH_ARM') or dpdk_conf.has('RTE_ARCH_ARM64')
 	cflags += '-flax-vector-conversions'
 	sources += files('acl_run_neon.c')
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 863549a38b..1154f35107 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -16,6 +16,32 @@ static struct rte_tailq_elem rte_acl_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_acl_tailq)
 
+#ifndef CC_AVX512_SUPPORT
+/*
+ * If the compiler doesn't support AVX512 instructions,
+ * then the dummy one would be used instead for AVX512 classify method.
+ */
+int
+rte_acl_classify_avx512x16(__rte_unused const struct rte_acl_ctx *ctx,
+	__rte_unused const uint8_t **data,
+	__rte_unused uint32_t *results,
+	__rte_unused uint32_t num,
+	__rte_unused uint32_t categories)
+{
+	return -ENOTSUP;
+}
+
+int
+rte_acl_classify_avx512x32(__rte_unused const struct rte_acl_ctx *ctx,
+	__rte_unused const uint8_t **data,
+	__rte_unused uint32_t *results,
+	__rte_unused uint32_t num,
+	__rte_unused uint32_t categories)
+{
+	return -ENOTSUP;
+}
+#endif
+
 #ifndef CC_AVX2_SUPPORT
 /*
  * If the compiler doesn't support AVX2 instructions,
@@ -77,6 +103,8 @@ static const rte_acl_classify_t classify_fns[] = {
 	[RTE_ACL_CLASSIFY_AVX2] = rte_acl_classify_avx2,
 	[RTE_ACL_CLASSIFY_NEON] = rte_acl_classify_neon,
 	[RTE_ACL_CLASSIFY_ALTIVEC] = rte_acl_classify_altivec,
+	[RTE_ACL_CLASSIFY_AVX512X16] = rte_acl_classify_avx512x16,
+	[RTE_ACL_CLASSIFY_AVX512X32] = rte_acl_classify_avx512x32,
 };
 
 /*
@@ -126,6 +154,18 @@ acl_check_alg_ppc(enum rte_acl_classify_alg alg)
 static int
 acl_check_alg_x86(enum rte_acl_classify_alg alg)
 {
+	if (alg == RTE_ACL_CLASSIFY_AVX512X16 ||
+			alg == RTE_ACL_CLASSIFY_AVX512X32) {
+#ifdef CC_AVX512_SUPPORT
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) &&
+			rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512VL) &&
+			rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512CD) &&
+			rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512BW))
+			return 0;
+#endif
+		return -ENOTSUP;
+	}
+
 	if (alg == RTE_ACL_CLASSIFY_AVX2) {
 #ifdef CC_AVX2_SUPPORT
 		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
@@ -159,6 +199,8 @@ acl_check_alg(enum rte_acl_classify_alg alg)
 		return acl_check_alg_arm(alg);
 	case RTE_ACL_CLASSIFY_ALTIVEC:
 		return acl_check_alg_ppc(alg);
+	case RTE_ACL_CLASSIFY_AVX512X32:
+	case RTE_ACL_CLASSIFY_AVX512X16:
 	case RTE_ACL_CLASSIFY_AVX2:
 	case RTE_ACL_CLASSIFY_SSE:
 		return acl_check_alg_x86(alg);
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index 3999f15ded..1bfed00743 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -241,6 +241,8 @@ enum rte_acl_classify_alg {
 	RTE_ACL_CLASSIFY_AVX2 = 3,    /**< requires AVX2 support. */
 	RTE_ACL_CLASSIFY_NEON = 4,    /**< requires NEON support. */
 	RTE_ACL_CLASSIFY_ALTIVEC = 5,    /**< requires ALTIVEC support. */
+	RTE_ACL_CLASSIFY_AVX512X16 = 6,  /**< requires AVX512 support. */
+	RTE_ACL_CLASSIFY_AVX512X32 = 7,  /**< requires AVX512 support. */
 };
 
 /**
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v3 08/14] acl: introduce 256-bit width AVX512 classify implementation
  2020-10-05 18:45   ` [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                       ` (6 preceding siblings ...)
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 07/14] acl: add infrastructure to support AVX512 classify Konstantin Ananyev
@ 2020-10-05 18:45     ` Konstantin Ananyev
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 09/14] acl: update default classify algorithm selection Konstantin Ananyev
                       ` (7 subsequent siblings)
  15 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-05 18:45 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Introduce classify implementation that uses AVX512 specific ISA.
rte_acl_classify_avx512x16() is able to process up to 16 flows in parallel.
It uses 256-bit width registers/instructions only
(to avoid frequency level change).
Note that for now only 64-bit version is supported.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 .../prog_guide/packet_classif_access_ctrl.rst |   4 +
 doc/guides/rel_notes/release_20_11.rst        |   5 +
 lib/librte_acl/acl.h                          |   7 +
 lib/librte_acl/acl_gen.c                      |   2 +-
 lib/librte_acl/acl_run_avx512.c               | 129 ++++
 lib/librte_acl/acl_run_avx512x8.h             | 642 ++++++++++++++++++
 6 files changed, 788 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_acl/acl_run_avx512x8.h

diff --git a/doc/guides/prog_guide/packet_classif_access_ctrl.rst b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
index daf03e6d7a..11f4bc841b 100644
--- a/doc/guides/prog_guide/packet_classif_access_ctrl.rst
+++ b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
@@ -379,6 +379,10 @@ There are several implementations of classify algorithm:
 *   **RTE_ACL_CLASSIFY_ALTIVEC**: vector implementation, can process up to 8
     flows in parallel. Requires ALTIVEC support.
 
+*   **RTE_ACL_CLASSIFY_AVX512X16**: vector implementation, can process up to 16
+    flows in parallel. Uses 256-bit width SIMD registers.
+    Requires AVX512 support.
+
 It is purely a runtime decision which method to choose, there is no build-time difference.
 All implementations operates over the same internal RT structures and use similar principles. The main difference is that vector implementations can manually exploit IA SIMD instructions and process several input data flows in parallel.
 At startup ACL library determines the highest available classify method for the given platform and sets it as default one. Though the user has an ability to override the default classifier function for a given ACL context or perform particular search using non-default classify method. In that case it is user responsibility to make sure that given platform supports selected classify implementation.
diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index e0de60c0c2..95d7bfd777 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -107,6 +107,11 @@ New Features
   * Extern objects and functions can be plugged into the pipeline.
   * Transaction-oriented table updates.
 
+* **Add new AVX512 specific classify algorithms for ACL library.**
+
+  * Added new ``RTE_ACL_CLASSIFY_AVX512X16`` vector implementation,
+    which can process up to 16 flows in parallel. Requires AVX512 support.
+
 
 Removed Items
 -------------
diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 543ce55659..7ac0d12f08 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -76,6 +76,13 @@ struct rte_acl_bitset {
  * input_byte - ((uint8_t *)&transition)[4 + input_byte / 64].
  */
 
+/*
+ * Each ACL RT contains an idle nomatch node:
+ * a SINGLE node at predefined position (RTE_ACL_DFA_SIZE)
+ * that points to itself.
+ */
+#define RTE_ACL_IDLE_NODE	(RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE)
+
 /*
  * Structure of a node is a set of ptrs and each ptr has a bit map
  * of values associated with this transition.
diff --git a/lib/librte_acl/acl_gen.c b/lib/librte_acl/acl_gen.c
index f1b9d12f1e..e759a2ca15 100644
--- a/lib/librte_acl/acl_gen.c
+++ b/lib/librte_acl/acl_gen.c
@@ -496,7 +496,7 @@ rte_acl_gen(struct rte_acl_ctx *ctx, struct rte_acl_trie *trie,
 	 * highest index, that points to itself)
 	 */
 
-	node_array[RTE_ACL_DFA_SIZE] = RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE;
+	node_array[RTE_ACL_DFA_SIZE] = RTE_ACL_IDLE_NODE;
 
 	for (n = 0; n < RTE_ACL_DFA_SIZE; n++)
 		node_array[n] = no_match;
diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
index 1817f88b29..f5bc628b7c 100644
--- a/lib/librte_acl/acl_run_avx512.c
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -4,10 +4,126 @@
 
 #include "acl_run_sse.h"
 
+/*sizeof(uint32_t) << match_log == sizeof(struct rte_acl_match_results)*/
+static const uint32_t match_log = 5;
+
+struct acl_flow_avx512 {
+	uint32_t num_packets;       /* number of packets processed */
+	uint32_t total_packets;     /* max number of packets to process */
+	uint32_t root_index;        /* current root index */
+	const uint64_t *trans;      /* transition table */
+	const uint32_t *data_index; /* input data indexes */
+	const uint8_t **idata;      /* input data */
+	uint32_t *matches;          /* match indexes */
+};
+
+static inline void
+acl_set_flow_avx512(struct acl_flow_avx512 *flow, const struct rte_acl_ctx *ctx,
+	uint32_t trie, const uint8_t *data[], uint32_t *matches,
+	uint32_t total_packets)
+{
+	flow->num_packets = 0;
+	flow->total_packets = total_packets;
+	flow->root_index = ctx->trie[trie].root_index;
+	flow->trans = ctx->trans_table;
+	flow->data_index = ctx->trie[trie].data_index;
+	flow->idata = data;
+	flow->matches = matches;
+}
+
+/*
+ * Update flow and result masks based on the number of unprocessed flows.
+ */
+static inline uint32_t
+update_flow_mask(const struct acl_flow_avx512 *flow, uint32_t *fmsk,
+	uint32_t *rmsk)
+{
+	uint32_t i, j, k, m, n;
+
+	fmsk[0] ^= rmsk[0];
+	m = rmsk[0];
+
+	k = __builtin_popcount(m);
+	n = flow->total_packets - flow->num_packets;
+
+	if (n < k) {
+		/* reduce mask */
+		for (i = k - n; i != 0; i--) {
+			j = sizeof(m) * CHAR_BIT - 1 - __builtin_clz(m);
+			m ^= 1 << j;
+		}
+	} else
+		n = k;
+
+	rmsk[0] = m;
+	fmsk[0] |= rmsk[0];
+
+	return n;
+}
+
+/*
+ * Resolve matches for multiple categories (LE 8, use 128b instuctions/regs)
+ */
+static inline void
+resolve_mcle8_avx512x1(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_cat, uint32_t nb_trie)
+{
+	const int32_t *pri;
+	const uint32_t *pm, *res;
+	uint32_t i, j, k, mi, mn;
+	__mmask8 msk;
+	xmm_t cp, cr, np, nr;
+
+	res = pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != nb_pkt; k++, result += nb_cat) {
+
+		mi = match[k] << match_log;
+
+		for (j = 0; j != nb_cat; j += RTE_ACL_RESULTS_MULTIPLIER) {
+
+			cr = _mm_loadu_si128((const xmm_t *)(res + mi + j));
+			cp = _mm_loadu_si128((const xmm_t *)(pri + mi + j));
+
+			for (i = 1, pm = match + nb_pkt; i != nb_trie;
+				i++, pm += nb_pkt) {
+
+				mn = j + (pm[k] << match_log);
+
+				nr = _mm_loadu_si128((const xmm_t *)(res + mn));
+				np = _mm_loadu_si128((const xmm_t *)(pri + mn));
+
+				msk = _mm_cmpgt_epi32_mask(cp, np);
+				cr = _mm_mask_mov_epi32(nr, msk, cr);
+				cp = _mm_mask_mov_epi32(np, msk, cp);
+			}
+
+			_mm_storeu_si128((xmm_t *)(result + j), cr);
+		}
+	}
+}
+
+#include "acl_run_avx512x8.h"
+
 int
 rte_acl_classify_avx512x16(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories)
 {
+	const uint32_t max_iter = MAX_SEARCHES_AVX16 * MAX_SEARCHES_AVX16;
+
+	/* split huge lookup (gt 256) into series of fixed size ones */
+	while (num > max_iter) {
+		search_avx512x8x2(ctx, data, results, max_iter, categories);
+		data += max_iter;
+		results += max_iter * categories;
+		num -= max_iter;
+	}
+
+	/* select classify method based on number of remaining requests */
+	if (num >= MAX_SEARCHES_AVX16)
+		return search_avx512x8x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE8)
 		return search_sse_8(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE4)
@@ -20,6 +136,19 @@ int
 rte_acl_classify_avx512x32(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories)
 {
+	const uint32_t max_iter = MAX_SEARCHES_AVX16 * MAX_SEARCHES_AVX16;
+
+	/* split huge lookup (gt 256) into series of fixed size ones */
+	while (num > max_iter) {
+		search_avx512x8x2(ctx, data, results, max_iter, categories);
+		data += max_iter;
+		results += max_iter * categories;
+		num -= max_iter;
+	}
+
+	/* select classify method based on number of remaining requests */
+	if (num >= MAX_SEARCHES_AVX16)
+		return search_avx512x8x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE8)
 		return search_sse_8(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE4)
diff --git a/lib/librte_acl/acl_run_avx512x8.h b/lib/librte_acl/acl_run_avx512x8.h
new file mode 100644
index 0000000000..cfba0299ed
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512x8.h
@@ -0,0 +1,642 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#define MASK8_BIT	(sizeof(__mmask8) * CHAR_BIT)
+
+#define NUM_AVX512X8X2	(2 * MASK8_BIT)
+#define MSK_AVX512X8X2	(NUM_AVX512X8X2 - 1)
+
+/* num/mask of pointers per SIMD regs */
+#define YMM_PTR_NUM	(sizeof(__m256i) / sizeof(uintptr_t))
+#define YMM_PTR_MSK	RTE_LEN2MASK(YMM_PTR_NUM, uint32_t)
+
+static const rte_ymm_t ymm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const rte_ymm_t ymm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const rte_ymm_t ymm_trlo_idle = {
+	.u32 = {
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+	},
+};
+
+static const rte_ymm_t ymm_trhi_idle = {
+	.u32 = {
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+	},
+};
+
+static const rte_ymm_t ymm_shuffle_input = {
+	.u32 = {
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+	},
+};
+
+static const rte_ymm_t ymm_four_32 = {
+	.u32 = {
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+	},
+};
+
+static const rte_ymm_t ymm_idx_add = {
+	.u32 = {
+		0, 1, 2, 3,
+		4, 5, 6, 7,
+	},
+};
+
+static const rte_ymm_t ymm_range_base = {
+	.u32 = {
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+	},
+};
+
+static const rte_ymm_t ymm_pminp = {
+	.u32 = {
+		0x00, 0x01, 0x02, 0x03,
+		0x08, 0x09, 0x0a, 0x0b,
+	},
+};
+
+static const __mmask16 ymm_pmidx_msk = 0x55;
+
+static const rte_ymm_t ymm_pmidx[2] = {
+	[0] = {
+		.u32 = {
+			0, 0, 1, 0, 2, 0, 3, 0,
+		},
+	},
+	[1] = {
+		.u32 = {
+			4, 0, 5, 0, 6, 0, 7, 0,
+		},
+	},
+};
+
+/*
+ * unfortunately current AVX512 ISA doesn't provide ability for
+ * gather load on a byte quantity. So we have to mimic it in SW,
+ * by doing 4x1B scalar loads.
+ */
+static inline __m128i
+_m256_mask_gather_epi8x4(__m256i pdata, __mmask8 mask)
+{
+	rte_xmm_t v;
+	rte_ymm_t p;
+
+	static const uint32_t zero;
+
+	p.y = _mm256_mask_set1_epi64(pdata, mask ^ YMM_PTR_MSK,
+		(uintptr_t)&zero);
+
+	v.u32[0] = *(uint8_t *)p.u64[0];
+	v.u32[1] = *(uint8_t *)p.u64[1];
+	v.u32[2] = *(uint8_t *)p.u64[2];
+	v.u32[3] = *(uint8_t *)p.u64[3];
+
+	return v.x;
+}
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes not supposed to be encountered here.
+ * For quad range nodes:
+ * Calculate number of range boundaries that are less than the
+ * input value. Range boundaries for each node are in signed 8 bit,
+ * ordered from -128 to 127.
+ * This is effectively a popcnt of bytes that are greater than the
+ * input byte.
+ * Single nodes are processed in the same ways as quad range nodes.
+ */
+static __rte_always_inline __m256i
+calc_addr8(__m256i index_mask, __m256i next_input, __m256i shuffle_input,
+	__m256i four_32, __m256i range_base, __m256i tr_lo, __m256i tr_hi)
+{
+	__mmask32 qm;
+	__mmask8 dfa_msk;
+	__m256i addr, in, node_type, r, t;
+	__m256i dfa_ofs, quad_ofs;
+
+	t = _mm256_xor_si256(index_mask, index_mask);
+	in = _mm256_shuffle_epi8(next_input, shuffle_input);
+
+	/* Calc node type and node addr */
+	node_type = _mm256_andnot_si256(index_mask, tr_lo);
+	addr = _mm256_and_si256(index_mask, tr_lo);
+
+	/* mask for DFA type(0) nodes */
+	dfa_msk = _mm256_cmpeq_epi32_mask(node_type, t);
+
+	/* DFA calculations. */
+	r = _mm256_srli_epi32(in, 30);
+	r = _mm256_add_epi8(r, range_base);
+	t = _mm256_srli_epi32(in, 24);
+	r = _mm256_shuffle_epi8(tr_hi, r);
+
+	dfa_ofs = _mm256_sub_epi32(t, r);
+
+	/* QUAD/SINGLE calculations. */
+	qm = _mm256_cmpgt_epi8_mask(in, tr_hi);
+	t = _mm256_maskz_set1_epi8(qm, (uint8_t)UINT8_MAX);
+	t = _mm256_lzcnt_epi32(t);
+	t = _mm256_srli_epi32(t, 3);
+	quad_ofs = _mm256_sub_epi32(four_32, t);
+
+	/* blend DFA and QUAD/SINGLE. */
+	t = _mm256_mask_mov_epi32(quad_ofs, dfa_msk, dfa_ofs);
+
+	/* calculate address for next transitions. */
+	addr = _mm256_add_epi32(addr, t);
+	return addr;
+}
+
+/*
+ * Process 16 transitions in parallel.
+ * tr_lo contains low 32 bits for 16 transition.
+ * tr_hi contains high 32 bits for 16 transition.
+ * next_input contains up to 4 input bytes for 16 flows.
+ */
+static __rte_always_inline __m256i
+transition8(__m256i next_input, const uint64_t *trans, __m256i *tr_lo,
+	__m256i *tr_hi)
+{
+	const int32_t *tr;
+	__m256i addr;
+
+	tr = (const int32_t *)(uintptr_t)trans;
+
+	/* Calculate the address (array index) for all 8 transitions. */
+	addr = calc_addr8(ymm_index_mask.y, next_input, ymm_shuffle_input.y,
+		ymm_four_32.y, ymm_range_base.y, *tr_lo, *tr_hi);
+
+	/* load lower 32 bits of 16 transactions at once. */
+	*tr_lo = _mm256_i32gather_epi32(tr, addr, sizeof(trans[0]));
+
+	next_input = _mm256_srli_epi32(next_input, CHAR_BIT);
+
+	/* load high 32 bits of 16 transactions at once. */
+	*tr_hi = _mm256_i32gather_epi32(tr + 1, addr, sizeof(trans[0]));
+
+	return next_input;
+}
+
+/*
+ * Execute first transition for up to 16 flows in parallel.
+ * next_input should contain one input byte for up to 16 flows.
+ * msk - mask of active flows.
+ * tr_lo contains low 32 bits for up to 16 transitions.
+ * tr_hi contains high 32 bits for up to 16 transitions.
+ */
+static __rte_always_inline void
+first_trans8(const struct acl_flow_avx512 *flow, __m256i next_input,
+	__mmask8 msk, __m256i *tr_lo, __m256i *tr_hi)
+{
+	const int32_t *tr;
+	__m256i addr, root;
+
+	tr = (const int32_t *)(uintptr_t)flow->trans;
+
+	addr = _mm256_set1_epi32(UINT8_MAX);
+	root = _mm256_set1_epi32(flow->root_index);
+
+	addr = _mm256_and_si256(next_input, addr);
+	addr = _mm256_add_epi32(root, addr);
+
+	/* load lower 32 bits of 16 transactions at once. */
+	*tr_lo = _mm256_mmask_i32gather_epi32(*tr_lo, msk, addr, tr,
+		sizeof(flow->trans[0]));
+
+	/* load high 32 bits of 16 transactions at once. */
+	*tr_hi = _mm256_mmask_i32gather_epi32(*tr_hi, msk, addr, (tr + 1),
+		sizeof(flow->trans[0]));
+}
+
+/*
+ * Load and return next 4 input bytes for up to 16 flows in parallel.
+ * pdata - 8x2 pointers to flow input data
+ * mask - mask of active flows.
+ * di - data indexes for these 16 flows.
+ */
+static inline __m256i
+get_next_bytes_avx512x8(const struct acl_flow_avx512 *flow, __m256i pdata[2],
+	uint32_t msk, __m256i *di, uint32_t bnum)
+{
+	const int32_t *div;
+	uint32_t m[2];
+	__m256i one, zero, t, p[2];
+	__m128i inp[2];
+
+	div = (const int32_t *)flow->data_index;
+
+	one = _mm256_set1_epi32(1);
+	zero = _mm256_xor_si256(one, one);
+
+	/* load data offsets for given indexes */
+	t = _mm256_mmask_i32gather_epi32(zero, msk, *di, div, sizeof(div[0]));
+
+	/* increment data indexes */
+	*di = _mm256_mask_add_epi32(*di, msk, *di, one);
+
+	/*
+	 * unsigned expand 32-bit indexes to 64-bit
+	 * (for later pointer arithmetic), i.e:
+	 * for (i = 0; i != 16; i++)
+	 *   p[i/8].u64[i%8] = (uint64_t)t.u32[i];
+	 */
+	p[0] = _mm256_maskz_permutexvar_epi32(ymm_pmidx_msk, ymm_pmidx[0].y, t);
+	p[1] = _mm256_maskz_permutexvar_epi32(ymm_pmidx_msk, ymm_pmidx[1].y, t);
+
+	p[0] = _mm256_add_epi64(p[0], pdata[0]);
+	p[1] = _mm256_add_epi64(p[1], pdata[1]);
+
+	/* load input byte(s), either one or four */
+
+	m[0] = msk & YMM_PTR_MSK;
+	m[1] = msk >> YMM_PTR_NUM;
+
+	if (bnum == sizeof(uint8_t)) {
+		inp[0] = _m256_mask_gather_epi8x4(p[0], m[0]);
+		inp[1] = _m256_mask_gather_epi8x4(p[1], m[1]);
+	} else {
+		inp[0] = _mm256_mmask_i64gather_epi32(
+				_mm256_castsi256_si128(zero), m[0], p[0],
+				NULL, sizeof(uint8_t));
+		inp[1] = _mm256_mmask_i64gather_epi32(
+				_mm256_castsi256_si128(zero), m[1], p[1],
+				NULL, sizeof(uint8_t));
+	}
+
+	/* squeeze input into one 512-bit register */
+	return _mm256_permutex2var_epi32(_mm256_castsi128_si256(inp[0]),
+			ymm_pminp.y,  _mm256_castsi128_si256(inp[1]));
+}
+
+/*
+ * Start up to 16 new flows.
+ * num - number of flows to start
+ * msk - mask of new flows.
+ * pdata - pointers to flow input data
+ * idx - match indexed for given flows
+ * di - data indexes for these flows.
+ */
+static inline void
+start_flow8(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
+	__m256i pdata[2], __m256i *idx, __m256i *di)
+{
+	uint32_t n, m[2], nm[2];
+	__m256i ni, nd[2];
+
+	m[0] = msk & YMM_PTR_MSK;
+	m[1] = msk >> YMM_PTR_NUM;
+
+	n = __builtin_popcount(m[0]);
+	nm[0] = (1 << n) - 1;
+	nm[1] = (1 << (num - n)) - 1;
+
+	/* load input data pointers for new flows */
+	nd[0] = _mm256_maskz_loadu_epi64(nm[0],
+		flow->idata + flow->num_packets);
+	nd[1] = _mm256_maskz_loadu_epi64(nm[1],
+		flow->idata + flow->num_packets + n);
+
+	/* calculate match indexes of new flows */
+	ni = _mm256_set1_epi32(flow->num_packets);
+	ni = _mm256_add_epi32(ni, ymm_idx_add.y);
+
+	/* merge new and existing flows data */
+	pdata[0] = _mm256_mask_expand_epi64(pdata[0], m[0], nd[0]);
+	pdata[1] = _mm256_mask_expand_epi64(pdata[1], m[1], nd[1]);
+
+	/* update match and data indexes */
+	*idx = _mm256_mask_expand_epi32(*idx, msk, ni);
+	*di = _mm256_maskz_mov_epi32(msk ^ UINT8_MAX, *di);
+
+	flow->num_packets += num;
+}
+
+/*
+ * Process found matches for up to 16 flows.
+ * fmsk - mask of active flows
+ * rmsk - mask of found matches
+ * pdata - pointers to flow input data
+ * di - data indexes for these flows
+ * idx - match indexed for given flows
+ * tr_lo contains low 32 bits for up to 8 transitions.
+ * tr_hi contains high 32 bits for up to 8 transitions.
+ */
+static inline uint32_t
+match_process_avx512x8(struct acl_flow_avx512 *flow, uint32_t *fmsk,
+	uint32_t *rmsk, __m256i pdata[2], __m256i *di, __m256i *idx,
+	__m256i *tr_lo, __m256i *tr_hi)
+{
+	uint32_t n;
+	__m256i res;
+
+	if (rmsk[0] == 0)
+		return 0;
+
+	/* extract match indexes */
+	res = _mm256_and_si256(tr_lo[0], ymm_index_mask.y);
+
+	/* mask  matched transitions to nop */
+	tr_lo[0] = _mm256_mask_mov_epi32(tr_lo[0], rmsk[0], ymm_trlo_idle.y);
+	tr_hi[0] = _mm256_mask_mov_epi32(tr_hi[0], rmsk[0], ymm_trhi_idle.y);
+
+	/* save found match indexes */
+	_mm256_mask_i32scatter_epi32(flow->matches, rmsk[0],
+		idx[0], res, sizeof(flow->matches[0]));
+
+	/* update masks and start new flows for matches */
+	n = update_flow_mask(flow, fmsk, rmsk);
+	start_flow8(flow, n, rmsk[0], pdata, idx, di);
+
+	return n;
+}
+
+/*
+ * Test for matches ut to 32 (2x16) flows at once,
+ * if matches exist - process them and start new flows.
+ */
+static inline void
+match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
+	__m256i pdata[4], __m256i di[2], __m256i idx[2], __m256i inp[2],
+	__m256i tr_lo[2], __m256i tr_hi[2])
+{
+	uint32_t n[2];
+	uint32_t rm[2];
+
+	/* check for matches */
+	rm[0] = _mm256_test_epi32_mask(tr_lo[0], ymm_match_mask.y);
+	rm[1] = _mm256_test_epi32_mask(tr_lo[1], ymm_match_mask.y);
+
+	/* till unprocessed matches exist */
+	while ((rm[0] | rm[1]) != 0) {
+
+		/* process matches and start new flows */
+		n[0] = match_process_avx512x8(flow, &fm[0], &rm[0], &pdata[0],
+			&di[0], &idx[0], &tr_lo[0], &tr_hi[0]);
+		n[1] = match_process_avx512x8(flow, &fm[1], &rm[1], &pdata[2],
+			&di[1], &idx[1], &tr_lo[1], &tr_hi[1]);
+
+		/* execute first transition for new flows, if any */
+
+		if (n[0] != 0) {
+			inp[0] = get_next_bytes_avx512x8(flow, &pdata[0],
+				rm[0], &di[0], sizeof(uint8_t));
+			first_trans8(flow, inp[0], rm[0], &tr_lo[0],
+				&tr_hi[0]);
+			rm[0] = _mm256_test_epi32_mask(tr_lo[0],
+				ymm_match_mask.y);
+		}
+
+		if (n[1] != 0) {
+			inp[1] = get_next_bytes_avx512x8(flow, &pdata[2],
+				rm[1], &di[1], sizeof(uint8_t));
+			first_trans8(flow, inp[1], rm[1], &tr_lo[1],
+				&tr_hi[1]);
+			rm[1] = _mm256_test_epi32_mask(tr_lo[1],
+				ymm_match_mask.y);
+		}
+	}
+}
+
+/*
+ * Perform search for up to 32 flows in parallel.
+ * Use two sets of metadata, each serves 16 flows max.
+ * So in fact we perform search for 2x16 flows.
+ */
+static inline void
+search_trie_avx512x8x2(struct acl_flow_avx512 *flow)
+{
+	uint32_t fm[2];
+	__m256i di[2], idx[2], in[2], pdata[4], tr_lo[2], tr_hi[2];
+
+	/* first 1B load */
+	start_flow8(flow, MASK8_BIT, UINT8_MAX, &pdata[0], &idx[0], &di[0]);
+	start_flow8(flow, MASK8_BIT, UINT8_MAX, &pdata[2], &idx[1], &di[1]);
+
+	in[0] = get_next_bytes_avx512x8(flow, &pdata[0], UINT8_MAX, &di[0],
+			sizeof(uint8_t));
+	in[1] = get_next_bytes_avx512x8(flow, &pdata[2], UINT8_MAX, &di[1],
+			sizeof(uint8_t));
+
+	first_trans8(flow, in[0], UINT8_MAX, &tr_lo[0], &tr_hi[0]);
+	first_trans8(flow, in[1], UINT8_MAX, &tr_lo[1], &tr_hi[1]);
+
+	fm[0] = UINT8_MAX;
+	fm[1] = UINT8_MAX;
+
+	/* match check */
+	match_check_process_avx512x8x2(flow, fm, pdata, di, idx, in,
+		tr_lo, tr_hi);
+
+	while ((fm[0] | fm[1]) != 0) {
+
+		/* load next 4B */
+
+		in[0] = get_next_bytes_avx512x8(flow, &pdata[0], fm[0],
+			&di[0], sizeof(uint32_t));
+		in[1] = get_next_bytes_avx512x8(flow, &pdata[2], fm[1],
+			&di[1], sizeof(uint32_t));
+
+		/* main 4B loop */
+
+		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		/* check for matches */
+		match_check_process_avx512x8x2(flow, fm, pdata, di, idx, in,
+			tr_lo, tr_hi);
+	}
+}
+
+/*
+ * resolve match index to actual result/priority offset.
+ */
+static inline __m256i
+resolve_match_idx_avx512x8(__m256i mi)
+{
+	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
+		1 << (match_log + 2));
+	return _mm256_slli_epi32(mi, match_log);
+}
+
+/*
+ * Resolve multiple matches for the same flow based on priority.
+ */
+static inline __m256i
+resolve_pri_avx512x8(const int32_t res[], const int32_t pri[],
+	const uint32_t match[], __mmask8 msk, uint32_t nb_trie,
+	uint32_t nb_skip)
+{
+	uint32_t i;
+	const uint32_t *pm;
+	__mmask16 m;
+	__m256i cp, cr, np, nr, mch;
+
+	const __m256i zero = _mm256_set1_epi32(0);
+
+	/* get match indexes */
+	mch = _mm256_maskz_loadu_epi32(msk, match);
+	mch = resolve_match_idx_avx512x8(mch);
+
+	/* read result and priority values for first trie */
+	cr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0]));
+	cp = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0]));
+
+	/*
+	 * read result and priority values for next tries and select one
+	 * with highest priority.
+	 */
+	for (i = 1, pm = match + nb_skip; i != nb_trie;
+			i++, pm += nb_skip) {
+
+		mch = _mm256_maskz_loadu_epi32(msk, pm);
+		mch = resolve_match_idx_avx512x8(mch);
+
+		nr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res,
+			sizeof(res[0]));
+		np = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri,
+			sizeof(pri[0]));
+
+		m = _mm256_cmpgt_epi32_mask(cp, np);
+		cr = _mm256_mask_mov_epi32(nr, m, cr);
+		cp = _mm256_mask_mov_epi32(np, m, cp);
+	}
+
+	return cr;
+}
+
+/*
+ * Resolve num (<= 8) matches for single category
+ */
+static inline void
+resolve_sc_avx512x8(uint32_t result[], const int32_t res[],
+	const int32_t pri[], const uint32_t match[], uint32_t nb_pkt,
+	uint32_t nb_trie, uint32_t nb_skip)
+{
+	__mmask8 msk;
+	__m256i cr;
+
+	msk = (1 << nb_pkt) - 1;
+	cr = resolve_pri_avx512x8(res, pri, match, msk, nb_trie, nb_skip);
+	_mm256_mask_storeu_epi32(result, msk, cr);
+}
+
+/*
+ * Resolve matches for single category
+ */
+static inline void
+resolve_sc_avx512x8x2(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_trie)
+{
+	uint32_t j, k, n;
+	const int32_t *res, *pri;
+	__m256i cr[2];
+
+	res = (const int32_t *)pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != (nb_pkt & ~MSK_AVX512X8X2); k += NUM_AVX512X8X2) {
+
+		j = k + MASK8_BIT;
+
+		cr[0] = resolve_pri_avx512x8(res, pri, match + k, UINT8_MAX,
+				nb_trie, nb_pkt);
+		cr[1] = resolve_pri_avx512x8(res, pri, match + j, UINT8_MAX,
+				nb_trie, nb_pkt);
+
+		_mm256_storeu_si256((void *)(result + k), cr[0]);
+		_mm256_storeu_si256((void *)(result + j), cr[1]);
+	}
+
+	n = nb_pkt - k;
+	if (n != 0) {
+		if (n > MASK8_BIT) {
+			resolve_sc_avx512x8(result + k, res, pri, match + k,
+				MASK8_BIT, nb_trie, nb_pkt);
+			k += MASK8_BIT;
+			n -= MASK8_BIT;
+		}
+		resolve_sc_avx512x8(result + k, res, pri, match + k, n,
+				nb_trie, nb_pkt);
+	}
+}
+
+static inline int
+search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	uint32_t i, *pm;
+	const struct rte_acl_match_results *pr;
+	struct acl_flow_avx512 flow;
+	uint32_t match[ctx->num_tries * total_packets];
+
+	for (i = 0, pm = match; i != ctx->num_tries; i++, pm += total_packets) {
+
+		/* setup for next trie */
+		acl_set_flow_avx512(&flow, ctx, i, data, pm, total_packets);
+
+		/* process the trie */
+		search_trie_avx512x8x2(&flow);
+	}
+
+	/* resolve matches */
+	pr = (const struct rte_acl_match_results *)
+		(ctx->trans_table + ctx->match_index);
+
+	if (categories == 1)
+		resolve_sc_avx512x8x2(results, pr, match, total_packets,
+			ctx->num_tries);
+	else
+		resolve_mcle8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+
+	return 0;
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v3 09/14] acl: update default classify algorithm selection
  2020-10-05 18:45   ` [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                       ` (7 preceding siblings ...)
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 08/14] acl: introduce 256-bit width AVX512 classify implementation Konstantin Ananyev
@ 2020-10-05 18:45     ` Konstantin Ananyev
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 10/14] acl: introduce 512-bit width AVX512 classify implementation Konstantin Ananyev
                       ` (6 subsequent siblings)
  15 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-05 18:45 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

On supported platforms, set RTE_ACL_CLASSIFY_AVX512X16 as
default ACL classify algorithm.
Note that AVX512X16 implementation uses 256-bit registers/instincts only
to avoid possibility of frequency drop.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/rte_acl.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 1154f35107..245af672ee 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -228,6 +228,7 @@ acl_get_best_alg(void)
 #elif defined(RTE_ARCH_PPC_64)
 		RTE_ACL_CLASSIFY_ALTIVEC,
 #elif defined(RTE_ARCH_X86)
+		RTE_ACL_CLASSIFY_AVX512X16,
 		RTE_ACL_CLASSIFY_AVX2,
 		RTE_ACL_CLASSIFY_SSE,
 #endif
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v3 10/14] acl: introduce 512-bit width AVX512 classify implementation
  2020-10-05 18:45   ` [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                       ` (8 preceding siblings ...)
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 09/14] acl: update default classify algorithm selection Konstantin Ananyev
@ 2020-10-05 18:45     ` Konstantin Ananyev
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 11/14] acl: for AVX512 classify use 4B load whenever possible Konstantin Ananyev
                       ` (5 subsequent siblings)
  15 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-05 18:45 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Introduce classify implementation that uses AVX512 specific ISA.
rte_acl_classify_avx512x32() is able to process up to 32 flows in parallel.
It uses 512-bit width registers/instructions and provides higher
performance then rte_acl_classify_avx512x16(), but can cause
frequency level change.
Note that for now only 64-bit version is supported.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

These patch depends on:
https://patches.dpdk.org/patch/79310/
to be applied first.

 .../prog_guide/packet_classif_access_ctrl.rst |  10 +
 doc/guides/rel_notes/release_20_11.rst        |   3 +
 lib/librte_acl/acl_run_avx512.c               |   6 +-
 lib/librte_acl/acl_run_avx512x16.h            | 732 ++++++++++++++++++
 4 files changed, 750 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_acl/acl_run_avx512x16.h

diff --git a/doc/guides/prog_guide/packet_classif_access_ctrl.rst b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
index 11f4bc841b..7659af8eb5 100644
--- a/doc/guides/prog_guide/packet_classif_access_ctrl.rst
+++ b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
@@ -383,10 +383,20 @@ There are several implementations of classify algorithm:
     flows in parallel. Uses 256-bit width SIMD registers.
     Requires AVX512 support.
 
+*   **RTE_ACL_CLASSIFY_AVX512X32**: vector implementation, can process up to 32
+    flows in parallel. Uses 512-bit width SIMD registers.
+    Requires AVX512 support.
+
 It is purely a runtime decision which method to choose, there is no build-time difference.
 All implementations operates over the same internal RT structures and use similar principles. The main difference is that vector implementations can manually exploit IA SIMD instructions and process several input data flows in parallel.
 At startup ACL library determines the highest available classify method for the given platform and sets it as default one. Though the user has an ability to override the default classifier function for a given ACL context or perform particular search using non-default classify method. In that case it is user responsibility to make sure that given platform supports selected classify implementation.
 
+.. note::
+
+     Right now ``RTE_ACL_CLASSIFY_AVX512X32`` is not selected by default
+     (due to possible frequency level change), but it can be selected at
+     runtime by apps through the use of ACL API: ``rte_acl_set_ctx_classify``.
+
 Application Programming Interface (API) Usage
 ---------------------------------------------
 
diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index 95d7bfd777..001e46f595 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -112,6 +112,9 @@ New Features
   * Added new ``RTE_ACL_CLASSIFY_AVX512X16`` vector implementation,
     which can process up to 16 flows in parallel. Requires AVX512 support.
 
+  * Added new ``RTE_ACL_CLASSIFY_AVX512X32`` vector implementation,
+    which can process up to 32 flows in parallel. Requires AVX512 support.
+
 
 Removed Items
 -------------
diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
index f5bc628b7c..74698fa2ea 100644
--- a/lib/librte_acl/acl_run_avx512.c
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -132,6 +132,8 @@ rte_acl_classify_avx512x16(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	return rte_acl_classify_scalar(ctx, data, results, num, categories);
 }
 
+#include "acl_run_avx512x16.h"
+
 int
 rte_acl_classify_avx512x32(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories)
@@ -140,13 +142,15 @@ rte_acl_classify_avx512x32(const struct rte_acl_ctx *ctx, const uint8_t **data,
 
 	/* split huge lookup (gt 256) into series of fixed size ones */
 	while (num > max_iter) {
-		search_avx512x8x2(ctx, data, results, max_iter, categories);
+		search_avx512x16x2(ctx, data, results, max_iter, categories);
 		data += max_iter;
 		results += max_iter * categories;
 		num -= max_iter;
 	}
 
 	/* select classify method based on number of remaining requests */
+	if (num >= 2 * MAX_SEARCHES_AVX16)
+		return search_avx512x16x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_AVX16)
 		return search_avx512x8x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE8)
diff --git a/lib/librte_acl/acl_run_avx512x16.h b/lib/librte_acl/acl_run_avx512x16.h
new file mode 100644
index 0000000000..981f8d16da
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512x16.h
@@ -0,0 +1,732 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#define	MASK16_BIT	(sizeof(__mmask16) * CHAR_BIT)
+
+#define NUM_AVX512X16X2	(2 * MASK16_BIT)
+#define MSK_AVX512X16X2	(NUM_AVX512X16X2 - 1)
+
+/* num/mask of pointers per SIMD regs */
+#define ZMM_PTR_NUM	(sizeof(__m512i) / sizeof(uintptr_t))
+#define ZMM_PTR_MSK	RTE_LEN2MASK(ZMM_PTR_NUM, uint32_t)
+
+static const __rte_x86_zmm_t zmm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_trlo_idle = {
+	.u32 = {
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_trhi_idle = {
+	.u32 = {
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_shuffle_input = {
+	.u32 = {
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_four_32 = {
+	.u32 = {
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_idx_add = {
+	.u32 = {
+		0, 1, 2, 3,
+		4, 5, 6, 7,
+		8, 9, 10, 11,
+		12, 13, 14, 15,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_range_base = {
+	.u32 = {
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_pminp = {
+	.u32 = {
+		0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+		0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17,
+	},
+};
+
+static const __mmask16 zmm_pmidx_msk = 0x5555;
+
+static const __rte_x86_zmm_t zmm_pmidx[2] = {
+	[0] = {
+		.u32 = {
+			0, 0, 1, 0, 2, 0, 3, 0,
+			4, 0, 5, 0, 6, 0, 7, 0,
+		},
+	},
+	[1] = {
+		.u32 = {
+			8, 0, 9, 0, 10, 0, 11, 0,
+			12, 0, 13, 0, 14, 0, 15, 0,
+		},
+	},
+};
+
+/*
+ * unfortunately current AVX512 ISA doesn't provide ability for
+ * gather load on a byte quantity. So we have to mimic it in SW,
+ * by doing 8x1B scalar loads.
+ */
+static inline ymm_t
+_m512_mask_gather_epi8x8(__m512i pdata, __mmask8 mask)
+{
+	rte_ymm_t v;
+	__rte_x86_zmm_t p;
+
+	static const uint32_t zero;
+
+	p.z = _mm512_mask_set1_epi64(pdata, mask ^ ZMM_PTR_MSK,
+		(uintptr_t)&zero);
+
+	v.u32[0] = *(uint8_t *)p.u64[0];
+	v.u32[1] = *(uint8_t *)p.u64[1];
+	v.u32[2] = *(uint8_t *)p.u64[2];
+	v.u32[3] = *(uint8_t *)p.u64[3];
+	v.u32[4] = *(uint8_t *)p.u64[4];
+	v.u32[5] = *(uint8_t *)p.u64[5];
+	v.u32[6] = *(uint8_t *)p.u64[6];
+	v.u32[7] = *(uint8_t *)p.u64[7];
+
+	return v.y;
+}
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes not supposed to be encountered here.
+ * For quad range nodes:
+ * Calculate number of range boundaries that are less than the
+ * input value. Range boundaries for each node are in signed 8 bit,
+ * ordered from -128 to 127.
+ * This is effectively a popcnt of bytes that are greater than the
+ * input byte.
+ * Single nodes are processed in the same ways as quad range nodes.
+ */
+static __rte_always_inline __m512i
+calc_addr16(__m512i index_mask, __m512i next_input, __m512i shuffle_input,
+	__m512i four_32, __m512i range_base, __m512i tr_lo, __m512i tr_hi)
+{
+	__mmask64 qm;
+	__mmask16 dfa_msk;
+	__m512i addr, in, node_type, r, t;
+	__m512i dfa_ofs, quad_ofs;
+
+	t = _mm512_xor_si512(index_mask, index_mask);
+	in = _mm512_shuffle_epi8(next_input, shuffle_input);
+
+	/* Calc node type and node addr */
+	node_type = _mm512_andnot_si512(index_mask, tr_lo);
+	addr = _mm512_and_si512(index_mask, tr_lo);
+
+	/* mask for DFA type(0) nodes */
+	dfa_msk = _mm512_cmpeq_epi32_mask(node_type, t);
+
+	/* DFA calculations. */
+	r = _mm512_srli_epi32(in, 30);
+	r = _mm512_add_epi8(r, range_base);
+	t = _mm512_srli_epi32(in, 24);
+	r = _mm512_shuffle_epi8(tr_hi, r);
+
+	dfa_ofs = _mm512_sub_epi32(t, r);
+
+	/* QUAD/SINGLE calculations. */
+	qm = _mm512_cmpgt_epi8_mask(in, tr_hi);
+	t = _mm512_maskz_set1_epi8(qm, (uint8_t)UINT8_MAX);
+	t = _mm512_lzcnt_epi32(t);
+	t = _mm512_srli_epi32(t, 3);
+	quad_ofs = _mm512_sub_epi32(four_32, t);
+
+	/* blend DFA and QUAD/SINGLE. */
+	t = _mm512_mask_mov_epi32(quad_ofs, dfa_msk, dfa_ofs);
+
+	/* calculate address for next transitions. */
+	addr = _mm512_add_epi32(addr, t);
+	return addr;
+}
+
+/*
+ * Process 16 transitions in parallel.
+ * tr_lo contains low 32 bits for 16 transition.
+ * tr_hi contains high 32 bits for 16 transition.
+ * next_input contains up to 4 input bytes for 16 flows.
+ */
+static __rte_always_inline __m512i
+transition16(__m512i next_input, const uint64_t *trans, __m512i *tr_lo,
+	__m512i *tr_hi)
+{
+	const int32_t *tr;
+	__m512i addr;
+
+	tr = (const int32_t *)(uintptr_t)trans;
+
+	/* Calculate the address (array index) for all 16 transitions. */
+	addr = calc_addr16(zmm_index_mask.z, next_input, zmm_shuffle_input.z,
+		zmm_four_32.z, zmm_range_base.z, *tr_lo, *tr_hi);
+
+	/* load lower 32 bits of 16 transactions at once. */
+	*tr_lo = _mm512_i32gather_epi32(addr, tr, sizeof(trans[0]));
+
+	next_input = _mm512_srli_epi32(next_input, CHAR_BIT);
+
+	/* load high 32 bits of 16 transactions at once. */
+	*tr_hi = _mm512_i32gather_epi32(addr, (tr + 1), sizeof(trans[0]));
+
+	return next_input;
+}
+
+/*
+ * Execute first transition for up to 16 flows in parallel.
+ * next_input should contain one input byte for up to 16 flows.
+ * msk - mask of active flows.
+ * tr_lo contains low 32 bits for up to 16 transitions.
+ * tr_hi contains high 32 bits for up to 16 transitions.
+ */
+static __rte_always_inline void
+first_trans16(const struct acl_flow_avx512 *flow, __m512i next_input,
+	__mmask16 msk, __m512i *tr_lo, __m512i *tr_hi)
+{
+	const int32_t *tr;
+	__m512i addr, root;
+
+	tr = (const int32_t *)(uintptr_t)flow->trans;
+
+	addr = _mm512_set1_epi32(UINT8_MAX);
+	root = _mm512_set1_epi32(flow->root_index);
+
+	addr = _mm512_and_si512(next_input, addr);
+	addr = _mm512_add_epi32(root, addr);
+
+	/* load lower 32 bits of 16 transactions at once. */
+	*tr_lo = _mm512_mask_i32gather_epi32(*tr_lo, msk, addr, tr,
+		sizeof(flow->trans[0]));
+
+	/* load high 32 bits of 16 transactions at once. */
+	*tr_hi = _mm512_mask_i32gather_epi32(*tr_hi, msk, addr, (tr + 1),
+		sizeof(flow->trans[0]));
+}
+
+/*
+ * Load and return next 4 input bytes for up to 16 flows in parallel.
+ * pdata - 8x2 pointers to flow input data
+ * mask - mask of active flows.
+ * di - data indexes for these 16 flows.
+ */
+static inline __m512i
+get_next_bytes_avx512x16(const struct acl_flow_avx512 *flow, __m512i pdata[2],
+	uint32_t msk, __m512i *di, uint32_t bnum)
+{
+	const int32_t *div;
+	uint32_t m[2];
+	__m512i one, zero, t, p[2];
+	ymm_t inp[2];
+
+	div = (const int32_t *)flow->data_index;
+
+	one = _mm512_set1_epi32(1);
+	zero = _mm512_xor_si512(one, one);
+
+	/* load data offsets for given indexes */
+	t = _mm512_mask_i32gather_epi32(zero, msk, *di, div, sizeof(div[0]));
+
+	/* increment data indexes */
+	*di = _mm512_mask_add_epi32(*di, msk, *di, one);
+
+	/*
+	 * unsigned expand 32-bit indexes to 64-bit
+	 * (for later pointer arithmetic), i.e:
+	 * for (i = 0; i != 16; i++)
+	 *   p[i/8].u64[i%8] = (uint64_t)t.u32[i];
+	 */
+	p[0] = _mm512_maskz_permutexvar_epi32(zmm_pmidx_msk, zmm_pmidx[0].z, t);
+	p[1] = _mm512_maskz_permutexvar_epi32(zmm_pmidx_msk, zmm_pmidx[1].z, t);
+
+	p[0] = _mm512_add_epi64(p[0], pdata[0]);
+	p[1] = _mm512_add_epi64(p[1], pdata[1]);
+
+	/* load input byte(s), either one or four */
+
+	m[0] = msk & ZMM_PTR_MSK;
+	m[1] = msk >> ZMM_PTR_NUM;
+
+	if (bnum == sizeof(uint8_t)) {
+		inp[0] = _m512_mask_gather_epi8x8(p[0], m[0]);
+		inp[1] = _m512_mask_gather_epi8x8(p[1], m[1]);
+	} else {
+		inp[0] = _mm512_mask_i64gather_epi32(
+				_mm512_castsi512_si256(zero), m[0], p[0],
+				NULL, sizeof(uint8_t));
+		inp[1] = _mm512_mask_i64gather_epi32(
+				_mm512_castsi512_si256(zero), m[1], p[1],
+				NULL, sizeof(uint8_t));
+	}
+
+	/* squeeze input into one 512-bit register */
+	return _mm512_permutex2var_epi32(_mm512_castsi256_si512(inp[0]),
+			zmm_pminp.z, _mm512_castsi256_si512(inp[1]));
+}
+
+/*
+ * Start up to 16 new flows.
+ * num - number of flows to start
+ * msk - mask of new flows.
+ * pdata - pointers to flow input data
+ * idx - match indexed for given flows
+ * di - data indexes for these flows.
+ */
+static inline void
+start_flow16(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
+	__m512i pdata[2], __m512i *idx, __m512i *di)
+{
+	uint32_t n, m[2], nm[2];
+	__m512i ni, nd[2];
+
+	/* split mask into two - one for each pdata[] */
+	m[0] = msk & ZMM_PTR_MSK;
+	m[1] = msk >> ZMM_PTR_NUM;
+
+	/* caluclate masks for new flows */
+	n = __builtin_popcount(m[0]);
+	nm[0] = (1 << n) - 1;
+	nm[1] = (1 << (num - n)) - 1;
+
+	/* load input data pointers for new flows */
+	nd[0] = _mm512_maskz_loadu_epi64(nm[0],
+		flow->idata + flow->num_packets);
+	nd[1] = _mm512_maskz_loadu_epi64(nm[1],
+		flow->idata + flow->num_packets + n);
+
+	/* calculate match indexes of new flows */
+	ni = _mm512_set1_epi32(flow->num_packets);
+	ni = _mm512_add_epi32(ni, zmm_idx_add.z);
+
+	/* merge new and existing flows data */
+	pdata[0] = _mm512_mask_expand_epi64(pdata[0], m[0], nd[0]);
+	pdata[1] = _mm512_mask_expand_epi64(pdata[1], m[1], nd[1]);
+
+	/* update match and data indexes */
+	*idx = _mm512_mask_expand_epi32(*idx, msk, ni);
+	*di = _mm512_maskz_mov_epi32(msk ^ UINT16_MAX, *di);
+
+	flow->num_packets += num;
+}
+
+/*
+ * Process found matches for up to 16 flows.
+ * fmsk - mask of active flows
+ * rmsk - mask of found matches
+ * pdata - pointers to flow input data
+ * di - data indexes for these flows
+ * idx - match indexed for given flows
+ * tr_lo contains low 32 bits for up to 8 transitions.
+ * tr_hi contains high 32 bits for up to 8 transitions.
+ */
+static inline uint32_t
+match_process_avx512x16(struct acl_flow_avx512 *flow, uint32_t *fmsk,
+	uint32_t *rmsk, __m512i pdata[2], __m512i *di, __m512i *idx,
+	__m512i *tr_lo, __m512i *tr_hi)
+{
+	uint32_t n;
+	__m512i res;
+
+	if (rmsk[0] == 0)
+		return 0;
+
+	/* extract match indexes */
+	res = _mm512_and_si512(tr_lo[0], zmm_index_mask.z);
+
+	/* mask  matched transitions to nop */
+	tr_lo[0] = _mm512_mask_mov_epi32(tr_lo[0], rmsk[0], zmm_trlo_idle.z);
+	tr_hi[0] = _mm512_mask_mov_epi32(tr_hi[0], rmsk[0], zmm_trhi_idle.z);
+
+	/* save found match indexes */
+	_mm512_mask_i32scatter_epi32(flow->matches, rmsk[0],
+		idx[0], res, sizeof(flow->matches[0]));
+
+	/* update masks and start new flows for matches */
+	n = update_flow_mask(flow, fmsk, rmsk);
+	start_flow16(flow, n, rmsk[0], pdata, idx, di);
+
+	return n;
+}
+
+/*
+ * Test for matches ut to 32 (2x16) flows at once,
+ * if matches exist - process them and start new flows.
+ */
+static inline void
+match_check_process_avx512x16x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
+	__m512i pdata[4], __m512i di[2], __m512i idx[2], __m512i inp[2],
+	__m512i tr_lo[2], __m512i tr_hi[2])
+{
+	uint32_t n[2];
+	uint32_t rm[2];
+
+	/* check for matches */
+	rm[0] = _mm512_test_epi32_mask(tr_lo[0], zmm_match_mask.z);
+	rm[1] = _mm512_test_epi32_mask(tr_lo[1], zmm_match_mask.z);
+
+	/* till unprocessed matches exist */
+	while ((rm[0] | rm[1]) != 0) {
+
+		/* process matches and start new flows */
+		n[0] = match_process_avx512x16(flow, &fm[0], &rm[0], &pdata[0],
+			&di[0], &idx[0], &tr_lo[0], &tr_hi[0]);
+		n[1] = match_process_avx512x16(flow, &fm[1], &rm[1], &pdata[2],
+			&di[1], &idx[1], &tr_lo[1], &tr_hi[1]);
+
+		/* execute first transition for new flows, if any */
+
+		if (n[0] != 0) {
+			inp[0] = get_next_bytes_avx512x16(flow, &pdata[0],
+				rm[0], &di[0], sizeof(uint8_t));
+			first_trans16(flow, inp[0], rm[0], &tr_lo[0],
+				&tr_hi[0]);
+			rm[0] = _mm512_test_epi32_mask(tr_lo[0],
+				zmm_match_mask.z);
+		}
+
+		if (n[1] != 0) {
+			inp[1] = get_next_bytes_avx512x16(flow, &pdata[2],
+				rm[1], &di[1], sizeof(uint8_t));
+			first_trans16(flow, inp[1], rm[1], &tr_lo[1],
+				&tr_hi[1]);
+			rm[1] = _mm512_test_epi32_mask(tr_lo[1],
+				zmm_match_mask.z);
+		}
+	}
+}
+
+/*
+ * Perform search for up to 32 flows in parallel.
+ * Use two sets of metadata, each serves 16 flows max.
+ * So in fact we perform search for 2x16 flows.
+ */
+static inline void
+search_trie_avx512x16x2(struct acl_flow_avx512 *flow)
+{
+	uint32_t fm[2];
+	__m512i di[2], idx[2], in[2], pdata[4], tr_lo[2], tr_hi[2];
+
+	/* first 1B load */
+	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[0], &idx[0], &di[0]);
+	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[2], &idx[1], &di[1]);
+
+	in[0] = get_next_bytes_avx512x16(flow, &pdata[0], UINT16_MAX, &di[0],
+			sizeof(uint8_t));
+	in[1] = get_next_bytes_avx512x16(flow, &pdata[2], UINT16_MAX, &di[1],
+			sizeof(uint8_t));
+
+	first_trans16(flow, in[0], UINT16_MAX, &tr_lo[0], &tr_hi[0]);
+	first_trans16(flow, in[1], UINT16_MAX, &tr_lo[1], &tr_hi[1]);
+
+	fm[0] = UINT16_MAX;
+	fm[1] = UINT16_MAX;
+
+	/* match check */
+	match_check_process_avx512x16x2(flow, fm, pdata, di, idx, in,
+		tr_lo, tr_hi);
+
+	while ((fm[0] | fm[1]) != 0) {
+
+		/* load next 4B */
+
+		in[0] = get_next_bytes_avx512x16(flow, &pdata[0], fm[0],
+			&di[0], sizeof(uint32_t));
+		in[1] = get_next_bytes_avx512x16(flow, &pdata[2], fm[1],
+			&di[1], sizeof(uint32_t));
+
+		/* main 4B loop */
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		/* check for matches */
+		match_check_process_avx512x16x2(flow, fm, pdata, di, idx, in,
+			tr_lo, tr_hi);
+	}
+}
+
+/*
+ * Resolve matches for multiple categories (GT 8, use 512b instuctions/regs)
+ */
+static inline void
+resolve_mcgt8_avx512x1(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_cat, uint32_t nb_trie)
+{
+	const int32_t *pri;
+	const uint32_t *pm, *res;
+	uint32_t i, k, mi;
+	__mmask16 cm, sm;
+	__m512i cp, cr, np, nr;
+
+	const uint32_t match_log = 5;
+
+	res = pr->results;
+	pri = pr->priority;
+
+	cm = (1 << nb_cat) - 1;
+
+	for (k = 0; k != nb_pkt; k++, result += nb_cat) {
+
+		mi = match[k] << match_log;
+
+		cr = _mm512_maskz_loadu_epi32(cm, res + mi);
+		cp = _mm512_maskz_loadu_epi32(cm, pri + mi);
+
+		for (i = 1, pm = match + nb_pkt; i != nb_trie;
+				i++, pm += nb_pkt) {
+
+			mi = pm[k] << match_log;
+
+			nr = _mm512_maskz_loadu_epi32(cm, res + mi);
+			np = _mm512_maskz_loadu_epi32(cm, pri + mi);
+
+			sm = _mm512_cmpgt_epi32_mask(cp, np);
+			cr = _mm512_mask_mov_epi32(nr, sm, cr);
+			cp = _mm512_mask_mov_epi32(np, sm, cp);
+		}
+
+		_mm512_mask_storeu_epi32(result, cm, cr);
+	}
+}
+
+/*
+ * resolve match index to actual result/priority offset.
+ */
+static inline __m512i
+resolve_match_idx_avx512x16(__m512i mi)
+{
+	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
+		1 << (match_log + 2));
+	return _mm512_slli_epi32(mi, match_log);
+}
+
+/*
+ * Resolve multiple matches for the same flow based on priority.
+ */
+static inline __m512i
+resolve_pri_avx512x16(const int32_t res[], const int32_t pri[],
+	const uint32_t match[], __mmask16 msk, uint32_t nb_trie,
+	uint32_t nb_skip)
+{
+	uint32_t i;
+	const uint32_t *pm;
+	__mmask16 m;
+	__m512i cp, cr, np, nr, mch;
+
+	const __m512i zero = _mm512_set1_epi32(0);
+
+	/* get match indexes */
+	mch = _mm512_maskz_loadu_epi32(msk, match);
+	mch = resolve_match_idx_avx512x16(mch);
+
+	/* read result and priority values for first trie */
+	cr = _mm512_mask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0]));
+	cp = _mm512_mask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0]));
+
+	/*
+	 * read result and priority values for next tries and select one
+	 * with highest priority.
+	 */
+	for (i = 1, pm = match + nb_skip; i != nb_trie;
+			i++, pm += nb_skip) {
+
+		mch = _mm512_maskz_loadu_epi32(msk, pm);
+		mch = resolve_match_idx_avx512x16(mch);
+
+		nr = _mm512_mask_i32gather_epi32(zero, msk, mch, res,
+			sizeof(res[0]));
+		np = _mm512_mask_i32gather_epi32(zero, msk, mch, pri,
+			sizeof(pri[0]));
+
+		m = _mm512_cmpgt_epi32_mask(cp, np);
+		cr = _mm512_mask_mov_epi32(nr, m, cr);
+		cp = _mm512_mask_mov_epi32(np, m, cp);
+	}
+
+	return cr;
+}
+
+/*
+ * Resolve num (<= 16) matches for single category
+ */
+static inline void
+resolve_sc_avx512x16(uint32_t result[], const int32_t res[],
+	const int32_t pri[], const uint32_t match[], uint32_t nb_pkt,
+	uint32_t nb_trie, uint32_t nb_skip)
+{
+	__mmask16 msk;
+	__m512i cr;
+
+	msk = (1 << nb_pkt) - 1;
+	cr = resolve_pri_avx512x16(res, pri, match, msk, nb_trie, nb_skip);
+	_mm512_mask_storeu_epi32(result, msk, cr);
+}
+
+/*
+ * Resolve matches for single category
+ */
+static inline void
+resolve_sc_avx512x16x2(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_trie)
+{
+	uint32_t j, k, n;
+	const int32_t *res, *pri;
+	__m512i cr[2];
+
+	res = (const int32_t *)pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != (nb_pkt & ~MSK_AVX512X16X2); k += NUM_AVX512X16X2) {
+
+		j = k + MASK16_BIT;
+
+		cr[0] = resolve_pri_avx512x16(res, pri, match + k, UINT16_MAX,
+				nb_trie, nb_pkt);
+		cr[1] = resolve_pri_avx512x16(res, pri, match + j, UINT16_MAX,
+				nb_trie, nb_pkt);
+
+		_mm512_storeu_si512(result + k, cr[0]);
+		_mm512_storeu_si512(result + j, cr[1]);
+	}
+
+	n = nb_pkt - k;
+	if (n != 0) {
+		if (n > MASK16_BIT) {
+			resolve_sc_avx512x16(result + k, res, pri, match + k,
+				MASK16_BIT, nb_trie, nb_pkt);
+			k += MASK16_BIT;
+			n -= MASK16_BIT;
+		}
+		resolve_sc_avx512x16(result + k, res, pri, match + k, n,
+				nb_trie, nb_pkt);
+	}
+}
+
+static inline int
+search_avx512x16x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	uint32_t i, *pm;
+	const struct rte_acl_match_results *pr;
+	struct acl_flow_avx512 flow;
+	uint32_t match[ctx->num_tries * total_packets];
+
+	for (i = 0, pm = match; i != ctx->num_tries; i++, pm += total_packets) {
+
+		/* setup for next trie */
+		acl_set_flow_avx512(&flow, ctx, i, data, pm, total_packets);
+
+		/* process the trie */
+		search_trie_avx512x16x2(&flow);
+	}
+
+	/* resolve matches */
+	pr = (const struct rte_acl_match_results *)
+		(ctx->trans_table + ctx->match_index);
+
+	if (categories == 1)
+		resolve_sc_avx512x16x2(results, pr, match, total_packets,
+			ctx->num_tries);
+	else if (categories <= RTE_ACL_MAX_CATEGORIES / 2)
+		resolve_mcle8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+	else
+		resolve_mcgt8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+
+	return 0;
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v3 11/14] acl: for AVX512 classify use 4B load whenever possible
  2020-10-05 18:45   ` [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                       ` (9 preceding siblings ...)
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 10/14] acl: introduce 512-bit width AVX512 classify implementation Konstantin Ananyev
@ 2020-10-05 18:45     ` Konstantin Ananyev
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 12/14] acl: deduplicate AVX512 code paths Konstantin Ananyev
                       ` (4 subsequent siblings)
  15 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-05 18:45 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

With current ACL implementation first field in the rule definition
has always to be one byte long. Though for optimising classify
implementation it might be useful to do 4B reads
(as we do for rest of the fields).
So at build phase, check user provided field definitions to determine
is it safe to do 4B loads for first ACL field.
Then at run-time this information can be used to choose classify
behavior.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl.h               |  1 +
 lib/librte_acl/acl_bld.c           | 34 ++++++++++++++++++++++++++++++
 lib/librte_acl/acl_run_avx512.c    |  2 ++
 lib/librte_acl/acl_run_avx512x16.h |  8 +++----
 lib/librte_acl/acl_run_avx512x8.h  |  8 +++----
 lib/librte_acl/rte_acl.c           |  1 +
 6 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 7ac0d12f08..4089ab2a04 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -169,6 +169,7 @@ struct rte_acl_ctx {
 	int32_t             socket_id;
 	/** Socket ID to allocate memory from. */
 	enum rte_acl_classify_alg alg;
+	uint32_t           first_load_sz;
 	void               *rules;
 	uint32_t            max_rules;
 	uint32_t            rule_sz;
diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
index d1f920b09c..da10864cd8 100644
--- a/lib/librte_acl/acl_bld.c
+++ b/lib/librte_acl/acl_bld.c
@@ -1581,6 +1581,37 @@ acl_check_bld_param(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg)
 	return 0;
 }
 
+/*
+ * With current ACL implementation first field in the rule definition
+ * has always to be one byte long. Though for optimising *classify*
+ * implementation it might be useful to be able to use 4B reads
+ * (as we do for rest of the fields).
+ * This function checks input config to determine is it safe to do 4B
+ * loads for first ACL field. For that we need to make sure that
+ * first field in our rule definition doesn't have the biggest offset,
+ * i.e. we still do have other fields located after the first one.
+ * Contrary if first field has the largest offset, then it means
+ * first field can occupy the very last byte in the input data buffer,
+ * and we have to do single byte load for it.
+ */
+static uint32_t
+get_first_load_size(const struct rte_acl_config *cfg)
+{
+	uint32_t i, max_ofs, ofs;
+
+	ofs = 0;
+	max_ofs = 0;
+
+	for (i = 0; i != cfg->num_fields; i++) {
+		if (cfg->defs[i].field_index == 0)
+			ofs = cfg->defs[i].offset;
+		else if (max_ofs < cfg->defs[i].offset)
+			max_ofs = cfg->defs[i].offset;
+	}
+
+	return (ofs < max_ofs) ? sizeof(uint32_t) : sizeof(uint8_t);
+}
+
 int
 rte_acl_build(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg)
 {
@@ -1618,6 +1649,9 @@ rte_acl_build(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg)
 				/* set data indexes. */
 				acl_set_data_indexes(ctx);
 
+				/* determine can we always do 4B load */
+				ctx->first_load_sz = get_first_load_size(cfg);
+
 				/* copy in build config. */
 				ctx->config = *cfg;
 			}
diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
index 74698fa2ea..3fd1e33c3f 100644
--- a/lib/librte_acl/acl_run_avx512.c
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -11,6 +11,7 @@ struct acl_flow_avx512 {
 	uint32_t num_packets;       /* number of packets processed */
 	uint32_t total_packets;     /* max number of packets to process */
 	uint32_t root_index;        /* current root index */
+	uint32_t first_load_sz;     /* first load size for new packet */
 	const uint64_t *trans;      /* transition table */
 	const uint32_t *data_index; /* input data indexes */
 	const uint8_t **idata;      /* input data */
@@ -24,6 +25,7 @@ acl_set_flow_avx512(struct acl_flow_avx512 *flow, const struct rte_acl_ctx *ctx,
 {
 	flow->num_packets = 0;
 	flow->total_packets = total_packets;
+	flow->first_load_sz = ctx->first_load_sz;
 	flow->root_index = ctx->trie[trie].root_index;
 	flow->trans = ctx->trans_table;
 	flow->data_index = ctx->trie[trie].data_index;
diff --git a/lib/librte_acl/acl_run_avx512x16.h b/lib/librte_acl/acl_run_avx512x16.h
index 981f8d16da..a39df8f3c0 100644
--- a/lib/librte_acl/acl_run_avx512x16.h
+++ b/lib/librte_acl/acl_run_avx512x16.h
@@ -460,7 +460,7 @@ match_check_process_avx512x16x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
 
 		if (n[0] != 0) {
 			inp[0] = get_next_bytes_avx512x16(flow, &pdata[0],
-				rm[0], &di[0], sizeof(uint8_t));
+				rm[0], &di[0], flow->first_load_sz);
 			first_trans16(flow, inp[0], rm[0], &tr_lo[0],
 				&tr_hi[0]);
 			rm[0] = _mm512_test_epi32_mask(tr_lo[0],
@@ -469,7 +469,7 @@ match_check_process_avx512x16x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
 
 		if (n[1] != 0) {
 			inp[1] = get_next_bytes_avx512x16(flow, &pdata[2],
-				rm[1], &di[1], sizeof(uint8_t));
+				rm[1], &di[1], flow->first_load_sz);
 			first_trans16(flow, inp[1], rm[1], &tr_lo[1],
 				&tr_hi[1]);
 			rm[1] = _mm512_test_epi32_mask(tr_lo[1],
@@ -494,9 +494,9 @@ search_trie_avx512x16x2(struct acl_flow_avx512 *flow)
 	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[2], &idx[1], &di[1]);
 
 	in[0] = get_next_bytes_avx512x16(flow, &pdata[0], UINT16_MAX, &di[0],
-			sizeof(uint8_t));
+			flow->first_load_sz);
 	in[1] = get_next_bytes_avx512x16(flow, &pdata[2], UINT16_MAX, &di[1],
-			sizeof(uint8_t));
+			flow->first_load_sz);
 
 	first_trans16(flow, in[0], UINT16_MAX, &tr_lo[0], &tr_hi[0]);
 	first_trans16(flow, in[1], UINT16_MAX, &tr_lo[1], &tr_hi[1]);
diff --git a/lib/librte_acl/acl_run_avx512x8.h b/lib/librte_acl/acl_run_avx512x8.h
index cfba0299ed..fedd79b9ae 100644
--- a/lib/librte_acl/acl_run_avx512x8.h
+++ b/lib/librte_acl/acl_run_avx512x8.h
@@ -418,7 +418,7 @@ match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
 
 		if (n[0] != 0) {
 			inp[0] = get_next_bytes_avx512x8(flow, &pdata[0],
-				rm[0], &di[0], sizeof(uint8_t));
+				rm[0], &di[0], flow->first_load_sz);
 			first_trans8(flow, inp[0], rm[0], &tr_lo[0],
 				&tr_hi[0]);
 			rm[0] = _mm256_test_epi32_mask(tr_lo[0],
@@ -427,7 +427,7 @@ match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
 
 		if (n[1] != 0) {
 			inp[1] = get_next_bytes_avx512x8(flow, &pdata[2],
-				rm[1], &di[1], sizeof(uint8_t));
+				rm[1], &di[1], flow->first_load_sz);
 			first_trans8(flow, inp[1], rm[1], &tr_lo[1],
 				&tr_hi[1]);
 			rm[1] = _mm256_test_epi32_mask(tr_lo[1],
@@ -452,9 +452,9 @@ search_trie_avx512x8x2(struct acl_flow_avx512 *flow)
 	start_flow8(flow, MASK8_BIT, UINT8_MAX, &pdata[2], &idx[1], &di[1]);
 
 	in[0] = get_next_bytes_avx512x8(flow, &pdata[0], UINT8_MAX, &di[0],
-			sizeof(uint8_t));
+			flow->first_load_sz);
 	in[1] = get_next_bytes_avx512x8(flow, &pdata[2], UINT8_MAX, &di[1],
-			sizeof(uint8_t));
+			flow->first_load_sz);
 
 	first_trans8(flow, in[0], UINT8_MAX, &tr_lo[0], &tr_hi[0]);
 	first_trans8(flow, in[1], UINT8_MAX, &tr_lo[1], &tr_hi[1]);
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 245af672ee..f1474038e5 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -500,6 +500,7 @@ rte_acl_dump(const struct rte_acl_ctx *ctx)
 	printf("acl context <%s>@%p\n", ctx->name, ctx);
 	printf("  socket_id=%"PRId32"\n", ctx->socket_id);
 	printf("  alg=%"PRId32"\n", ctx->alg);
+	printf("  first_load_sz=%"PRIu32"\n", ctx->first_load_sz);
 	printf("  max_rules=%"PRIu32"\n", ctx->max_rules);
 	printf("  rule_size=%"PRIu32"\n", ctx->rule_sz);
 	printf("  num_rules=%"PRIu32"\n", ctx->num_rules);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v3 12/14] acl: deduplicate AVX512 code paths
  2020-10-05 18:45   ` [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                       ` (10 preceding siblings ...)
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 11/14] acl: for AVX512 classify use 4B load whenever possible Konstantin Ananyev
@ 2020-10-05 18:45     ` Konstantin Ananyev
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 13/14] test/acl: add AVX512 classify support Konstantin Ananyev
                       ` (3 subsequent siblings)
  15 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-05 18:45 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Current rte_acl_classify_avx512x32() and rte_acl_classify_avx512x16()
code paths are very similar. The only differences are due to
256/512 register/instrincts naming conventions.
So to deduplicate the code:
  - Move common code into “acl_run_avx512_common.h”
  - Use macros to hide difference in naming conventions

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl_run_avx512_common.h | 477 +++++++++++++++++++++
 lib/librte_acl/acl_run_avx512x16.h     | 569 ++++---------------------
 lib/librte_acl/acl_run_avx512x8.h      | 565 ++++--------------------
 3 files changed, 654 insertions(+), 957 deletions(-)
 create mode 100644 lib/librte_acl/acl_run_avx512_common.h

diff --git a/lib/librte_acl/acl_run_avx512_common.h b/lib/librte_acl/acl_run_avx512_common.h
new file mode 100644
index 0000000000..a1c6c6597b
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512_common.h
@@ -0,0 +1,477 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+/*
+ * WARNING: It is not recommended to include this file directly.
+ * Please include "acl_run_avx512x*.h" instead.
+ * To make this file to generate proper code an includer has to
+ * define several macros, refer to "acl_run_avx512x*.h" for more details.
+ */
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes not supposed to be encountered here.
+ * For quad range nodes:
+ * Calculate number of range boundaries that are less than the
+ * input value. Range boundaries for each node are in signed 8 bit,
+ * ordered from -128 to 127.
+ * This is effectively a popcnt of bytes that are greater than the
+ * input byte.
+ * Single nodes are processed in the same ways as quad range nodes.
+ */
+static __rte_always_inline _T_simd
+_F_(calc_addr)(_T_simd index_mask, _T_simd next_input, _T_simd shuffle_input,
+	_T_simd four_32, _T_simd range_base, _T_simd tr_lo, _T_simd tr_hi)
+{
+	__mmask64 qm;
+	_T_mask dfa_msk;
+	_T_simd addr, in, node_type, r, t;
+	_T_simd dfa_ofs, quad_ofs;
+
+	t = _M_SI_(xor)(index_mask, index_mask);
+	in = _M_I_(shuffle_epi8)(next_input, shuffle_input);
+
+	/* Calc node type and node addr */
+	node_type = _M_SI_(andnot)(index_mask, tr_lo);
+	addr = _M_SI_(and)(index_mask, tr_lo);
+
+	/* mask for DFA type(0) nodes */
+	dfa_msk = _M_I_(cmpeq_epi32_mask)(node_type, t);
+
+	/* DFA calculations. */
+	r = _M_I_(srli_epi32)(in, 30);
+	r = _M_I_(add_epi8)(r, range_base);
+	t = _M_I_(srli_epi32)(in, 24);
+	r = _M_I_(shuffle_epi8)(tr_hi, r);
+
+	dfa_ofs = _M_I_(sub_epi32)(t, r);
+
+	/* QUAD/SINGLE calculations. */
+	qm = _M_I_(cmpgt_epi8_mask)(in, tr_hi);
+	t = _M_I_(maskz_set1_epi8)(qm, (uint8_t)UINT8_MAX);
+	t = _M_I_(lzcnt_epi32)(t);
+	t = _M_I_(srli_epi32)(t, 3);
+	quad_ofs = _M_I_(sub_epi32)(four_32, t);
+
+	/* blend DFA and QUAD/SINGLE. */
+	t = _M_I_(mask_mov_epi32)(quad_ofs, dfa_msk, dfa_ofs);
+
+	/* calculate address for next transitions. */
+	addr = _M_I_(add_epi32)(addr, t);
+	return addr;
+}
+
+/*
+ * Process _N_ transitions in parallel.
+ * tr_lo contains low 32 bits for _N_ transition.
+ * tr_hi contains high 32 bits for _N_ transition.
+ * next_input contains up to 4 input bytes for _N_ flows.
+ */
+static __rte_always_inline _T_simd
+_F_(trans)(_T_simd next_input, const uint64_t *trans, _T_simd *tr_lo,
+	_T_simd *tr_hi)
+{
+	const int32_t *tr;
+	_T_simd addr;
+
+	tr = (const int32_t *)(uintptr_t)trans;
+
+	/* Calculate the address (array index) for all _N_ transitions. */
+	addr = _F_(calc_addr)(_SV_(index_mask), next_input, _SV_(shuffle_input),
+		_SV_(four_32), _SV_(range_base), *tr_lo, *tr_hi);
+
+	/* load lower 32 bits of _N_ transactions at once. */
+	*tr_lo = _M_GI_(i32gather_epi32, addr, tr, sizeof(trans[0]));
+
+	next_input = _M_I_(srli_epi32)(next_input, CHAR_BIT);
+
+	/* load high 32 bits of _N_ transactions at once. */
+	*tr_hi = _M_GI_(i32gather_epi32, addr, (tr + 1), sizeof(trans[0]));
+
+	return next_input;
+}
+
+/*
+ * Execute first transition for up to _N_ flows in parallel.
+ * next_input should contain one input byte for up to _N_ flows.
+ * msk - mask of active flows.
+ * tr_lo contains low 32 bits for up to _N_ transitions.
+ * tr_hi contains high 32 bits for up to _N_ transitions.
+ */
+static __rte_always_inline void
+_F_(first_trans)(const struct acl_flow_avx512 *flow, _T_simd next_input,
+	_T_mask msk, _T_simd *tr_lo, _T_simd *tr_hi)
+{
+	const int32_t *tr;
+	_T_simd addr, root;
+
+	tr = (const int32_t *)(uintptr_t)flow->trans;
+
+	addr = _M_I_(set1_epi32)(UINT8_MAX);
+	root = _M_I_(set1_epi32)(flow->root_index);
+
+	addr = _M_SI_(and)(next_input, addr);
+	addr = _M_I_(add_epi32)(root, addr);
+
+	/* load lower 32 bits of _N_ transactions at once. */
+	*tr_lo = _M_MGI_(mask_i32gather_epi32)(*tr_lo, msk, addr, tr,
+		sizeof(flow->trans[0]));
+
+	/* load high 32 bits of _N_ transactions at once. */
+	*tr_hi = _M_MGI_(mask_i32gather_epi32)(*tr_hi, msk, addr, (tr + 1),
+		sizeof(flow->trans[0]));
+}
+
+/*
+ * Load and return next 4 input bytes for up to _N_ flows in parallel.
+ * pdata - 8x2 pointers to flow input data
+ * mask - mask of active flows.
+ * di - data indexes for these _N_ flows.
+ */
+static inline _T_simd
+_F_(get_next_bytes)(const struct acl_flow_avx512 *flow, _T_simd pdata[2],
+	uint32_t msk, _T_simd *di, uint32_t bnum)
+{
+	const int32_t *div;
+	uint32_t m[2];
+	_T_simd one, zero, t, p[2];
+
+	div = (const int32_t *)flow->data_index;
+
+	one = _M_I_(set1_epi32)(1);
+	zero = _M_SI_(xor)(one, one);
+
+	/* load data offsets for given indexes */
+	t = _M_MGI_(mask_i32gather_epi32)(zero, msk, *di, div, sizeof(div[0]));
+
+	/* increment data indexes */
+	*di = _M_I_(mask_add_epi32)(*di, msk, *di, one);
+
+	/*
+	 * unsigned expand 32-bit indexes to 64-bit
+	 * (for later pointer arithmetic), i.e:
+	 * for (i = 0; i != _N_; i++)
+	 *   p[i/8].u64[i%8] = (uint64_t)t.u32[i];
+	 */
+	p[0] = _M_I_(maskz_permutexvar_epi32)(_SC_(pmidx_msk), _SV_(pmidx[0]),
+			t);
+	p[1] = _M_I_(maskz_permutexvar_epi32)(_SC_(pmidx_msk), _SV_(pmidx[1]),
+			t);
+
+	p[0] = _M_I_(add_epi64)(p[0], pdata[0]);
+	p[1] = _M_I_(add_epi64)(p[1], pdata[1]);
+
+	/* load input byte(s), either one or four */
+
+	m[0] = msk & _SIMD_PTR_MSK_;
+	m[1] = msk >> _SIMD_PTR_NUM_;
+
+	return _F_(gather_bytes)(zero, p, m, bnum);
+}
+
+/*
+ * Start up to _N_ new flows.
+ * num - number of flows to start
+ * msk - mask of new flows.
+ * pdata - pointers to flow input data
+ * idx - match indexed for given flows
+ * di - data indexes for these flows.
+ */
+static inline void
+_F_(start_flow)(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
+	_T_simd pdata[2], _T_simd *idx, _T_simd *di)
+{
+	uint32_t n, m[2], nm[2];
+	_T_simd ni, nd[2];
+
+	/* split mask into two - one for each pdata[] */
+	m[0] = msk & _SIMD_PTR_MSK_;
+	m[1] = msk >> _SIMD_PTR_NUM_;
+
+	/* caluclate masks for new flows */
+	n = __builtin_popcount(m[0]);
+	nm[0] = (1 << n) - 1;
+	nm[1] = (1 << (num - n)) - 1;
+
+	/* load input data pointers for new flows */
+	nd[0] = _M_I_(maskz_loadu_epi64)(nm[0],
+			flow->idata + flow->num_packets);
+	nd[1] = _M_I_(maskz_loadu_epi64)(nm[1],
+			flow->idata + flow->num_packets + n);
+
+	/* calculate match indexes of new flows */
+	ni = _M_I_(set1_epi32)(flow->num_packets);
+	ni = _M_I_(add_epi32)(ni, _SV_(idx_add));
+
+	/* merge new and existing flows data */
+	pdata[0] = _M_I_(mask_expand_epi64)(pdata[0], m[0], nd[0]);
+	pdata[1] = _M_I_(mask_expand_epi64)(pdata[1], m[1], nd[1]);
+
+	/* update match and data indexes */
+	*idx = _M_I_(mask_expand_epi32)(*idx, msk, ni);
+	*di = _M_I_(maskz_mov_epi32)(msk ^ _SIMD_MASK_MAX_, *di);
+
+	flow->num_packets += num;
+}
+
+/*
+ * Process found matches for up to _N_ flows.
+ * fmsk - mask of active flows
+ * rmsk - mask of found matches
+ * pdata - pointers to flow input data
+ * di - data indexes for these flows
+ * idx - match indexed for given flows
+ * tr_lo contains low 32 bits for up to _N_ transitions.
+ * tr_hi contains high 32 bits for up to _N_ transitions.
+ */
+static inline uint32_t
+_F_(match_process)(struct acl_flow_avx512 *flow, uint32_t *fmsk,
+	uint32_t *rmsk, _T_simd pdata[2], _T_simd *di, _T_simd *idx,
+	_T_simd *tr_lo, _T_simd *tr_hi)
+{
+	uint32_t n;
+	_T_simd res;
+
+	if (rmsk[0] == 0)
+		return 0;
+
+	/* extract match indexes */
+	res = _M_SI_(and)(tr_lo[0], _SV_(index_mask));
+
+	/* mask  matched transitions to nop */
+	tr_lo[0] = _M_I_(mask_mov_epi32)(tr_lo[0], rmsk[0], _SV_(trlo_idle));
+	tr_hi[0] = _M_I_(mask_mov_epi32)(tr_hi[0], rmsk[0], _SV_(trhi_idle));
+
+	/* save found match indexes */
+	_M_I_(mask_i32scatter_epi32)(flow->matches, rmsk[0], idx[0], res,
+			sizeof(flow->matches[0]));
+
+	/* update masks and start new flows for matches */
+	n = update_flow_mask(flow, fmsk, rmsk);
+	_F_(start_flow)(flow, n, rmsk[0], pdata, idx, di);
+
+	return n;
+}
+
+/*
+ * Test for matches ut to (2 * _N_) flows at once,
+ * if matches exist - process them and start new flows.
+ */
+static inline void
+_F_(match_check_process)(struct acl_flow_avx512 *flow, uint32_t fm[2],
+	_T_simd pdata[4], _T_simd di[2], _T_simd idx[2], _T_simd inp[2],
+	_T_simd tr_lo[2], _T_simd tr_hi[2])
+{
+	uint32_t n[2];
+	uint32_t rm[2];
+
+	/* check for matches */
+	rm[0] = _M_I_(test_epi32_mask)(tr_lo[0], _SV_(match_mask));
+	rm[1] = _M_I_(test_epi32_mask)(tr_lo[1], _SV_(match_mask));
+
+	/* till unprocessed matches exist */
+	while ((rm[0] | rm[1]) != 0) {
+
+		/* process matches and start new flows */
+		n[0] = _F_(match_process)(flow, &fm[0], &rm[0], &pdata[0],
+			&di[0], &idx[0], &tr_lo[0], &tr_hi[0]);
+		n[1] = _F_(match_process)(flow, &fm[1], &rm[1], &pdata[2],
+			&di[1], &idx[1], &tr_lo[1], &tr_hi[1]);
+
+		/* execute first transition for new flows, if any */
+
+		if (n[0] != 0) {
+			inp[0] = _F_(get_next_bytes)(flow, &pdata[0],
+					rm[0], &di[0], flow->first_load_sz);
+			_F_(first_trans)(flow, inp[0], rm[0], &tr_lo[0],
+					&tr_hi[0]);
+			rm[0] = _M_I_(test_epi32_mask)(tr_lo[0],
+					_SV_(match_mask));
+		}
+
+		if (n[1] != 0) {
+			inp[1] = _F_(get_next_bytes)(flow, &pdata[2],
+					rm[1], &di[1], flow->first_load_sz);
+			_F_(first_trans)(flow, inp[1], rm[1], &tr_lo[1],
+					&tr_hi[1]);
+			rm[1] = _M_I_(test_epi32_mask)(tr_lo[1],
+					_SV_(match_mask));
+		}
+	}
+}
+
+/*
+ * Perform search for up to (2 * _N_) flows in parallel.
+ * Use two sets of metadata, each serves _N_ flows max.
+ */
+static inline void
+_F_(search_trie)(struct acl_flow_avx512 *flow)
+{
+	uint32_t fm[2];
+	_T_simd di[2], idx[2], in[2], pdata[4], tr_lo[2], tr_hi[2];
+
+	/* first 1B load */
+	_F_(start_flow)(flow, _SIMD_MASK_BIT_, _SIMD_MASK_MAX_,
+			&pdata[0], &idx[0], &di[0]);
+	_F_(start_flow)(flow, _SIMD_MASK_BIT_, _SIMD_MASK_MAX_,
+			&pdata[2], &idx[1], &di[1]);
+
+	in[0] = _F_(get_next_bytes)(flow, &pdata[0], _SIMD_MASK_MAX_, &di[0],
+			flow->first_load_sz);
+	in[1] = _F_(get_next_bytes)(flow, &pdata[2], _SIMD_MASK_MAX_, &di[1],
+			flow->first_load_sz);
+
+	_F_(first_trans)(flow, in[0], _SIMD_MASK_MAX_, &tr_lo[0], &tr_hi[0]);
+	_F_(first_trans)(flow, in[1], _SIMD_MASK_MAX_, &tr_lo[1], &tr_hi[1]);
+
+	fm[0] = _SIMD_MASK_MAX_;
+	fm[1] = _SIMD_MASK_MAX_;
+
+	/* match check */
+	_F_(match_check_process)(flow, fm, pdata, di, idx, in, tr_lo, tr_hi);
+
+	while ((fm[0] | fm[1]) != 0) {
+
+		/* load next 4B */
+
+		in[0] = _F_(get_next_bytes)(flow, &pdata[0], fm[0],
+				&di[0], sizeof(uint32_t));
+		in[1] = _F_(get_next_bytes)(flow, &pdata[2], fm[1],
+				&di[1], sizeof(uint32_t));
+
+		/* main 4B loop */
+
+		in[0] = _F_(trans)(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = _F_(trans)(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = _F_(trans)(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = _F_(trans)(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = _F_(trans)(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = _F_(trans)(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = _F_(trans)(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = _F_(trans)(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		/* check for matches */
+		_F_(match_check_process)(flow, fm, pdata, di, idx, in,
+			tr_lo, tr_hi);
+	}
+}
+
+/*
+ * resolve match index to actual result/priority offset.
+ */
+static inline _T_simd
+_F_(resolve_match_idx)(_T_simd mi)
+{
+	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
+		1 << (match_log + 2));
+	return _M_I_(slli_epi32)(mi, match_log);
+}
+
+/*
+ * Resolve multiple matches for the same flow based on priority.
+ */
+static inline _T_simd
+_F_(resolve_pri)(const int32_t res[], const int32_t pri[],
+	const uint32_t match[], _T_mask msk, uint32_t nb_trie,
+	uint32_t nb_skip)
+{
+	uint32_t i;
+	const uint32_t *pm;
+	_T_mask m;
+	_T_simd cp, cr, np, nr, mch;
+
+	const _T_simd zero = _M_I_(set1_epi32)(0);
+
+	/* get match indexes */
+	mch = _M_I_(maskz_loadu_epi32)(msk, match);
+	mch = _F_(resolve_match_idx)(mch);
+
+	/* read result and priority values for first trie */
+	cr = _M_MGI_(mask_i32gather_epi32)(zero, msk, mch, res, sizeof(res[0]));
+	cp = _M_MGI_(mask_i32gather_epi32)(zero, msk, mch, pri, sizeof(pri[0]));
+
+	/*
+	 * read result and priority values for next tries and select one
+	 * with highest priority.
+	 */
+	for (i = 1, pm = match + nb_skip; i != nb_trie;
+			i++, pm += nb_skip) {
+
+		mch = _M_I_(maskz_loadu_epi32)(msk, pm);
+		mch = _F_(resolve_match_idx)(mch);
+
+		nr = _M_MGI_(mask_i32gather_epi32)(zero, msk, mch, res,
+				sizeof(res[0]));
+		np = _M_MGI_(mask_i32gather_epi32)(zero, msk, mch, pri,
+				sizeof(pri[0]));
+
+		m = _M_I_(cmpgt_epi32_mask)(cp, np);
+		cr = _M_I_(mask_mov_epi32)(nr, m, cr);
+		cp = _M_I_(mask_mov_epi32)(np, m, cp);
+	}
+
+	return cr;
+}
+
+/*
+ * Resolve num (<= _N_) matches for single category
+ */
+static inline void
+_F_(resolve_sc)(uint32_t result[], const int32_t res[],
+	const int32_t pri[], const uint32_t match[], uint32_t nb_pkt,
+	uint32_t nb_trie, uint32_t nb_skip)
+{
+	_T_mask msk;
+	_T_simd cr;
+
+	msk = (1 << nb_pkt) - 1;
+	cr = _F_(resolve_pri)(res, pri, match, msk, nb_trie, nb_skip);
+	_M_I_(mask_storeu_epi32)(result, msk, cr);
+}
+
+/*
+ * Resolve matches for single category
+ */
+static inline void
+_F_(resolve_single_cat)(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_trie)
+{
+	uint32_t j, k, n;
+	const int32_t *res, *pri;
+	_T_simd cr[2];
+
+	res = (const int32_t *)pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != (nb_pkt & ~_SIMD_FLOW_MSK_); k += _SIMD_FLOW_NUM_) {
+
+		j = k + _SIMD_MASK_BIT_;
+
+		cr[0] = _F_(resolve_pri)(res, pri, match + k, _SIMD_MASK_MAX_,
+				nb_trie, nb_pkt);
+		cr[1] = _F_(resolve_pri)(res, pri, match + j, _SIMD_MASK_MAX_,
+				nb_trie, nb_pkt);
+
+		_M_SI_(storeu)((void *)(result + k), cr[0]);
+		_M_SI_(storeu)((void *)(result + j), cr[1]);
+	}
+
+	n = nb_pkt - k;
+	if (n != 0) {
+		if (n > _SIMD_MASK_BIT_) {
+			_F_(resolve_sc)(result + k, res, pri, match + k,
+				_SIMD_MASK_BIT_, nb_trie, nb_pkt);
+			k += _SIMD_MASK_BIT_;
+			n -= _SIMD_MASK_BIT_;
+		}
+		_F_(resolve_sc)(result + k, res, pri, match + k, n,
+				nb_trie, nb_pkt);
+	}
+}
diff --git a/lib/librte_acl/acl_run_avx512x16.h b/lib/librte_acl/acl_run_avx512x16.h
index a39df8f3c0..da244bc257 100644
--- a/lib/librte_acl/acl_run_avx512x16.h
+++ b/lib/librte_acl/acl_run_avx512x16.h
@@ -2,16 +2,57 @@
  * Copyright(c) 2020 Intel Corporation
  */
 
-#define	MASK16_BIT	(sizeof(__mmask16) * CHAR_BIT)
+/*
+ * Defines required by "acl_run_avx512_common.h".
+ * Note that all of them has to be undefined by the end
+ * of this file, as "acl_run_avx512_common.h" can be included several
+ * times from different *.h files for the same *.c.
+ */
+
+/*
+ * This implementation uses 512-bit registers(zmm) and instrincts.
+ * So our main SIMD type is 512-bit width and each such variable can
+ * process sizeof(__m512i) / sizeof(uint32_t) == 16 entries in parallel.
+ */
+#define _T_simd		__m512i
+#define _T_mask		__mmask16
+
+/* Naming convention for static const variables. */
+#define _SC_(x)		zmm_##x
+#define _SV_(x)		(zmm_##x.z)
+
+/* Naming convention for internal functions. */
+#define _F_(x)		x##_avx512x16
+
+/*
+ * Same instrincts have different syntaxis (depending on the bit-width),
+ * so to overcome that few macros need to be defined.
+ */
+
+/* Naming convention for generic epi(packed integers) type instrincts. */
+#define _M_I_(x)	_mm512_##x
+
+/* Naming convention for si(whole simd integer) type instrincts. */
+#define _M_SI_(x)	_mm512_##x##_si512
+
+/* Naming convention for masked gather type instrincts. */
+#define _M_MGI_(x)	_mm512_##x
+
+/* Naming convention for gather type instrincts. */
+#define _M_GI_(name, idx, base, scale)	_mm512_##name(idx, base, scale)
 
-#define NUM_AVX512X16X2	(2 * MASK16_BIT)
-#define MSK_AVX512X16X2	(NUM_AVX512X16X2 - 1)
+/* num/mask of transitions per SIMD regs */
+#define _SIMD_MASK_BIT_	(sizeof(_T_simd) / sizeof(uint32_t))
+#define _SIMD_MASK_MAX_	RTE_LEN2MASK(_SIMD_MASK_BIT_, uint32_t)
+
+#define _SIMD_FLOW_NUM_	(2 * _SIMD_MASK_BIT_)
+#define _SIMD_FLOW_MSK_	(_SIMD_FLOW_NUM_ - 1)
 
 /* num/mask of pointers per SIMD regs */
-#define ZMM_PTR_NUM	(sizeof(__m512i) / sizeof(uintptr_t))
-#define ZMM_PTR_MSK	RTE_LEN2MASK(ZMM_PTR_NUM, uint32_t)
+#define _SIMD_PTR_NUM_	(sizeof(_T_simd) / sizeof(uintptr_t))
+#define _SIMD_PTR_MSK_	RTE_LEN2MASK(_SIMD_PTR_NUM_, uint32_t)
 
-static const __rte_x86_zmm_t zmm_match_mask = {
+static const __rte_x86_zmm_t _SC_(match_mask) = {
 	.u32 = {
 		RTE_ACL_NODE_MATCH,
 		RTE_ACL_NODE_MATCH,
@@ -32,7 +73,7 @@ static const __rte_x86_zmm_t zmm_match_mask = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_index_mask = {
+static const __rte_x86_zmm_t _SC_(index_mask) = {
 	.u32 = {
 		RTE_ACL_NODE_INDEX,
 		RTE_ACL_NODE_INDEX,
@@ -53,7 +94,7 @@ static const __rte_x86_zmm_t zmm_index_mask = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_trlo_idle = {
+static const __rte_x86_zmm_t _SC_(trlo_idle) = {
 	.u32 = {
 		RTE_ACL_IDLE_NODE,
 		RTE_ACL_IDLE_NODE,
@@ -74,7 +115,7 @@ static const __rte_x86_zmm_t zmm_trlo_idle = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_trhi_idle = {
+static const __rte_x86_zmm_t _SC_(trhi_idle) = {
 	.u32 = {
 		0, 0, 0, 0,
 		0, 0, 0, 0,
@@ -83,7 +124,7 @@ static const __rte_x86_zmm_t zmm_trhi_idle = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_shuffle_input = {
+static const __rte_x86_zmm_t _SC_(shuffle_input) = {
 	.u32 = {
 		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
 		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
@@ -92,7 +133,7 @@ static const __rte_x86_zmm_t zmm_shuffle_input = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_four_32 = {
+static const __rte_x86_zmm_t _SC_(four_32) = {
 	.u32 = {
 		4, 4, 4, 4,
 		4, 4, 4, 4,
@@ -101,7 +142,7 @@ static const __rte_x86_zmm_t zmm_four_32 = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_idx_add = {
+static const __rte_x86_zmm_t _SC_(idx_add) = {
 	.u32 = {
 		0, 1, 2, 3,
 		4, 5, 6, 7,
@@ -110,7 +151,7 @@ static const __rte_x86_zmm_t zmm_idx_add = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_range_base = {
+static const __rte_x86_zmm_t _SC_(range_base) = {
 	.u32 = {
 		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
 		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
@@ -119,16 +160,16 @@ static const __rte_x86_zmm_t zmm_range_base = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_pminp = {
+static const __rte_x86_zmm_t _SC_(pminp) = {
 	.u32 = {
 		0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
 		0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17,
 	},
 };
 
-static const __mmask16 zmm_pmidx_msk = 0x5555;
+static const _T_mask _SC_(pmidx_msk) = 0x5555;
 
-static const __rte_x86_zmm_t zmm_pmidx[2] = {
+static const __rte_x86_zmm_t _SC_(pmidx[2]) = {
 	[0] = {
 		.u32 = {
 			0, 0, 1, 0, 2, 0, 3, 0,
@@ -148,7 +189,7 @@ static const __rte_x86_zmm_t zmm_pmidx[2] = {
  * gather load on a byte quantity. So we have to mimic it in SW,
  * by doing 8x1B scalar loads.
  */
-static inline ymm_t
+static inline __m256i
 _m512_mask_gather_epi8x8(__m512i pdata, __mmask8 mask)
 {
 	rte_ymm_t v;
@@ -156,7 +197,7 @@ _m512_mask_gather_epi8x8(__m512i pdata, __mmask8 mask)
 
 	static const uint32_t zero;
 
-	p.z = _mm512_mask_set1_epi64(pdata, mask ^ ZMM_PTR_MSK,
+	p.z = _mm512_mask_set1_epi64(pdata, mask ^ _SIMD_PTR_MSK_,
 		(uintptr_t)&zero);
 
 	v.u32[0] = *(uint8_t *)p.u64[0];
@@ -172,369 +213,29 @@ _m512_mask_gather_epi8x8(__m512i pdata, __mmask8 mask)
 }
 
 /*
- * Calculate the address of the next transition for
- * all types of nodes. Note that only DFA nodes and range
- * nodes actually transition to another node. Match
- * nodes not supposed to be encountered here.
- * For quad range nodes:
- * Calculate number of range boundaries that are less than the
- * input value. Range boundaries for each node are in signed 8 bit,
- * ordered from -128 to 127.
- * This is effectively a popcnt of bytes that are greater than the
- * input byte.
- * Single nodes are processed in the same ways as quad range nodes.
- */
-static __rte_always_inline __m512i
-calc_addr16(__m512i index_mask, __m512i next_input, __m512i shuffle_input,
-	__m512i four_32, __m512i range_base, __m512i tr_lo, __m512i tr_hi)
-{
-	__mmask64 qm;
-	__mmask16 dfa_msk;
-	__m512i addr, in, node_type, r, t;
-	__m512i dfa_ofs, quad_ofs;
-
-	t = _mm512_xor_si512(index_mask, index_mask);
-	in = _mm512_shuffle_epi8(next_input, shuffle_input);
-
-	/* Calc node type and node addr */
-	node_type = _mm512_andnot_si512(index_mask, tr_lo);
-	addr = _mm512_and_si512(index_mask, tr_lo);
-
-	/* mask for DFA type(0) nodes */
-	dfa_msk = _mm512_cmpeq_epi32_mask(node_type, t);
-
-	/* DFA calculations. */
-	r = _mm512_srli_epi32(in, 30);
-	r = _mm512_add_epi8(r, range_base);
-	t = _mm512_srli_epi32(in, 24);
-	r = _mm512_shuffle_epi8(tr_hi, r);
-
-	dfa_ofs = _mm512_sub_epi32(t, r);
-
-	/* QUAD/SINGLE calculations. */
-	qm = _mm512_cmpgt_epi8_mask(in, tr_hi);
-	t = _mm512_maskz_set1_epi8(qm, (uint8_t)UINT8_MAX);
-	t = _mm512_lzcnt_epi32(t);
-	t = _mm512_srli_epi32(t, 3);
-	quad_ofs = _mm512_sub_epi32(four_32, t);
-
-	/* blend DFA and QUAD/SINGLE. */
-	t = _mm512_mask_mov_epi32(quad_ofs, dfa_msk, dfa_ofs);
-
-	/* calculate address for next transitions. */
-	addr = _mm512_add_epi32(addr, t);
-	return addr;
-}
-
-/*
- * Process 16 transitions in parallel.
- * tr_lo contains low 32 bits for 16 transition.
- * tr_hi contains high 32 bits for 16 transition.
- * next_input contains up to 4 input bytes for 16 flows.
+ * Gather 4/1 input bytes for up to 16 (2*8) locations in parallel.
  */
 static __rte_always_inline __m512i
-transition16(__m512i next_input, const uint64_t *trans, __m512i *tr_lo,
-	__m512i *tr_hi)
-{
-	const int32_t *tr;
-	__m512i addr;
-
-	tr = (const int32_t *)(uintptr_t)trans;
-
-	/* Calculate the address (array index) for all 16 transitions. */
-	addr = calc_addr16(zmm_index_mask.z, next_input, zmm_shuffle_input.z,
-		zmm_four_32.z, zmm_range_base.z, *tr_lo, *tr_hi);
-
-	/* load lower 32 bits of 16 transactions at once. */
-	*tr_lo = _mm512_i32gather_epi32(addr, tr, sizeof(trans[0]));
-
-	next_input = _mm512_srli_epi32(next_input, CHAR_BIT);
-
-	/* load high 32 bits of 16 transactions at once. */
-	*tr_hi = _mm512_i32gather_epi32(addr, (tr + 1), sizeof(trans[0]));
-
-	return next_input;
-}
-
-/*
- * Execute first transition for up to 16 flows in parallel.
- * next_input should contain one input byte for up to 16 flows.
- * msk - mask of active flows.
- * tr_lo contains low 32 bits for up to 16 transitions.
- * tr_hi contains high 32 bits for up to 16 transitions.
- */
-static __rte_always_inline void
-first_trans16(const struct acl_flow_avx512 *flow, __m512i next_input,
-	__mmask16 msk, __m512i *tr_lo, __m512i *tr_hi)
+_F_(gather_bytes)(__m512i zero, const __m512i p[2], const uint32_t m[2],
+	uint32_t bnum)
 {
-	const int32_t *tr;
-	__m512i addr, root;
-
-	tr = (const int32_t *)(uintptr_t)flow->trans;
-
-	addr = _mm512_set1_epi32(UINT8_MAX);
-	root = _mm512_set1_epi32(flow->root_index);
-
-	addr = _mm512_and_si512(next_input, addr);
-	addr = _mm512_add_epi32(root, addr);
-
-	/* load lower 32 bits of 16 transactions at once. */
-	*tr_lo = _mm512_mask_i32gather_epi32(*tr_lo, msk, addr, tr,
-		sizeof(flow->trans[0]));
-
-	/* load high 32 bits of 16 transactions at once. */
-	*tr_hi = _mm512_mask_i32gather_epi32(*tr_hi, msk, addr, (tr + 1),
-		sizeof(flow->trans[0]));
-}
-
-/*
- * Load and return next 4 input bytes for up to 16 flows in parallel.
- * pdata - 8x2 pointers to flow input data
- * mask - mask of active flows.
- * di - data indexes for these 16 flows.
- */
-static inline __m512i
-get_next_bytes_avx512x16(const struct acl_flow_avx512 *flow, __m512i pdata[2],
-	uint32_t msk, __m512i *di, uint32_t bnum)
-{
-	const int32_t *div;
-	uint32_t m[2];
-	__m512i one, zero, t, p[2];
-	ymm_t inp[2];
-
-	div = (const int32_t *)flow->data_index;
-
-	one = _mm512_set1_epi32(1);
-	zero = _mm512_xor_si512(one, one);
-
-	/* load data offsets for given indexes */
-	t = _mm512_mask_i32gather_epi32(zero, msk, *di, div, sizeof(div[0]));
-
-	/* increment data indexes */
-	*di = _mm512_mask_add_epi32(*di, msk, *di, one);
-
-	/*
-	 * unsigned expand 32-bit indexes to 64-bit
-	 * (for later pointer arithmetic), i.e:
-	 * for (i = 0; i != 16; i++)
-	 *   p[i/8].u64[i%8] = (uint64_t)t.u32[i];
-	 */
-	p[0] = _mm512_maskz_permutexvar_epi32(zmm_pmidx_msk, zmm_pmidx[0].z, t);
-	p[1] = _mm512_maskz_permutexvar_epi32(zmm_pmidx_msk, zmm_pmidx[1].z, t);
-
-	p[0] = _mm512_add_epi64(p[0], pdata[0]);
-	p[1] = _mm512_add_epi64(p[1], pdata[1]);
-
-	/* load input byte(s), either one or four */
-
-	m[0] = msk & ZMM_PTR_MSK;
-	m[1] = msk >> ZMM_PTR_NUM;
+	__m256i inp[2];
 
 	if (bnum == sizeof(uint8_t)) {
 		inp[0] = _m512_mask_gather_epi8x8(p[0], m[0]);
 		inp[1] = _m512_mask_gather_epi8x8(p[1], m[1]);
 	} else {
 		inp[0] = _mm512_mask_i64gather_epi32(
-				_mm512_castsi512_si256(zero), m[0], p[0],
-				NULL, sizeof(uint8_t));
+				_mm512_castsi512_si256(zero),
+				m[0], p[0], NULL, sizeof(uint8_t));
 		inp[1] = _mm512_mask_i64gather_epi32(
-				_mm512_castsi512_si256(zero), m[1], p[1],
-				NULL, sizeof(uint8_t));
+				_mm512_castsi512_si256(zero),
+				m[1], p[1], NULL, sizeof(uint8_t));
 	}
 
 	/* squeeze input into one 512-bit register */
 	return _mm512_permutex2var_epi32(_mm512_castsi256_si512(inp[0]),
-			zmm_pminp.z, _mm512_castsi256_si512(inp[1]));
-}
-
-/*
- * Start up to 16 new flows.
- * num - number of flows to start
- * msk - mask of new flows.
- * pdata - pointers to flow input data
- * idx - match indexed for given flows
- * di - data indexes for these flows.
- */
-static inline void
-start_flow16(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
-	__m512i pdata[2], __m512i *idx, __m512i *di)
-{
-	uint32_t n, m[2], nm[2];
-	__m512i ni, nd[2];
-
-	/* split mask into two - one for each pdata[] */
-	m[0] = msk & ZMM_PTR_MSK;
-	m[1] = msk >> ZMM_PTR_NUM;
-
-	/* caluclate masks for new flows */
-	n = __builtin_popcount(m[0]);
-	nm[0] = (1 << n) - 1;
-	nm[1] = (1 << (num - n)) - 1;
-
-	/* load input data pointers for new flows */
-	nd[0] = _mm512_maskz_loadu_epi64(nm[0],
-		flow->idata + flow->num_packets);
-	nd[1] = _mm512_maskz_loadu_epi64(nm[1],
-		flow->idata + flow->num_packets + n);
-
-	/* calculate match indexes of new flows */
-	ni = _mm512_set1_epi32(flow->num_packets);
-	ni = _mm512_add_epi32(ni, zmm_idx_add.z);
-
-	/* merge new and existing flows data */
-	pdata[0] = _mm512_mask_expand_epi64(pdata[0], m[0], nd[0]);
-	pdata[1] = _mm512_mask_expand_epi64(pdata[1], m[1], nd[1]);
-
-	/* update match and data indexes */
-	*idx = _mm512_mask_expand_epi32(*idx, msk, ni);
-	*di = _mm512_maskz_mov_epi32(msk ^ UINT16_MAX, *di);
-
-	flow->num_packets += num;
-}
-
-/*
- * Process found matches for up to 16 flows.
- * fmsk - mask of active flows
- * rmsk - mask of found matches
- * pdata - pointers to flow input data
- * di - data indexes for these flows
- * idx - match indexed for given flows
- * tr_lo contains low 32 bits for up to 8 transitions.
- * tr_hi contains high 32 bits for up to 8 transitions.
- */
-static inline uint32_t
-match_process_avx512x16(struct acl_flow_avx512 *flow, uint32_t *fmsk,
-	uint32_t *rmsk, __m512i pdata[2], __m512i *di, __m512i *idx,
-	__m512i *tr_lo, __m512i *tr_hi)
-{
-	uint32_t n;
-	__m512i res;
-
-	if (rmsk[0] == 0)
-		return 0;
-
-	/* extract match indexes */
-	res = _mm512_and_si512(tr_lo[0], zmm_index_mask.z);
-
-	/* mask  matched transitions to nop */
-	tr_lo[0] = _mm512_mask_mov_epi32(tr_lo[0], rmsk[0], zmm_trlo_idle.z);
-	tr_hi[0] = _mm512_mask_mov_epi32(tr_hi[0], rmsk[0], zmm_trhi_idle.z);
-
-	/* save found match indexes */
-	_mm512_mask_i32scatter_epi32(flow->matches, rmsk[0],
-		idx[0], res, sizeof(flow->matches[0]));
-
-	/* update masks and start new flows for matches */
-	n = update_flow_mask(flow, fmsk, rmsk);
-	start_flow16(flow, n, rmsk[0], pdata, idx, di);
-
-	return n;
-}
-
-/*
- * Test for matches ut to 32 (2x16) flows at once,
- * if matches exist - process them and start new flows.
- */
-static inline void
-match_check_process_avx512x16x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
-	__m512i pdata[4], __m512i di[2], __m512i idx[2], __m512i inp[2],
-	__m512i tr_lo[2], __m512i tr_hi[2])
-{
-	uint32_t n[2];
-	uint32_t rm[2];
-
-	/* check for matches */
-	rm[0] = _mm512_test_epi32_mask(tr_lo[0], zmm_match_mask.z);
-	rm[1] = _mm512_test_epi32_mask(tr_lo[1], zmm_match_mask.z);
-
-	/* till unprocessed matches exist */
-	while ((rm[0] | rm[1]) != 0) {
-
-		/* process matches and start new flows */
-		n[0] = match_process_avx512x16(flow, &fm[0], &rm[0], &pdata[0],
-			&di[0], &idx[0], &tr_lo[0], &tr_hi[0]);
-		n[1] = match_process_avx512x16(flow, &fm[1], &rm[1], &pdata[2],
-			&di[1], &idx[1], &tr_lo[1], &tr_hi[1]);
-
-		/* execute first transition for new flows, if any */
-
-		if (n[0] != 0) {
-			inp[0] = get_next_bytes_avx512x16(flow, &pdata[0],
-				rm[0], &di[0], flow->first_load_sz);
-			first_trans16(flow, inp[0], rm[0], &tr_lo[0],
-				&tr_hi[0]);
-			rm[0] = _mm512_test_epi32_mask(tr_lo[0],
-				zmm_match_mask.z);
-		}
-
-		if (n[1] != 0) {
-			inp[1] = get_next_bytes_avx512x16(flow, &pdata[2],
-				rm[1], &di[1], flow->first_load_sz);
-			first_trans16(flow, inp[1], rm[1], &tr_lo[1],
-				&tr_hi[1]);
-			rm[1] = _mm512_test_epi32_mask(tr_lo[1],
-				zmm_match_mask.z);
-		}
-	}
-}
-
-/*
- * Perform search for up to 32 flows in parallel.
- * Use two sets of metadata, each serves 16 flows max.
- * So in fact we perform search for 2x16 flows.
- */
-static inline void
-search_trie_avx512x16x2(struct acl_flow_avx512 *flow)
-{
-	uint32_t fm[2];
-	__m512i di[2], idx[2], in[2], pdata[4], tr_lo[2], tr_hi[2];
-
-	/* first 1B load */
-	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[0], &idx[0], &di[0]);
-	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[2], &idx[1], &di[1]);
-
-	in[0] = get_next_bytes_avx512x16(flow, &pdata[0], UINT16_MAX, &di[0],
-			flow->first_load_sz);
-	in[1] = get_next_bytes_avx512x16(flow, &pdata[2], UINT16_MAX, &di[1],
-			flow->first_load_sz);
-
-	first_trans16(flow, in[0], UINT16_MAX, &tr_lo[0], &tr_hi[0]);
-	first_trans16(flow, in[1], UINT16_MAX, &tr_lo[1], &tr_hi[1]);
-
-	fm[0] = UINT16_MAX;
-	fm[1] = UINT16_MAX;
-
-	/* match check */
-	match_check_process_avx512x16x2(flow, fm, pdata, di, idx, in,
-		tr_lo, tr_hi);
-
-	while ((fm[0] | fm[1]) != 0) {
-
-		/* load next 4B */
-
-		in[0] = get_next_bytes_avx512x16(flow, &pdata[0], fm[0],
-			&di[0], sizeof(uint32_t));
-		in[1] = get_next_bytes_avx512x16(flow, &pdata[2], fm[1],
-			&di[1], sizeof(uint32_t));
-
-		/* main 4B loop */
-
-		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		/* check for matches */
-		match_check_process_avx512x16x2(flow, fm, pdata, di, idx, in,
-			tr_lo, tr_hi);
-	}
+			_SV_(pminp), _mm512_castsi256_si512(inp[1]));
 }
 
 /*
@@ -582,120 +283,12 @@ resolve_mcgt8_avx512x1(uint32_t result[],
 	}
 }
 
-/*
- * resolve match index to actual result/priority offset.
- */
-static inline __m512i
-resolve_match_idx_avx512x16(__m512i mi)
-{
-	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
-		1 << (match_log + 2));
-	return _mm512_slli_epi32(mi, match_log);
-}
-
-/*
- * Resolve multiple matches for the same flow based on priority.
- */
-static inline __m512i
-resolve_pri_avx512x16(const int32_t res[], const int32_t pri[],
-	const uint32_t match[], __mmask16 msk, uint32_t nb_trie,
-	uint32_t nb_skip)
-{
-	uint32_t i;
-	const uint32_t *pm;
-	__mmask16 m;
-	__m512i cp, cr, np, nr, mch;
-
-	const __m512i zero = _mm512_set1_epi32(0);
-
-	/* get match indexes */
-	mch = _mm512_maskz_loadu_epi32(msk, match);
-	mch = resolve_match_idx_avx512x16(mch);
-
-	/* read result and priority values for first trie */
-	cr = _mm512_mask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0]));
-	cp = _mm512_mask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0]));
-
-	/*
-	 * read result and priority values for next tries and select one
-	 * with highest priority.
-	 */
-	for (i = 1, pm = match + nb_skip; i != nb_trie;
-			i++, pm += nb_skip) {
-
-		mch = _mm512_maskz_loadu_epi32(msk, pm);
-		mch = resolve_match_idx_avx512x16(mch);
-
-		nr = _mm512_mask_i32gather_epi32(zero, msk, mch, res,
-			sizeof(res[0]));
-		np = _mm512_mask_i32gather_epi32(zero, msk, mch, pri,
-			sizeof(pri[0]));
-
-		m = _mm512_cmpgt_epi32_mask(cp, np);
-		cr = _mm512_mask_mov_epi32(nr, m, cr);
-		cp = _mm512_mask_mov_epi32(np, m, cp);
-	}
-
-	return cr;
-}
-
-/*
- * Resolve num (<= 16) matches for single category
- */
-static inline void
-resolve_sc_avx512x16(uint32_t result[], const int32_t res[],
-	const int32_t pri[], const uint32_t match[], uint32_t nb_pkt,
-	uint32_t nb_trie, uint32_t nb_skip)
-{
-	__mmask16 msk;
-	__m512i cr;
-
-	msk = (1 << nb_pkt) - 1;
-	cr = resolve_pri_avx512x16(res, pri, match, msk, nb_trie, nb_skip);
-	_mm512_mask_storeu_epi32(result, msk, cr);
-}
+#include "acl_run_avx512_common.h"
 
 /*
- * Resolve matches for single category
+ * Perform search for up to (2 * 16) flows in parallel.
+ * Use two sets of metadata, each serves 16 flows max.
  */
-static inline void
-resolve_sc_avx512x16x2(uint32_t result[],
-	const struct rte_acl_match_results pr[], const uint32_t match[],
-	uint32_t nb_pkt, uint32_t nb_trie)
-{
-	uint32_t j, k, n;
-	const int32_t *res, *pri;
-	__m512i cr[2];
-
-	res = (const int32_t *)pr->results;
-	pri = pr->priority;
-
-	for (k = 0; k != (nb_pkt & ~MSK_AVX512X16X2); k += NUM_AVX512X16X2) {
-
-		j = k + MASK16_BIT;
-
-		cr[0] = resolve_pri_avx512x16(res, pri, match + k, UINT16_MAX,
-				nb_trie, nb_pkt);
-		cr[1] = resolve_pri_avx512x16(res, pri, match + j, UINT16_MAX,
-				nb_trie, nb_pkt);
-
-		_mm512_storeu_si512(result + k, cr[0]);
-		_mm512_storeu_si512(result + j, cr[1]);
-	}
-
-	n = nb_pkt - k;
-	if (n != 0) {
-		if (n > MASK16_BIT) {
-			resolve_sc_avx512x16(result + k, res, pri, match + k,
-				MASK16_BIT, nb_trie, nb_pkt);
-			k += MASK16_BIT;
-			n -= MASK16_BIT;
-		}
-		resolve_sc_avx512x16(result + k, res, pri, match + k, n,
-				nb_trie, nb_pkt);
-	}
-}
-
 static inline int
 search_avx512x16x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t total_packets, uint32_t categories)
@@ -711,7 +304,7 @@ search_avx512x16x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 		acl_set_flow_avx512(&flow, ctx, i, data, pm, total_packets);
 
 		/* process the trie */
-		search_trie_avx512x16x2(&flow);
+		_F_(search_trie)(&flow);
 	}
 
 	/* resolve matches */
@@ -719,7 +312,7 @@ search_avx512x16x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 		(ctx->trans_table + ctx->match_index);
 
 	if (categories == 1)
-		resolve_sc_avx512x16x2(results, pr, match, total_packets,
+		_F_(resolve_single_cat)(results, pr, match, total_packets,
 			ctx->num_tries);
 	else if (categories <= RTE_ACL_MAX_CATEGORIES / 2)
 		resolve_mcle8_avx512x1(results, pr, match, total_packets,
@@ -730,3 +323,19 @@ search_avx512x16x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 
 	return 0;
 }
+
+#undef _SIMD_PTR_MSK_
+#undef _SIMD_PTR_NUM_
+#undef _SIMD_FLOW_MSK_
+#undef _SIMD_FLOW_NUM_
+#undef _SIMD_MASK_MAX_
+#undef _SIMD_MASK_BIT_
+#undef _M_GI_
+#undef _M_MGI_
+#undef _M_SI_
+#undef _M_I_
+#undef _F_
+#undef _SV_
+#undef _SC_
+#undef _T_mask
+#undef _T_simd
diff --git a/lib/librte_acl/acl_run_avx512x8.h b/lib/librte_acl/acl_run_avx512x8.h
index fedd79b9ae..61ac9d1b47 100644
--- a/lib/librte_acl/acl_run_avx512x8.h
+++ b/lib/librte_acl/acl_run_avx512x8.h
@@ -2,16 +2,57 @@
  * Copyright(c) 2020 Intel Corporation
  */
 
-#define MASK8_BIT	(sizeof(__mmask8) * CHAR_BIT)
+/*
+ * Defines required by "acl_run_avx512_common.h".
+ * Note that all of them has to be undefined by the end
+ * of this file, as "acl_run_avx512_common.h" can be included several
+ * times from different *.h files for the same *.c.
+ */
+
+/*
+ * This implementation uses 256-bit registers(ymm) and instrincts.
+ * So our main SIMD type is 256-bit width and each such variable can
+ * process sizeof(__m256i) / sizeof(uint32_t) == 8 entries in parallel.
+ */
+#define _T_simd		__m256i
+#define _T_mask		__mmask8
+
+/* Naming convention for static const variables. */
+#define _SC_(x)		ymm_##x
+#define _SV_(x)		(ymm_##x.y)
+
+/* Naming convention for internal functions. */
+#define _F_(x)		x##_avx512x8
+
+/*
+ * Same instrincts have different syntaxis (depending on the bit-width),
+ * so to overcome that few macros need to be defined.
+ */
+
+/* Naming convention for generic epi(packed integers) type instrincts. */
+#define _M_I_(x)	_mm256_##x
+
+/* Naming convention for si(whole simd integer) type instrincts. */
+#define _M_SI_(x)	_mm256_##x##_si256
 
-#define NUM_AVX512X8X2	(2 * MASK8_BIT)
-#define MSK_AVX512X8X2	(NUM_AVX512X8X2 - 1)
+/* Naming convention for masked gather type instrincts. */
+#define _M_MGI_(x)	_mm256_m##x
+
+/* Naming convention for gather type instrincts. */
+#define _M_GI_(name, idx, base, scale)	_mm256_##name(base, idx, scale)
+
+/* num/mask of transitions per SIMD regs */
+#define _SIMD_MASK_BIT_	(sizeof(_T_simd) / sizeof(uint32_t))
+#define _SIMD_MASK_MAX_	RTE_LEN2MASK(_SIMD_MASK_BIT_, uint32_t)
+
+#define _SIMD_FLOW_NUM_	(2 * _SIMD_MASK_BIT_)
+#define _SIMD_FLOW_MSK_	(_SIMD_FLOW_NUM_ - 1)
 
 /* num/mask of pointers per SIMD regs */
-#define YMM_PTR_NUM	(sizeof(__m256i) / sizeof(uintptr_t))
-#define YMM_PTR_MSK	RTE_LEN2MASK(YMM_PTR_NUM, uint32_t)
+#define _SIMD_PTR_NUM_	(sizeof(_T_simd) / sizeof(uintptr_t))
+#define _SIMD_PTR_MSK_	RTE_LEN2MASK(_SIMD_PTR_NUM_, uint32_t)
 
-static const rte_ymm_t ymm_match_mask = {
+static const rte_ymm_t _SC_(match_mask) = {
 	.u32 = {
 		RTE_ACL_NODE_MATCH,
 		RTE_ACL_NODE_MATCH,
@@ -24,7 +65,7 @@ static const rte_ymm_t ymm_match_mask = {
 	},
 };
 
-static const rte_ymm_t ymm_index_mask = {
+static const rte_ymm_t _SC_(index_mask) = {
 	.u32 = {
 		RTE_ACL_NODE_INDEX,
 		RTE_ACL_NODE_INDEX,
@@ -37,7 +78,7 @@ static const rte_ymm_t ymm_index_mask = {
 	},
 };
 
-static const rte_ymm_t ymm_trlo_idle = {
+static const rte_ymm_t _SC_(trlo_idle) = {
 	.u32 = {
 		RTE_ACL_IDLE_NODE,
 		RTE_ACL_IDLE_NODE,
@@ -50,51 +91,51 @@ static const rte_ymm_t ymm_trlo_idle = {
 	},
 };
 
-static const rte_ymm_t ymm_trhi_idle = {
+static const rte_ymm_t _SC_(trhi_idle) = {
 	.u32 = {
 		0, 0, 0, 0,
 		0, 0, 0, 0,
 	},
 };
 
-static const rte_ymm_t ymm_shuffle_input = {
+static const rte_ymm_t _SC_(shuffle_input) = {
 	.u32 = {
 		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
 		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
 	},
 };
 
-static const rte_ymm_t ymm_four_32 = {
+static const rte_ymm_t _SC_(four_32) = {
 	.u32 = {
 		4, 4, 4, 4,
 		4, 4, 4, 4,
 	},
 };
 
-static const rte_ymm_t ymm_idx_add = {
+static const rte_ymm_t _SC_(idx_add) = {
 	.u32 = {
 		0, 1, 2, 3,
 		4, 5, 6, 7,
 	},
 };
 
-static const rte_ymm_t ymm_range_base = {
+static const rte_ymm_t _SC_(range_base) = {
 	.u32 = {
 		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
 		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
 	},
 };
 
-static const rte_ymm_t ymm_pminp = {
+static const rte_ymm_t _SC_(pminp) = {
 	.u32 = {
 		0x00, 0x01, 0x02, 0x03,
 		0x08, 0x09, 0x0a, 0x0b,
 	},
 };
 
-static const __mmask16 ymm_pmidx_msk = 0x55;
+static const __mmask16 _SC_(pmidx_msk) = 0x55;
 
-static const rte_ymm_t ymm_pmidx[2] = {
+static const rte_ymm_t _SC_(pmidx[2]) = {
 	[0] = {
 		.u32 = {
 			0, 0, 1, 0, 2, 0, 3, 0,
@@ -120,7 +161,7 @@ _m256_mask_gather_epi8x4(__m256i pdata, __mmask8 mask)
 
 	static const uint32_t zero;
 
-	p.y = _mm256_mask_set1_epi64(pdata, mask ^ YMM_PTR_MSK,
+	p.y = _mm256_mask_set1_epi64(pdata, mask ^ _SIMD_PTR_MSK_,
 		(uintptr_t)&zero);
 
 	v.u32[0] = *(uint8_t *)p.u64[0];
@@ -132,483 +173,37 @@ _m256_mask_gather_epi8x4(__m256i pdata, __mmask8 mask)
 }
 
 /*
- * Calculate the address of the next transition for
- * all types of nodes. Note that only DFA nodes and range
- * nodes actually transition to another node. Match
- * nodes not supposed to be encountered here.
- * For quad range nodes:
- * Calculate number of range boundaries that are less than the
- * input value. Range boundaries for each node are in signed 8 bit,
- * ordered from -128 to 127.
- * This is effectively a popcnt of bytes that are greater than the
- * input byte.
- * Single nodes are processed in the same ways as quad range nodes.
- */
-static __rte_always_inline __m256i
-calc_addr8(__m256i index_mask, __m256i next_input, __m256i shuffle_input,
-	__m256i four_32, __m256i range_base, __m256i tr_lo, __m256i tr_hi)
-{
-	__mmask32 qm;
-	__mmask8 dfa_msk;
-	__m256i addr, in, node_type, r, t;
-	__m256i dfa_ofs, quad_ofs;
-
-	t = _mm256_xor_si256(index_mask, index_mask);
-	in = _mm256_shuffle_epi8(next_input, shuffle_input);
-
-	/* Calc node type and node addr */
-	node_type = _mm256_andnot_si256(index_mask, tr_lo);
-	addr = _mm256_and_si256(index_mask, tr_lo);
-
-	/* mask for DFA type(0) nodes */
-	dfa_msk = _mm256_cmpeq_epi32_mask(node_type, t);
-
-	/* DFA calculations. */
-	r = _mm256_srli_epi32(in, 30);
-	r = _mm256_add_epi8(r, range_base);
-	t = _mm256_srli_epi32(in, 24);
-	r = _mm256_shuffle_epi8(tr_hi, r);
-
-	dfa_ofs = _mm256_sub_epi32(t, r);
-
-	/* QUAD/SINGLE calculations. */
-	qm = _mm256_cmpgt_epi8_mask(in, tr_hi);
-	t = _mm256_maskz_set1_epi8(qm, (uint8_t)UINT8_MAX);
-	t = _mm256_lzcnt_epi32(t);
-	t = _mm256_srli_epi32(t, 3);
-	quad_ofs = _mm256_sub_epi32(four_32, t);
-
-	/* blend DFA and QUAD/SINGLE. */
-	t = _mm256_mask_mov_epi32(quad_ofs, dfa_msk, dfa_ofs);
-
-	/* calculate address for next transitions. */
-	addr = _mm256_add_epi32(addr, t);
-	return addr;
-}
-
-/*
- * Process 16 transitions in parallel.
- * tr_lo contains low 32 bits for 16 transition.
- * tr_hi contains high 32 bits for 16 transition.
- * next_input contains up to 4 input bytes for 16 flows.
+ * Gather 4/1 input bytes for up to 8 (2*8) locations in parallel.
  */
 static __rte_always_inline __m256i
-transition8(__m256i next_input, const uint64_t *trans, __m256i *tr_lo,
-	__m256i *tr_hi)
-{
-	const int32_t *tr;
-	__m256i addr;
-
-	tr = (const int32_t *)(uintptr_t)trans;
-
-	/* Calculate the address (array index) for all 8 transitions. */
-	addr = calc_addr8(ymm_index_mask.y, next_input, ymm_shuffle_input.y,
-		ymm_four_32.y, ymm_range_base.y, *tr_lo, *tr_hi);
-
-	/* load lower 32 bits of 16 transactions at once. */
-	*tr_lo = _mm256_i32gather_epi32(tr, addr, sizeof(trans[0]));
-
-	next_input = _mm256_srli_epi32(next_input, CHAR_BIT);
-
-	/* load high 32 bits of 16 transactions at once. */
-	*tr_hi = _mm256_i32gather_epi32(tr + 1, addr, sizeof(trans[0]));
-
-	return next_input;
-}
-
-/*
- * Execute first transition for up to 16 flows in parallel.
- * next_input should contain one input byte for up to 16 flows.
- * msk - mask of active flows.
- * tr_lo contains low 32 bits for up to 16 transitions.
- * tr_hi contains high 32 bits for up to 16 transitions.
- */
-static __rte_always_inline void
-first_trans8(const struct acl_flow_avx512 *flow, __m256i next_input,
-	__mmask8 msk, __m256i *tr_lo, __m256i *tr_hi)
+_F_(gather_bytes)(__m256i zero, const __m256i p[2], const uint32_t m[2],
+	uint32_t bnum)
 {
-	const int32_t *tr;
-	__m256i addr, root;
-
-	tr = (const int32_t *)(uintptr_t)flow->trans;
-
-	addr = _mm256_set1_epi32(UINT8_MAX);
-	root = _mm256_set1_epi32(flow->root_index);
-
-	addr = _mm256_and_si256(next_input, addr);
-	addr = _mm256_add_epi32(root, addr);
-
-	/* load lower 32 bits of 16 transactions at once. */
-	*tr_lo = _mm256_mmask_i32gather_epi32(*tr_lo, msk, addr, tr,
-		sizeof(flow->trans[0]));
-
-	/* load high 32 bits of 16 transactions at once. */
-	*tr_hi = _mm256_mmask_i32gather_epi32(*tr_hi, msk, addr, (tr + 1),
-		sizeof(flow->trans[0]));
-}
-
-/*
- * Load and return next 4 input bytes for up to 16 flows in parallel.
- * pdata - 8x2 pointers to flow input data
- * mask - mask of active flows.
- * di - data indexes for these 16 flows.
- */
-static inline __m256i
-get_next_bytes_avx512x8(const struct acl_flow_avx512 *flow, __m256i pdata[2],
-	uint32_t msk, __m256i *di, uint32_t bnum)
-{
-	const int32_t *div;
-	uint32_t m[2];
-	__m256i one, zero, t, p[2];
 	__m128i inp[2];
 
-	div = (const int32_t *)flow->data_index;
-
-	one = _mm256_set1_epi32(1);
-	zero = _mm256_xor_si256(one, one);
-
-	/* load data offsets for given indexes */
-	t = _mm256_mmask_i32gather_epi32(zero, msk, *di, div, sizeof(div[0]));
-
-	/* increment data indexes */
-	*di = _mm256_mask_add_epi32(*di, msk, *di, one);
-
-	/*
-	 * unsigned expand 32-bit indexes to 64-bit
-	 * (for later pointer arithmetic), i.e:
-	 * for (i = 0; i != 16; i++)
-	 *   p[i/8].u64[i%8] = (uint64_t)t.u32[i];
-	 */
-	p[0] = _mm256_maskz_permutexvar_epi32(ymm_pmidx_msk, ymm_pmidx[0].y, t);
-	p[1] = _mm256_maskz_permutexvar_epi32(ymm_pmidx_msk, ymm_pmidx[1].y, t);
-
-	p[0] = _mm256_add_epi64(p[0], pdata[0]);
-	p[1] = _mm256_add_epi64(p[1], pdata[1]);
-
-	/* load input byte(s), either one or four */
-
-	m[0] = msk & YMM_PTR_MSK;
-	m[1] = msk >> YMM_PTR_NUM;
-
 	if (bnum == sizeof(uint8_t)) {
 		inp[0] = _m256_mask_gather_epi8x4(p[0], m[0]);
 		inp[1] = _m256_mask_gather_epi8x4(p[1], m[1]);
 	} else {
 		inp[0] = _mm256_mmask_i64gather_epi32(
-				_mm256_castsi256_si128(zero), m[0], p[0],
-				NULL, sizeof(uint8_t));
+				_mm256_castsi256_si128(zero),
+				m[0], p[0], NULL, sizeof(uint8_t));
 		inp[1] = _mm256_mmask_i64gather_epi32(
-				_mm256_castsi256_si128(zero), m[1], p[1],
-				NULL, sizeof(uint8_t));
+				_mm256_castsi256_si128(zero),
+				m[1], p[1], NULL, sizeof(uint8_t));
 	}
 
-	/* squeeze input into one 512-bit register */
+	/* squeeze input into one 256-bit register */
 	return _mm256_permutex2var_epi32(_mm256_castsi128_si256(inp[0]),
-			ymm_pminp.y,  _mm256_castsi128_si256(inp[1]));
-}
-
-/*
- * Start up to 16 new flows.
- * num - number of flows to start
- * msk - mask of new flows.
- * pdata - pointers to flow input data
- * idx - match indexed for given flows
- * di - data indexes for these flows.
- */
-static inline void
-start_flow8(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
-	__m256i pdata[2], __m256i *idx, __m256i *di)
-{
-	uint32_t n, m[2], nm[2];
-	__m256i ni, nd[2];
-
-	m[0] = msk & YMM_PTR_MSK;
-	m[1] = msk >> YMM_PTR_NUM;
-
-	n = __builtin_popcount(m[0]);
-	nm[0] = (1 << n) - 1;
-	nm[1] = (1 << (num - n)) - 1;
-
-	/* load input data pointers for new flows */
-	nd[0] = _mm256_maskz_loadu_epi64(nm[0],
-		flow->idata + flow->num_packets);
-	nd[1] = _mm256_maskz_loadu_epi64(nm[1],
-		flow->idata + flow->num_packets + n);
-
-	/* calculate match indexes of new flows */
-	ni = _mm256_set1_epi32(flow->num_packets);
-	ni = _mm256_add_epi32(ni, ymm_idx_add.y);
-
-	/* merge new and existing flows data */
-	pdata[0] = _mm256_mask_expand_epi64(pdata[0], m[0], nd[0]);
-	pdata[1] = _mm256_mask_expand_epi64(pdata[1], m[1], nd[1]);
-
-	/* update match and data indexes */
-	*idx = _mm256_mask_expand_epi32(*idx, msk, ni);
-	*di = _mm256_maskz_mov_epi32(msk ^ UINT8_MAX, *di);
-
-	flow->num_packets += num;
-}
-
-/*
- * Process found matches for up to 16 flows.
- * fmsk - mask of active flows
- * rmsk - mask of found matches
- * pdata - pointers to flow input data
- * di - data indexes for these flows
- * idx - match indexed for given flows
- * tr_lo contains low 32 bits for up to 8 transitions.
- * tr_hi contains high 32 bits for up to 8 transitions.
- */
-static inline uint32_t
-match_process_avx512x8(struct acl_flow_avx512 *flow, uint32_t *fmsk,
-	uint32_t *rmsk, __m256i pdata[2], __m256i *di, __m256i *idx,
-	__m256i *tr_lo, __m256i *tr_hi)
-{
-	uint32_t n;
-	__m256i res;
-
-	if (rmsk[0] == 0)
-		return 0;
-
-	/* extract match indexes */
-	res = _mm256_and_si256(tr_lo[0], ymm_index_mask.y);
-
-	/* mask  matched transitions to nop */
-	tr_lo[0] = _mm256_mask_mov_epi32(tr_lo[0], rmsk[0], ymm_trlo_idle.y);
-	tr_hi[0] = _mm256_mask_mov_epi32(tr_hi[0], rmsk[0], ymm_trhi_idle.y);
-
-	/* save found match indexes */
-	_mm256_mask_i32scatter_epi32(flow->matches, rmsk[0],
-		idx[0], res, sizeof(flow->matches[0]));
-
-	/* update masks and start new flows for matches */
-	n = update_flow_mask(flow, fmsk, rmsk);
-	start_flow8(flow, n, rmsk[0], pdata, idx, di);
-
-	return n;
-}
-
-/*
- * Test for matches ut to 32 (2x16) flows at once,
- * if matches exist - process them and start new flows.
- */
-static inline void
-match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
-	__m256i pdata[4], __m256i di[2], __m256i idx[2], __m256i inp[2],
-	__m256i tr_lo[2], __m256i tr_hi[2])
-{
-	uint32_t n[2];
-	uint32_t rm[2];
-
-	/* check for matches */
-	rm[0] = _mm256_test_epi32_mask(tr_lo[0], ymm_match_mask.y);
-	rm[1] = _mm256_test_epi32_mask(tr_lo[1], ymm_match_mask.y);
-
-	/* till unprocessed matches exist */
-	while ((rm[0] | rm[1]) != 0) {
-
-		/* process matches and start new flows */
-		n[0] = match_process_avx512x8(flow, &fm[0], &rm[0], &pdata[0],
-			&di[0], &idx[0], &tr_lo[0], &tr_hi[0]);
-		n[1] = match_process_avx512x8(flow, &fm[1], &rm[1], &pdata[2],
-			&di[1], &idx[1], &tr_lo[1], &tr_hi[1]);
-
-		/* execute first transition for new flows, if any */
-
-		if (n[0] != 0) {
-			inp[0] = get_next_bytes_avx512x8(flow, &pdata[0],
-				rm[0], &di[0], flow->first_load_sz);
-			first_trans8(flow, inp[0], rm[0], &tr_lo[0],
-				&tr_hi[0]);
-			rm[0] = _mm256_test_epi32_mask(tr_lo[0],
-				ymm_match_mask.y);
-		}
-
-		if (n[1] != 0) {
-			inp[1] = get_next_bytes_avx512x8(flow, &pdata[2],
-				rm[1], &di[1], flow->first_load_sz);
-			first_trans8(flow, inp[1], rm[1], &tr_lo[1],
-				&tr_hi[1]);
-			rm[1] = _mm256_test_epi32_mask(tr_lo[1],
-				ymm_match_mask.y);
-		}
-	}
-}
-
-/*
- * Perform search for up to 32 flows in parallel.
- * Use two sets of metadata, each serves 16 flows max.
- * So in fact we perform search for 2x16 flows.
- */
-static inline void
-search_trie_avx512x8x2(struct acl_flow_avx512 *flow)
-{
-	uint32_t fm[2];
-	__m256i di[2], idx[2], in[2], pdata[4], tr_lo[2], tr_hi[2];
-
-	/* first 1B load */
-	start_flow8(flow, MASK8_BIT, UINT8_MAX, &pdata[0], &idx[0], &di[0]);
-	start_flow8(flow, MASK8_BIT, UINT8_MAX, &pdata[2], &idx[1], &di[1]);
-
-	in[0] = get_next_bytes_avx512x8(flow, &pdata[0], UINT8_MAX, &di[0],
-			flow->first_load_sz);
-	in[1] = get_next_bytes_avx512x8(flow, &pdata[2], UINT8_MAX, &di[1],
-			flow->first_load_sz);
-
-	first_trans8(flow, in[0], UINT8_MAX, &tr_lo[0], &tr_hi[0]);
-	first_trans8(flow, in[1], UINT8_MAX, &tr_lo[1], &tr_hi[1]);
-
-	fm[0] = UINT8_MAX;
-	fm[1] = UINT8_MAX;
-
-	/* match check */
-	match_check_process_avx512x8x2(flow, fm, pdata, di, idx, in,
-		tr_lo, tr_hi);
-
-	while ((fm[0] | fm[1]) != 0) {
-
-		/* load next 4B */
-
-		in[0] = get_next_bytes_avx512x8(flow, &pdata[0], fm[0],
-			&di[0], sizeof(uint32_t));
-		in[1] = get_next_bytes_avx512x8(flow, &pdata[2], fm[1],
-			&di[1], sizeof(uint32_t));
-
-		/* main 4B loop */
-
-		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		/* check for matches */
-		match_check_process_avx512x8x2(flow, fm, pdata, di, idx, in,
-			tr_lo, tr_hi);
-	}
-}
-
-/*
- * resolve match index to actual result/priority offset.
- */
-static inline __m256i
-resolve_match_idx_avx512x8(__m256i mi)
-{
-	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
-		1 << (match_log + 2));
-	return _mm256_slli_epi32(mi, match_log);
+			_SV_(pminp), _mm256_castsi128_si256(inp[1]));
 }
 
-/*
- * Resolve multiple matches for the same flow based on priority.
- */
-static inline __m256i
-resolve_pri_avx512x8(const int32_t res[], const int32_t pri[],
-	const uint32_t match[], __mmask8 msk, uint32_t nb_trie,
-	uint32_t nb_skip)
-{
-	uint32_t i;
-	const uint32_t *pm;
-	__mmask16 m;
-	__m256i cp, cr, np, nr, mch;
-
-	const __m256i zero = _mm256_set1_epi32(0);
-
-	/* get match indexes */
-	mch = _mm256_maskz_loadu_epi32(msk, match);
-	mch = resolve_match_idx_avx512x8(mch);
-
-	/* read result and priority values for first trie */
-	cr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0]));
-	cp = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0]));
-
-	/*
-	 * read result and priority values for next tries and select one
-	 * with highest priority.
-	 */
-	for (i = 1, pm = match + nb_skip; i != nb_trie;
-			i++, pm += nb_skip) {
-
-		mch = _mm256_maskz_loadu_epi32(msk, pm);
-		mch = resolve_match_idx_avx512x8(mch);
-
-		nr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res,
-			sizeof(res[0]));
-		np = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri,
-			sizeof(pri[0]));
-
-		m = _mm256_cmpgt_epi32_mask(cp, np);
-		cr = _mm256_mask_mov_epi32(nr, m, cr);
-		cp = _mm256_mask_mov_epi32(np, m, cp);
-	}
-
-	return cr;
-}
-
-/*
- * Resolve num (<= 8) matches for single category
- */
-static inline void
-resolve_sc_avx512x8(uint32_t result[], const int32_t res[],
-	const int32_t pri[], const uint32_t match[], uint32_t nb_pkt,
-	uint32_t nb_trie, uint32_t nb_skip)
-{
-	__mmask8 msk;
-	__m256i cr;
-
-	msk = (1 << nb_pkt) - 1;
-	cr = resolve_pri_avx512x8(res, pri, match, msk, nb_trie, nb_skip);
-	_mm256_mask_storeu_epi32(result, msk, cr);
-}
+#include "acl_run_avx512_common.h"
 
 /*
- * Resolve matches for single category
+ * Perform search for up to (2 * 8) flows in parallel.
+ * Use two sets of metadata, each serves 8 flows max.
  */
-static inline void
-resolve_sc_avx512x8x2(uint32_t result[],
-	const struct rte_acl_match_results pr[], const uint32_t match[],
-	uint32_t nb_pkt, uint32_t nb_trie)
-{
-	uint32_t j, k, n;
-	const int32_t *res, *pri;
-	__m256i cr[2];
-
-	res = (const int32_t *)pr->results;
-	pri = pr->priority;
-
-	for (k = 0; k != (nb_pkt & ~MSK_AVX512X8X2); k += NUM_AVX512X8X2) {
-
-		j = k + MASK8_BIT;
-
-		cr[0] = resolve_pri_avx512x8(res, pri, match + k, UINT8_MAX,
-				nb_trie, nb_pkt);
-		cr[1] = resolve_pri_avx512x8(res, pri, match + j, UINT8_MAX,
-				nb_trie, nb_pkt);
-
-		_mm256_storeu_si256((void *)(result + k), cr[0]);
-		_mm256_storeu_si256((void *)(result + j), cr[1]);
-	}
-
-	n = nb_pkt - k;
-	if (n != 0) {
-		if (n > MASK8_BIT) {
-			resolve_sc_avx512x8(result + k, res, pri, match + k,
-				MASK8_BIT, nb_trie, nb_pkt);
-			k += MASK8_BIT;
-			n -= MASK8_BIT;
-		}
-		resolve_sc_avx512x8(result + k, res, pri, match + k, n,
-				nb_trie, nb_pkt);
-	}
-}
-
 static inline int
 search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t total_packets, uint32_t categories)
@@ -624,7 +219,7 @@ search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 		acl_set_flow_avx512(&flow, ctx, i, data, pm, total_packets);
 
 		/* process the trie */
-		search_trie_avx512x8x2(&flow);
+		_F_(search_trie)(&flow);
 	}
 
 	/* resolve matches */
@@ -632,7 +227,7 @@ search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 		(ctx->trans_table + ctx->match_index);
 
 	if (categories == 1)
-		resolve_sc_avx512x8x2(results, pr, match, total_packets,
+		_F_(resolve_single_cat)(results, pr, match, total_packets,
 			ctx->num_tries);
 	else
 		resolve_mcle8_avx512x1(results, pr, match, total_packets,
@@ -640,3 +235,19 @@ search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 
 	return 0;
 }
+
+#undef _SIMD_PTR_MSK_
+#undef _SIMD_PTR_NUM_
+#undef _SIMD_FLOW_MSK_
+#undef _SIMD_FLOW_NUM_
+#undef _SIMD_MASK_MAX_
+#undef _SIMD_MASK_BIT_
+#undef _M_GI_
+#undef _M_MGI_
+#undef _M_SI_
+#undef _M_I_
+#undef _F_
+#undef _SV_
+#undef _SC_
+#undef _T_mask
+#undef _T_simd
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v3 13/14] test/acl: add AVX512 classify support
  2020-10-05 18:45   ` [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                       ` (11 preceding siblings ...)
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 12/14] acl: deduplicate AVX512 code paths Konstantin Ananyev
@ 2020-10-05 18:45     ` Konstantin Ananyev
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 14/14] app/acl: " Konstantin Ananyev
                       ` (2 subsequent siblings)
  15 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-05 18:45 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Add AVX512 classify to the test coverage.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test/test_acl.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/app/test/test_acl.c b/app/test/test_acl.c
index 333b347579..5b32347954 100644
--- a/app/test/test_acl.c
+++ b/app/test/test_acl.c
@@ -278,8 +278,8 @@ test_classify_alg(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
 
 	/* set given classify alg, skip test if alg is not supported */
 	ret = rte_acl_set_ctx_classify(acx, alg);
-	if (ret == -ENOTSUP)
-		return 0;
+	if (ret != 0)
+		return (ret == -ENOTSUP) ? 0 : ret;
 
 	/**
 	 * these will run quite a few times, it's necessary to test code paths
@@ -341,6 +341,8 @@ test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
 		RTE_ACL_CLASSIFY_AVX2,
 		RTE_ACL_CLASSIFY_NEON,
 		RTE_ACL_CLASSIFY_ALTIVEC,
+		RTE_ACL_CLASSIFY_AVX512X16,
+		RTE_ACL_CLASSIFY_AVX512X32,
 	};
 
 	/* swap all bytes in the data to network order */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v3 14/14] app/acl: add AVX512 classify support
  2020-10-05 18:45   ` [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                       ` (12 preceding siblings ...)
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 13/14] test/acl: add AVX512 classify support Konstantin Ananyev
@ 2020-10-05 18:45     ` Konstantin Ananyev
  2020-10-06 15:03     ` [dpdk-dev] [PATCH v4 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
  2020-10-06 15:05     ` [dpdk-dev] [PATCH v3 " David Marchand
  15 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-05 18:45 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Add ability to use AVX512 classify method.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test-acl/main.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/app/test-acl/main.c b/app/test-acl/main.c
index d9b65517cb..2a3a35a054 100644
--- a/app/test-acl/main.c
+++ b/app/test-acl/main.c
@@ -81,6 +81,14 @@ static const struct acl_alg acl_alg[] = {
 		.name = "altivec",
 		.alg = RTE_ACL_CLASSIFY_ALTIVEC,
 	},
+	{
+		.name = "avx512x16",
+		.alg = RTE_ACL_CLASSIFY_AVX512X16,
+	},
+	{
+		.name = "avx512x32",
+		.alg = RTE_ACL_CLASSIFY_AVX512X32,
+	},
 };
 
 static struct {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v4 00/14] acl: introduce AVX512 classify methods
  2020-10-05 18:45   ` [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                       ` (13 preceding siblings ...)
  2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 14/14] app/acl: " Konstantin Ananyev
@ 2020-10-06 15:03     ` Konstantin Ananyev
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 01/14] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
                         ` (14 more replies)
  2020-10-06 15:05     ` [dpdk-dev] [PATCH v3 " David Marchand
  15 siblings, 15 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-06 15:03 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

These patch series introduce support of AVX512 specific classify
implementation for ACL library.
It adds two new algorithms:
 - RTE_ACL_CLASSIFY_AVX512X16 - can process up to 16 flows in parallel.
   It uses 256-bit width instructions/registers only
   (to avoid frequency level change).
   On my SKX box test-acl shows ~15-30% improvement
   (depending on rule-set and input burst size)
   when switching from AVX2 to AVX512X16 classify algorithms.
 - RTE_ACL_CLASSIFY_AVX512X32 - can process up to 32 flows in parallel.
   It uses 512-bit width instructions/registers and provides higher
   performance then AVX512X16, but can cause frequency level change.
   On my SKX box test-acl shows ~50-70% improvement
   (depending on rule-set and input burst size)
   when switching from AVX2 to AVX512X32 classify algorithms.
   ICX and CLX testing showed similar level of speedup.

Current AVX512 classify implementation is only supported on x86_64.
Note that this series introduce a formal ABI incompatibility
with previous versions of ACL library.

Depends-on: patch-79310 ("eal/x86: introduce AVX 512-bit type")

v3 -> v4
  Fix problems with meson 0.47
  Updates to conform latest changes in the mainline
  (removal of RTE_MACHINE_CPUFLAG_*)
  Fix checkpatch warnings

v2 -> v3:
  Fix checkpatch warnings
  Split AVX512 algorithm into two and deduplicate common code
v1 -> v2:
  Deduplicated 8/16 code paths as much as possible
  Updated default algorithm selection
    Removed library constructor to make it easier integrate with
    https://patches.dpdk.org/project/dpdk/list/?series=11831
  Updated docs


Konstantin Ananyev (14):
  acl: fix x86 build when compiler doesn't support AVX2
  doc: fix missing classify methods in ACL guide
  acl: remove of unused enum value
  acl: remove library constructor
  app/acl: few small improvements
  test/acl: expand classify test coverage
  acl: add infrastructure to support AVX512 classify
  acl: introduce 256-bit width AVX512 classify implementation
  acl: update default classify algorithm selection
  acl: introduce 512-bit width AVX512 classify implementation
  acl: for AVX512 classify use 4B load whenever possible
  acl: deduplicate AVX512 code paths
  test/acl: add AVX512 classify support
  app/acl: add AVX512 classify support

 app/test-acl/main.c                           |  23 +-
 app/test/test_acl.c                           | 105 ++--
 config/x86/meson.build                        |   3 +-
 .../prog_guide/packet_classif_access_ctrl.rst |  20 +
 doc/guides/rel_notes/deprecation.rst          |   4 -
 doc/guides/rel_notes/release_20_11.rst        |  12 +
 lib/librte_acl/acl.h                          |  16 +
 lib/librte_acl/acl_bld.c                      |  34 ++
 lib/librte_acl/acl_gen.c                      |   2 +-
 lib/librte_acl/acl_run_avx512.c               | 164 ++++++
 lib/librte_acl/acl_run_avx512_common.h        | 477 ++++++++++++++++++
 lib/librte_acl/acl_run_avx512x16.h            | 341 +++++++++++++
 lib/librte_acl/acl_run_avx512x8.h             | 253 ++++++++++
 lib/librte_acl/meson.build                    |  48 ++
 lib/librte_acl/rte_acl.c                      | 212 ++++++--
 lib/librte_acl/rte_acl.h                      |   4 +-
 16 files changed, 1618 insertions(+), 100 deletions(-)
 create mode 100644 lib/librte_acl/acl_run_avx512.c
 create mode 100644 lib/librte_acl/acl_run_avx512_common.h
 create mode 100644 lib/librte_acl/acl_run_avx512x16.h
 create mode 100644 lib/librte_acl/acl_run_avx512x8.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v4 01/14] acl: fix x86 build when compiler doesn't support AVX2
  2020-10-06 15:03     ` [dpdk-dev] [PATCH v4 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
@ 2020-10-06 15:03       ` Konstantin Ananyev
  2020-10-08 13:42         ` [dpdk-dev] [dpdk-stable] " David Marchand
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 02/14] doc: fix missing classify methods in ACL guide Konstantin Ananyev
                         ` (13 subsequent siblings)
  14 siblings, 1 reply; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-06 15:03 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev, stable

Right now we define dummy version of rte_acl_classify_avx2()
when both X86 and AVX2 are not detected, though it should be
for non-AVX2 case only.

Fixes: e53ce4e41379 ("acl: remove use of weak functions")
Cc: stable@dpdk.org

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/rte_acl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 777ec4d340..715b023592 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -16,7 +16,6 @@ static struct rte_tailq_elem rte_acl_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_acl_tailq)
 
-#ifndef RTE_ARCH_X86
 #ifndef CC_AVX2_SUPPORT
 /*
  * If the compiler doesn't support AVX2 instructions,
@@ -33,6 +32,7 @@ rte_acl_classify_avx2(__rte_unused const struct rte_acl_ctx *ctx,
 }
 #endif
 
+#ifndef RTE_ARCH_X86
 int
 rte_acl_classify_sse(__rte_unused const struct rte_acl_ctx *ctx,
 	__rte_unused const uint8_t **data,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v4 02/14] doc: fix missing classify methods in ACL guide
  2020-10-06 15:03     ` [dpdk-dev] [PATCH v4 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 01/14] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
@ 2020-10-06 15:03       ` Konstantin Ananyev
  2020-10-08 13:42         ` David Marchand
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 03/14] acl: remove of unused enum value Konstantin Ananyev
                         ` (12 subsequent siblings)
  14 siblings, 1 reply; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-06 15:03 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev, stable

Add brief description for missing ACL classify algorithms:
RTE_ACL_CLASSIFY_NEON and RTE_ACL_CLASSIFY_ALTIVEC.

Fixes: 34fa6c27c156 ("acl: add NEON optimization for ARMv8")
Fixes: 1d73135f9f1c ("acl: add AltiVec for ppc64")
Cc: stable@dpdk.org

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 doc/guides/prog_guide/packet_classif_access_ctrl.rst | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/doc/guides/prog_guide/packet_classif_access_ctrl.rst b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
index 0345512b9e..daf03e6d7a 100644
--- a/doc/guides/prog_guide/packet_classif_access_ctrl.rst
+++ b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
@@ -373,6 +373,12 @@ There are several implementations of classify algorithm:
 
 *   **RTE_ACL_CLASSIFY_AVX2**: vector implementation, can process up to 16 flows in parallel. Requires AVX2 support.
 
+*   **RTE_ACL_CLASSIFY_NEON**: vector implementation, can process up to 8 flows
+    in parallel. Requires NEON support.
+
+*   **RTE_ACL_CLASSIFY_ALTIVEC**: vector implementation, can process up to 8
+    flows in parallel. Requires ALTIVEC support.
+
 It is purely a runtime decision which method to choose, there is no build-time difference.
 All implementations operates over the same internal RT structures and use similar principles. The main difference is that vector implementations can manually exploit IA SIMD instructions and process several input data flows in parallel.
 At startup ACL library determines the highest available classify method for the given platform and sets it as default one. Though the user has an ability to override the default classifier function for a given ACL context or perform particular search using non-default classify method. In that case it is user responsibility to make sure that given platform supports selected classify implementation.
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v4 03/14] acl: remove of unused enum value
  2020-10-06 15:03     ` [dpdk-dev] [PATCH v4 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 01/14] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 02/14] doc: fix missing classify methods in ACL guide Konstantin Ananyev
@ 2020-10-06 15:03       ` Konstantin Ananyev
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 04/14] acl: remove library constructor Konstantin Ananyev
                         ` (11 subsequent siblings)
  14 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-06 15:03 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Removal of unused enum value (RTE_ACL_CLASSIFY_NUM).
This enum value is not used inside DPDK, while it prevents
to add new classify algorithms without causing an ABI breakage.

Note that this change introduce a formal ABI incompatibility
with previous versions of ACL library.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 doc/guides/rel_notes/deprecation.rst   | 4 ----
 doc/guides/rel_notes/release_20_11.rst | 4 ++++
 lib/librte_acl/rte_acl.h               | 1 -
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
index 8080a28896..938e967c8f 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -209,10 +209,6 @@ Deprecation Notices
   - https://patches.dpdk.org/patch/71457/
   - https://patches.dpdk.org/patch/71456/
 
-* acl: ``RTE_ACL_CLASSIFY_NUM`` enum value will be removed.
-  This enum value is not used inside DPDK, while it prevents to add new
-  classify algorithms without causing an ABI breakage.
-
 * sched: To allow more traffic classes, flexible mapping of pipe queues to
   traffic classes, and subport level configuration of pipes and queues
   changes will be made to macros, data structures and API functions defined
diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index 6d8c24413d..e0de60c0c2 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -210,6 +210,10 @@ API Changes
 
 * bpf: ``RTE_BPF_XTYPE_NUM`` has been dropped from ``rte_bpf_xtype``.
 
+* acl: ``RTE_ACL_CLASSIFY_NUM`` enum value has been removed.
+  This enum value was not used inside DPDK, while it prevented to add new
+  classify algorithms without causing an ABI breakage.
+
 
 ABI Changes
 -----------
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index aa22e70c6e..b814423a63 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -241,7 +241,6 @@ enum rte_acl_classify_alg {
 	RTE_ACL_CLASSIFY_AVX2 = 3,    /**< requires AVX2 support. */
 	RTE_ACL_CLASSIFY_NEON = 4,    /**< requires NEON support. */
 	RTE_ACL_CLASSIFY_ALTIVEC = 5,    /**< requires ALTIVEC support. */
-	RTE_ACL_CLASSIFY_NUM          /* should always be the last one. */
 };
 
 /**
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v4 04/14] acl: remove library constructor
  2020-10-06 15:03     ` [dpdk-dev] [PATCH v4 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                         ` (2 preceding siblings ...)
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 03/14] acl: remove of unused enum value Konstantin Ananyev
@ 2020-10-06 15:03       ` Konstantin Ananyev
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 05/14] app/acl: few small improvements Konstantin Ananyev
                         ` (10 subsequent siblings)
  14 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-06 15:03 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Right now ACL library determines best possible (default) classify method
on a given platform with specilal constructor function rte_acl_init().
This patch makes the following changes:
 - Move selection of default classify method into a separate private
   function and call it for each ACL context creation (rte_acl_create()).
 - Remove library constructor function
 - Make rte_acl_set_ctx_classify() to check that requested algorithm
   is supported on given platform.

The purpose of these changes to improve and simplify algorithm selection
process and prepare ACL library to be integrated with:
add max SIMD bitwidth to EAL
(https://patches.dpdk.org/project/dpdk/list/?series=11831)
patch-set

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/rte_acl.c | 166 ++++++++++++++++++++++++++++++---------
 lib/librte_acl/rte_acl.h |   1 +
 2 files changed, 132 insertions(+), 35 deletions(-)

diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 715b023592..863549a38b 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -79,57 +79,153 @@ static const rte_acl_classify_t classify_fns[] = {
 	[RTE_ACL_CLASSIFY_ALTIVEC] = rte_acl_classify_altivec,
 };
 
-/* by default, use always available scalar code path. */
-static enum rte_acl_classify_alg rte_acl_default_classify =
-	RTE_ACL_CLASSIFY_SCALAR;
+/*
+ * Helper function for acl_check_alg.
+ * Check support for ARM specific classify methods.
+ */
+static int
+acl_check_alg_arm(enum rte_acl_classify_alg alg)
+{
+	if (alg == RTE_ACL_CLASSIFY_NEON) {
+#if defined(RTE_ARCH_ARM64)
+		return 0;
+#elif defined(RTE_ARCH_ARM)
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON))
+			return 0;
+		return -ENOTSUP;
+#else
+		return -ENOTSUP;
+#endif
+	}
+
+	return -EINVAL;
+}
 
-static void
-rte_acl_set_default_classify(enum rte_acl_classify_alg alg)
+/*
+ * Helper function for acl_check_alg.
+ * Check support for PPC specific classify methods.
+ */
+static int
+acl_check_alg_ppc(enum rte_acl_classify_alg alg)
 {
-	rte_acl_default_classify = alg;
+	if (alg == RTE_ACL_CLASSIFY_ALTIVEC) {
+#if defined(RTE_ARCH_PPC_64)
+		return 0;
+#else
+		return -ENOTSUP;
+#endif
+	}
+
+	return -EINVAL;
 }
 
-extern int
-rte_acl_set_ctx_classify(struct rte_acl_ctx *ctx, enum rte_acl_classify_alg alg)
+/*
+ * Helper function for acl_check_alg.
+ * Check support for x86 specific classify methods.
+ */
+static int
+acl_check_alg_x86(enum rte_acl_classify_alg alg)
 {
-	if (ctx == NULL || (uint32_t)alg >= RTE_DIM(classify_fns))
-		return -EINVAL;
+	if (alg == RTE_ACL_CLASSIFY_AVX2) {
+#ifdef CC_AVX2_SUPPORT
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
+			return 0;
+#endif
+		return -ENOTSUP;
+	}
 
-	ctx->alg = alg;
-	return 0;
+	if (alg == RTE_ACL_CLASSIFY_SSE) {
+#ifdef RTE_ARCH_X86
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
+			return 0;
+#endif
+		return -ENOTSUP;
+	}
+
+	return -EINVAL;
 }
 
 /*
- * Select highest available classify method as default one.
- * Note that CLASSIFY_AVX2 should be set as a default only
- * if both conditions are met:
- * at build time compiler supports AVX2 and target cpu supports AVX2.
+ * Check if input alg is supported by given platform/binary.
+ * Note that both conditions should be met:
+ * - at build time compiler supports ISA used by given methods
+ * - at run time target cpu supports necessary ISA.
  */
-RTE_INIT(rte_acl_init)
+static int
+acl_check_alg(enum rte_acl_classify_alg alg)
 {
-	enum rte_acl_classify_alg alg = RTE_ACL_CLASSIFY_DEFAULT;
+	switch (alg) {
+	case RTE_ACL_CLASSIFY_NEON:
+		return acl_check_alg_arm(alg);
+	case RTE_ACL_CLASSIFY_ALTIVEC:
+		return acl_check_alg_ppc(alg);
+	case RTE_ACL_CLASSIFY_AVX2:
+	case RTE_ACL_CLASSIFY_SSE:
+		return acl_check_alg_x86(alg);
+	/* scalar method is supported on all platforms */
+	case RTE_ACL_CLASSIFY_SCALAR:
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
 
-#if defined(RTE_ARCH_ARM64)
-	alg =  RTE_ACL_CLASSIFY_NEON;
-#elif defined(RTE_ARCH_ARM)
-	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON))
-		alg =  RTE_ACL_CLASSIFY_NEON;
+/*
+ * Get preferred alg for given platform.
+ */
+static enum rte_acl_classify_alg
+acl_get_best_alg(void)
+{
+	/*
+	 * array of supported methods for each platform.
+	 * Note that order is important - from most to less preferable.
+	 */
+	static const enum rte_acl_classify_alg alg[] = {
+#if defined(RTE_ARCH_ARM)
+		RTE_ACL_CLASSIFY_NEON,
 #elif defined(RTE_ARCH_PPC_64)
-	alg = RTE_ACL_CLASSIFY_ALTIVEC;
-#else
-#ifdef CC_AVX2_SUPPORT
-	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
-		alg = RTE_ACL_CLASSIFY_AVX2;
-	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
-#else
-	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
+		RTE_ACL_CLASSIFY_ALTIVEC,
+#elif defined(RTE_ARCH_X86)
+		RTE_ACL_CLASSIFY_AVX2,
+		RTE_ACL_CLASSIFY_SSE,
 #endif
-		alg = RTE_ACL_CLASSIFY_SSE;
+		RTE_ACL_CLASSIFY_SCALAR,
+	};
 
-#endif
-	rte_acl_set_default_classify(alg);
+	uint32_t i;
+
+	/* find best possible alg */
+	for (i = 0; i != RTE_DIM(alg) && acl_check_alg(alg[i]) != 0; i++)
+		;
+
+	/* we always have to find something suitable */
+	RTE_VERIFY(i != RTE_DIM(alg));
+	return alg[i];
+}
+
+extern int
+rte_acl_set_ctx_classify(struct rte_acl_ctx *ctx, enum rte_acl_classify_alg alg)
+{
+	int32_t rc;
+
+	/* formal parameters check */
+	if (ctx == NULL || (uint32_t)alg >= RTE_DIM(classify_fns))
+		return -EINVAL;
+
+	/* user asked us to select the *best* one */
+	if (alg == RTE_ACL_CLASSIFY_DEFAULT)
+		alg = acl_get_best_alg();
+
+	/* check that given alg is supported */
+	rc = acl_check_alg(alg);
+	if (rc != 0)
+		return rc;
+
+	ctx->alg = alg;
+	return 0;
 }
 
+
 int
 rte_acl_classify_alg(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories,
@@ -262,7 +358,7 @@ rte_acl_create(const struct rte_acl_param *param)
 		ctx->max_rules = param->max_rule_num;
 		ctx->rule_sz = param->rule_size;
 		ctx->socket_id = param->socket_id;
-		ctx->alg = rte_acl_default_classify;
+		ctx->alg = acl_get_best_alg();
 		strlcpy(ctx->name, param->name, sizeof(ctx->name));
 
 		te->data = (void *) ctx;
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index b814423a63..3999f15ded 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -329,6 +329,7 @@ rte_acl_classify_alg(const struct rte_acl_ctx *ctx,
  *   existing algorithm, and that it could be run on the given CPU.
  * @return
  *   - -EINVAL if the parameters are invalid.
+ *   - -ENOTSUP requested algorithm is not supported by given platform.
  *   - Zero if operation completed successfully.
  */
 extern int
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v4 05/14] app/acl: few small improvements
  2020-10-06 15:03     ` [dpdk-dev] [PATCH v4 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                         ` (3 preceding siblings ...)
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 04/14] acl: remove library constructor Konstantin Ananyev
@ 2020-10-06 15:03       ` Konstantin Ananyev
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 06/14] test/acl: expand classify test coverage Konstantin Ananyev
                         ` (9 subsequent siblings)
  14 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-06 15:03 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

- enhance output to print extra stats
- use rte_rdtsc_precise() for cycle measurements

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test-acl/main.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/app/test-acl/main.c b/app/test-acl/main.c
index 0a5dfb621d..d9b65517cb 100644
--- a/app/test-acl/main.c
+++ b/app/test-acl/main.c
@@ -862,9 +862,10 @@ search_ip5tuples(__rte_unused void *arg)
 {
 	uint64_t pkt, start, tm;
 	uint32_t i, lcore;
+	long double st;
 
 	lcore = rte_lcore_id();
-	start = rte_rdtsc();
+	start = rte_rdtsc_precise();
 	pkt = 0;
 
 	for (i = 0; i != config.iter_num; i++) {
@@ -872,12 +873,16 @@ search_ip5tuples(__rte_unused void *arg)
 			config.trace_step, config.alg.name);
 	}
 
-	tm = rte_rdtsc() - start;
+	tm = rte_rdtsc_precise() - start;
+
+	st = (long double)tm / rte_get_timer_hz();
 	dump_verbose(DUMP_NONE, stdout,
 		"%s  @lcore %u: %" PRIu32 " iterations, %" PRIu64 " pkts, %"
-		PRIu32 " categories, %" PRIu64 " cycles, %#Lf cycles/pkt\n",
-		__func__, lcore, i, pkt, config.run_categories,
-		tm, (pkt == 0) ? 0 : (long double)tm / pkt);
+		PRIu32 " categories, %" PRIu64 " cycles (%.2Lf sec), "
+		"%.2Lf cycles/pkt, %.2Lf pkt/sec\n",
+		__func__, lcore, i, pkt,
+		config.run_categories, tm, st,
+		(pkt == 0) ? 0 : (long double)tm / pkt, pkt / st);
 
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v4 06/14] test/acl: expand classify test coverage
  2020-10-06 15:03     ` [dpdk-dev] [PATCH v4 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                         ` (4 preceding siblings ...)
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 05/14] app/acl: few small improvements Konstantin Ananyev
@ 2020-10-06 15:03       ` Konstantin Ananyev
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 07/14] acl: add infrastructure to support AVX512 classify Konstantin Ananyev
                         ` (8 subsequent siblings)
  14 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-06 15:03 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Make classify test to run for all supported methods.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test/test_acl.c | 103 ++++++++++++++++++++++----------------------
 1 file changed, 51 insertions(+), 52 deletions(-)

diff --git a/app/test/test_acl.c b/app/test/test_acl.c
index 316bf4d065..333b347579 100644
--- a/app/test/test_acl.c
+++ b/app/test/test_acl.c
@@ -266,22 +266,20 @@ rte_acl_ipv4vlan_build(struct rte_acl_ctx *ctx,
 }
 
 /*
- * Test scalar and SSE ACL lookup.
+ * Test ACL lookup (selected alg).
  */
 static int
-test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
-	size_t dim)
+test_classify_alg(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
+	const uint8_t *data[], size_t dim, enum rte_acl_classify_alg alg)
 {
-	int ret, i;
-	uint32_t result, count;
+	int32_t ret;
+	uint32_t i, result, count;
 	uint32_t results[dim * RTE_ACL_MAX_CATEGORIES];
-	const uint8_t *data[dim];
-	/* swap all bytes in the data to network order */
-	bswap_test_data(test_data, dim, 1);
 
-	/* store pointers to test data */
-	for (i = 0; i < (int) dim; i++)
-		data[i] = (uint8_t *)&test_data[i];
+	/* set given classify alg, skip test if alg is not supported */
+	ret = rte_acl_set_ctx_classify(acx, alg);
+	if (ret == -ENOTSUP)
+		return 0;
 
 	/**
 	 * these will run quite a few times, it's necessary to test code paths
@@ -291,12 +289,13 @@ test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
 		ret = rte_acl_classify(acx, data, results,
 				count, RTE_ACL_MAX_CATEGORIES);
 		if (ret != 0) {
-			printf("Line %i: SSE classify failed!\n", __LINE__);
-			goto err;
+			printf("Line %i: classify(alg=%d) failed!\n",
+				__LINE__, alg);
+			return ret;
 		}
 
 		/* check if we allow everything we should allow */
-		for (i = 0; i < (int) count; i++) {
+		for (i = 0; i < count; i++) {
 			result =
 				results[i * RTE_ACL_MAX_CATEGORIES + ACL_ALLOW];
 			if (result != test_data[i].allow) {
@@ -304,63 +303,63 @@ test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
 					"(expected %"PRIu32" got %"PRIu32")!\n",
 					__LINE__, i, test_data[i].allow,
 					result);
-				ret = -EINVAL;
-				goto err;
+				return -EINVAL;
 			}
 		}
 
 		/* check if we deny everything we should deny */
-		for (i = 0; i < (int) count; i++) {
+		for (i = 0; i < count; i++) {
 			result = results[i * RTE_ACL_MAX_CATEGORIES + ACL_DENY];
 			if (result != test_data[i].deny) {
 				printf("Line %i: Error in deny results at %i "
 					"(expected %"PRIu32" got %"PRIu32")!\n",
 					__LINE__, i, test_data[i].deny,
 					result);
-				ret = -EINVAL;
-				goto err;
+				return -EINVAL;
 			}
 		}
 	}
 
-	/* make a quick check for scalar */
-	ret = rte_acl_classify_alg(acx, data, results,
-			dim, RTE_ACL_MAX_CATEGORIES,
-			RTE_ACL_CLASSIFY_SCALAR);
-	if (ret != 0) {
-		printf("Line %i: scalar classify failed!\n", __LINE__);
-		goto err;
-	}
+	/* restore default classify alg */
+	return rte_acl_set_ctx_classify(acx, RTE_ACL_CLASSIFY_DEFAULT);
+}
 
-	/* check if we allow everything we should allow */
-	for (i = 0; i < (int) dim; i++) {
-		result = results[i * RTE_ACL_MAX_CATEGORIES + ACL_ALLOW];
-		if (result != test_data[i].allow) {
-			printf("Line %i: Error in allow results at %i "
-					"(expected %"PRIu32" got %"PRIu32")!\n",
-					__LINE__, i, test_data[i].allow,
-					result);
-			ret = -EINVAL;
-			goto err;
-		}
-	}
+/*
+ * Test ACL lookup (all possible methods).
+ */
+static int
+test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
+	size_t dim)
+{
+	int32_t ret;
+	uint32_t i;
+	const uint8_t *data[dim];
 
-	/* check if we deny everything we should deny */
-	for (i = 0; i < (int) dim; i++) {
-		result = results[i * RTE_ACL_MAX_CATEGORIES + ACL_DENY];
-		if (result != test_data[i].deny) {
-			printf("Line %i: Error in deny results at %i "
-					"(expected %"PRIu32" got %"PRIu32")!\n",
-					__LINE__, i, test_data[i].deny,
-					result);
-			ret = -EINVAL;
-			goto err;
-		}
-	}
+	static const enum rte_acl_classify_alg alg[] = {
+		RTE_ACL_CLASSIFY_SCALAR,
+		RTE_ACL_CLASSIFY_SSE,
+		RTE_ACL_CLASSIFY_AVX2,
+		RTE_ACL_CLASSIFY_NEON,
+		RTE_ACL_CLASSIFY_ALTIVEC,
+	};
+
+	/* swap all bytes in the data to network order */
+	bswap_test_data(test_data, dim, 1);
+
+	/* store pointers to test data */
+	for (i = 0; i < dim; i++)
+		data[i] = (uint8_t *)&test_data[i];
 
 	ret = 0;
+	for (i = 0; i != RTE_DIM(alg); i++) {
+		ret = test_classify_alg(acx, test_data, data, dim, alg[i]);
+		if (ret < 0) {
+			printf("Line %i: %s() for alg=%d failed, errno=%d\n",
+				__LINE__, __func__, alg[i], -ret);
+			break;
+		}
+	}
 
-err:
 	/* swap data back to cpu order so that next time tests don't fail */
 	bswap_test_data(test_data, dim, 0);
 	return ret;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v4 07/14] acl: add infrastructure to support AVX512 classify
  2020-10-06 15:03     ` [dpdk-dev] [PATCH v4 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                         ` (5 preceding siblings ...)
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 06/14] test/acl: expand classify test coverage Konstantin Ananyev
@ 2020-10-06 15:03       ` Konstantin Ananyev
  2020-10-13 19:17         ` David Marchand
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 08/14] acl: introduce 256-bit width AVX512 classify implementation Konstantin Ananyev
                         ` (7 subsequent siblings)
  14 siblings, 1 reply; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-06 15:03 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Add necessary changes to support new AVX512 specific ACL classify
algorithm:
 - changes in meson.build to check that build tools
   (compiler, assembler, etc.) do properly support AVX512.
 - run-time checks to make sure target platform does support AVX512.
 - dummy rte_acl_classify_avx512() for targets where AVX512
   implementation couldn't be properly supported.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
---
 config/x86/meson.build          |  3 ++-
 lib/librte_acl/acl.h            |  8 ++++++
 lib/librte_acl/acl_run_avx512.c | 29 ++++++++++++++++++++
 lib/librte_acl/meson.build      | 48 +++++++++++++++++++++++++++++++++
 lib/librte_acl/rte_acl.c        | 42 +++++++++++++++++++++++++++++
 lib/librte_acl/rte_acl.h        |  2 ++
 6 files changed, 131 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_acl/acl_run_avx512.c

diff --git a/config/x86/meson.build b/config/x86/meson.build
index fea4d54035..724e69f4c4 100644
--- a/config/x86/meson.build
+++ b/config/x86/meson.build
@@ -22,7 +22,8 @@ foreach f:base_flags
 endforeach
 
 optional_flags = ['AES', 'PCLMUL',
-		'AVX', 'AVX2', 'AVX512F',
+		'AVX', 'AVX2',
+		'AVX512F', 'AVX512VL', 'AVX512CD', 'AVX512BW',
 		'RDRND', 'RDSEED']
 foreach f:optional_flags
 	if cc.get_define('__@0@__'.format(f), args: machine_args) == '1'
diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 39d45a0c2b..543ce55659 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -201,6 +201,14 @@ int
 rte_acl_classify_avx2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories);
 
+int
+rte_acl_classify_avx512x16(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories);
+
+int
+rte_acl_classify_avx512x32(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories);
+
 int
 rte_acl_classify_neon(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories);
diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
new file mode 100644
index 0000000000..1817f88b29
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#include "acl_run_sse.h"
+
+int
+rte_acl_classify_avx512x16(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	if (num >= MAX_SEARCHES_SSE8)
+		return search_sse_8(ctx, data, results, num, categories);
+	if (num >= MAX_SEARCHES_SSE4)
+		return search_sse_4(ctx, data, results, num, categories);
+
+	return rte_acl_classify_scalar(ctx, data, results, num, categories);
+}
+
+int
+rte_acl_classify_avx512x32(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	if (num >= MAX_SEARCHES_SSE8)
+		return search_sse_8(ctx, data, results, num, categories);
+	if (num >= MAX_SEARCHES_SSE4)
+		return search_sse_4(ctx, data, results, num, categories);
+
+	return rte_acl_classify_scalar(ctx, data, results, num, categories);
+}
diff --git a/lib/librte_acl/meson.build b/lib/librte_acl/meson.build
index b31a3f798e..a3c7c398d0 100644
--- a/lib/librte_acl/meson.build
+++ b/lib/librte_acl/meson.build
@@ -27,6 +27,54 @@ if dpdk_conf.has('RTE_ARCH_X86')
 		cflags += '-DCC_AVX2_SUPPORT'
 	endif
 
+	# compile AVX512 version if:
+	# we are building 64-bit binary AND binutils can generate proper code
+
+	if dpdk_conf.has('RTE_ARCH_X86_64') and binutils_ok.returncode() == 0
+
+		# compile AVX512 version if either:
+		# a. we have AVX512 supported in minimum instruction set
+		#    baseline
+		# b. it's not minimum instruction set, but supported by
+		#    compiler
+		#
+		# in former case, just add avx512 C file to files list
+		# in latter case, compile c file to static lib, using correct
+		# compiler flags, and then have the .o file from static lib
+		# linked into main lib.
+
+		# check if all required flags already enabled (variant a).
+		acl_avx512_flags = ['__AVX512F__', '__AVX512VL__',
+			'__AVX512CD__', '__AVX512BW__']
+
+		acl_avx512_on = true
+		foreach f:acl_avx512_flags
+
+			if cc.get_define(f, args: machine_args) == ''
+				acl_avx512_on = false
+			endif
+		endforeach
+
+		if acl_avx512_on == true
+
+			sources += files('acl_run_avx512.c')
+			cflags += '-DCC_AVX512_SUPPORT'
+
+		elif cc.has_multi_arguments('-mavx512f', '-mavx512vl',
+					'-mavx512cd', '-mavx512bw')
+
+			avx512_tmplib = static_library('avx512_tmp',
+				'acl_run_avx512.c',
+				dependencies: static_rte_eal,
+				c_args: cflags +
+					['-mavx512f', '-mavx512vl',
+					 '-mavx512cd', '-mavx512bw'])
+			objs += avx512_tmplib.extract_objects(
+					'acl_run_avx512.c')
+			cflags += '-DCC_AVX512_SUPPORT'
+		endif
+	endif
+
 elif dpdk_conf.has('RTE_ARCH_ARM') or dpdk_conf.has('RTE_ARCH_ARM64')
 	cflags += '-flax-vector-conversions'
 	sources += files('acl_run_neon.c')
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 863549a38b..1154f35107 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -16,6 +16,32 @@ static struct rte_tailq_elem rte_acl_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_acl_tailq)
 
+#ifndef CC_AVX512_SUPPORT
+/*
+ * If the compiler doesn't support AVX512 instructions,
+ * then the dummy one would be used instead for AVX512 classify method.
+ */
+int
+rte_acl_classify_avx512x16(__rte_unused const struct rte_acl_ctx *ctx,
+	__rte_unused const uint8_t **data,
+	__rte_unused uint32_t *results,
+	__rte_unused uint32_t num,
+	__rte_unused uint32_t categories)
+{
+	return -ENOTSUP;
+}
+
+int
+rte_acl_classify_avx512x32(__rte_unused const struct rte_acl_ctx *ctx,
+	__rte_unused const uint8_t **data,
+	__rte_unused uint32_t *results,
+	__rte_unused uint32_t num,
+	__rte_unused uint32_t categories)
+{
+	return -ENOTSUP;
+}
+#endif
+
 #ifndef CC_AVX2_SUPPORT
 /*
  * If the compiler doesn't support AVX2 instructions,
@@ -77,6 +103,8 @@ static const rte_acl_classify_t classify_fns[] = {
 	[RTE_ACL_CLASSIFY_AVX2] = rte_acl_classify_avx2,
 	[RTE_ACL_CLASSIFY_NEON] = rte_acl_classify_neon,
 	[RTE_ACL_CLASSIFY_ALTIVEC] = rte_acl_classify_altivec,
+	[RTE_ACL_CLASSIFY_AVX512X16] = rte_acl_classify_avx512x16,
+	[RTE_ACL_CLASSIFY_AVX512X32] = rte_acl_classify_avx512x32,
 };
 
 /*
@@ -126,6 +154,18 @@ acl_check_alg_ppc(enum rte_acl_classify_alg alg)
 static int
 acl_check_alg_x86(enum rte_acl_classify_alg alg)
 {
+	if (alg == RTE_ACL_CLASSIFY_AVX512X16 ||
+			alg == RTE_ACL_CLASSIFY_AVX512X32) {
+#ifdef CC_AVX512_SUPPORT
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) &&
+			rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512VL) &&
+			rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512CD) &&
+			rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512BW))
+			return 0;
+#endif
+		return -ENOTSUP;
+	}
+
 	if (alg == RTE_ACL_CLASSIFY_AVX2) {
 #ifdef CC_AVX2_SUPPORT
 		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
@@ -159,6 +199,8 @@ acl_check_alg(enum rte_acl_classify_alg alg)
 		return acl_check_alg_arm(alg);
 	case RTE_ACL_CLASSIFY_ALTIVEC:
 		return acl_check_alg_ppc(alg);
+	case RTE_ACL_CLASSIFY_AVX512X32:
+	case RTE_ACL_CLASSIFY_AVX512X16:
 	case RTE_ACL_CLASSIFY_AVX2:
 	case RTE_ACL_CLASSIFY_SSE:
 		return acl_check_alg_x86(alg);
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index 3999f15ded..1bfed00743 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -241,6 +241,8 @@ enum rte_acl_classify_alg {
 	RTE_ACL_CLASSIFY_AVX2 = 3,    /**< requires AVX2 support. */
 	RTE_ACL_CLASSIFY_NEON = 4,    /**< requires NEON support. */
 	RTE_ACL_CLASSIFY_ALTIVEC = 5,    /**< requires ALTIVEC support. */
+	RTE_ACL_CLASSIFY_AVX512X16 = 6,  /**< requires AVX512 support. */
+	RTE_ACL_CLASSIFY_AVX512X32 = 7,  /**< requires AVX512 support. */
 };
 
 /**
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v4 08/14] acl: introduce 256-bit width AVX512 classify implementation
  2020-10-06 15:03     ` [dpdk-dev] [PATCH v4 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                         ` (6 preceding siblings ...)
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 07/14] acl: add infrastructure to support AVX512 classify Konstantin Ananyev
@ 2020-10-06 15:03       ` Konstantin Ananyev
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 09/14] acl: update default classify algorithm selection Konstantin Ananyev
                         ` (6 subsequent siblings)
  14 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-06 15:03 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Introduce classify implementation that uses AVX512 specific ISA.
rte_acl_classify_avx512x16() is able to process up to 16 flows in parallel.
It uses 256-bit width registers/instructions only
(to avoid frequency level change).
Note that for now only 64-bit version is supported.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 .../prog_guide/packet_classif_access_ctrl.rst |   4 +
 doc/guides/rel_notes/release_20_11.rst        |   5 +
 lib/librte_acl/acl.h                          |   7 +
 lib/librte_acl/acl_gen.c                      |   2 +-
 lib/librte_acl/acl_run_avx512.c               | 129 ++++
 lib/librte_acl/acl_run_avx512x8.h             | 642 ++++++++++++++++++
 6 files changed, 788 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_acl/acl_run_avx512x8.h

diff --git a/doc/guides/prog_guide/packet_classif_access_ctrl.rst b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
index daf03e6d7a..11f4bc841b 100644
--- a/doc/guides/prog_guide/packet_classif_access_ctrl.rst
+++ b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
@@ -379,6 +379,10 @@ There are several implementations of classify algorithm:
 *   **RTE_ACL_CLASSIFY_ALTIVEC**: vector implementation, can process up to 8
     flows in parallel. Requires ALTIVEC support.
 
+*   **RTE_ACL_CLASSIFY_AVX512X16**: vector implementation, can process up to 16
+    flows in parallel. Uses 256-bit width SIMD registers.
+    Requires AVX512 support.
+
 It is purely a runtime decision which method to choose, there is no build-time difference.
 All implementations operates over the same internal RT structures and use similar principles. The main difference is that vector implementations can manually exploit IA SIMD instructions and process several input data flows in parallel.
 At startup ACL library determines the highest available classify method for the given platform and sets it as default one. Though the user has an ability to override the default classifier function for a given ACL context or perform particular search using non-default classify method. In that case it is user responsibility to make sure that given platform supports selected classify implementation.
diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index e0de60c0c2..95d7bfd777 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -107,6 +107,11 @@ New Features
   * Extern objects and functions can be plugged into the pipeline.
   * Transaction-oriented table updates.
 
+* **Add new AVX512 specific classify algorithms for ACL library.**
+
+  * Added new ``RTE_ACL_CLASSIFY_AVX512X16`` vector implementation,
+    which can process up to 16 flows in parallel. Requires AVX512 support.
+
 
 Removed Items
 -------------
diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 543ce55659..7ac0d12f08 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -76,6 +76,13 @@ struct rte_acl_bitset {
  * input_byte - ((uint8_t *)&transition)[4 + input_byte / 64].
  */
 
+/*
+ * Each ACL RT contains an idle nomatch node:
+ * a SINGLE node at predefined position (RTE_ACL_DFA_SIZE)
+ * that points to itself.
+ */
+#define RTE_ACL_IDLE_NODE	(RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE)
+
 /*
  * Structure of a node is a set of ptrs and each ptr has a bit map
  * of values associated with this transition.
diff --git a/lib/librte_acl/acl_gen.c b/lib/librte_acl/acl_gen.c
index f1b9d12f1e..e759a2ca15 100644
--- a/lib/librte_acl/acl_gen.c
+++ b/lib/librte_acl/acl_gen.c
@@ -496,7 +496,7 @@ rte_acl_gen(struct rte_acl_ctx *ctx, struct rte_acl_trie *trie,
 	 * highest index, that points to itself)
 	 */
 
-	node_array[RTE_ACL_DFA_SIZE] = RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE;
+	node_array[RTE_ACL_DFA_SIZE] = RTE_ACL_IDLE_NODE;
 
 	for (n = 0; n < RTE_ACL_DFA_SIZE; n++)
 		node_array[n] = no_match;
diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
index 1817f88b29..f5bc628b7c 100644
--- a/lib/librte_acl/acl_run_avx512.c
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -4,10 +4,126 @@
 
 #include "acl_run_sse.h"
 
+/*sizeof(uint32_t) << match_log == sizeof(struct rte_acl_match_results)*/
+static const uint32_t match_log = 5;
+
+struct acl_flow_avx512 {
+	uint32_t num_packets;       /* number of packets processed */
+	uint32_t total_packets;     /* max number of packets to process */
+	uint32_t root_index;        /* current root index */
+	const uint64_t *trans;      /* transition table */
+	const uint32_t *data_index; /* input data indexes */
+	const uint8_t **idata;      /* input data */
+	uint32_t *matches;          /* match indexes */
+};
+
+static inline void
+acl_set_flow_avx512(struct acl_flow_avx512 *flow, const struct rte_acl_ctx *ctx,
+	uint32_t trie, const uint8_t *data[], uint32_t *matches,
+	uint32_t total_packets)
+{
+	flow->num_packets = 0;
+	flow->total_packets = total_packets;
+	flow->root_index = ctx->trie[trie].root_index;
+	flow->trans = ctx->trans_table;
+	flow->data_index = ctx->trie[trie].data_index;
+	flow->idata = data;
+	flow->matches = matches;
+}
+
+/*
+ * Update flow and result masks based on the number of unprocessed flows.
+ */
+static inline uint32_t
+update_flow_mask(const struct acl_flow_avx512 *flow, uint32_t *fmsk,
+	uint32_t *rmsk)
+{
+	uint32_t i, j, k, m, n;
+
+	fmsk[0] ^= rmsk[0];
+	m = rmsk[0];
+
+	k = __builtin_popcount(m);
+	n = flow->total_packets - flow->num_packets;
+
+	if (n < k) {
+		/* reduce mask */
+		for (i = k - n; i != 0; i--) {
+			j = sizeof(m) * CHAR_BIT - 1 - __builtin_clz(m);
+			m ^= 1 << j;
+		}
+	} else
+		n = k;
+
+	rmsk[0] = m;
+	fmsk[0] |= rmsk[0];
+
+	return n;
+}
+
+/*
+ * Resolve matches for multiple categories (LE 8, use 128b instuctions/regs)
+ */
+static inline void
+resolve_mcle8_avx512x1(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_cat, uint32_t nb_trie)
+{
+	const int32_t *pri;
+	const uint32_t *pm, *res;
+	uint32_t i, j, k, mi, mn;
+	__mmask8 msk;
+	xmm_t cp, cr, np, nr;
+
+	res = pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != nb_pkt; k++, result += nb_cat) {
+
+		mi = match[k] << match_log;
+
+		for (j = 0; j != nb_cat; j += RTE_ACL_RESULTS_MULTIPLIER) {
+
+			cr = _mm_loadu_si128((const xmm_t *)(res + mi + j));
+			cp = _mm_loadu_si128((const xmm_t *)(pri + mi + j));
+
+			for (i = 1, pm = match + nb_pkt; i != nb_trie;
+				i++, pm += nb_pkt) {
+
+				mn = j + (pm[k] << match_log);
+
+				nr = _mm_loadu_si128((const xmm_t *)(res + mn));
+				np = _mm_loadu_si128((const xmm_t *)(pri + mn));
+
+				msk = _mm_cmpgt_epi32_mask(cp, np);
+				cr = _mm_mask_mov_epi32(nr, msk, cr);
+				cp = _mm_mask_mov_epi32(np, msk, cp);
+			}
+
+			_mm_storeu_si128((xmm_t *)(result + j), cr);
+		}
+	}
+}
+
+#include "acl_run_avx512x8.h"
+
 int
 rte_acl_classify_avx512x16(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories)
 {
+	const uint32_t max_iter = MAX_SEARCHES_AVX16 * MAX_SEARCHES_AVX16;
+
+	/* split huge lookup (gt 256) into series of fixed size ones */
+	while (num > max_iter) {
+		search_avx512x8x2(ctx, data, results, max_iter, categories);
+		data += max_iter;
+		results += max_iter * categories;
+		num -= max_iter;
+	}
+
+	/* select classify method based on number of remaining requests */
+	if (num >= MAX_SEARCHES_AVX16)
+		return search_avx512x8x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE8)
 		return search_sse_8(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE4)
@@ -20,6 +136,19 @@ int
 rte_acl_classify_avx512x32(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories)
 {
+	const uint32_t max_iter = MAX_SEARCHES_AVX16 * MAX_SEARCHES_AVX16;
+
+	/* split huge lookup (gt 256) into series of fixed size ones */
+	while (num > max_iter) {
+		search_avx512x8x2(ctx, data, results, max_iter, categories);
+		data += max_iter;
+		results += max_iter * categories;
+		num -= max_iter;
+	}
+
+	/* select classify method based on number of remaining requests */
+	if (num >= MAX_SEARCHES_AVX16)
+		return search_avx512x8x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE8)
 		return search_sse_8(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE4)
diff --git a/lib/librte_acl/acl_run_avx512x8.h b/lib/librte_acl/acl_run_avx512x8.h
new file mode 100644
index 0000000000..cfba0299ed
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512x8.h
@@ -0,0 +1,642 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#define MASK8_BIT	(sizeof(__mmask8) * CHAR_BIT)
+
+#define NUM_AVX512X8X2	(2 * MASK8_BIT)
+#define MSK_AVX512X8X2	(NUM_AVX512X8X2 - 1)
+
+/* num/mask of pointers per SIMD regs */
+#define YMM_PTR_NUM	(sizeof(__m256i) / sizeof(uintptr_t))
+#define YMM_PTR_MSK	RTE_LEN2MASK(YMM_PTR_NUM, uint32_t)
+
+static const rte_ymm_t ymm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const rte_ymm_t ymm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const rte_ymm_t ymm_trlo_idle = {
+	.u32 = {
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+	},
+};
+
+static const rte_ymm_t ymm_trhi_idle = {
+	.u32 = {
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+	},
+};
+
+static const rte_ymm_t ymm_shuffle_input = {
+	.u32 = {
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+	},
+};
+
+static const rte_ymm_t ymm_four_32 = {
+	.u32 = {
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+	},
+};
+
+static const rte_ymm_t ymm_idx_add = {
+	.u32 = {
+		0, 1, 2, 3,
+		4, 5, 6, 7,
+	},
+};
+
+static const rte_ymm_t ymm_range_base = {
+	.u32 = {
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+	},
+};
+
+static const rte_ymm_t ymm_pminp = {
+	.u32 = {
+		0x00, 0x01, 0x02, 0x03,
+		0x08, 0x09, 0x0a, 0x0b,
+	},
+};
+
+static const __mmask16 ymm_pmidx_msk = 0x55;
+
+static const rte_ymm_t ymm_pmidx[2] = {
+	[0] = {
+		.u32 = {
+			0, 0, 1, 0, 2, 0, 3, 0,
+		},
+	},
+	[1] = {
+		.u32 = {
+			4, 0, 5, 0, 6, 0, 7, 0,
+		},
+	},
+};
+
+/*
+ * unfortunately current AVX512 ISA doesn't provide ability for
+ * gather load on a byte quantity. So we have to mimic it in SW,
+ * by doing 4x1B scalar loads.
+ */
+static inline __m128i
+_m256_mask_gather_epi8x4(__m256i pdata, __mmask8 mask)
+{
+	rte_xmm_t v;
+	rte_ymm_t p;
+
+	static const uint32_t zero;
+
+	p.y = _mm256_mask_set1_epi64(pdata, mask ^ YMM_PTR_MSK,
+		(uintptr_t)&zero);
+
+	v.u32[0] = *(uint8_t *)p.u64[0];
+	v.u32[1] = *(uint8_t *)p.u64[1];
+	v.u32[2] = *(uint8_t *)p.u64[2];
+	v.u32[3] = *(uint8_t *)p.u64[3];
+
+	return v.x;
+}
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes not supposed to be encountered here.
+ * For quad range nodes:
+ * Calculate number of range boundaries that are less than the
+ * input value. Range boundaries for each node are in signed 8 bit,
+ * ordered from -128 to 127.
+ * This is effectively a popcnt of bytes that are greater than the
+ * input byte.
+ * Single nodes are processed in the same ways as quad range nodes.
+ */
+static __rte_always_inline __m256i
+calc_addr8(__m256i index_mask, __m256i next_input, __m256i shuffle_input,
+	__m256i four_32, __m256i range_base, __m256i tr_lo, __m256i tr_hi)
+{
+	__mmask32 qm;
+	__mmask8 dfa_msk;
+	__m256i addr, in, node_type, r, t;
+	__m256i dfa_ofs, quad_ofs;
+
+	t = _mm256_xor_si256(index_mask, index_mask);
+	in = _mm256_shuffle_epi8(next_input, shuffle_input);
+
+	/* Calc node type and node addr */
+	node_type = _mm256_andnot_si256(index_mask, tr_lo);
+	addr = _mm256_and_si256(index_mask, tr_lo);
+
+	/* mask for DFA type(0) nodes */
+	dfa_msk = _mm256_cmpeq_epi32_mask(node_type, t);
+
+	/* DFA calculations. */
+	r = _mm256_srli_epi32(in, 30);
+	r = _mm256_add_epi8(r, range_base);
+	t = _mm256_srli_epi32(in, 24);
+	r = _mm256_shuffle_epi8(tr_hi, r);
+
+	dfa_ofs = _mm256_sub_epi32(t, r);
+
+	/* QUAD/SINGLE calculations. */
+	qm = _mm256_cmpgt_epi8_mask(in, tr_hi);
+	t = _mm256_maskz_set1_epi8(qm, (uint8_t)UINT8_MAX);
+	t = _mm256_lzcnt_epi32(t);
+	t = _mm256_srli_epi32(t, 3);
+	quad_ofs = _mm256_sub_epi32(four_32, t);
+
+	/* blend DFA and QUAD/SINGLE. */
+	t = _mm256_mask_mov_epi32(quad_ofs, dfa_msk, dfa_ofs);
+
+	/* calculate address for next transitions. */
+	addr = _mm256_add_epi32(addr, t);
+	return addr;
+}
+
+/*
+ * Process 16 transitions in parallel.
+ * tr_lo contains low 32 bits for 16 transition.
+ * tr_hi contains high 32 bits for 16 transition.
+ * next_input contains up to 4 input bytes for 16 flows.
+ */
+static __rte_always_inline __m256i
+transition8(__m256i next_input, const uint64_t *trans, __m256i *tr_lo,
+	__m256i *tr_hi)
+{
+	const int32_t *tr;
+	__m256i addr;
+
+	tr = (const int32_t *)(uintptr_t)trans;
+
+	/* Calculate the address (array index) for all 8 transitions. */
+	addr = calc_addr8(ymm_index_mask.y, next_input, ymm_shuffle_input.y,
+		ymm_four_32.y, ymm_range_base.y, *tr_lo, *tr_hi);
+
+	/* load lower 32 bits of 16 transactions at once. */
+	*tr_lo = _mm256_i32gather_epi32(tr, addr, sizeof(trans[0]));
+
+	next_input = _mm256_srli_epi32(next_input, CHAR_BIT);
+
+	/* load high 32 bits of 16 transactions at once. */
+	*tr_hi = _mm256_i32gather_epi32(tr + 1, addr, sizeof(trans[0]));
+
+	return next_input;
+}
+
+/*
+ * Execute first transition for up to 16 flows in parallel.
+ * next_input should contain one input byte for up to 16 flows.
+ * msk - mask of active flows.
+ * tr_lo contains low 32 bits for up to 16 transitions.
+ * tr_hi contains high 32 bits for up to 16 transitions.
+ */
+static __rte_always_inline void
+first_trans8(const struct acl_flow_avx512 *flow, __m256i next_input,
+	__mmask8 msk, __m256i *tr_lo, __m256i *tr_hi)
+{
+	const int32_t *tr;
+	__m256i addr, root;
+
+	tr = (const int32_t *)(uintptr_t)flow->trans;
+
+	addr = _mm256_set1_epi32(UINT8_MAX);
+	root = _mm256_set1_epi32(flow->root_index);
+
+	addr = _mm256_and_si256(next_input, addr);
+	addr = _mm256_add_epi32(root, addr);
+
+	/* load lower 32 bits of 16 transactions at once. */
+	*tr_lo = _mm256_mmask_i32gather_epi32(*tr_lo, msk, addr, tr,
+		sizeof(flow->trans[0]));
+
+	/* load high 32 bits of 16 transactions at once. */
+	*tr_hi = _mm256_mmask_i32gather_epi32(*tr_hi, msk, addr, (tr + 1),
+		sizeof(flow->trans[0]));
+}
+
+/*
+ * Load and return next 4 input bytes for up to 16 flows in parallel.
+ * pdata - 8x2 pointers to flow input data
+ * mask - mask of active flows.
+ * di - data indexes for these 16 flows.
+ */
+static inline __m256i
+get_next_bytes_avx512x8(const struct acl_flow_avx512 *flow, __m256i pdata[2],
+	uint32_t msk, __m256i *di, uint32_t bnum)
+{
+	const int32_t *div;
+	uint32_t m[2];
+	__m256i one, zero, t, p[2];
+	__m128i inp[2];
+
+	div = (const int32_t *)flow->data_index;
+
+	one = _mm256_set1_epi32(1);
+	zero = _mm256_xor_si256(one, one);
+
+	/* load data offsets for given indexes */
+	t = _mm256_mmask_i32gather_epi32(zero, msk, *di, div, sizeof(div[0]));
+
+	/* increment data indexes */
+	*di = _mm256_mask_add_epi32(*di, msk, *di, one);
+
+	/*
+	 * unsigned expand 32-bit indexes to 64-bit
+	 * (for later pointer arithmetic), i.e:
+	 * for (i = 0; i != 16; i++)
+	 *   p[i/8].u64[i%8] = (uint64_t)t.u32[i];
+	 */
+	p[0] = _mm256_maskz_permutexvar_epi32(ymm_pmidx_msk, ymm_pmidx[0].y, t);
+	p[1] = _mm256_maskz_permutexvar_epi32(ymm_pmidx_msk, ymm_pmidx[1].y, t);
+
+	p[0] = _mm256_add_epi64(p[0], pdata[0]);
+	p[1] = _mm256_add_epi64(p[1], pdata[1]);
+
+	/* load input byte(s), either one or four */
+
+	m[0] = msk & YMM_PTR_MSK;
+	m[1] = msk >> YMM_PTR_NUM;
+
+	if (bnum == sizeof(uint8_t)) {
+		inp[0] = _m256_mask_gather_epi8x4(p[0], m[0]);
+		inp[1] = _m256_mask_gather_epi8x4(p[1], m[1]);
+	} else {
+		inp[0] = _mm256_mmask_i64gather_epi32(
+				_mm256_castsi256_si128(zero), m[0], p[0],
+				NULL, sizeof(uint8_t));
+		inp[1] = _mm256_mmask_i64gather_epi32(
+				_mm256_castsi256_si128(zero), m[1], p[1],
+				NULL, sizeof(uint8_t));
+	}
+
+	/* squeeze input into one 512-bit register */
+	return _mm256_permutex2var_epi32(_mm256_castsi128_si256(inp[0]),
+			ymm_pminp.y,  _mm256_castsi128_si256(inp[1]));
+}
+
+/*
+ * Start up to 16 new flows.
+ * num - number of flows to start
+ * msk - mask of new flows.
+ * pdata - pointers to flow input data
+ * idx - match indexed for given flows
+ * di - data indexes for these flows.
+ */
+static inline void
+start_flow8(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
+	__m256i pdata[2], __m256i *idx, __m256i *di)
+{
+	uint32_t n, m[2], nm[2];
+	__m256i ni, nd[2];
+
+	m[0] = msk & YMM_PTR_MSK;
+	m[1] = msk >> YMM_PTR_NUM;
+
+	n = __builtin_popcount(m[0]);
+	nm[0] = (1 << n) - 1;
+	nm[1] = (1 << (num - n)) - 1;
+
+	/* load input data pointers for new flows */
+	nd[0] = _mm256_maskz_loadu_epi64(nm[0],
+		flow->idata + flow->num_packets);
+	nd[1] = _mm256_maskz_loadu_epi64(nm[1],
+		flow->idata + flow->num_packets + n);
+
+	/* calculate match indexes of new flows */
+	ni = _mm256_set1_epi32(flow->num_packets);
+	ni = _mm256_add_epi32(ni, ymm_idx_add.y);
+
+	/* merge new and existing flows data */
+	pdata[0] = _mm256_mask_expand_epi64(pdata[0], m[0], nd[0]);
+	pdata[1] = _mm256_mask_expand_epi64(pdata[1], m[1], nd[1]);
+
+	/* update match and data indexes */
+	*idx = _mm256_mask_expand_epi32(*idx, msk, ni);
+	*di = _mm256_maskz_mov_epi32(msk ^ UINT8_MAX, *di);
+
+	flow->num_packets += num;
+}
+
+/*
+ * Process found matches for up to 16 flows.
+ * fmsk - mask of active flows
+ * rmsk - mask of found matches
+ * pdata - pointers to flow input data
+ * di - data indexes for these flows
+ * idx - match indexed for given flows
+ * tr_lo contains low 32 bits for up to 8 transitions.
+ * tr_hi contains high 32 bits for up to 8 transitions.
+ */
+static inline uint32_t
+match_process_avx512x8(struct acl_flow_avx512 *flow, uint32_t *fmsk,
+	uint32_t *rmsk, __m256i pdata[2], __m256i *di, __m256i *idx,
+	__m256i *tr_lo, __m256i *tr_hi)
+{
+	uint32_t n;
+	__m256i res;
+
+	if (rmsk[0] == 0)
+		return 0;
+
+	/* extract match indexes */
+	res = _mm256_and_si256(tr_lo[0], ymm_index_mask.y);
+
+	/* mask  matched transitions to nop */
+	tr_lo[0] = _mm256_mask_mov_epi32(tr_lo[0], rmsk[0], ymm_trlo_idle.y);
+	tr_hi[0] = _mm256_mask_mov_epi32(tr_hi[0], rmsk[0], ymm_trhi_idle.y);
+
+	/* save found match indexes */
+	_mm256_mask_i32scatter_epi32(flow->matches, rmsk[0],
+		idx[0], res, sizeof(flow->matches[0]));
+
+	/* update masks and start new flows for matches */
+	n = update_flow_mask(flow, fmsk, rmsk);
+	start_flow8(flow, n, rmsk[0], pdata, idx, di);
+
+	return n;
+}
+
+/*
+ * Test for matches ut to 32 (2x16) flows at once,
+ * if matches exist - process them and start new flows.
+ */
+static inline void
+match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
+	__m256i pdata[4], __m256i di[2], __m256i idx[2], __m256i inp[2],
+	__m256i tr_lo[2], __m256i tr_hi[2])
+{
+	uint32_t n[2];
+	uint32_t rm[2];
+
+	/* check for matches */
+	rm[0] = _mm256_test_epi32_mask(tr_lo[0], ymm_match_mask.y);
+	rm[1] = _mm256_test_epi32_mask(tr_lo[1], ymm_match_mask.y);
+
+	/* till unprocessed matches exist */
+	while ((rm[0] | rm[1]) != 0) {
+
+		/* process matches and start new flows */
+		n[0] = match_process_avx512x8(flow, &fm[0], &rm[0], &pdata[0],
+			&di[0], &idx[0], &tr_lo[0], &tr_hi[0]);
+		n[1] = match_process_avx512x8(flow, &fm[1], &rm[1], &pdata[2],
+			&di[1], &idx[1], &tr_lo[1], &tr_hi[1]);
+
+		/* execute first transition for new flows, if any */
+
+		if (n[0] != 0) {
+			inp[0] = get_next_bytes_avx512x8(flow, &pdata[0],
+				rm[0], &di[0], sizeof(uint8_t));
+			first_trans8(flow, inp[0], rm[0], &tr_lo[0],
+				&tr_hi[0]);
+			rm[0] = _mm256_test_epi32_mask(tr_lo[0],
+				ymm_match_mask.y);
+		}
+
+		if (n[1] != 0) {
+			inp[1] = get_next_bytes_avx512x8(flow, &pdata[2],
+				rm[1], &di[1], sizeof(uint8_t));
+			first_trans8(flow, inp[1], rm[1], &tr_lo[1],
+				&tr_hi[1]);
+			rm[1] = _mm256_test_epi32_mask(tr_lo[1],
+				ymm_match_mask.y);
+		}
+	}
+}
+
+/*
+ * Perform search for up to 32 flows in parallel.
+ * Use two sets of metadata, each serves 16 flows max.
+ * So in fact we perform search for 2x16 flows.
+ */
+static inline void
+search_trie_avx512x8x2(struct acl_flow_avx512 *flow)
+{
+	uint32_t fm[2];
+	__m256i di[2], idx[2], in[2], pdata[4], tr_lo[2], tr_hi[2];
+
+	/* first 1B load */
+	start_flow8(flow, MASK8_BIT, UINT8_MAX, &pdata[0], &idx[0], &di[0]);
+	start_flow8(flow, MASK8_BIT, UINT8_MAX, &pdata[2], &idx[1], &di[1]);
+
+	in[0] = get_next_bytes_avx512x8(flow, &pdata[0], UINT8_MAX, &di[0],
+			sizeof(uint8_t));
+	in[1] = get_next_bytes_avx512x8(flow, &pdata[2], UINT8_MAX, &di[1],
+			sizeof(uint8_t));
+
+	first_trans8(flow, in[0], UINT8_MAX, &tr_lo[0], &tr_hi[0]);
+	first_trans8(flow, in[1], UINT8_MAX, &tr_lo[1], &tr_hi[1]);
+
+	fm[0] = UINT8_MAX;
+	fm[1] = UINT8_MAX;
+
+	/* match check */
+	match_check_process_avx512x8x2(flow, fm, pdata, di, idx, in,
+		tr_lo, tr_hi);
+
+	while ((fm[0] | fm[1]) != 0) {
+
+		/* load next 4B */
+
+		in[0] = get_next_bytes_avx512x8(flow, &pdata[0], fm[0],
+			&di[0], sizeof(uint32_t));
+		in[1] = get_next_bytes_avx512x8(flow, &pdata[2], fm[1],
+			&di[1], sizeof(uint32_t));
+
+		/* main 4B loop */
+
+		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		/* check for matches */
+		match_check_process_avx512x8x2(flow, fm, pdata, di, idx, in,
+			tr_lo, tr_hi);
+	}
+}
+
+/*
+ * resolve match index to actual result/priority offset.
+ */
+static inline __m256i
+resolve_match_idx_avx512x8(__m256i mi)
+{
+	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
+		1 << (match_log + 2));
+	return _mm256_slli_epi32(mi, match_log);
+}
+
+/*
+ * Resolve multiple matches for the same flow based on priority.
+ */
+static inline __m256i
+resolve_pri_avx512x8(const int32_t res[], const int32_t pri[],
+	const uint32_t match[], __mmask8 msk, uint32_t nb_trie,
+	uint32_t nb_skip)
+{
+	uint32_t i;
+	const uint32_t *pm;
+	__mmask16 m;
+	__m256i cp, cr, np, nr, mch;
+
+	const __m256i zero = _mm256_set1_epi32(0);
+
+	/* get match indexes */
+	mch = _mm256_maskz_loadu_epi32(msk, match);
+	mch = resolve_match_idx_avx512x8(mch);
+
+	/* read result and priority values for first trie */
+	cr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0]));
+	cp = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0]));
+
+	/*
+	 * read result and priority values for next tries and select one
+	 * with highest priority.
+	 */
+	for (i = 1, pm = match + nb_skip; i != nb_trie;
+			i++, pm += nb_skip) {
+
+		mch = _mm256_maskz_loadu_epi32(msk, pm);
+		mch = resolve_match_idx_avx512x8(mch);
+
+		nr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res,
+			sizeof(res[0]));
+		np = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri,
+			sizeof(pri[0]));
+
+		m = _mm256_cmpgt_epi32_mask(cp, np);
+		cr = _mm256_mask_mov_epi32(nr, m, cr);
+		cp = _mm256_mask_mov_epi32(np, m, cp);
+	}
+
+	return cr;
+}
+
+/*
+ * Resolve num (<= 8) matches for single category
+ */
+static inline void
+resolve_sc_avx512x8(uint32_t result[], const int32_t res[],
+	const int32_t pri[], const uint32_t match[], uint32_t nb_pkt,
+	uint32_t nb_trie, uint32_t nb_skip)
+{
+	__mmask8 msk;
+	__m256i cr;
+
+	msk = (1 << nb_pkt) - 1;
+	cr = resolve_pri_avx512x8(res, pri, match, msk, nb_trie, nb_skip);
+	_mm256_mask_storeu_epi32(result, msk, cr);
+}
+
+/*
+ * Resolve matches for single category
+ */
+static inline void
+resolve_sc_avx512x8x2(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_trie)
+{
+	uint32_t j, k, n;
+	const int32_t *res, *pri;
+	__m256i cr[2];
+
+	res = (const int32_t *)pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != (nb_pkt & ~MSK_AVX512X8X2); k += NUM_AVX512X8X2) {
+
+		j = k + MASK8_BIT;
+
+		cr[0] = resolve_pri_avx512x8(res, pri, match + k, UINT8_MAX,
+				nb_trie, nb_pkt);
+		cr[1] = resolve_pri_avx512x8(res, pri, match + j, UINT8_MAX,
+				nb_trie, nb_pkt);
+
+		_mm256_storeu_si256((void *)(result + k), cr[0]);
+		_mm256_storeu_si256((void *)(result + j), cr[1]);
+	}
+
+	n = nb_pkt - k;
+	if (n != 0) {
+		if (n > MASK8_BIT) {
+			resolve_sc_avx512x8(result + k, res, pri, match + k,
+				MASK8_BIT, nb_trie, nb_pkt);
+			k += MASK8_BIT;
+			n -= MASK8_BIT;
+		}
+		resolve_sc_avx512x8(result + k, res, pri, match + k, n,
+				nb_trie, nb_pkt);
+	}
+}
+
+static inline int
+search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	uint32_t i, *pm;
+	const struct rte_acl_match_results *pr;
+	struct acl_flow_avx512 flow;
+	uint32_t match[ctx->num_tries * total_packets];
+
+	for (i = 0, pm = match; i != ctx->num_tries; i++, pm += total_packets) {
+
+		/* setup for next trie */
+		acl_set_flow_avx512(&flow, ctx, i, data, pm, total_packets);
+
+		/* process the trie */
+		search_trie_avx512x8x2(&flow);
+	}
+
+	/* resolve matches */
+	pr = (const struct rte_acl_match_results *)
+		(ctx->trans_table + ctx->match_index);
+
+	if (categories == 1)
+		resolve_sc_avx512x8x2(results, pr, match, total_packets,
+			ctx->num_tries);
+	else
+		resolve_mcle8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+
+	return 0;
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v4 09/14] acl: update default classify algorithm selection
  2020-10-06 15:03     ` [dpdk-dev] [PATCH v4 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                         ` (7 preceding siblings ...)
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 08/14] acl: introduce 256-bit width AVX512 classify implementation Konstantin Ananyev
@ 2020-10-06 15:03       ` Konstantin Ananyev
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 10/14] acl: introduce 512-bit width AVX512 classify implementation Konstantin Ananyev
                         ` (5 subsequent siblings)
  14 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-06 15:03 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

On supported platforms, set RTE_ACL_CLASSIFY_AVX512X16 as
default ACL classify algorithm.
Note that AVX512X16 implementation uses 256-bit registers/instincts only
to avoid possibility of frequency drop.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/rte_acl.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 1154f35107..245af672ee 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -228,6 +228,7 @@ acl_get_best_alg(void)
 #elif defined(RTE_ARCH_PPC_64)
 		RTE_ACL_CLASSIFY_ALTIVEC,
 #elif defined(RTE_ARCH_X86)
+		RTE_ACL_CLASSIFY_AVX512X16,
 		RTE_ACL_CLASSIFY_AVX2,
 		RTE_ACL_CLASSIFY_SSE,
 #endif
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v4 10/14] acl: introduce 512-bit width AVX512 classify implementation
  2020-10-06 15:03     ` [dpdk-dev] [PATCH v4 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                         ` (8 preceding siblings ...)
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 09/14] acl: update default classify algorithm selection Konstantin Ananyev
@ 2020-10-06 15:03       ` Konstantin Ananyev
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 11/14] acl: for AVX512 classify use 4B load whenever possible Konstantin Ananyev
                         ` (4 subsequent siblings)
  14 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-06 15:03 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Introduce classify implementation that uses AVX512 specific ISA.
rte_acl_classify_avx512x32() is able to process up to 32 flows in parallel.
It uses 512-bit width registers/instructions and provides higher
performance then rte_acl_classify_avx512x16(), but can cause
frequency level change.
Note that for now only 64-bit version is supported.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
Depends-on: patch-79310 ("eal/x86: introduce AVX 512-bit type")

 .../prog_guide/packet_classif_access_ctrl.rst |  10 +
 doc/guides/rel_notes/release_20_11.rst        |   3 +
 lib/librte_acl/acl_run_avx512.c               |   6 +-
 lib/librte_acl/acl_run_avx512x16.h            | 732 ++++++++++++++++++
 4 files changed, 750 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_acl/acl_run_avx512x16.h

diff --git a/doc/guides/prog_guide/packet_classif_access_ctrl.rst b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
index 11f4bc841b..7659af8eb5 100644
--- a/doc/guides/prog_guide/packet_classif_access_ctrl.rst
+++ b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
@@ -383,10 +383,20 @@ There are several implementations of classify algorithm:
     flows in parallel. Uses 256-bit width SIMD registers.
     Requires AVX512 support.
 
+*   **RTE_ACL_CLASSIFY_AVX512X32**: vector implementation, can process up to 32
+    flows in parallel. Uses 512-bit width SIMD registers.
+    Requires AVX512 support.
+
 It is purely a runtime decision which method to choose, there is no build-time difference.
 All implementations operates over the same internal RT structures and use similar principles. The main difference is that vector implementations can manually exploit IA SIMD instructions and process several input data flows in parallel.
 At startup ACL library determines the highest available classify method for the given platform and sets it as default one. Though the user has an ability to override the default classifier function for a given ACL context or perform particular search using non-default classify method. In that case it is user responsibility to make sure that given platform supports selected classify implementation.
 
+.. note::
+
+     Right now ``RTE_ACL_CLASSIFY_AVX512X32`` is not selected by default
+     (due to possible frequency level change), but it can be selected at
+     runtime by apps through the use of ACL API: ``rte_acl_set_ctx_classify``.
+
 Application Programming Interface (API) Usage
 ---------------------------------------------
 
diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index 95d7bfd777..001e46f595 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -112,6 +112,9 @@ New Features
   * Added new ``RTE_ACL_CLASSIFY_AVX512X16`` vector implementation,
     which can process up to 16 flows in parallel. Requires AVX512 support.
 
+  * Added new ``RTE_ACL_CLASSIFY_AVX512X32`` vector implementation,
+    which can process up to 32 flows in parallel. Requires AVX512 support.
+
 
 Removed Items
 -------------
diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
index f5bc628b7c..74698fa2ea 100644
--- a/lib/librte_acl/acl_run_avx512.c
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -132,6 +132,8 @@ rte_acl_classify_avx512x16(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	return rte_acl_classify_scalar(ctx, data, results, num, categories);
 }
 
+#include "acl_run_avx512x16.h"
+
 int
 rte_acl_classify_avx512x32(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories)
@@ -140,13 +142,15 @@ rte_acl_classify_avx512x32(const struct rte_acl_ctx *ctx, const uint8_t **data,
 
 	/* split huge lookup (gt 256) into series of fixed size ones */
 	while (num > max_iter) {
-		search_avx512x8x2(ctx, data, results, max_iter, categories);
+		search_avx512x16x2(ctx, data, results, max_iter, categories);
 		data += max_iter;
 		results += max_iter * categories;
 		num -= max_iter;
 	}
 
 	/* select classify method based on number of remaining requests */
+	if (num >= 2 * MAX_SEARCHES_AVX16)
+		return search_avx512x16x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_AVX16)
 		return search_avx512x8x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE8)
diff --git a/lib/librte_acl/acl_run_avx512x16.h b/lib/librte_acl/acl_run_avx512x16.h
new file mode 100644
index 0000000000..981f8d16da
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512x16.h
@@ -0,0 +1,732 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#define	MASK16_BIT	(sizeof(__mmask16) * CHAR_BIT)
+
+#define NUM_AVX512X16X2	(2 * MASK16_BIT)
+#define MSK_AVX512X16X2	(NUM_AVX512X16X2 - 1)
+
+/* num/mask of pointers per SIMD regs */
+#define ZMM_PTR_NUM	(sizeof(__m512i) / sizeof(uintptr_t))
+#define ZMM_PTR_MSK	RTE_LEN2MASK(ZMM_PTR_NUM, uint32_t)
+
+static const __rte_x86_zmm_t zmm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_trlo_idle = {
+	.u32 = {
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_trhi_idle = {
+	.u32 = {
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_shuffle_input = {
+	.u32 = {
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_four_32 = {
+	.u32 = {
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_idx_add = {
+	.u32 = {
+		0, 1, 2, 3,
+		4, 5, 6, 7,
+		8, 9, 10, 11,
+		12, 13, 14, 15,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_range_base = {
+	.u32 = {
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_pminp = {
+	.u32 = {
+		0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+		0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17,
+	},
+};
+
+static const __mmask16 zmm_pmidx_msk = 0x5555;
+
+static const __rte_x86_zmm_t zmm_pmidx[2] = {
+	[0] = {
+		.u32 = {
+			0, 0, 1, 0, 2, 0, 3, 0,
+			4, 0, 5, 0, 6, 0, 7, 0,
+		},
+	},
+	[1] = {
+		.u32 = {
+			8, 0, 9, 0, 10, 0, 11, 0,
+			12, 0, 13, 0, 14, 0, 15, 0,
+		},
+	},
+};
+
+/*
+ * unfortunately current AVX512 ISA doesn't provide ability for
+ * gather load on a byte quantity. So we have to mimic it in SW,
+ * by doing 8x1B scalar loads.
+ */
+static inline ymm_t
+_m512_mask_gather_epi8x8(__m512i pdata, __mmask8 mask)
+{
+	rte_ymm_t v;
+	__rte_x86_zmm_t p;
+
+	static const uint32_t zero;
+
+	p.z = _mm512_mask_set1_epi64(pdata, mask ^ ZMM_PTR_MSK,
+		(uintptr_t)&zero);
+
+	v.u32[0] = *(uint8_t *)p.u64[0];
+	v.u32[1] = *(uint8_t *)p.u64[1];
+	v.u32[2] = *(uint8_t *)p.u64[2];
+	v.u32[3] = *(uint8_t *)p.u64[3];
+	v.u32[4] = *(uint8_t *)p.u64[4];
+	v.u32[5] = *(uint8_t *)p.u64[5];
+	v.u32[6] = *(uint8_t *)p.u64[6];
+	v.u32[7] = *(uint8_t *)p.u64[7];
+
+	return v.y;
+}
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes not supposed to be encountered here.
+ * For quad range nodes:
+ * Calculate number of range boundaries that are less than the
+ * input value. Range boundaries for each node are in signed 8 bit,
+ * ordered from -128 to 127.
+ * This is effectively a popcnt of bytes that are greater than the
+ * input byte.
+ * Single nodes are processed in the same ways as quad range nodes.
+ */
+static __rte_always_inline __m512i
+calc_addr16(__m512i index_mask, __m512i next_input, __m512i shuffle_input,
+	__m512i four_32, __m512i range_base, __m512i tr_lo, __m512i tr_hi)
+{
+	__mmask64 qm;
+	__mmask16 dfa_msk;
+	__m512i addr, in, node_type, r, t;
+	__m512i dfa_ofs, quad_ofs;
+
+	t = _mm512_xor_si512(index_mask, index_mask);
+	in = _mm512_shuffle_epi8(next_input, shuffle_input);
+
+	/* Calc node type and node addr */
+	node_type = _mm512_andnot_si512(index_mask, tr_lo);
+	addr = _mm512_and_si512(index_mask, tr_lo);
+
+	/* mask for DFA type(0) nodes */
+	dfa_msk = _mm512_cmpeq_epi32_mask(node_type, t);
+
+	/* DFA calculations. */
+	r = _mm512_srli_epi32(in, 30);
+	r = _mm512_add_epi8(r, range_base);
+	t = _mm512_srli_epi32(in, 24);
+	r = _mm512_shuffle_epi8(tr_hi, r);
+
+	dfa_ofs = _mm512_sub_epi32(t, r);
+
+	/* QUAD/SINGLE calculations. */
+	qm = _mm512_cmpgt_epi8_mask(in, tr_hi);
+	t = _mm512_maskz_set1_epi8(qm, (uint8_t)UINT8_MAX);
+	t = _mm512_lzcnt_epi32(t);
+	t = _mm512_srli_epi32(t, 3);
+	quad_ofs = _mm512_sub_epi32(four_32, t);
+
+	/* blend DFA and QUAD/SINGLE. */
+	t = _mm512_mask_mov_epi32(quad_ofs, dfa_msk, dfa_ofs);
+
+	/* calculate address for next transitions. */
+	addr = _mm512_add_epi32(addr, t);
+	return addr;
+}
+
+/*
+ * Process 16 transitions in parallel.
+ * tr_lo contains low 32 bits for 16 transition.
+ * tr_hi contains high 32 bits for 16 transition.
+ * next_input contains up to 4 input bytes for 16 flows.
+ */
+static __rte_always_inline __m512i
+transition16(__m512i next_input, const uint64_t *trans, __m512i *tr_lo,
+	__m512i *tr_hi)
+{
+	const int32_t *tr;
+	__m512i addr;
+
+	tr = (const int32_t *)(uintptr_t)trans;
+
+	/* Calculate the address (array index) for all 16 transitions. */
+	addr = calc_addr16(zmm_index_mask.z, next_input, zmm_shuffle_input.z,
+		zmm_four_32.z, zmm_range_base.z, *tr_lo, *tr_hi);
+
+	/* load lower 32 bits of 16 transactions at once. */
+	*tr_lo = _mm512_i32gather_epi32(addr, tr, sizeof(trans[0]));
+
+	next_input = _mm512_srli_epi32(next_input, CHAR_BIT);
+
+	/* load high 32 bits of 16 transactions at once. */
+	*tr_hi = _mm512_i32gather_epi32(addr, (tr + 1), sizeof(trans[0]));
+
+	return next_input;
+}
+
+/*
+ * Execute first transition for up to 16 flows in parallel.
+ * next_input should contain one input byte for up to 16 flows.
+ * msk - mask of active flows.
+ * tr_lo contains low 32 bits for up to 16 transitions.
+ * tr_hi contains high 32 bits for up to 16 transitions.
+ */
+static __rte_always_inline void
+first_trans16(const struct acl_flow_avx512 *flow, __m512i next_input,
+	__mmask16 msk, __m512i *tr_lo, __m512i *tr_hi)
+{
+	const int32_t *tr;
+	__m512i addr, root;
+
+	tr = (const int32_t *)(uintptr_t)flow->trans;
+
+	addr = _mm512_set1_epi32(UINT8_MAX);
+	root = _mm512_set1_epi32(flow->root_index);
+
+	addr = _mm512_and_si512(next_input, addr);
+	addr = _mm512_add_epi32(root, addr);
+
+	/* load lower 32 bits of 16 transactions at once. */
+	*tr_lo = _mm512_mask_i32gather_epi32(*tr_lo, msk, addr, tr,
+		sizeof(flow->trans[0]));
+
+	/* load high 32 bits of 16 transactions at once. */
+	*tr_hi = _mm512_mask_i32gather_epi32(*tr_hi, msk, addr, (tr + 1),
+		sizeof(flow->trans[0]));
+}
+
+/*
+ * Load and return next 4 input bytes for up to 16 flows in parallel.
+ * pdata - 8x2 pointers to flow input data
+ * mask - mask of active flows.
+ * di - data indexes for these 16 flows.
+ */
+static inline __m512i
+get_next_bytes_avx512x16(const struct acl_flow_avx512 *flow, __m512i pdata[2],
+	uint32_t msk, __m512i *di, uint32_t bnum)
+{
+	const int32_t *div;
+	uint32_t m[2];
+	__m512i one, zero, t, p[2];
+	ymm_t inp[2];
+
+	div = (const int32_t *)flow->data_index;
+
+	one = _mm512_set1_epi32(1);
+	zero = _mm512_xor_si512(one, one);
+
+	/* load data offsets for given indexes */
+	t = _mm512_mask_i32gather_epi32(zero, msk, *di, div, sizeof(div[0]));
+
+	/* increment data indexes */
+	*di = _mm512_mask_add_epi32(*di, msk, *di, one);
+
+	/*
+	 * unsigned expand 32-bit indexes to 64-bit
+	 * (for later pointer arithmetic), i.e:
+	 * for (i = 0; i != 16; i++)
+	 *   p[i/8].u64[i%8] = (uint64_t)t.u32[i];
+	 */
+	p[0] = _mm512_maskz_permutexvar_epi32(zmm_pmidx_msk, zmm_pmidx[0].z, t);
+	p[1] = _mm512_maskz_permutexvar_epi32(zmm_pmidx_msk, zmm_pmidx[1].z, t);
+
+	p[0] = _mm512_add_epi64(p[0], pdata[0]);
+	p[1] = _mm512_add_epi64(p[1], pdata[1]);
+
+	/* load input byte(s), either one or four */
+
+	m[0] = msk & ZMM_PTR_MSK;
+	m[1] = msk >> ZMM_PTR_NUM;
+
+	if (bnum == sizeof(uint8_t)) {
+		inp[0] = _m512_mask_gather_epi8x8(p[0], m[0]);
+		inp[1] = _m512_mask_gather_epi8x8(p[1], m[1]);
+	} else {
+		inp[0] = _mm512_mask_i64gather_epi32(
+				_mm512_castsi512_si256(zero), m[0], p[0],
+				NULL, sizeof(uint8_t));
+		inp[1] = _mm512_mask_i64gather_epi32(
+				_mm512_castsi512_si256(zero), m[1], p[1],
+				NULL, sizeof(uint8_t));
+	}
+
+	/* squeeze input into one 512-bit register */
+	return _mm512_permutex2var_epi32(_mm512_castsi256_si512(inp[0]),
+			zmm_pminp.z, _mm512_castsi256_si512(inp[1]));
+}
+
+/*
+ * Start up to 16 new flows.
+ * num - number of flows to start
+ * msk - mask of new flows.
+ * pdata - pointers to flow input data
+ * idx - match indexed for given flows
+ * di - data indexes for these flows.
+ */
+static inline void
+start_flow16(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
+	__m512i pdata[2], __m512i *idx, __m512i *di)
+{
+	uint32_t n, m[2], nm[2];
+	__m512i ni, nd[2];
+
+	/* split mask into two - one for each pdata[] */
+	m[0] = msk & ZMM_PTR_MSK;
+	m[1] = msk >> ZMM_PTR_NUM;
+
+	/* calculate masks for new flows */
+	n = __builtin_popcount(m[0]);
+	nm[0] = (1 << n) - 1;
+	nm[1] = (1 << (num - n)) - 1;
+
+	/* load input data pointers for new flows */
+	nd[0] = _mm512_maskz_loadu_epi64(nm[0],
+		flow->idata + flow->num_packets);
+	nd[1] = _mm512_maskz_loadu_epi64(nm[1],
+		flow->idata + flow->num_packets + n);
+
+	/* calculate match indexes of new flows */
+	ni = _mm512_set1_epi32(flow->num_packets);
+	ni = _mm512_add_epi32(ni, zmm_idx_add.z);
+
+	/* merge new and existing flows data */
+	pdata[0] = _mm512_mask_expand_epi64(pdata[0], m[0], nd[0]);
+	pdata[1] = _mm512_mask_expand_epi64(pdata[1], m[1], nd[1]);
+
+	/* update match and data indexes */
+	*idx = _mm512_mask_expand_epi32(*idx, msk, ni);
+	*di = _mm512_maskz_mov_epi32(msk ^ UINT16_MAX, *di);
+
+	flow->num_packets += num;
+}
+
+/*
+ * Process found matches for up to 16 flows.
+ * fmsk - mask of active flows
+ * rmsk - mask of found matches
+ * pdata - pointers to flow input data
+ * di - data indexes for these flows
+ * idx - match indexed for given flows
+ * tr_lo contains low 32 bits for up to 8 transitions.
+ * tr_hi contains high 32 bits for up to 8 transitions.
+ */
+static inline uint32_t
+match_process_avx512x16(struct acl_flow_avx512 *flow, uint32_t *fmsk,
+	uint32_t *rmsk, __m512i pdata[2], __m512i *di, __m512i *idx,
+	__m512i *tr_lo, __m512i *tr_hi)
+{
+	uint32_t n;
+	__m512i res;
+
+	if (rmsk[0] == 0)
+		return 0;
+
+	/* extract match indexes */
+	res = _mm512_and_si512(tr_lo[0], zmm_index_mask.z);
+
+	/* mask  matched transitions to nop */
+	tr_lo[0] = _mm512_mask_mov_epi32(tr_lo[0], rmsk[0], zmm_trlo_idle.z);
+	tr_hi[0] = _mm512_mask_mov_epi32(tr_hi[0], rmsk[0], zmm_trhi_idle.z);
+
+	/* save found match indexes */
+	_mm512_mask_i32scatter_epi32(flow->matches, rmsk[0],
+		idx[0], res, sizeof(flow->matches[0]));
+
+	/* update masks and start new flows for matches */
+	n = update_flow_mask(flow, fmsk, rmsk);
+	start_flow16(flow, n, rmsk[0], pdata, idx, di);
+
+	return n;
+}
+
+/*
+ * Test for matches ut to 32 (2x16) flows at once,
+ * if matches exist - process them and start new flows.
+ */
+static inline void
+match_check_process_avx512x16x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
+	__m512i pdata[4], __m512i di[2], __m512i idx[2], __m512i inp[2],
+	__m512i tr_lo[2], __m512i tr_hi[2])
+{
+	uint32_t n[2];
+	uint32_t rm[2];
+
+	/* check for matches */
+	rm[0] = _mm512_test_epi32_mask(tr_lo[0], zmm_match_mask.z);
+	rm[1] = _mm512_test_epi32_mask(tr_lo[1], zmm_match_mask.z);
+
+	/* till unprocessed matches exist */
+	while ((rm[0] | rm[1]) != 0) {
+
+		/* process matches and start new flows */
+		n[0] = match_process_avx512x16(flow, &fm[0], &rm[0], &pdata[0],
+			&di[0], &idx[0], &tr_lo[0], &tr_hi[0]);
+		n[1] = match_process_avx512x16(flow, &fm[1], &rm[1], &pdata[2],
+			&di[1], &idx[1], &tr_lo[1], &tr_hi[1]);
+
+		/* execute first transition for new flows, if any */
+
+		if (n[0] != 0) {
+			inp[0] = get_next_bytes_avx512x16(flow, &pdata[0],
+				rm[0], &di[0], sizeof(uint8_t));
+			first_trans16(flow, inp[0], rm[0], &tr_lo[0],
+				&tr_hi[0]);
+			rm[0] = _mm512_test_epi32_mask(tr_lo[0],
+				zmm_match_mask.z);
+		}
+
+		if (n[1] != 0) {
+			inp[1] = get_next_bytes_avx512x16(flow, &pdata[2],
+				rm[1], &di[1], sizeof(uint8_t));
+			first_trans16(flow, inp[1], rm[1], &tr_lo[1],
+				&tr_hi[1]);
+			rm[1] = _mm512_test_epi32_mask(tr_lo[1],
+				zmm_match_mask.z);
+		}
+	}
+}
+
+/*
+ * Perform search for up to 32 flows in parallel.
+ * Use two sets of metadata, each serves 16 flows max.
+ * So in fact we perform search for 2x16 flows.
+ */
+static inline void
+search_trie_avx512x16x2(struct acl_flow_avx512 *flow)
+{
+	uint32_t fm[2];
+	__m512i di[2], idx[2], in[2], pdata[4], tr_lo[2], tr_hi[2];
+
+	/* first 1B load */
+	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[0], &idx[0], &di[0]);
+	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[2], &idx[1], &di[1]);
+
+	in[0] = get_next_bytes_avx512x16(flow, &pdata[0], UINT16_MAX, &di[0],
+			sizeof(uint8_t));
+	in[1] = get_next_bytes_avx512x16(flow, &pdata[2], UINT16_MAX, &di[1],
+			sizeof(uint8_t));
+
+	first_trans16(flow, in[0], UINT16_MAX, &tr_lo[0], &tr_hi[0]);
+	first_trans16(flow, in[1], UINT16_MAX, &tr_lo[1], &tr_hi[1]);
+
+	fm[0] = UINT16_MAX;
+	fm[1] = UINT16_MAX;
+
+	/* match check */
+	match_check_process_avx512x16x2(flow, fm, pdata, di, idx, in,
+		tr_lo, tr_hi);
+
+	while ((fm[0] | fm[1]) != 0) {
+
+		/* load next 4B */
+
+		in[0] = get_next_bytes_avx512x16(flow, &pdata[0], fm[0],
+			&di[0], sizeof(uint32_t));
+		in[1] = get_next_bytes_avx512x16(flow, &pdata[2], fm[1],
+			&di[1], sizeof(uint32_t));
+
+		/* main 4B loop */
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		/* check for matches */
+		match_check_process_avx512x16x2(flow, fm, pdata, di, idx, in,
+			tr_lo, tr_hi);
+	}
+}
+
+/*
+ * Resolve matches for multiple categories (GT 8, use 512b instuctions/regs)
+ */
+static inline void
+resolve_mcgt8_avx512x1(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_cat, uint32_t nb_trie)
+{
+	const int32_t *pri;
+	const uint32_t *pm, *res;
+	uint32_t i, k, mi;
+	__mmask16 cm, sm;
+	__m512i cp, cr, np, nr;
+
+	const uint32_t match_log = 5;
+
+	res = pr->results;
+	pri = pr->priority;
+
+	cm = (1 << nb_cat) - 1;
+
+	for (k = 0; k != nb_pkt; k++, result += nb_cat) {
+
+		mi = match[k] << match_log;
+
+		cr = _mm512_maskz_loadu_epi32(cm, res + mi);
+		cp = _mm512_maskz_loadu_epi32(cm, pri + mi);
+
+		for (i = 1, pm = match + nb_pkt; i != nb_trie;
+				i++, pm += nb_pkt) {
+
+			mi = pm[k] << match_log;
+
+			nr = _mm512_maskz_loadu_epi32(cm, res + mi);
+			np = _mm512_maskz_loadu_epi32(cm, pri + mi);
+
+			sm = _mm512_cmpgt_epi32_mask(cp, np);
+			cr = _mm512_mask_mov_epi32(nr, sm, cr);
+			cp = _mm512_mask_mov_epi32(np, sm, cp);
+		}
+
+		_mm512_mask_storeu_epi32(result, cm, cr);
+	}
+}
+
+/*
+ * resolve match index to actual result/priority offset.
+ */
+static inline __m512i
+resolve_match_idx_avx512x16(__m512i mi)
+{
+	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
+		1 << (match_log + 2));
+	return _mm512_slli_epi32(mi, match_log);
+}
+
+/*
+ * Resolve multiple matches for the same flow based on priority.
+ */
+static inline __m512i
+resolve_pri_avx512x16(const int32_t res[], const int32_t pri[],
+	const uint32_t match[], __mmask16 msk, uint32_t nb_trie,
+	uint32_t nb_skip)
+{
+	uint32_t i;
+	const uint32_t *pm;
+	__mmask16 m;
+	__m512i cp, cr, np, nr, mch;
+
+	const __m512i zero = _mm512_set1_epi32(0);
+
+	/* get match indexes */
+	mch = _mm512_maskz_loadu_epi32(msk, match);
+	mch = resolve_match_idx_avx512x16(mch);
+
+	/* read result and priority values for first trie */
+	cr = _mm512_mask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0]));
+	cp = _mm512_mask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0]));
+
+	/*
+	 * read result and priority values for next tries and select one
+	 * with highest priority.
+	 */
+	for (i = 1, pm = match + nb_skip; i != nb_trie;
+			i++, pm += nb_skip) {
+
+		mch = _mm512_maskz_loadu_epi32(msk, pm);
+		mch = resolve_match_idx_avx512x16(mch);
+
+		nr = _mm512_mask_i32gather_epi32(zero, msk, mch, res,
+			sizeof(res[0]));
+		np = _mm512_mask_i32gather_epi32(zero, msk, mch, pri,
+			sizeof(pri[0]));
+
+		m = _mm512_cmpgt_epi32_mask(cp, np);
+		cr = _mm512_mask_mov_epi32(nr, m, cr);
+		cp = _mm512_mask_mov_epi32(np, m, cp);
+	}
+
+	return cr;
+}
+
+/*
+ * Resolve num (<= 16) matches for single category
+ */
+static inline void
+resolve_sc_avx512x16(uint32_t result[], const int32_t res[],
+	const int32_t pri[], const uint32_t match[], uint32_t nb_pkt,
+	uint32_t nb_trie, uint32_t nb_skip)
+{
+	__mmask16 msk;
+	__m512i cr;
+
+	msk = (1 << nb_pkt) - 1;
+	cr = resolve_pri_avx512x16(res, pri, match, msk, nb_trie, nb_skip);
+	_mm512_mask_storeu_epi32(result, msk, cr);
+}
+
+/*
+ * Resolve matches for single category
+ */
+static inline void
+resolve_sc_avx512x16x2(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_trie)
+{
+	uint32_t j, k, n;
+	const int32_t *res, *pri;
+	__m512i cr[2];
+
+	res = (const int32_t *)pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != (nb_pkt & ~MSK_AVX512X16X2); k += NUM_AVX512X16X2) {
+
+		j = k + MASK16_BIT;
+
+		cr[0] = resolve_pri_avx512x16(res, pri, match + k, UINT16_MAX,
+				nb_trie, nb_pkt);
+		cr[1] = resolve_pri_avx512x16(res, pri, match + j, UINT16_MAX,
+				nb_trie, nb_pkt);
+
+		_mm512_storeu_si512(result + k, cr[0]);
+		_mm512_storeu_si512(result + j, cr[1]);
+	}
+
+	n = nb_pkt - k;
+	if (n != 0) {
+		if (n > MASK16_BIT) {
+			resolve_sc_avx512x16(result + k, res, pri, match + k,
+				MASK16_BIT, nb_trie, nb_pkt);
+			k += MASK16_BIT;
+			n -= MASK16_BIT;
+		}
+		resolve_sc_avx512x16(result + k, res, pri, match + k, n,
+				nb_trie, nb_pkt);
+	}
+}
+
+static inline int
+search_avx512x16x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	uint32_t i, *pm;
+	const struct rte_acl_match_results *pr;
+	struct acl_flow_avx512 flow;
+	uint32_t match[ctx->num_tries * total_packets];
+
+	for (i = 0, pm = match; i != ctx->num_tries; i++, pm += total_packets) {
+
+		/* setup for next trie */
+		acl_set_flow_avx512(&flow, ctx, i, data, pm, total_packets);
+
+		/* process the trie */
+		search_trie_avx512x16x2(&flow);
+	}
+
+	/* resolve matches */
+	pr = (const struct rte_acl_match_results *)
+		(ctx->trans_table + ctx->match_index);
+
+	if (categories == 1)
+		resolve_sc_avx512x16x2(results, pr, match, total_packets,
+			ctx->num_tries);
+	else if (categories <= RTE_ACL_MAX_CATEGORIES / 2)
+		resolve_mcle8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+	else
+		resolve_mcgt8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+
+	return 0;
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v4 11/14] acl: for AVX512 classify use 4B load whenever possible
  2020-10-06 15:03     ` [dpdk-dev] [PATCH v4 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                         ` (9 preceding siblings ...)
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 10/14] acl: introduce 512-bit width AVX512 classify implementation Konstantin Ananyev
@ 2020-10-06 15:03       ` Konstantin Ananyev
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 12/14] acl: deduplicate AVX512 code paths Konstantin Ananyev
                         ` (3 subsequent siblings)
  14 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-06 15:03 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

With current ACL implementation first field in the rule definition
has always to be one byte long. Though for optimising classify
implementation it might be useful to do 4B reads
(as we do for rest of the fields).
So at build phase, check user provided field definitions to determine
is it safe to do 4B loads for first ACL field.
Then at run-time this information can be used to choose classify
behavior.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl.h               |  1 +
 lib/librte_acl/acl_bld.c           | 34 ++++++++++++++++++++++++++++++
 lib/librte_acl/acl_run_avx512.c    |  2 ++
 lib/librte_acl/acl_run_avx512x16.h |  8 +++----
 lib/librte_acl/acl_run_avx512x8.h  |  8 +++----
 lib/librte_acl/rte_acl.c           |  1 +
 6 files changed, 46 insertions(+), 8 deletions(-)

diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 7ac0d12f08..4089ab2a04 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -169,6 +169,7 @@ struct rte_acl_ctx {
 	int32_t             socket_id;
 	/** Socket ID to allocate memory from. */
 	enum rte_acl_classify_alg alg;
+	uint32_t           first_load_sz;
 	void               *rules;
 	uint32_t            max_rules;
 	uint32_t            rule_sz;
diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
index d1f920b09c..da10864cd8 100644
--- a/lib/librte_acl/acl_bld.c
+++ b/lib/librte_acl/acl_bld.c
@@ -1581,6 +1581,37 @@ acl_check_bld_param(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg)
 	return 0;
 }
 
+/*
+ * With current ACL implementation first field in the rule definition
+ * has always to be one byte long. Though for optimising *classify*
+ * implementation it might be useful to be able to use 4B reads
+ * (as we do for rest of the fields).
+ * This function checks input config to determine is it safe to do 4B
+ * loads for first ACL field. For that we need to make sure that
+ * first field in our rule definition doesn't have the biggest offset,
+ * i.e. we still do have other fields located after the first one.
+ * Contrary if first field has the largest offset, then it means
+ * first field can occupy the very last byte in the input data buffer,
+ * and we have to do single byte load for it.
+ */
+static uint32_t
+get_first_load_size(const struct rte_acl_config *cfg)
+{
+	uint32_t i, max_ofs, ofs;
+
+	ofs = 0;
+	max_ofs = 0;
+
+	for (i = 0; i != cfg->num_fields; i++) {
+		if (cfg->defs[i].field_index == 0)
+			ofs = cfg->defs[i].offset;
+		else if (max_ofs < cfg->defs[i].offset)
+			max_ofs = cfg->defs[i].offset;
+	}
+
+	return (ofs < max_ofs) ? sizeof(uint32_t) : sizeof(uint8_t);
+}
+
 int
 rte_acl_build(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg)
 {
@@ -1618,6 +1649,9 @@ rte_acl_build(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg)
 				/* set data indexes. */
 				acl_set_data_indexes(ctx);
 
+				/* determine can we always do 4B load */
+				ctx->first_load_sz = get_first_load_size(cfg);
+
 				/* copy in build config. */
 				ctx->config = *cfg;
 			}
diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
index 74698fa2ea..3fd1e33c3f 100644
--- a/lib/librte_acl/acl_run_avx512.c
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -11,6 +11,7 @@ struct acl_flow_avx512 {
 	uint32_t num_packets;       /* number of packets processed */
 	uint32_t total_packets;     /* max number of packets to process */
 	uint32_t root_index;        /* current root index */
+	uint32_t first_load_sz;     /* first load size for new packet */
 	const uint64_t *trans;      /* transition table */
 	const uint32_t *data_index; /* input data indexes */
 	const uint8_t **idata;      /* input data */
@@ -24,6 +25,7 @@ acl_set_flow_avx512(struct acl_flow_avx512 *flow, const struct rte_acl_ctx *ctx,
 {
 	flow->num_packets = 0;
 	flow->total_packets = total_packets;
+	flow->first_load_sz = ctx->first_load_sz;
 	flow->root_index = ctx->trie[trie].root_index;
 	flow->trans = ctx->trans_table;
 	flow->data_index = ctx->trie[trie].data_index;
diff --git a/lib/librte_acl/acl_run_avx512x16.h b/lib/librte_acl/acl_run_avx512x16.h
index 981f8d16da..a39df8f3c0 100644
--- a/lib/librte_acl/acl_run_avx512x16.h
+++ b/lib/librte_acl/acl_run_avx512x16.h
@@ -460,7 +460,7 @@ match_check_process_avx512x16x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
 
 		if (n[0] != 0) {
 			inp[0] = get_next_bytes_avx512x16(flow, &pdata[0],
-				rm[0], &di[0], sizeof(uint8_t));
+				rm[0], &di[0], flow->first_load_sz);
 			first_trans16(flow, inp[0], rm[0], &tr_lo[0],
 				&tr_hi[0]);
 			rm[0] = _mm512_test_epi32_mask(tr_lo[0],
@@ -469,7 +469,7 @@ match_check_process_avx512x16x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
 
 		if (n[1] != 0) {
 			inp[1] = get_next_bytes_avx512x16(flow, &pdata[2],
-				rm[1], &di[1], sizeof(uint8_t));
+				rm[1], &di[1], flow->first_load_sz);
 			first_trans16(flow, inp[1], rm[1], &tr_lo[1],
 				&tr_hi[1]);
 			rm[1] = _mm512_test_epi32_mask(tr_lo[1],
@@ -494,9 +494,9 @@ search_trie_avx512x16x2(struct acl_flow_avx512 *flow)
 	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[2], &idx[1], &di[1]);
 
 	in[0] = get_next_bytes_avx512x16(flow, &pdata[0], UINT16_MAX, &di[0],
-			sizeof(uint8_t));
+			flow->first_load_sz);
 	in[1] = get_next_bytes_avx512x16(flow, &pdata[2], UINT16_MAX, &di[1],
-			sizeof(uint8_t));
+			flow->first_load_sz);
 
 	first_trans16(flow, in[0], UINT16_MAX, &tr_lo[0], &tr_hi[0]);
 	first_trans16(flow, in[1], UINT16_MAX, &tr_lo[1], &tr_hi[1]);
diff --git a/lib/librte_acl/acl_run_avx512x8.h b/lib/librte_acl/acl_run_avx512x8.h
index cfba0299ed..fedd79b9ae 100644
--- a/lib/librte_acl/acl_run_avx512x8.h
+++ b/lib/librte_acl/acl_run_avx512x8.h
@@ -418,7 +418,7 @@ match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
 
 		if (n[0] != 0) {
 			inp[0] = get_next_bytes_avx512x8(flow, &pdata[0],
-				rm[0], &di[0], sizeof(uint8_t));
+				rm[0], &di[0], flow->first_load_sz);
 			first_trans8(flow, inp[0], rm[0], &tr_lo[0],
 				&tr_hi[0]);
 			rm[0] = _mm256_test_epi32_mask(tr_lo[0],
@@ -427,7 +427,7 @@ match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
 
 		if (n[1] != 0) {
 			inp[1] = get_next_bytes_avx512x8(flow, &pdata[2],
-				rm[1], &di[1], sizeof(uint8_t));
+				rm[1], &di[1], flow->first_load_sz);
 			first_trans8(flow, inp[1], rm[1], &tr_lo[1],
 				&tr_hi[1]);
 			rm[1] = _mm256_test_epi32_mask(tr_lo[1],
@@ -452,9 +452,9 @@ search_trie_avx512x8x2(struct acl_flow_avx512 *flow)
 	start_flow8(flow, MASK8_BIT, UINT8_MAX, &pdata[2], &idx[1], &di[1]);
 
 	in[0] = get_next_bytes_avx512x8(flow, &pdata[0], UINT8_MAX, &di[0],
-			sizeof(uint8_t));
+			flow->first_load_sz);
 	in[1] = get_next_bytes_avx512x8(flow, &pdata[2], UINT8_MAX, &di[1],
-			sizeof(uint8_t));
+			flow->first_load_sz);
 
 	first_trans8(flow, in[0], UINT8_MAX, &tr_lo[0], &tr_hi[0]);
 	first_trans8(flow, in[1], UINT8_MAX, &tr_lo[1], &tr_hi[1]);
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 245af672ee..f1474038e5 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -500,6 +500,7 @@ rte_acl_dump(const struct rte_acl_ctx *ctx)
 	printf("acl context <%s>@%p\n", ctx->name, ctx);
 	printf("  socket_id=%"PRId32"\n", ctx->socket_id);
 	printf("  alg=%"PRId32"\n", ctx->alg);
+	printf("  first_load_sz=%"PRIu32"\n", ctx->first_load_sz);
 	printf("  max_rules=%"PRIu32"\n", ctx->max_rules);
 	printf("  rule_size=%"PRIu32"\n", ctx->rule_sz);
 	printf("  num_rules=%"PRIu32"\n", ctx->num_rules);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v4 12/14] acl: deduplicate AVX512 code paths
  2020-10-06 15:03     ` [dpdk-dev] [PATCH v4 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                         ` (10 preceding siblings ...)
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 11/14] acl: for AVX512 classify use 4B load whenever possible Konstantin Ananyev
@ 2020-10-06 15:03       ` Konstantin Ananyev
  2020-10-16 15:56         ` Ferruh Yigit
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 13/14] test/acl: add AVX512 classify support Konstantin Ananyev
                         ` (2 subsequent siblings)
  14 siblings, 1 reply; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-06 15:03 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Current rte_acl_classify_avx512x32() and rte_acl_classify_avx512x16()
code paths are very similar. The only differences are due to
256/512 register/instrincts naming conventions.
So to deduplicate the code:
  - Move common code into “acl_run_avx512_common.h”
  - Use macros to hide difference in naming conventions

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl_run_avx512_common.h | 477 +++++++++++++++++++++
 lib/librte_acl/acl_run_avx512x16.h     | 569 ++++---------------------
 lib/librte_acl/acl_run_avx512x8.h      | 565 ++++--------------------
 3 files changed, 654 insertions(+), 957 deletions(-)
 create mode 100644 lib/librte_acl/acl_run_avx512_common.h

diff --git a/lib/librte_acl/acl_run_avx512_common.h b/lib/librte_acl/acl_run_avx512_common.h
new file mode 100644
index 0000000000..1baf79b7ae
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512_common.h
@@ -0,0 +1,477 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+/*
+ * WARNING: It is not recommended to include this file directly.
+ * Please include "acl_run_avx512x*.h" instead.
+ * To make this file to generate proper code an includer has to
+ * define several macros, refer to "acl_run_avx512x*.h" for more details.
+ */
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes not supposed to be encountered here.
+ * For quad range nodes:
+ * Calculate number of range boundaries that are less than the
+ * input value. Range boundaries for each node are in signed 8 bit,
+ * ordered from -128 to 127.
+ * This is effectively a popcnt of bytes that are greater than the
+ * input byte.
+ * Single nodes are processed in the same ways as quad range nodes.
+ */
+static __rte_always_inline _T_simd
+_F_(calc_addr)(_T_simd index_mask, _T_simd next_input, _T_simd shuffle_input,
+	_T_simd four_32, _T_simd range_base, _T_simd tr_lo, _T_simd tr_hi)
+{
+	__mmask64 qm;
+	_T_mask dfa_msk;
+	_T_simd addr, in, node_type, r, t;
+	_T_simd dfa_ofs, quad_ofs;
+
+	t = _M_SI_(xor)(index_mask, index_mask);
+	in = _M_I_(shuffle_epi8)(next_input, shuffle_input);
+
+	/* Calc node type and node addr */
+	node_type = _M_SI_(andnot)(index_mask, tr_lo);
+	addr = _M_SI_(and)(index_mask, tr_lo);
+
+	/* mask for DFA type(0) nodes */
+	dfa_msk = _M_I_(cmpeq_epi32_mask)(node_type, t);
+
+	/* DFA calculations. */
+	r = _M_I_(srli_epi32)(in, 30);
+	r = _M_I_(add_epi8)(r, range_base);
+	t = _M_I_(srli_epi32)(in, 24);
+	r = _M_I_(shuffle_epi8)(tr_hi, r);
+
+	dfa_ofs = _M_I_(sub_epi32)(t, r);
+
+	/* QUAD/SINGLE calculations. */
+	qm = _M_I_(cmpgt_epi8_mask)(in, tr_hi);
+	t = _M_I_(maskz_set1_epi8)(qm, (uint8_t)UINT8_MAX);
+	t = _M_I_(lzcnt_epi32)(t);
+	t = _M_I_(srli_epi32)(t, 3);
+	quad_ofs = _M_I_(sub_epi32)(four_32, t);
+
+	/* blend DFA and QUAD/SINGLE. */
+	t = _M_I_(mask_mov_epi32)(quad_ofs, dfa_msk, dfa_ofs);
+
+	/* calculate address for next transitions. */
+	addr = _M_I_(add_epi32)(addr, t);
+	return addr;
+}
+
+/*
+ * Process _N_ transitions in parallel.
+ * tr_lo contains low 32 bits for _N_ transition.
+ * tr_hi contains high 32 bits for _N_ transition.
+ * next_input contains up to 4 input bytes for _N_ flows.
+ */
+static __rte_always_inline _T_simd
+_F_(trans)(_T_simd next_input, const uint64_t *trans, _T_simd *tr_lo,
+	_T_simd *tr_hi)
+{
+	const int32_t *tr;
+	_T_simd addr;
+
+	tr = (const int32_t *)(uintptr_t)trans;
+
+	/* Calculate the address (array index) for all _N_ transitions. */
+	addr = _F_(calc_addr)(_SV_(index_mask), next_input, _SV_(shuffle_input),
+		_SV_(four_32), _SV_(range_base), *tr_lo, *tr_hi);
+
+	/* load lower 32 bits of _N_ transactions at once. */
+	*tr_lo = _M_GI_(i32gather_epi32, addr, tr, sizeof(trans[0]));
+
+	next_input = _M_I_(srli_epi32)(next_input, CHAR_BIT);
+
+	/* load high 32 bits of _N_ transactions at once. */
+	*tr_hi = _M_GI_(i32gather_epi32, addr, (tr + 1), sizeof(trans[0]));
+
+	return next_input;
+}
+
+/*
+ * Execute first transition for up to _N_ flows in parallel.
+ * next_input should contain one input byte for up to _N_ flows.
+ * msk - mask of active flows.
+ * tr_lo contains low 32 bits for up to _N_ transitions.
+ * tr_hi contains high 32 bits for up to _N_ transitions.
+ */
+static __rte_always_inline void
+_F_(first_trans)(const struct acl_flow_avx512 *flow, _T_simd next_input,
+	_T_mask msk, _T_simd *tr_lo, _T_simd *tr_hi)
+{
+	const int32_t *tr;
+	_T_simd addr, root;
+
+	tr = (const int32_t *)(uintptr_t)flow->trans;
+
+	addr = _M_I_(set1_epi32)(UINT8_MAX);
+	root = _M_I_(set1_epi32)(flow->root_index);
+
+	addr = _M_SI_(and)(next_input, addr);
+	addr = _M_I_(add_epi32)(root, addr);
+
+	/* load lower 32 bits of _N_ transactions at once. */
+	*tr_lo = _M_MGI_(mask_i32gather_epi32)(*tr_lo, msk, addr, tr,
+		sizeof(flow->trans[0]));
+
+	/* load high 32 bits of _N_ transactions at once. */
+	*tr_hi = _M_MGI_(mask_i32gather_epi32)(*tr_hi, msk, addr, (tr + 1),
+		sizeof(flow->trans[0]));
+}
+
+/*
+ * Load and return next 4 input bytes for up to _N_ flows in parallel.
+ * pdata - 8x2 pointers to flow input data
+ * mask - mask of active flows.
+ * di - data indexes for these _N_ flows.
+ */
+static inline _T_simd
+_F_(get_next_bytes)(const struct acl_flow_avx512 *flow, _T_simd pdata[2],
+	uint32_t msk, _T_simd *di, uint32_t bnum)
+{
+	const int32_t *div;
+	uint32_t m[2];
+	_T_simd one, zero, t, p[2];
+
+	div = (const int32_t *)flow->data_index;
+
+	one = _M_I_(set1_epi32)(1);
+	zero = _M_SI_(xor)(one, one);
+
+	/* load data offsets for given indexes */
+	t = _M_MGI_(mask_i32gather_epi32)(zero, msk, *di, div, sizeof(div[0]));
+
+	/* increment data indexes */
+	*di = _M_I_(mask_add_epi32)(*di, msk, *di, one);
+
+	/*
+	 * unsigned expand 32-bit indexes to 64-bit
+	 * (for later pointer arithmetic), i.e:
+	 * for (i = 0; i != _N_; i++)
+	 *   p[i/8].u64[i%8] = (uint64_t)t.u32[i];
+	 */
+	p[0] = _M_I_(maskz_permutexvar_epi32)(_SC_(pmidx_msk), _SV_(pmidx[0]),
+			t);
+	p[1] = _M_I_(maskz_permutexvar_epi32)(_SC_(pmidx_msk), _SV_(pmidx[1]),
+			t);
+
+	p[0] = _M_I_(add_epi64)(p[0], pdata[0]);
+	p[1] = _M_I_(add_epi64)(p[1], pdata[1]);
+
+	/* load input byte(s), either one or four */
+
+	m[0] = msk & _SIMD_PTR_MSK_;
+	m[1] = msk >> _SIMD_PTR_NUM_;
+
+	return _F_(gather_bytes)(zero, p, m, bnum);
+}
+
+/*
+ * Start up to _N_ new flows.
+ * num - number of flows to start
+ * msk - mask of new flows.
+ * pdata - pointers to flow input data
+ * idx - match indexed for given flows
+ * di - data indexes for these flows.
+ */
+static inline void
+_F_(start_flow)(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
+	_T_simd pdata[2], _T_simd *idx, _T_simd *di)
+{
+	uint32_t n, m[2], nm[2];
+	_T_simd ni, nd[2];
+
+	/* split mask into two - one for each pdata[] */
+	m[0] = msk & _SIMD_PTR_MSK_;
+	m[1] = msk >> _SIMD_PTR_NUM_;
+
+	/* calculate masks for new flows */
+	n = __builtin_popcount(m[0]);
+	nm[0] = (1 << n) - 1;
+	nm[1] = (1 << (num - n)) - 1;
+
+	/* load input data pointers for new flows */
+	nd[0] = _M_I_(maskz_loadu_epi64)(nm[0],
+			flow->idata + flow->num_packets);
+	nd[1] = _M_I_(maskz_loadu_epi64)(nm[1],
+			flow->idata + flow->num_packets + n);
+
+	/* calculate match indexes of new flows */
+	ni = _M_I_(set1_epi32)(flow->num_packets);
+	ni = _M_I_(add_epi32)(ni, _SV_(idx_add));
+
+	/* merge new and existing flows data */
+	pdata[0] = _M_I_(mask_expand_epi64)(pdata[0], m[0], nd[0]);
+	pdata[1] = _M_I_(mask_expand_epi64)(pdata[1], m[1], nd[1]);
+
+	/* update match and data indexes */
+	*idx = _M_I_(mask_expand_epi32)(*idx, msk, ni);
+	*di = _M_I_(maskz_mov_epi32)(msk ^ _SIMD_MASK_MAX_, *di);
+
+	flow->num_packets += num;
+}
+
+/*
+ * Process found matches for up to _N_ flows.
+ * fmsk - mask of active flows
+ * rmsk - mask of found matches
+ * pdata - pointers to flow input data
+ * di - data indexes for these flows
+ * idx - match indexed for given flows
+ * tr_lo contains low 32 bits for up to _N_ transitions.
+ * tr_hi contains high 32 bits for up to _N_ transitions.
+ */
+static inline uint32_t
+_F_(match_process)(struct acl_flow_avx512 *flow, uint32_t *fmsk,
+	uint32_t *rmsk, _T_simd pdata[2], _T_simd *di, _T_simd *idx,
+	_T_simd *tr_lo, _T_simd *tr_hi)
+{
+	uint32_t n;
+	_T_simd res;
+
+	if (rmsk[0] == 0)
+		return 0;
+
+	/* extract match indexes */
+	res = _M_SI_(and)(tr_lo[0], _SV_(index_mask));
+
+	/* mask  matched transitions to nop */
+	tr_lo[0] = _M_I_(mask_mov_epi32)(tr_lo[0], rmsk[0], _SV_(trlo_idle));
+	tr_hi[0] = _M_I_(mask_mov_epi32)(tr_hi[0], rmsk[0], _SV_(trhi_idle));
+
+	/* save found match indexes */
+	_M_I_(mask_i32scatter_epi32)(flow->matches, rmsk[0], idx[0], res,
+			sizeof(flow->matches[0]));
+
+	/* update masks and start new flows for matches */
+	n = update_flow_mask(flow, fmsk, rmsk);
+	_F_(start_flow)(flow, n, rmsk[0], pdata, idx, di);
+
+	return n;
+}
+
+/*
+ * Test for matches ut to (2 * _N_) flows at once,
+ * if matches exist - process them and start new flows.
+ */
+static inline void
+_F_(match_check_process)(struct acl_flow_avx512 *flow, uint32_t fm[2],
+	_T_simd pdata[4], _T_simd di[2], _T_simd idx[2], _T_simd inp[2],
+	_T_simd tr_lo[2], _T_simd tr_hi[2])
+{
+	uint32_t n[2];
+	uint32_t rm[2];
+
+	/* check for matches */
+	rm[0] = _M_I_(test_epi32_mask)(tr_lo[0], _SV_(match_mask));
+	rm[1] = _M_I_(test_epi32_mask)(tr_lo[1], _SV_(match_mask));
+
+	/* till unprocessed matches exist */
+	while ((rm[0] | rm[1]) != 0) {
+
+		/* process matches and start new flows */
+		n[0] = _F_(match_process)(flow, &fm[0], &rm[0], &pdata[0],
+			&di[0], &idx[0], &tr_lo[0], &tr_hi[0]);
+		n[1] = _F_(match_process)(flow, &fm[1], &rm[1], &pdata[2],
+			&di[1], &idx[1], &tr_lo[1], &tr_hi[1]);
+
+		/* execute first transition for new flows, if any */
+
+		if (n[0] != 0) {
+			inp[0] = _F_(get_next_bytes)(flow, &pdata[0],
+					rm[0], &di[0], flow->first_load_sz);
+			_F_(first_trans)(flow, inp[0], rm[0], &tr_lo[0],
+					&tr_hi[0]);
+			rm[0] = _M_I_(test_epi32_mask)(tr_lo[0],
+					_SV_(match_mask));
+		}
+
+		if (n[1] != 0) {
+			inp[1] = _F_(get_next_bytes)(flow, &pdata[2],
+					rm[1], &di[1], flow->first_load_sz);
+			_F_(first_trans)(flow, inp[1], rm[1], &tr_lo[1],
+					&tr_hi[1]);
+			rm[1] = _M_I_(test_epi32_mask)(tr_lo[1],
+					_SV_(match_mask));
+		}
+	}
+}
+
+/*
+ * Perform search for up to (2 * _N_) flows in parallel.
+ * Use two sets of metadata, each serves _N_ flows max.
+ */
+static inline void
+_F_(search_trie)(struct acl_flow_avx512 *flow)
+{
+	uint32_t fm[2];
+	_T_simd di[2], idx[2], in[2], pdata[4], tr_lo[2], tr_hi[2];
+
+	/* first 1B load */
+	_F_(start_flow)(flow, _SIMD_MASK_BIT_, _SIMD_MASK_MAX_,
+			&pdata[0], &idx[0], &di[0]);
+	_F_(start_flow)(flow, _SIMD_MASK_BIT_, _SIMD_MASK_MAX_,
+			&pdata[2], &idx[1], &di[1]);
+
+	in[0] = _F_(get_next_bytes)(flow, &pdata[0], _SIMD_MASK_MAX_, &di[0],
+			flow->first_load_sz);
+	in[1] = _F_(get_next_bytes)(flow, &pdata[2], _SIMD_MASK_MAX_, &di[1],
+			flow->first_load_sz);
+
+	_F_(first_trans)(flow, in[0], _SIMD_MASK_MAX_, &tr_lo[0], &tr_hi[0]);
+	_F_(first_trans)(flow, in[1], _SIMD_MASK_MAX_, &tr_lo[1], &tr_hi[1]);
+
+	fm[0] = _SIMD_MASK_MAX_;
+	fm[1] = _SIMD_MASK_MAX_;
+
+	/* match check */
+	_F_(match_check_process)(flow, fm, pdata, di, idx, in, tr_lo, tr_hi);
+
+	while ((fm[0] | fm[1]) != 0) {
+
+		/* load next 4B */
+
+		in[0] = _F_(get_next_bytes)(flow, &pdata[0], fm[0],
+				&di[0], sizeof(uint32_t));
+		in[1] = _F_(get_next_bytes)(flow, &pdata[2], fm[1],
+				&di[1], sizeof(uint32_t));
+
+		/* main 4B loop */
+
+		in[0] = _F_(trans)(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = _F_(trans)(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = _F_(trans)(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = _F_(trans)(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = _F_(trans)(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = _F_(trans)(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = _F_(trans)(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = _F_(trans)(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		/* check for matches */
+		_F_(match_check_process)(flow, fm, pdata, di, idx, in,
+			tr_lo, tr_hi);
+	}
+}
+
+/*
+ * resolve match index to actual result/priority offset.
+ */
+static inline _T_simd
+_F_(resolve_match_idx)(_T_simd mi)
+{
+	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
+		1 << (match_log + 2));
+	return _M_I_(slli_epi32)(mi, match_log);
+}
+
+/*
+ * Resolve multiple matches for the same flow based on priority.
+ */
+static inline _T_simd
+_F_(resolve_pri)(const int32_t res[], const int32_t pri[],
+	const uint32_t match[], _T_mask msk, uint32_t nb_trie,
+	uint32_t nb_skip)
+{
+	uint32_t i;
+	const uint32_t *pm;
+	_T_mask m;
+	_T_simd cp, cr, np, nr, mch;
+
+	const _T_simd zero = _M_I_(set1_epi32)(0);
+
+	/* get match indexes */
+	mch = _M_I_(maskz_loadu_epi32)(msk, match);
+	mch = _F_(resolve_match_idx)(mch);
+
+	/* read result and priority values for first trie */
+	cr = _M_MGI_(mask_i32gather_epi32)(zero, msk, mch, res, sizeof(res[0]));
+	cp = _M_MGI_(mask_i32gather_epi32)(zero, msk, mch, pri, sizeof(pri[0]));
+
+	/*
+	 * read result and priority values for next tries and select one
+	 * with highest priority.
+	 */
+	for (i = 1, pm = match + nb_skip; i != nb_trie;
+			i++, pm += nb_skip) {
+
+		mch = _M_I_(maskz_loadu_epi32)(msk, pm);
+		mch = _F_(resolve_match_idx)(mch);
+
+		nr = _M_MGI_(mask_i32gather_epi32)(zero, msk, mch, res,
+				sizeof(res[0]));
+		np = _M_MGI_(mask_i32gather_epi32)(zero, msk, mch, pri,
+				sizeof(pri[0]));
+
+		m = _M_I_(cmpgt_epi32_mask)(cp, np);
+		cr = _M_I_(mask_mov_epi32)(nr, m, cr);
+		cp = _M_I_(mask_mov_epi32)(np, m, cp);
+	}
+
+	return cr;
+}
+
+/*
+ * Resolve num (<= _N_) matches for single category
+ */
+static inline void
+_F_(resolve_sc)(uint32_t result[], const int32_t res[],
+	const int32_t pri[], const uint32_t match[], uint32_t nb_pkt,
+	uint32_t nb_trie, uint32_t nb_skip)
+{
+	_T_mask msk;
+	_T_simd cr;
+
+	msk = (1 << nb_pkt) - 1;
+	cr = _F_(resolve_pri)(res, pri, match, msk, nb_trie, nb_skip);
+	_M_I_(mask_storeu_epi32)(result, msk, cr);
+}
+
+/*
+ * Resolve matches for single category
+ */
+static inline void
+_F_(resolve_single_cat)(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_trie)
+{
+	uint32_t j, k, n;
+	const int32_t *res, *pri;
+	_T_simd cr[2];
+
+	res = (const int32_t *)pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != (nb_pkt & ~_SIMD_FLOW_MSK_); k += _SIMD_FLOW_NUM_) {
+
+		j = k + _SIMD_MASK_BIT_;
+
+		cr[0] = _F_(resolve_pri)(res, pri, match + k, _SIMD_MASK_MAX_,
+				nb_trie, nb_pkt);
+		cr[1] = _F_(resolve_pri)(res, pri, match + j, _SIMD_MASK_MAX_,
+				nb_trie, nb_pkt);
+
+		_M_SI_(storeu)((void *)(result + k), cr[0]);
+		_M_SI_(storeu)((void *)(result + j), cr[1]);
+	}
+
+	n = nb_pkt - k;
+	if (n != 0) {
+		if (n > _SIMD_MASK_BIT_) {
+			_F_(resolve_sc)(result + k, res, pri, match + k,
+				_SIMD_MASK_BIT_, nb_trie, nb_pkt);
+			k += _SIMD_MASK_BIT_;
+			n -= _SIMD_MASK_BIT_;
+		}
+		_F_(resolve_sc)(result + k, res, pri, match + k, n,
+				nb_trie, nb_pkt);
+	}
+}
diff --git a/lib/librte_acl/acl_run_avx512x16.h b/lib/librte_acl/acl_run_avx512x16.h
index a39df8f3c0..da244bc257 100644
--- a/lib/librte_acl/acl_run_avx512x16.h
+++ b/lib/librte_acl/acl_run_avx512x16.h
@@ -2,16 +2,57 @@
  * Copyright(c) 2020 Intel Corporation
  */
 
-#define	MASK16_BIT	(sizeof(__mmask16) * CHAR_BIT)
+/*
+ * Defines required by "acl_run_avx512_common.h".
+ * Note that all of them has to be undefined by the end
+ * of this file, as "acl_run_avx512_common.h" can be included several
+ * times from different *.h files for the same *.c.
+ */
+
+/*
+ * This implementation uses 512-bit registers(zmm) and instrincts.
+ * So our main SIMD type is 512-bit width and each such variable can
+ * process sizeof(__m512i) / sizeof(uint32_t) == 16 entries in parallel.
+ */
+#define _T_simd		__m512i
+#define _T_mask		__mmask16
+
+/* Naming convention for static const variables. */
+#define _SC_(x)		zmm_##x
+#define _SV_(x)		(zmm_##x.z)
+
+/* Naming convention for internal functions. */
+#define _F_(x)		x##_avx512x16
+
+/*
+ * Same instrincts have different syntaxis (depending on the bit-width),
+ * so to overcome that few macros need to be defined.
+ */
+
+/* Naming convention for generic epi(packed integers) type instrincts. */
+#define _M_I_(x)	_mm512_##x
+
+/* Naming convention for si(whole simd integer) type instrincts. */
+#define _M_SI_(x)	_mm512_##x##_si512
+
+/* Naming convention for masked gather type instrincts. */
+#define _M_MGI_(x)	_mm512_##x
+
+/* Naming convention for gather type instrincts. */
+#define _M_GI_(name, idx, base, scale)	_mm512_##name(idx, base, scale)
 
-#define NUM_AVX512X16X2	(2 * MASK16_BIT)
-#define MSK_AVX512X16X2	(NUM_AVX512X16X2 - 1)
+/* num/mask of transitions per SIMD regs */
+#define _SIMD_MASK_BIT_	(sizeof(_T_simd) / sizeof(uint32_t))
+#define _SIMD_MASK_MAX_	RTE_LEN2MASK(_SIMD_MASK_BIT_, uint32_t)
+
+#define _SIMD_FLOW_NUM_	(2 * _SIMD_MASK_BIT_)
+#define _SIMD_FLOW_MSK_	(_SIMD_FLOW_NUM_ - 1)
 
 /* num/mask of pointers per SIMD regs */
-#define ZMM_PTR_NUM	(sizeof(__m512i) / sizeof(uintptr_t))
-#define ZMM_PTR_MSK	RTE_LEN2MASK(ZMM_PTR_NUM, uint32_t)
+#define _SIMD_PTR_NUM_	(sizeof(_T_simd) / sizeof(uintptr_t))
+#define _SIMD_PTR_MSK_	RTE_LEN2MASK(_SIMD_PTR_NUM_, uint32_t)
 
-static const __rte_x86_zmm_t zmm_match_mask = {
+static const __rte_x86_zmm_t _SC_(match_mask) = {
 	.u32 = {
 		RTE_ACL_NODE_MATCH,
 		RTE_ACL_NODE_MATCH,
@@ -32,7 +73,7 @@ static const __rte_x86_zmm_t zmm_match_mask = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_index_mask = {
+static const __rte_x86_zmm_t _SC_(index_mask) = {
 	.u32 = {
 		RTE_ACL_NODE_INDEX,
 		RTE_ACL_NODE_INDEX,
@@ -53,7 +94,7 @@ static const __rte_x86_zmm_t zmm_index_mask = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_trlo_idle = {
+static const __rte_x86_zmm_t _SC_(trlo_idle) = {
 	.u32 = {
 		RTE_ACL_IDLE_NODE,
 		RTE_ACL_IDLE_NODE,
@@ -74,7 +115,7 @@ static const __rte_x86_zmm_t zmm_trlo_idle = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_trhi_idle = {
+static const __rte_x86_zmm_t _SC_(trhi_idle) = {
 	.u32 = {
 		0, 0, 0, 0,
 		0, 0, 0, 0,
@@ -83,7 +124,7 @@ static const __rte_x86_zmm_t zmm_trhi_idle = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_shuffle_input = {
+static const __rte_x86_zmm_t _SC_(shuffle_input) = {
 	.u32 = {
 		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
 		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
@@ -92,7 +133,7 @@ static const __rte_x86_zmm_t zmm_shuffle_input = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_four_32 = {
+static const __rte_x86_zmm_t _SC_(four_32) = {
 	.u32 = {
 		4, 4, 4, 4,
 		4, 4, 4, 4,
@@ -101,7 +142,7 @@ static const __rte_x86_zmm_t zmm_four_32 = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_idx_add = {
+static const __rte_x86_zmm_t _SC_(idx_add) = {
 	.u32 = {
 		0, 1, 2, 3,
 		4, 5, 6, 7,
@@ -110,7 +151,7 @@ static const __rte_x86_zmm_t zmm_idx_add = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_range_base = {
+static const __rte_x86_zmm_t _SC_(range_base) = {
 	.u32 = {
 		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
 		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
@@ -119,16 +160,16 @@ static const __rte_x86_zmm_t zmm_range_base = {
 	},
 };
 
-static const __rte_x86_zmm_t zmm_pminp = {
+static const __rte_x86_zmm_t _SC_(pminp) = {
 	.u32 = {
 		0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
 		0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17,
 	},
 };
 
-static const __mmask16 zmm_pmidx_msk = 0x5555;
+static const _T_mask _SC_(pmidx_msk) = 0x5555;
 
-static const __rte_x86_zmm_t zmm_pmidx[2] = {
+static const __rte_x86_zmm_t _SC_(pmidx[2]) = {
 	[0] = {
 		.u32 = {
 			0, 0, 1, 0, 2, 0, 3, 0,
@@ -148,7 +189,7 @@ static const __rte_x86_zmm_t zmm_pmidx[2] = {
  * gather load on a byte quantity. So we have to mimic it in SW,
  * by doing 8x1B scalar loads.
  */
-static inline ymm_t
+static inline __m256i
 _m512_mask_gather_epi8x8(__m512i pdata, __mmask8 mask)
 {
 	rte_ymm_t v;
@@ -156,7 +197,7 @@ _m512_mask_gather_epi8x8(__m512i pdata, __mmask8 mask)
 
 	static const uint32_t zero;
 
-	p.z = _mm512_mask_set1_epi64(pdata, mask ^ ZMM_PTR_MSK,
+	p.z = _mm512_mask_set1_epi64(pdata, mask ^ _SIMD_PTR_MSK_,
 		(uintptr_t)&zero);
 
 	v.u32[0] = *(uint8_t *)p.u64[0];
@@ -172,369 +213,29 @@ _m512_mask_gather_epi8x8(__m512i pdata, __mmask8 mask)
 }
 
 /*
- * Calculate the address of the next transition for
- * all types of nodes. Note that only DFA nodes and range
- * nodes actually transition to another node. Match
- * nodes not supposed to be encountered here.
- * For quad range nodes:
- * Calculate number of range boundaries that are less than the
- * input value. Range boundaries for each node are in signed 8 bit,
- * ordered from -128 to 127.
- * This is effectively a popcnt of bytes that are greater than the
- * input byte.
- * Single nodes are processed in the same ways as quad range nodes.
- */
-static __rte_always_inline __m512i
-calc_addr16(__m512i index_mask, __m512i next_input, __m512i shuffle_input,
-	__m512i four_32, __m512i range_base, __m512i tr_lo, __m512i tr_hi)
-{
-	__mmask64 qm;
-	__mmask16 dfa_msk;
-	__m512i addr, in, node_type, r, t;
-	__m512i dfa_ofs, quad_ofs;
-
-	t = _mm512_xor_si512(index_mask, index_mask);
-	in = _mm512_shuffle_epi8(next_input, shuffle_input);
-
-	/* Calc node type and node addr */
-	node_type = _mm512_andnot_si512(index_mask, tr_lo);
-	addr = _mm512_and_si512(index_mask, tr_lo);
-
-	/* mask for DFA type(0) nodes */
-	dfa_msk = _mm512_cmpeq_epi32_mask(node_type, t);
-
-	/* DFA calculations. */
-	r = _mm512_srli_epi32(in, 30);
-	r = _mm512_add_epi8(r, range_base);
-	t = _mm512_srli_epi32(in, 24);
-	r = _mm512_shuffle_epi8(tr_hi, r);
-
-	dfa_ofs = _mm512_sub_epi32(t, r);
-
-	/* QUAD/SINGLE calculations. */
-	qm = _mm512_cmpgt_epi8_mask(in, tr_hi);
-	t = _mm512_maskz_set1_epi8(qm, (uint8_t)UINT8_MAX);
-	t = _mm512_lzcnt_epi32(t);
-	t = _mm512_srli_epi32(t, 3);
-	quad_ofs = _mm512_sub_epi32(four_32, t);
-
-	/* blend DFA and QUAD/SINGLE. */
-	t = _mm512_mask_mov_epi32(quad_ofs, dfa_msk, dfa_ofs);
-
-	/* calculate address for next transitions. */
-	addr = _mm512_add_epi32(addr, t);
-	return addr;
-}
-
-/*
- * Process 16 transitions in parallel.
- * tr_lo contains low 32 bits for 16 transition.
- * tr_hi contains high 32 bits for 16 transition.
- * next_input contains up to 4 input bytes for 16 flows.
+ * Gather 4/1 input bytes for up to 16 (2*8) locations in parallel.
  */
 static __rte_always_inline __m512i
-transition16(__m512i next_input, const uint64_t *trans, __m512i *tr_lo,
-	__m512i *tr_hi)
-{
-	const int32_t *tr;
-	__m512i addr;
-
-	tr = (const int32_t *)(uintptr_t)trans;
-
-	/* Calculate the address (array index) for all 16 transitions. */
-	addr = calc_addr16(zmm_index_mask.z, next_input, zmm_shuffle_input.z,
-		zmm_four_32.z, zmm_range_base.z, *tr_lo, *tr_hi);
-
-	/* load lower 32 bits of 16 transactions at once. */
-	*tr_lo = _mm512_i32gather_epi32(addr, tr, sizeof(trans[0]));
-
-	next_input = _mm512_srli_epi32(next_input, CHAR_BIT);
-
-	/* load high 32 bits of 16 transactions at once. */
-	*tr_hi = _mm512_i32gather_epi32(addr, (tr + 1), sizeof(trans[0]));
-
-	return next_input;
-}
-
-/*
- * Execute first transition for up to 16 flows in parallel.
- * next_input should contain one input byte for up to 16 flows.
- * msk - mask of active flows.
- * tr_lo contains low 32 bits for up to 16 transitions.
- * tr_hi contains high 32 bits for up to 16 transitions.
- */
-static __rte_always_inline void
-first_trans16(const struct acl_flow_avx512 *flow, __m512i next_input,
-	__mmask16 msk, __m512i *tr_lo, __m512i *tr_hi)
+_F_(gather_bytes)(__m512i zero, const __m512i p[2], const uint32_t m[2],
+	uint32_t bnum)
 {
-	const int32_t *tr;
-	__m512i addr, root;
-
-	tr = (const int32_t *)(uintptr_t)flow->trans;
-
-	addr = _mm512_set1_epi32(UINT8_MAX);
-	root = _mm512_set1_epi32(flow->root_index);
-
-	addr = _mm512_and_si512(next_input, addr);
-	addr = _mm512_add_epi32(root, addr);
-
-	/* load lower 32 bits of 16 transactions at once. */
-	*tr_lo = _mm512_mask_i32gather_epi32(*tr_lo, msk, addr, tr,
-		sizeof(flow->trans[0]));
-
-	/* load high 32 bits of 16 transactions at once. */
-	*tr_hi = _mm512_mask_i32gather_epi32(*tr_hi, msk, addr, (tr + 1),
-		sizeof(flow->trans[0]));
-}
-
-/*
- * Load and return next 4 input bytes for up to 16 flows in parallel.
- * pdata - 8x2 pointers to flow input data
- * mask - mask of active flows.
- * di - data indexes for these 16 flows.
- */
-static inline __m512i
-get_next_bytes_avx512x16(const struct acl_flow_avx512 *flow, __m512i pdata[2],
-	uint32_t msk, __m512i *di, uint32_t bnum)
-{
-	const int32_t *div;
-	uint32_t m[2];
-	__m512i one, zero, t, p[2];
-	ymm_t inp[2];
-
-	div = (const int32_t *)flow->data_index;
-
-	one = _mm512_set1_epi32(1);
-	zero = _mm512_xor_si512(one, one);
-
-	/* load data offsets for given indexes */
-	t = _mm512_mask_i32gather_epi32(zero, msk, *di, div, sizeof(div[0]));
-
-	/* increment data indexes */
-	*di = _mm512_mask_add_epi32(*di, msk, *di, one);
-
-	/*
-	 * unsigned expand 32-bit indexes to 64-bit
-	 * (for later pointer arithmetic), i.e:
-	 * for (i = 0; i != 16; i++)
-	 *   p[i/8].u64[i%8] = (uint64_t)t.u32[i];
-	 */
-	p[0] = _mm512_maskz_permutexvar_epi32(zmm_pmidx_msk, zmm_pmidx[0].z, t);
-	p[1] = _mm512_maskz_permutexvar_epi32(zmm_pmidx_msk, zmm_pmidx[1].z, t);
-
-	p[0] = _mm512_add_epi64(p[0], pdata[0]);
-	p[1] = _mm512_add_epi64(p[1], pdata[1]);
-
-	/* load input byte(s), either one or four */
-
-	m[0] = msk & ZMM_PTR_MSK;
-	m[1] = msk >> ZMM_PTR_NUM;
+	__m256i inp[2];
 
 	if (bnum == sizeof(uint8_t)) {
 		inp[0] = _m512_mask_gather_epi8x8(p[0], m[0]);
 		inp[1] = _m512_mask_gather_epi8x8(p[1], m[1]);
 	} else {
 		inp[0] = _mm512_mask_i64gather_epi32(
-				_mm512_castsi512_si256(zero), m[0], p[0],
-				NULL, sizeof(uint8_t));
+				_mm512_castsi512_si256(zero),
+				m[0], p[0], NULL, sizeof(uint8_t));
 		inp[1] = _mm512_mask_i64gather_epi32(
-				_mm512_castsi512_si256(zero), m[1], p[1],
-				NULL, sizeof(uint8_t));
+				_mm512_castsi512_si256(zero),
+				m[1], p[1], NULL, sizeof(uint8_t));
 	}
 
 	/* squeeze input into one 512-bit register */
 	return _mm512_permutex2var_epi32(_mm512_castsi256_si512(inp[0]),
-			zmm_pminp.z, _mm512_castsi256_si512(inp[1]));
-}
-
-/*
- * Start up to 16 new flows.
- * num - number of flows to start
- * msk - mask of new flows.
- * pdata - pointers to flow input data
- * idx - match indexed for given flows
- * di - data indexes for these flows.
- */
-static inline void
-start_flow16(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
-	__m512i pdata[2], __m512i *idx, __m512i *di)
-{
-	uint32_t n, m[2], nm[2];
-	__m512i ni, nd[2];
-
-	/* split mask into two - one for each pdata[] */
-	m[0] = msk & ZMM_PTR_MSK;
-	m[1] = msk >> ZMM_PTR_NUM;
-
-	/* calculate masks for new flows */
-	n = __builtin_popcount(m[0]);
-	nm[0] = (1 << n) - 1;
-	nm[1] = (1 << (num - n)) - 1;
-
-	/* load input data pointers for new flows */
-	nd[0] = _mm512_maskz_loadu_epi64(nm[0],
-		flow->idata + flow->num_packets);
-	nd[1] = _mm512_maskz_loadu_epi64(nm[1],
-		flow->idata + flow->num_packets + n);
-
-	/* calculate match indexes of new flows */
-	ni = _mm512_set1_epi32(flow->num_packets);
-	ni = _mm512_add_epi32(ni, zmm_idx_add.z);
-
-	/* merge new and existing flows data */
-	pdata[0] = _mm512_mask_expand_epi64(pdata[0], m[0], nd[0]);
-	pdata[1] = _mm512_mask_expand_epi64(pdata[1], m[1], nd[1]);
-
-	/* update match and data indexes */
-	*idx = _mm512_mask_expand_epi32(*idx, msk, ni);
-	*di = _mm512_maskz_mov_epi32(msk ^ UINT16_MAX, *di);
-
-	flow->num_packets += num;
-}
-
-/*
- * Process found matches for up to 16 flows.
- * fmsk - mask of active flows
- * rmsk - mask of found matches
- * pdata - pointers to flow input data
- * di - data indexes for these flows
- * idx - match indexed for given flows
- * tr_lo contains low 32 bits for up to 8 transitions.
- * tr_hi contains high 32 bits for up to 8 transitions.
- */
-static inline uint32_t
-match_process_avx512x16(struct acl_flow_avx512 *flow, uint32_t *fmsk,
-	uint32_t *rmsk, __m512i pdata[2], __m512i *di, __m512i *idx,
-	__m512i *tr_lo, __m512i *tr_hi)
-{
-	uint32_t n;
-	__m512i res;
-
-	if (rmsk[0] == 0)
-		return 0;
-
-	/* extract match indexes */
-	res = _mm512_and_si512(tr_lo[0], zmm_index_mask.z);
-
-	/* mask  matched transitions to nop */
-	tr_lo[0] = _mm512_mask_mov_epi32(tr_lo[0], rmsk[0], zmm_trlo_idle.z);
-	tr_hi[0] = _mm512_mask_mov_epi32(tr_hi[0], rmsk[0], zmm_trhi_idle.z);
-
-	/* save found match indexes */
-	_mm512_mask_i32scatter_epi32(flow->matches, rmsk[0],
-		idx[0], res, sizeof(flow->matches[0]));
-
-	/* update masks and start new flows for matches */
-	n = update_flow_mask(flow, fmsk, rmsk);
-	start_flow16(flow, n, rmsk[0], pdata, idx, di);
-
-	return n;
-}
-
-/*
- * Test for matches ut to 32 (2x16) flows at once,
- * if matches exist - process them and start new flows.
- */
-static inline void
-match_check_process_avx512x16x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
-	__m512i pdata[4], __m512i di[2], __m512i idx[2], __m512i inp[2],
-	__m512i tr_lo[2], __m512i tr_hi[2])
-{
-	uint32_t n[2];
-	uint32_t rm[2];
-
-	/* check for matches */
-	rm[0] = _mm512_test_epi32_mask(tr_lo[0], zmm_match_mask.z);
-	rm[1] = _mm512_test_epi32_mask(tr_lo[1], zmm_match_mask.z);
-
-	/* till unprocessed matches exist */
-	while ((rm[0] | rm[1]) != 0) {
-
-		/* process matches and start new flows */
-		n[0] = match_process_avx512x16(flow, &fm[0], &rm[0], &pdata[0],
-			&di[0], &idx[0], &tr_lo[0], &tr_hi[0]);
-		n[1] = match_process_avx512x16(flow, &fm[1], &rm[1], &pdata[2],
-			&di[1], &idx[1], &tr_lo[1], &tr_hi[1]);
-
-		/* execute first transition for new flows, if any */
-
-		if (n[0] != 0) {
-			inp[0] = get_next_bytes_avx512x16(flow, &pdata[0],
-				rm[0], &di[0], flow->first_load_sz);
-			first_trans16(flow, inp[0], rm[0], &tr_lo[0],
-				&tr_hi[0]);
-			rm[0] = _mm512_test_epi32_mask(tr_lo[0],
-				zmm_match_mask.z);
-		}
-
-		if (n[1] != 0) {
-			inp[1] = get_next_bytes_avx512x16(flow, &pdata[2],
-				rm[1], &di[1], flow->first_load_sz);
-			first_trans16(flow, inp[1], rm[1], &tr_lo[1],
-				&tr_hi[1]);
-			rm[1] = _mm512_test_epi32_mask(tr_lo[1],
-				zmm_match_mask.z);
-		}
-	}
-}
-
-/*
- * Perform search for up to 32 flows in parallel.
- * Use two sets of metadata, each serves 16 flows max.
- * So in fact we perform search for 2x16 flows.
- */
-static inline void
-search_trie_avx512x16x2(struct acl_flow_avx512 *flow)
-{
-	uint32_t fm[2];
-	__m512i di[2], idx[2], in[2], pdata[4], tr_lo[2], tr_hi[2];
-
-	/* first 1B load */
-	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[0], &idx[0], &di[0]);
-	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[2], &idx[1], &di[1]);
-
-	in[0] = get_next_bytes_avx512x16(flow, &pdata[0], UINT16_MAX, &di[0],
-			flow->first_load_sz);
-	in[1] = get_next_bytes_avx512x16(flow, &pdata[2], UINT16_MAX, &di[1],
-			flow->first_load_sz);
-
-	first_trans16(flow, in[0], UINT16_MAX, &tr_lo[0], &tr_hi[0]);
-	first_trans16(flow, in[1], UINT16_MAX, &tr_lo[1], &tr_hi[1]);
-
-	fm[0] = UINT16_MAX;
-	fm[1] = UINT16_MAX;
-
-	/* match check */
-	match_check_process_avx512x16x2(flow, fm, pdata, di, idx, in,
-		tr_lo, tr_hi);
-
-	while ((fm[0] | fm[1]) != 0) {
-
-		/* load next 4B */
-
-		in[0] = get_next_bytes_avx512x16(flow, &pdata[0], fm[0],
-			&di[0], sizeof(uint32_t));
-		in[1] = get_next_bytes_avx512x16(flow, &pdata[2], fm[1],
-			&di[1], sizeof(uint32_t));
-
-		/* main 4B loop */
-
-		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		/* check for matches */
-		match_check_process_avx512x16x2(flow, fm, pdata, di, idx, in,
-			tr_lo, tr_hi);
-	}
+			_SV_(pminp), _mm512_castsi256_si512(inp[1]));
 }
 
 /*
@@ -582,120 +283,12 @@ resolve_mcgt8_avx512x1(uint32_t result[],
 	}
 }
 
-/*
- * resolve match index to actual result/priority offset.
- */
-static inline __m512i
-resolve_match_idx_avx512x16(__m512i mi)
-{
-	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
-		1 << (match_log + 2));
-	return _mm512_slli_epi32(mi, match_log);
-}
-
-/*
- * Resolve multiple matches for the same flow based on priority.
- */
-static inline __m512i
-resolve_pri_avx512x16(const int32_t res[], const int32_t pri[],
-	const uint32_t match[], __mmask16 msk, uint32_t nb_trie,
-	uint32_t nb_skip)
-{
-	uint32_t i;
-	const uint32_t *pm;
-	__mmask16 m;
-	__m512i cp, cr, np, nr, mch;
-
-	const __m512i zero = _mm512_set1_epi32(0);
-
-	/* get match indexes */
-	mch = _mm512_maskz_loadu_epi32(msk, match);
-	mch = resolve_match_idx_avx512x16(mch);
-
-	/* read result and priority values for first trie */
-	cr = _mm512_mask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0]));
-	cp = _mm512_mask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0]));
-
-	/*
-	 * read result and priority values for next tries and select one
-	 * with highest priority.
-	 */
-	for (i = 1, pm = match + nb_skip; i != nb_trie;
-			i++, pm += nb_skip) {
-
-		mch = _mm512_maskz_loadu_epi32(msk, pm);
-		mch = resolve_match_idx_avx512x16(mch);
-
-		nr = _mm512_mask_i32gather_epi32(zero, msk, mch, res,
-			sizeof(res[0]));
-		np = _mm512_mask_i32gather_epi32(zero, msk, mch, pri,
-			sizeof(pri[0]));
-
-		m = _mm512_cmpgt_epi32_mask(cp, np);
-		cr = _mm512_mask_mov_epi32(nr, m, cr);
-		cp = _mm512_mask_mov_epi32(np, m, cp);
-	}
-
-	return cr;
-}
-
-/*
- * Resolve num (<= 16) matches for single category
- */
-static inline void
-resolve_sc_avx512x16(uint32_t result[], const int32_t res[],
-	const int32_t pri[], const uint32_t match[], uint32_t nb_pkt,
-	uint32_t nb_trie, uint32_t nb_skip)
-{
-	__mmask16 msk;
-	__m512i cr;
-
-	msk = (1 << nb_pkt) - 1;
-	cr = resolve_pri_avx512x16(res, pri, match, msk, nb_trie, nb_skip);
-	_mm512_mask_storeu_epi32(result, msk, cr);
-}
+#include "acl_run_avx512_common.h"
 
 /*
- * Resolve matches for single category
+ * Perform search for up to (2 * 16) flows in parallel.
+ * Use two sets of metadata, each serves 16 flows max.
  */
-static inline void
-resolve_sc_avx512x16x2(uint32_t result[],
-	const struct rte_acl_match_results pr[], const uint32_t match[],
-	uint32_t nb_pkt, uint32_t nb_trie)
-{
-	uint32_t j, k, n;
-	const int32_t *res, *pri;
-	__m512i cr[2];
-
-	res = (const int32_t *)pr->results;
-	pri = pr->priority;
-
-	for (k = 0; k != (nb_pkt & ~MSK_AVX512X16X2); k += NUM_AVX512X16X2) {
-
-		j = k + MASK16_BIT;
-
-		cr[0] = resolve_pri_avx512x16(res, pri, match + k, UINT16_MAX,
-				nb_trie, nb_pkt);
-		cr[1] = resolve_pri_avx512x16(res, pri, match + j, UINT16_MAX,
-				nb_trie, nb_pkt);
-
-		_mm512_storeu_si512(result + k, cr[0]);
-		_mm512_storeu_si512(result + j, cr[1]);
-	}
-
-	n = nb_pkt - k;
-	if (n != 0) {
-		if (n > MASK16_BIT) {
-			resolve_sc_avx512x16(result + k, res, pri, match + k,
-				MASK16_BIT, nb_trie, nb_pkt);
-			k += MASK16_BIT;
-			n -= MASK16_BIT;
-		}
-		resolve_sc_avx512x16(result + k, res, pri, match + k, n,
-				nb_trie, nb_pkt);
-	}
-}
-
 static inline int
 search_avx512x16x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t total_packets, uint32_t categories)
@@ -711,7 +304,7 @@ search_avx512x16x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 		acl_set_flow_avx512(&flow, ctx, i, data, pm, total_packets);
 
 		/* process the trie */
-		search_trie_avx512x16x2(&flow);
+		_F_(search_trie)(&flow);
 	}
 
 	/* resolve matches */
@@ -719,7 +312,7 @@ search_avx512x16x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 		(ctx->trans_table + ctx->match_index);
 
 	if (categories == 1)
-		resolve_sc_avx512x16x2(results, pr, match, total_packets,
+		_F_(resolve_single_cat)(results, pr, match, total_packets,
 			ctx->num_tries);
 	else if (categories <= RTE_ACL_MAX_CATEGORIES / 2)
 		resolve_mcle8_avx512x1(results, pr, match, total_packets,
@@ -730,3 +323,19 @@ search_avx512x16x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 
 	return 0;
 }
+
+#undef _SIMD_PTR_MSK_
+#undef _SIMD_PTR_NUM_
+#undef _SIMD_FLOW_MSK_
+#undef _SIMD_FLOW_NUM_
+#undef _SIMD_MASK_MAX_
+#undef _SIMD_MASK_BIT_
+#undef _M_GI_
+#undef _M_MGI_
+#undef _M_SI_
+#undef _M_I_
+#undef _F_
+#undef _SV_
+#undef _SC_
+#undef _T_mask
+#undef _T_simd
diff --git a/lib/librte_acl/acl_run_avx512x8.h b/lib/librte_acl/acl_run_avx512x8.h
index fedd79b9ae..61ac9d1b47 100644
--- a/lib/librte_acl/acl_run_avx512x8.h
+++ b/lib/librte_acl/acl_run_avx512x8.h
@@ -2,16 +2,57 @@
  * Copyright(c) 2020 Intel Corporation
  */
 
-#define MASK8_BIT	(sizeof(__mmask8) * CHAR_BIT)
+/*
+ * Defines required by "acl_run_avx512_common.h".
+ * Note that all of them has to be undefined by the end
+ * of this file, as "acl_run_avx512_common.h" can be included several
+ * times from different *.h files for the same *.c.
+ */
+
+/*
+ * This implementation uses 256-bit registers(ymm) and instrincts.
+ * So our main SIMD type is 256-bit width and each such variable can
+ * process sizeof(__m256i) / sizeof(uint32_t) == 8 entries in parallel.
+ */
+#define _T_simd		__m256i
+#define _T_mask		__mmask8
+
+/* Naming convention for static const variables. */
+#define _SC_(x)		ymm_##x
+#define _SV_(x)		(ymm_##x.y)
+
+/* Naming convention for internal functions. */
+#define _F_(x)		x##_avx512x8
+
+/*
+ * Same instrincts have different syntaxis (depending on the bit-width),
+ * so to overcome that few macros need to be defined.
+ */
+
+/* Naming convention for generic epi(packed integers) type instrincts. */
+#define _M_I_(x)	_mm256_##x
+
+/* Naming convention for si(whole simd integer) type instrincts. */
+#define _M_SI_(x)	_mm256_##x##_si256
 
-#define NUM_AVX512X8X2	(2 * MASK8_BIT)
-#define MSK_AVX512X8X2	(NUM_AVX512X8X2 - 1)
+/* Naming convention for masked gather type instrincts. */
+#define _M_MGI_(x)	_mm256_m##x
+
+/* Naming convention for gather type instrincts. */
+#define _M_GI_(name, idx, base, scale)	_mm256_##name(base, idx, scale)
+
+/* num/mask of transitions per SIMD regs */
+#define _SIMD_MASK_BIT_	(sizeof(_T_simd) / sizeof(uint32_t))
+#define _SIMD_MASK_MAX_	RTE_LEN2MASK(_SIMD_MASK_BIT_, uint32_t)
+
+#define _SIMD_FLOW_NUM_	(2 * _SIMD_MASK_BIT_)
+#define _SIMD_FLOW_MSK_	(_SIMD_FLOW_NUM_ - 1)
 
 /* num/mask of pointers per SIMD regs */
-#define YMM_PTR_NUM	(sizeof(__m256i) / sizeof(uintptr_t))
-#define YMM_PTR_MSK	RTE_LEN2MASK(YMM_PTR_NUM, uint32_t)
+#define _SIMD_PTR_NUM_	(sizeof(_T_simd) / sizeof(uintptr_t))
+#define _SIMD_PTR_MSK_	RTE_LEN2MASK(_SIMD_PTR_NUM_, uint32_t)
 
-static const rte_ymm_t ymm_match_mask = {
+static const rte_ymm_t _SC_(match_mask) = {
 	.u32 = {
 		RTE_ACL_NODE_MATCH,
 		RTE_ACL_NODE_MATCH,
@@ -24,7 +65,7 @@ static const rte_ymm_t ymm_match_mask = {
 	},
 };
 
-static const rte_ymm_t ymm_index_mask = {
+static const rte_ymm_t _SC_(index_mask) = {
 	.u32 = {
 		RTE_ACL_NODE_INDEX,
 		RTE_ACL_NODE_INDEX,
@@ -37,7 +78,7 @@ static const rte_ymm_t ymm_index_mask = {
 	},
 };
 
-static const rte_ymm_t ymm_trlo_idle = {
+static const rte_ymm_t _SC_(trlo_idle) = {
 	.u32 = {
 		RTE_ACL_IDLE_NODE,
 		RTE_ACL_IDLE_NODE,
@@ -50,51 +91,51 @@ static const rte_ymm_t ymm_trlo_idle = {
 	},
 };
 
-static const rte_ymm_t ymm_trhi_idle = {
+static const rte_ymm_t _SC_(trhi_idle) = {
 	.u32 = {
 		0, 0, 0, 0,
 		0, 0, 0, 0,
 	},
 };
 
-static const rte_ymm_t ymm_shuffle_input = {
+static const rte_ymm_t _SC_(shuffle_input) = {
 	.u32 = {
 		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
 		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
 	},
 };
 
-static const rte_ymm_t ymm_four_32 = {
+static const rte_ymm_t _SC_(four_32) = {
 	.u32 = {
 		4, 4, 4, 4,
 		4, 4, 4, 4,
 	},
 };
 
-static const rte_ymm_t ymm_idx_add = {
+static const rte_ymm_t _SC_(idx_add) = {
 	.u32 = {
 		0, 1, 2, 3,
 		4, 5, 6, 7,
 	},
 };
 
-static const rte_ymm_t ymm_range_base = {
+static const rte_ymm_t _SC_(range_base) = {
 	.u32 = {
 		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
 		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
 	},
 };
 
-static const rte_ymm_t ymm_pminp = {
+static const rte_ymm_t _SC_(pminp) = {
 	.u32 = {
 		0x00, 0x01, 0x02, 0x03,
 		0x08, 0x09, 0x0a, 0x0b,
 	},
 };
 
-static const __mmask16 ymm_pmidx_msk = 0x55;
+static const __mmask16 _SC_(pmidx_msk) = 0x55;
 
-static const rte_ymm_t ymm_pmidx[2] = {
+static const rte_ymm_t _SC_(pmidx[2]) = {
 	[0] = {
 		.u32 = {
 			0, 0, 1, 0, 2, 0, 3, 0,
@@ -120,7 +161,7 @@ _m256_mask_gather_epi8x4(__m256i pdata, __mmask8 mask)
 
 	static const uint32_t zero;
 
-	p.y = _mm256_mask_set1_epi64(pdata, mask ^ YMM_PTR_MSK,
+	p.y = _mm256_mask_set1_epi64(pdata, mask ^ _SIMD_PTR_MSK_,
 		(uintptr_t)&zero);
 
 	v.u32[0] = *(uint8_t *)p.u64[0];
@@ -132,483 +173,37 @@ _m256_mask_gather_epi8x4(__m256i pdata, __mmask8 mask)
 }
 
 /*
- * Calculate the address of the next transition for
- * all types of nodes. Note that only DFA nodes and range
- * nodes actually transition to another node. Match
- * nodes not supposed to be encountered here.
- * For quad range nodes:
- * Calculate number of range boundaries that are less than the
- * input value. Range boundaries for each node are in signed 8 bit,
- * ordered from -128 to 127.
- * This is effectively a popcnt of bytes that are greater than the
- * input byte.
- * Single nodes are processed in the same ways as quad range nodes.
- */
-static __rte_always_inline __m256i
-calc_addr8(__m256i index_mask, __m256i next_input, __m256i shuffle_input,
-	__m256i four_32, __m256i range_base, __m256i tr_lo, __m256i tr_hi)
-{
-	__mmask32 qm;
-	__mmask8 dfa_msk;
-	__m256i addr, in, node_type, r, t;
-	__m256i dfa_ofs, quad_ofs;
-
-	t = _mm256_xor_si256(index_mask, index_mask);
-	in = _mm256_shuffle_epi8(next_input, shuffle_input);
-
-	/* Calc node type and node addr */
-	node_type = _mm256_andnot_si256(index_mask, tr_lo);
-	addr = _mm256_and_si256(index_mask, tr_lo);
-
-	/* mask for DFA type(0) nodes */
-	dfa_msk = _mm256_cmpeq_epi32_mask(node_type, t);
-
-	/* DFA calculations. */
-	r = _mm256_srli_epi32(in, 30);
-	r = _mm256_add_epi8(r, range_base);
-	t = _mm256_srli_epi32(in, 24);
-	r = _mm256_shuffle_epi8(tr_hi, r);
-
-	dfa_ofs = _mm256_sub_epi32(t, r);
-
-	/* QUAD/SINGLE calculations. */
-	qm = _mm256_cmpgt_epi8_mask(in, tr_hi);
-	t = _mm256_maskz_set1_epi8(qm, (uint8_t)UINT8_MAX);
-	t = _mm256_lzcnt_epi32(t);
-	t = _mm256_srli_epi32(t, 3);
-	quad_ofs = _mm256_sub_epi32(four_32, t);
-
-	/* blend DFA and QUAD/SINGLE. */
-	t = _mm256_mask_mov_epi32(quad_ofs, dfa_msk, dfa_ofs);
-
-	/* calculate address for next transitions. */
-	addr = _mm256_add_epi32(addr, t);
-	return addr;
-}
-
-/*
- * Process 16 transitions in parallel.
- * tr_lo contains low 32 bits for 16 transition.
- * tr_hi contains high 32 bits for 16 transition.
- * next_input contains up to 4 input bytes for 16 flows.
+ * Gather 4/1 input bytes for up to 8 (2*8) locations in parallel.
  */
 static __rte_always_inline __m256i
-transition8(__m256i next_input, const uint64_t *trans, __m256i *tr_lo,
-	__m256i *tr_hi)
-{
-	const int32_t *tr;
-	__m256i addr;
-
-	tr = (const int32_t *)(uintptr_t)trans;
-
-	/* Calculate the address (array index) for all 8 transitions. */
-	addr = calc_addr8(ymm_index_mask.y, next_input, ymm_shuffle_input.y,
-		ymm_four_32.y, ymm_range_base.y, *tr_lo, *tr_hi);
-
-	/* load lower 32 bits of 16 transactions at once. */
-	*tr_lo = _mm256_i32gather_epi32(tr, addr, sizeof(trans[0]));
-
-	next_input = _mm256_srli_epi32(next_input, CHAR_BIT);
-
-	/* load high 32 bits of 16 transactions at once. */
-	*tr_hi = _mm256_i32gather_epi32(tr + 1, addr, sizeof(trans[0]));
-
-	return next_input;
-}
-
-/*
- * Execute first transition for up to 16 flows in parallel.
- * next_input should contain one input byte for up to 16 flows.
- * msk - mask of active flows.
- * tr_lo contains low 32 bits for up to 16 transitions.
- * tr_hi contains high 32 bits for up to 16 transitions.
- */
-static __rte_always_inline void
-first_trans8(const struct acl_flow_avx512 *flow, __m256i next_input,
-	__mmask8 msk, __m256i *tr_lo, __m256i *tr_hi)
+_F_(gather_bytes)(__m256i zero, const __m256i p[2], const uint32_t m[2],
+	uint32_t bnum)
 {
-	const int32_t *tr;
-	__m256i addr, root;
-
-	tr = (const int32_t *)(uintptr_t)flow->trans;
-
-	addr = _mm256_set1_epi32(UINT8_MAX);
-	root = _mm256_set1_epi32(flow->root_index);
-
-	addr = _mm256_and_si256(next_input, addr);
-	addr = _mm256_add_epi32(root, addr);
-
-	/* load lower 32 bits of 16 transactions at once. */
-	*tr_lo = _mm256_mmask_i32gather_epi32(*tr_lo, msk, addr, tr,
-		sizeof(flow->trans[0]));
-
-	/* load high 32 bits of 16 transactions at once. */
-	*tr_hi = _mm256_mmask_i32gather_epi32(*tr_hi, msk, addr, (tr + 1),
-		sizeof(flow->trans[0]));
-}
-
-/*
- * Load and return next 4 input bytes for up to 16 flows in parallel.
- * pdata - 8x2 pointers to flow input data
- * mask - mask of active flows.
- * di - data indexes for these 16 flows.
- */
-static inline __m256i
-get_next_bytes_avx512x8(const struct acl_flow_avx512 *flow, __m256i pdata[2],
-	uint32_t msk, __m256i *di, uint32_t bnum)
-{
-	const int32_t *div;
-	uint32_t m[2];
-	__m256i one, zero, t, p[2];
 	__m128i inp[2];
 
-	div = (const int32_t *)flow->data_index;
-
-	one = _mm256_set1_epi32(1);
-	zero = _mm256_xor_si256(one, one);
-
-	/* load data offsets for given indexes */
-	t = _mm256_mmask_i32gather_epi32(zero, msk, *di, div, sizeof(div[0]));
-
-	/* increment data indexes */
-	*di = _mm256_mask_add_epi32(*di, msk, *di, one);
-
-	/*
-	 * unsigned expand 32-bit indexes to 64-bit
-	 * (for later pointer arithmetic), i.e:
-	 * for (i = 0; i != 16; i++)
-	 *   p[i/8].u64[i%8] = (uint64_t)t.u32[i];
-	 */
-	p[0] = _mm256_maskz_permutexvar_epi32(ymm_pmidx_msk, ymm_pmidx[0].y, t);
-	p[1] = _mm256_maskz_permutexvar_epi32(ymm_pmidx_msk, ymm_pmidx[1].y, t);
-
-	p[0] = _mm256_add_epi64(p[0], pdata[0]);
-	p[1] = _mm256_add_epi64(p[1], pdata[1]);
-
-	/* load input byte(s), either one or four */
-
-	m[0] = msk & YMM_PTR_MSK;
-	m[1] = msk >> YMM_PTR_NUM;
-
 	if (bnum == sizeof(uint8_t)) {
 		inp[0] = _m256_mask_gather_epi8x4(p[0], m[0]);
 		inp[1] = _m256_mask_gather_epi8x4(p[1], m[1]);
 	} else {
 		inp[0] = _mm256_mmask_i64gather_epi32(
-				_mm256_castsi256_si128(zero), m[0], p[0],
-				NULL, sizeof(uint8_t));
+				_mm256_castsi256_si128(zero),
+				m[0], p[0], NULL, sizeof(uint8_t));
 		inp[1] = _mm256_mmask_i64gather_epi32(
-				_mm256_castsi256_si128(zero), m[1], p[1],
-				NULL, sizeof(uint8_t));
+				_mm256_castsi256_si128(zero),
+				m[1], p[1], NULL, sizeof(uint8_t));
 	}
 
-	/* squeeze input into one 512-bit register */
+	/* squeeze input into one 256-bit register */
 	return _mm256_permutex2var_epi32(_mm256_castsi128_si256(inp[0]),
-			ymm_pminp.y,  _mm256_castsi128_si256(inp[1]));
-}
-
-/*
- * Start up to 16 new flows.
- * num - number of flows to start
- * msk - mask of new flows.
- * pdata - pointers to flow input data
- * idx - match indexed for given flows
- * di - data indexes for these flows.
- */
-static inline void
-start_flow8(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
-	__m256i pdata[2], __m256i *idx, __m256i *di)
-{
-	uint32_t n, m[2], nm[2];
-	__m256i ni, nd[2];
-
-	m[0] = msk & YMM_PTR_MSK;
-	m[1] = msk >> YMM_PTR_NUM;
-
-	n = __builtin_popcount(m[0]);
-	nm[0] = (1 << n) - 1;
-	nm[1] = (1 << (num - n)) - 1;
-
-	/* load input data pointers for new flows */
-	nd[0] = _mm256_maskz_loadu_epi64(nm[0],
-		flow->idata + flow->num_packets);
-	nd[1] = _mm256_maskz_loadu_epi64(nm[1],
-		flow->idata + flow->num_packets + n);
-
-	/* calculate match indexes of new flows */
-	ni = _mm256_set1_epi32(flow->num_packets);
-	ni = _mm256_add_epi32(ni, ymm_idx_add.y);
-
-	/* merge new and existing flows data */
-	pdata[0] = _mm256_mask_expand_epi64(pdata[0], m[0], nd[0]);
-	pdata[1] = _mm256_mask_expand_epi64(pdata[1], m[1], nd[1]);
-
-	/* update match and data indexes */
-	*idx = _mm256_mask_expand_epi32(*idx, msk, ni);
-	*di = _mm256_maskz_mov_epi32(msk ^ UINT8_MAX, *di);
-
-	flow->num_packets += num;
-}
-
-/*
- * Process found matches for up to 16 flows.
- * fmsk - mask of active flows
- * rmsk - mask of found matches
- * pdata - pointers to flow input data
- * di - data indexes for these flows
- * idx - match indexed for given flows
- * tr_lo contains low 32 bits for up to 8 transitions.
- * tr_hi contains high 32 bits for up to 8 transitions.
- */
-static inline uint32_t
-match_process_avx512x8(struct acl_flow_avx512 *flow, uint32_t *fmsk,
-	uint32_t *rmsk, __m256i pdata[2], __m256i *di, __m256i *idx,
-	__m256i *tr_lo, __m256i *tr_hi)
-{
-	uint32_t n;
-	__m256i res;
-
-	if (rmsk[0] == 0)
-		return 0;
-
-	/* extract match indexes */
-	res = _mm256_and_si256(tr_lo[0], ymm_index_mask.y);
-
-	/* mask  matched transitions to nop */
-	tr_lo[0] = _mm256_mask_mov_epi32(tr_lo[0], rmsk[0], ymm_trlo_idle.y);
-	tr_hi[0] = _mm256_mask_mov_epi32(tr_hi[0], rmsk[0], ymm_trhi_idle.y);
-
-	/* save found match indexes */
-	_mm256_mask_i32scatter_epi32(flow->matches, rmsk[0],
-		idx[0], res, sizeof(flow->matches[0]));
-
-	/* update masks and start new flows for matches */
-	n = update_flow_mask(flow, fmsk, rmsk);
-	start_flow8(flow, n, rmsk[0], pdata, idx, di);
-
-	return n;
-}
-
-/*
- * Test for matches ut to 32 (2x16) flows at once,
- * if matches exist - process them and start new flows.
- */
-static inline void
-match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
-	__m256i pdata[4], __m256i di[2], __m256i idx[2], __m256i inp[2],
-	__m256i tr_lo[2], __m256i tr_hi[2])
-{
-	uint32_t n[2];
-	uint32_t rm[2];
-
-	/* check for matches */
-	rm[0] = _mm256_test_epi32_mask(tr_lo[0], ymm_match_mask.y);
-	rm[1] = _mm256_test_epi32_mask(tr_lo[1], ymm_match_mask.y);
-
-	/* till unprocessed matches exist */
-	while ((rm[0] | rm[1]) != 0) {
-
-		/* process matches and start new flows */
-		n[0] = match_process_avx512x8(flow, &fm[0], &rm[0], &pdata[0],
-			&di[0], &idx[0], &tr_lo[0], &tr_hi[0]);
-		n[1] = match_process_avx512x8(flow, &fm[1], &rm[1], &pdata[2],
-			&di[1], &idx[1], &tr_lo[1], &tr_hi[1]);
-
-		/* execute first transition for new flows, if any */
-
-		if (n[0] != 0) {
-			inp[0] = get_next_bytes_avx512x8(flow, &pdata[0],
-				rm[0], &di[0], flow->first_load_sz);
-			first_trans8(flow, inp[0], rm[0], &tr_lo[0],
-				&tr_hi[0]);
-			rm[0] = _mm256_test_epi32_mask(tr_lo[0],
-				ymm_match_mask.y);
-		}
-
-		if (n[1] != 0) {
-			inp[1] = get_next_bytes_avx512x8(flow, &pdata[2],
-				rm[1], &di[1], flow->first_load_sz);
-			first_trans8(flow, inp[1], rm[1], &tr_lo[1],
-				&tr_hi[1]);
-			rm[1] = _mm256_test_epi32_mask(tr_lo[1],
-				ymm_match_mask.y);
-		}
-	}
-}
-
-/*
- * Perform search for up to 32 flows in parallel.
- * Use two sets of metadata, each serves 16 flows max.
- * So in fact we perform search for 2x16 flows.
- */
-static inline void
-search_trie_avx512x8x2(struct acl_flow_avx512 *flow)
-{
-	uint32_t fm[2];
-	__m256i di[2], idx[2], in[2], pdata[4], tr_lo[2], tr_hi[2];
-
-	/* first 1B load */
-	start_flow8(flow, MASK8_BIT, UINT8_MAX, &pdata[0], &idx[0], &di[0]);
-	start_flow8(flow, MASK8_BIT, UINT8_MAX, &pdata[2], &idx[1], &di[1]);
-
-	in[0] = get_next_bytes_avx512x8(flow, &pdata[0], UINT8_MAX, &di[0],
-			flow->first_load_sz);
-	in[1] = get_next_bytes_avx512x8(flow, &pdata[2], UINT8_MAX, &di[1],
-			flow->first_load_sz);
-
-	first_trans8(flow, in[0], UINT8_MAX, &tr_lo[0], &tr_hi[0]);
-	first_trans8(flow, in[1], UINT8_MAX, &tr_lo[1], &tr_hi[1]);
-
-	fm[0] = UINT8_MAX;
-	fm[1] = UINT8_MAX;
-
-	/* match check */
-	match_check_process_avx512x8x2(flow, fm, pdata, di, idx, in,
-		tr_lo, tr_hi);
-
-	while ((fm[0] | fm[1]) != 0) {
-
-		/* load next 4B */
-
-		in[0] = get_next_bytes_avx512x8(flow, &pdata[0], fm[0],
-			&di[0], sizeof(uint32_t));
-		in[1] = get_next_bytes_avx512x8(flow, &pdata[2], fm[1],
-			&di[1], sizeof(uint32_t));
-
-		/* main 4B loop */
-
-		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		in[0] = transition8(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
-		in[1] = transition8(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
-
-		/* check for matches */
-		match_check_process_avx512x8x2(flow, fm, pdata, di, idx, in,
-			tr_lo, tr_hi);
-	}
-}
-
-/*
- * resolve match index to actual result/priority offset.
- */
-static inline __m256i
-resolve_match_idx_avx512x8(__m256i mi)
-{
-	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
-		1 << (match_log + 2));
-	return _mm256_slli_epi32(mi, match_log);
+			_SV_(pminp), _mm256_castsi128_si256(inp[1]));
 }
 
-/*
- * Resolve multiple matches for the same flow based on priority.
- */
-static inline __m256i
-resolve_pri_avx512x8(const int32_t res[], const int32_t pri[],
-	const uint32_t match[], __mmask8 msk, uint32_t nb_trie,
-	uint32_t nb_skip)
-{
-	uint32_t i;
-	const uint32_t *pm;
-	__mmask16 m;
-	__m256i cp, cr, np, nr, mch;
-
-	const __m256i zero = _mm256_set1_epi32(0);
-
-	/* get match indexes */
-	mch = _mm256_maskz_loadu_epi32(msk, match);
-	mch = resolve_match_idx_avx512x8(mch);
-
-	/* read result and priority values for first trie */
-	cr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0]));
-	cp = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0]));
-
-	/*
-	 * read result and priority values for next tries and select one
-	 * with highest priority.
-	 */
-	for (i = 1, pm = match + nb_skip; i != nb_trie;
-			i++, pm += nb_skip) {
-
-		mch = _mm256_maskz_loadu_epi32(msk, pm);
-		mch = resolve_match_idx_avx512x8(mch);
-
-		nr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res,
-			sizeof(res[0]));
-		np = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri,
-			sizeof(pri[0]));
-
-		m = _mm256_cmpgt_epi32_mask(cp, np);
-		cr = _mm256_mask_mov_epi32(nr, m, cr);
-		cp = _mm256_mask_mov_epi32(np, m, cp);
-	}
-
-	return cr;
-}
-
-/*
- * Resolve num (<= 8) matches for single category
- */
-static inline void
-resolve_sc_avx512x8(uint32_t result[], const int32_t res[],
-	const int32_t pri[], const uint32_t match[], uint32_t nb_pkt,
-	uint32_t nb_trie, uint32_t nb_skip)
-{
-	__mmask8 msk;
-	__m256i cr;
-
-	msk = (1 << nb_pkt) - 1;
-	cr = resolve_pri_avx512x8(res, pri, match, msk, nb_trie, nb_skip);
-	_mm256_mask_storeu_epi32(result, msk, cr);
-}
+#include "acl_run_avx512_common.h"
 
 /*
- * Resolve matches for single category
+ * Perform search for up to (2 * 8) flows in parallel.
+ * Use two sets of metadata, each serves 8 flows max.
  */
-static inline void
-resolve_sc_avx512x8x2(uint32_t result[],
-	const struct rte_acl_match_results pr[], const uint32_t match[],
-	uint32_t nb_pkt, uint32_t nb_trie)
-{
-	uint32_t j, k, n;
-	const int32_t *res, *pri;
-	__m256i cr[2];
-
-	res = (const int32_t *)pr->results;
-	pri = pr->priority;
-
-	for (k = 0; k != (nb_pkt & ~MSK_AVX512X8X2); k += NUM_AVX512X8X2) {
-
-		j = k + MASK8_BIT;
-
-		cr[0] = resolve_pri_avx512x8(res, pri, match + k, UINT8_MAX,
-				nb_trie, nb_pkt);
-		cr[1] = resolve_pri_avx512x8(res, pri, match + j, UINT8_MAX,
-				nb_trie, nb_pkt);
-
-		_mm256_storeu_si256((void *)(result + k), cr[0]);
-		_mm256_storeu_si256((void *)(result + j), cr[1]);
-	}
-
-	n = nb_pkt - k;
-	if (n != 0) {
-		if (n > MASK8_BIT) {
-			resolve_sc_avx512x8(result + k, res, pri, match + k,
-				MASK8_BIT, nb_trie, nb_pkt);
-			k += MASK8_BIT;
-			n -= MASK8_BIT;
-		}
-		resolve_sc_avx512x8(result + k, res, pri, match + k, n,
-				nb_trie, nb_pkt);
-	}
-}
-
 static inline int
 search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t total_packets, uint32_t categories)
@@ -624,7 +219,7 @@ search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 		acl_set_flow_avx512(&flow, ctx, i, data, pm, total_packets);
 
 		/* process the trie */
-		search_trie_avx512x8x2(&flow);
+		_F_(search_trie)(&flow);
 	}
 
 	/* resolve matches */
@@ -632,7 +227,7 @@ search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 		(ctx->trans_table + ctx->match_index);
 
 	if (categories == 1)
-		resolve_sc_avx512x8x2(results, pr, match, total_packets,
+		_F_(resolve_single_cat)(results, pr, match, total_packets,
 			ctx->num_tries);
 	else
 		resolve_mcle8_avx512x1(results, pr, match, total_packets,
@@ -640,3 +235,19 @@ search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 
 	return 0;
 }
+
+#undef _SIMD_PTR_MSK_
+#undef _SIMD_PTR_NUM_
+#undef _SIMD_FLOW_MSK_
+#undef _SIMD_FLOW_NUM_
+#undef _SIMD_MASK_MAX_
+#undef _SIMD_MASK_BIT_
+#undef _M_GI_
+#undef _M_MGI_
+#undef _M_SI_
+#undef _M_I_
+#undef _F_
+#undef _SV_
+#undef _SC_
+#undef _T_mask
+#undef _T_simd
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v4 13/14] test/acl: add AVX512 classify support
  2020-10-06 15:03     ` [dpdk-dev] [PATCH v4 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                         ` (11 preceding siblings ...)
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 12/14] acl: deduplicate AVX512 code paths Konstantin Ananyev
@ 2020-10-06 15:03       ` Konstantin Ananyev
  2020-10-14 10:26         ` David Marchand
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 14/14] app/acl: " Konstantin Ananyev
  2020-10-14 12:40       ` [dpdk-dev] [PATCH v4 00/14] acl: introduce AVX512 classify methods David Marchand
  14 siblings, 1 reply; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-06 15:03 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Add AVX512 classify to the test coverage.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test/test_acl.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/app/test/test_acl.c b/app/test/test_acl.c
index 333b347579..5b32347954 100644
--- a/app/test/test_acl.c
+++ b/app/test/test_acl.c
@@ -278,8 +278,8 @@ test_classify_alg(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
 
 	/* set given classify alg, skip test if alg is not supported */
 	ret = rte_acl_set_ctx_classify(acx, alg);
-	if (ret == -ENOTSUP)
-		return 0;
+	if (ret != 0)
+		return (ret == -ENOTSUP) ? 0 : ret;
 
 	/**
 	 * these will run quite a few times, it's necessary to test code paths
@@ -341,6 +341,8 @@ test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
 		RTE_ACL_CLASSIFY_AVX2,
 		RTE_ACL_CLASSIFY_NEON,
 		RTE_ACL_CLASSIFY_ALTIVEC,
+		RTE_ACL_CLASSIFY_AVX512X16,
+		RTE_ACL_CLASSIFY_AVX512X32,
 	};
 
 	/* swap all bytes in the data to network order */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [dpdk-dev] [PATCH v4 14/14] app/acl: add AVX512 classify support
  2020-10-06 15:03     ` [dpdk-dev] [PATCH v4 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                         ` (12 preceding siblings ...)
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 13/14] test/acl: add AVX512 classify support Konstantin Ananyev
@ 2020-10-06 15:03       ` Konstantin Ananyev
  2020-10-14 12:40       ` [dpdk-dev] [PATCH v4 00/14] acl: introduce AVX512 classify methods David Marchand
  14 siblings, 0 replies; 70+ messages in thread
From: Konstantin Ananyev @ 2020-10-06 15:03 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Add ability to use AVX512 classify method.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test-acl/main.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/app/test-acl/main.c b/app/test-acl/main.c
index d9b65517cb..2a3a35a054 100644
--- a/app/test-acl/main.c
+++ b/app/test-acl/main.c
@@ -81,6 +81,14 @@ static const struct acl_alg acl_alg[] = {
 		.name = "altivec",
 		.alg = RTE_ACL_CLASSIFY_ALTIVEC,
 	},
+	{
+		.name = "avx512x16",
+		.alg = RTE_ACL_CLASSIFY_AVX512X16,
+	},
+	{
+		.name = "avx512x32",
+		.alg = RTE_ACL_CLASSIFY_AVX512X32,
+	},
 };
 
 static struct {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods
  2020-10-05 18:45   ` [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                       ` (14 preceding siblings ...)
  2020-10-06 15:03     ` [dpdk-dev] [PATCH v4 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
@ 2020-10-06 15:05     ` David Marchand
  2020-10-06 16:07       ` Ananyev, Konstantin
  15 siblings, 1 reply; 70+ messages in thread
From: David Marchand @ 2020-10-06 15:05 UTC (permalink / raw)
  To: Konstantin Ananyev
  Cc: dev, Jerin Jacob Kollanukkaran,
	Ruifeng Wang (Arm Technology China),
	Vladimir Medvedkin, Thomas Monjalon, Ray Kinsella,
	Bruce Richardson

On Mon, Oct 5, 2020 at 9:44 PM Konstantin Ananyev
<konstantin.ananyev@intel.com> wrote:
>
> These patch series introduce support of AVX512 specific classify
> implementation for ACL library.
> It adds two new algorithms:
>  - RTE_ACL_CLASSIFY_AVX512X16 - can process up to 16 flows in parallel.
>    It uses 256-bit width instructions/registers only
>    (to avoid frequency level change).
>    On my SKX box test-acl shows ~15-30% improvement
>    (depending on rule-set and input burst size)
>    when switching from AVX2 to AVX512X16 classify algorithms.
>  - RTE_ACL_CLASSIFY_AVX512X32 - can process up to 32 flows in parallel.
>    It uses 512-bit width instructions/registers and provides higher
>    performance then AVX512X16, but can cause frequency level change.
>    On my SKX box test-acl shows ~50-70% improvement
>    (depending on rule-set and input burst size)
>    when switching from AVX2 to AVX512X32 classify algorithms.
>    ICX and CLX testing showed similar level of speedup.
>
> Current AVX512 classify implementation is only supported on x86_64.
> Note that this series introduce a formal ABI incompatibility

The only API change I can see is in rte_acl_classify_alg() new error
code but I don't think we need an announcement for this.
As for ABI, we are breaking it in this release, so I see no pb.


> with previous versions of ACL library.
>
> v2 -> v3:
>   Fix checkpatch warnings
>   Split AVX512 algorithm into two and deduplicate common code

Patch 7 still references a RTE_MACHINE_CPUFLAG flag.
Can you rework now that those flags have been dropped?

Thanks.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods
  2020-10-06 15:05     ` [dpdk-dev] [PATCH v3 " David Marchand
@ 2020-10-06 16:07       ` Ananyev, Konstantin
  2020-10-08 10:49         ` David Marchand
  2020-10-14  9:23         ` Kinsella, Ray
  0 siblings, 2 replies; 70+ messages in thread
From: Ananyev, Konstantin @ 2020-10-06 16:07 UTC (permalink / raw)
  To: David Marchand
  Cc: dev, Jerin Jacob Kollanukkaran,
	Ruifeng Wang (Arm Technology China),
	Medvedkin, Vladimir, Thomas Monjalon, Ray Kinsella, Richardson,
	Bruce


> 
> On Mon, Oct 5, 2020 at 9:44 PM Konstantin Ananyev
> <konstantin.ananyev@intel.com> wrote:
> >
> > These patch series introduce support of AVX512 specific classify
> > implementation for ACL library.
> > It adds two new algorithms:
> >  - RTE_ACL_CLASSIFY_AVX512X16 - can process up to 16 flows in parallel.
> >    It uses 256-bit width instructions/registers only
> >    (to avoid frequency level change).
> >    On my SKX box test-acl shows ~15-30% improvement
> >    (depending on rule-set and input burst size)
> >    when switching from AVX2 to AVX512X16 classify algorithms.
> >  - RTE_ACL_CLASSIFY_AVX512X32 - can process up to 32 flows in parallel.
> >    It uses 512-bit width instructions/registers and provides higher
> >    performance then AVX512X16, but can cause frequency level change.
> >    On my SKX box test-acl shows ~50-70% improvement
> >    (depending on rule-set and input burst size)
> >    when switching from AVX2 to AVX512X32 classify algorithms.
> >    ICX and CLX testing showed similar level of speedup.
> >
> > Current AVX512 classify implementation is only supported on x86_64.
> > Note that this series introduce a formal ABI incompatibility
> 
> The only API change I can see is in rte_acl_classify_alg() new error
> code but I don't think we need an announcement for this.
> As for ABI, we are breaking it in this release, so I see no pb.

Cool, I just wanted to underline that patch #3:
https://patches.dpdk.org/patch/79786/
is a formal ABI breakage.

> 
> 
> > with previous versions of ACL library.
> >
> > v2 -> v3:
> >   Fix checkpatch warnings
> >   Split AVX512 algorithm into two and deduplicate common code
> 
> Patch 7 still references a RTE_MACHINE_CPUFLAG flag.
> Can you rework now that those flags have been dropped?
> 

Should be fixed in v4:
https://patches.dpdk.org/project/dpdk/list/?series=12721

One more thing to mention - this series has a dependency on Vladimir's patch:
https://patches.dpdk.org/patch/79310/ ("eal/x86: introduce AVX 512-bit type"),
so CI/travis would still report an error.

Thanks
Konstantin


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods
  2020-10-06 16:07       ` Ananyev, Konstantin
@ 2020-10-08 10:49         ` David Marchand
  2020-10-14  9:23         ` Kinsella, Ray
  1 sibling, 0 replies; 70+ messages in thread
From: David Marchand @ 2020-10-08 10:49 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: dev, Jerin Jacob Kollanukkaran,
	Ruifeng Wang (Arm Technology China),
	Medvedkin, Vladimir, Thomas Monjalon, Ray Kinsella, Richardson,
	Bruce

On Tue, Oct 6, 2020 at 6:13 PM Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
> > > v2 -> v3:
> > >   Fix checkpatch warnings
> > >   Split AVX512 algorithm into two and deduplicate common code
> >
> > Patch 7 still references a RTE_MACHINE_CPUFLAG flag.
> > Can you rework now that those flags have been dropped?
> >
>
> Should be fixed in v4:
> https://patches.dpdk.org/project/dpdk/list/?series=12721

Thanks.


>
> One more thing to mention - this series has a dependency on Vladimir's patch:
> https://patches.dpdk.org/patch/79310/ ("eal/x86: introduce AVX 512-bit type"),
> so CI/travis would still report an error.

Yes, I am looking at this patch and I think I will take it alone so
that we remove the dependency between those series.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [dpdk-stable] [PATCH v4 01/14] acl: fix x86 build when compiler doesn't support AVX2
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 01/14] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
@ 2020-10-08 13:42         ` David Marchand
  0 siblings, 0 replies; 70+ messages in thread
From: David Marchand @ 2020-10-08 13:42 UTC (permalink / raw)
  To: Konstantin Ananyev
  Cc: dev, Jerin Jacob Kollanukkaran,
	Ruifeng Wang (Arm Technology China),
	Vladimir Medvedkin, dpdk stable

On Tue, Oct 6, 2020 at 5:08 PM Konstantin Ananyev
<konstantin.ananyev@intel.com> wrote:
>
> Right now we define dummy version of rte_acl_classify_avx2()
> when both X86 and AVX2 are not detected, though it should be
> for non-AVX2 case only.
>
> Fixes: e53ce4e41379 ("acl: remove use of weak functions")
> Cc: stable@dpdk.org
>
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
>  lib/librte_acl/rte_acl.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
> index 777ec4d340..715b023592 100644
> --- a/lib/librte_acl/rte_acl.c
> +++ b/lib/librte_acl/rte_acl.c
> @@ -16,7 +16,6 @@ static struct rte_tailq_elem rte_acl_tailq = {
>  };
>  EAL_REGISTER_TAILQ(rte_acl_tailq)
>
> -#ifndef RTE_ARCH_X86
>  #ifndef CC_AVX2_SUPPORT
>  /*
>   * If the compiler doesn't support AVX2 instructions,
> @@ -33,6 +32,7 @@ rte_acl_classify_avx2(__rte_unused const struct rte_acl_ctx *ctx,
>  }
>  #endif
>
> +#ifndef RTE_ARCH_X86
>  int
>  rte_acl_classify_sse(__rte_unused const struct rte_acl_ctx *ctx,
>         __rte_unused const uint8_t **data,
> --
> 2.17.1
>

Reviewed-by: David Marchand <david.marchand@redhat.com>


-- 
David Marchand


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v4 02/14] doc: fix missing classify methods in ACL guide
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 02/14] doc: fix missing classify methods in ACL guide Konstantin Ananyev
@ 2020-10-08 13:42         ` David Marchand
  0 siblings, 0 replies; 70+ messages in thread
From: David Marchand @ 2020-10-08 13:42 UTC (permalink / raw)
  To: Konstantin Ananyev
  Cc: dev, Jerin Jacob Kollanukkaran,
	Ruifeng Wang (Arm Technology China),
	Vladimir Medvedkin, dpdk stable

On Tue, Oct 6, 2020 at 5:09 PM Konstantin Ananyev
<konstantin.ananyev@intel.com> wrote:
>
> Add brief description for missing ACL classify algorithms:
> RTE_ACL_CLASSIFY_NEON and RTE_ACL_CLASSIFY_ALTIVEC.
>
> Fixes: 34fa6c27c156 ("acl: add NEON optimization for ARMv8")
> Fixes: 1d73135f9f1c ("acl: add AltiVec for ppc64")
> Cc: stable@dpdk.org
>
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
>  doc/guides/prog_guide/packet_classif_access_ctrl.rst | 6 ++++++
>  1 file changed, 6 insertions(+)
>
> diff --git a/doc/guides/prog_guide/packet_classif_access_ctrl.rst b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
> index 0345512b9e..daf03e6d7a 100644
> --- a/doc/guides/prog_guide/packet_classif_access_ctrl.rst
> +++ b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
> @@ -373,6 +373,12 @@ There are several implementations of classify algorithm:
>
>  *   **RTE_ACL_CLASSIFY_AVX2**: vector implementation, can process up to 16 flows in parallel. Requires AVX2 support.
>
> +*   **RTE_ACL_CLASSIFY_NEON**: vector implementation, can process up to 8 flows
> +    in parallel. Requires NEON support.
> +
> +*   **RTE_ACL_CLASSIFY_ALTIVEC**: vector implementation, can process up to 8
> +    flows in parallel. Requires ALTIVEC support.
> +
>  It is purely a runtime decision which method to choose, there is no build-time difference.
>  All implementations operates over the same internal RT structures and use similar principles. The main difference is that vector implementations can manually exploit IA SIMD instructions and process several input data flows in parallel.
>  At startup ACL library determines the highest available classify method for the given platform and sets it as default one. Though the user has an ability to override the default classifier function for a given ACL context or perform particular search using non-default classify method. In that case it is user responsibility to make sure that given platform supports selected classify implementation.
> --
> 2.17.1
>

Reviewed-by: David Marchand <david.marchand@redhat.com>


-- 
David Marchand


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v4 07/14] acl: add infrastructure to support AVX512 classify
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 07/14] acl: add infrastructure to support AVX512 classify Konstantin Ananyev
@ 2020-10-13 19:17         ` David Marchand
  2020-10-13 22:26           ` Ananyev, Konstantin
  0 siblings, 1 reply; 70+ messages in thread
From: David Marchand @ 2020-10-13 19:17 UTC (permalink / raw)
  To: Konstantin Ananyev, Bruce Richardson
  Cc: dev, Jerin Jacob Kollanukkaran,
	Ruifeng Wang (Arm Technology China),
	Vladimir Medvedkin

On Tue, Oct 6, 2020 at 5:11 PM Konstantin Ananyev
<konstantin.ananyev@intel.com> wrote:
> diff --git a/config/x86/meson.build b/config/x86/meson.build
> index fea4d54035..724e69f4c4 100644
> --- a/config/x86/meson.build
> +++ b/config/x86/meson.build
> @@ -22,7 +22,8 @@ foreach f:base_flags
>  endforeach
>
>  optional_flags = ['AES', 'PCLMUL',
> -               'AVX', 'AVX2', 'AVX512F',
> +               'AVX', 'AVX2',
> +               'AVX512F', 'AVX512VL', 'AVX512CD', 'AVX512BW',
>                 'RDRND', 'RDSEED']

Rebasing on current main and resolving the conflict with the net crc patches.

I am for sorting this alphabetically as the current order seems chaotic.
The diff in this patch against origin/main would be:

-optional_flags = ['AES', 'PCLMUL',
-        'AVX', 'AVX2', 'AVX512F',
-        'RDRND', 'RDSEED',
-        'AVX512BW', 'AVX512DQ',
-        'AVX512VL', 'VPCLMULQDQ']
+optional_flags = [
+        'AES',
+        'AVX',
+        'AVX2',
+        'AVX512BW',
+        'AVX512CD',
+        'AVX512DQ',
+        'AVX512F',
+        'AVX512VL',
+        'PCLMUL',
+        'RDRND',
+        'RDSEED',
+        'VPCLMULQDQ',
+]

Objection?


>  foreach f:optional_flags
>         if cc.get_define('__@0@__'.format(f), args: machine_args) == '1'

Thanks.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v4 07/14] acl: add infrastructure to support AVX512 classify
  2020-10-13 19:17         ` David Marchand
@ 2020-10-13 22:26           ` Ananyev, Konstantin
  0 siblings, 0 replies; 70+ messages in thread
From: Ananyev, Konstantin @ 2020-10-13 22:26 UTC (permalink / raw)
  To: David Marchand, Richardson, Bruce
  Cc: dev, Jerin Jacob Kollanukkaran,
	Ruifeng Wang (Arm Technology China),
	Medvedkin, Vladimir



> On Tue, Oct 6, 2020 at 5:11 PM Konstantin Ananyev
> <konstantin.ananyev@intel.com> wrote:
> > diff --git a/config/x86/meson.build b/config/x86/meson.build
> > index fea4d54035..724e69f4c4 100644
> > --- a/config/x86/meson.build
> > +++ b/config/x86/meson.build
> > @@ -22,7 +22,8 @@ foreach f:base_flags
> >  endforeach
> >
> >  optional_flags = ['AES', 'PCLMUL',
> > -               'AVX', 'AVX2', 'AVX512F',
> > +               'AVX', 'AVX2',
> > +               'AVX512F', 'AVX512VL', 'AVX512CD', 'AVX512BW',
> >                 'RDRND', 'RDSEED']
> 
> Rebasing on current main and resolving the conflict with the net crc patches.
> 
> I am for sorting this alphabetically as the current order seems chaotic.
> The diff in this patch against origin/main would be:
> 
> -optional_flags = ['AES', 'PCLMUL',
> -        'AVX', 'AVX2', 'AVX512F',
> -        'RDRND', 'RDSEED',
> -        'AVX512BW', 'AVX512DQ',
> -        'AVX512VL', 'VPCLMULQDQ']
> +optional_flags = [
> +        'AES',
> +        'AVX',
> +        'AVX2',
> +        'AVX512BW',
> +        'AVX512CD',
> +        'AVX512DQ',
> +        'AVX512F',
> +        'AVX512VL',
> +        'PCLMUL',
> +        'RDRND',
> +        'RDSEED',
> +        'VPCLMULQDQ',
> +]
> 
> Objection?

None 😊
Thanks
Konstantin

> 
> 
> >  foreach f:optional_flags
> >         if cc.get_define('__@0@__'.format(f), args: machine_args) == '1'
> 
> Thanks.
> 
> 
> --
> David Marchand


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods
  2020-10-06 16:07       ` Ananyev, Konstantin
  2020-10-08 10:49         ` David Marchand
@ 2020-10-14  9:23         ` Kinsella, Ray
  1 sibling, 0 replies; 70+ messages in thread
From: Kinsella, Ray @ 2020-10-14  9:23 UTC (permalink / raw)
  To: Ananyev, Konstantin, David Marchand
  Cc: dev, Jerin Jacob Kollanukkaran,
	Ruifeng Wang (Arm Technology China),
	Medvedkin, Vladimir, Thomas Monjalon, Richardson, Bruce



On 06/10/2020 17:07, Ananyev, Konstantin wrote:
> 
>>
>> On Mon, Oct 5, 2020 at 9:44 PM Konstantin Ananyev
>> <konstantin.ananyev@intel.com> wrote:
>>>
>>> These patch series introduce support of AVX512 specific classify
>>> implementation for ACL library.
>>> It adds two new algorithms:
>>>  - RTE_ACL_CLASSIFY_AVX512X16 - can process up to 16 flows in parallel.
>>>    It uses 256-bit width instructions/registers only
>>>    (to avoid frequency level change).
>>>    On my SKX box test-acl shows ~15-30% improvement
>>>    (depending on rule-set and input burst size)
>>>    when switching from AVX2 to AVX512X16 classify algorithms.
>>>  - RTE_ACL_CLASSIFY_AVX512X32 - can process up to 32 flows in parallel.
>>>    It uses 512-bit width instructions/registers and provides higher
>>>    performance then AVX512X16, but can cause frequency level change.
>>>    On my SKX box test-acl shows ~50-70% improvement
>>>    (depending on rule-set and input burst size)
>>>    when switching from AVX2 to AVX512X32 classify algorithms.
>>>    ICX and CLX testing showed similar level of speedup.
>>>
>>> Current AVX512 classify implementation is only supported on x86_64.
>>> Note that this series introduce a formal ABI incompatibility
>>
>> The only API change I can see is in rte_acl_classify_alg() new error
>> code but I don't think we need an announcement for this.
>> As for ABI, we are breaking it in this release, so I see no pb.
> 
> Cool, I just wanted to underline that patch #3:
> https://patches.dpdk.org/patch/79786/
> is a formal ABI breakage.

As David said, this is an ABI breaking release - so there is no requirement to maintain compatibility. 

https://doc.dpdk.org/guides/contributing/abi_policy.html

However the following requirements remain:-

* The acknowledgment of the maintainer of the component is mandatory, or if no maintainer is available for the component, the tree/sub-tree maintainer for that component must acknowledge the ABI change instead.
* The acknowledgment of three members of the technical board, as delegates of the technical board acknowledging the need for the ABI change, is also mandatory.

I guess you are the maintainer in this case, so that requirement is satisfied. 

> 
>>
>>
>>> with previous versions of ACL library.
>>>
>>> v2 -> v3:
>>>   Fix checkpatch warnings
>>>   Split AVX512 algorithm into two and deduplicate common code
>>
>> Patch 7 still references a RTE_MACHINE_CPUFLAG flag.
>> Can you rework now that those flags have been dropped?
>>
> 
> Should be fixed in v4:
> https://patches.dpdk.org/project/dpdk/list/?series=12721
> 
> One more thing to mention - this series has a dependency on Vladimir's patch:
> https://patches.dpdk.org/patch/79310/ ("eal/x86: introduce AVX 512-bit type"),
> so CI/travis would still report an error.
> 
> Thanks
> Konstantin
> 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v4 13/14] test/acl: add AVX512 classify support
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 13/14] test/acl: add AVX512 classify support Konstantin Ananyev
@ 2020-10-14 10:26         ` David Marchand
  2020-10-14 10:32           ` Ananyev, Konstantin
  0 siblings, 1 reply; 70+ messages in thread
From: David Marchand @ 2020-10-14 10:26 UTC (permalink / raw)
  To: Konstantin Ananyev
  Cc: dev, Jerin Jacob Kollanukkaran,
	Ruifeng Wang (Arm Technology China),
	Vladimir Medvedkin

On Tue, Oct 6, 2020 at 5:13 PM Konstantin Ananyev
<konstantin.ananyev@intel.com> wrote:
>
> Add AVX512 classify to the test coverage.
>
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
>  app/test/test_acl.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/app/test/test_acl.c b/app/test/test_acl.c
> index 333b347579..5b32347954 100644
> --- a/app/test/test_acl.c
> +++ b/app/test/test_acl.c
> @@ -278,8 +278,8 @@ test_classify_alg(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
>
>         /* set given classify alg, skip test if alg is not supported */
>         ret = rte_acl_set_ctx_classify(acx, alg);
> -       if (ret == -ENOTSUP)
> -               return 0;
> +       if (ret != 0)
> +               return (ret == -ENOTSUP) ? 0 : ret;

Does this really belong to this patch?
I would expect it in "test/acl: expand classify test coverage".


-- 
David Marchand


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v4 13/14] test/acl: add AVX512 classify support
  2020-10-14 10:26         ` David Marchand
@ 2020-10-14 10:32           ` Ananyev, Konstantin
  2020-10-14 10:35             ` David Marchand
  0 siblings, 1 reply; 70+ messages in thread
From: Ananyev, Konstantin @ 2020-10-14 10:32 UTC (permalink / raw)
  To: David Marchand
  Cc: dev, Jerin Jacob Kollanukkaran,
	Ruifeng Wang (Arm Technology China),
	Medvedkin, Vladimir



> 
> On Tue, Oct 6, 2020 at 5:13 PM Konstantin Ananyev
> <konstantin.ananyev@intel.com> wrote:
> >
> > Add AVX512 classify to the test coverage.
> >
> > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > ---
> >  app/test/test_acl.c | 6 ++++--
> >  1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/app/test/test_acl.c b/app/test/test_acl.c
> > index 333b347579..5b32347954 100644
> > --- a/app/test/test_acl.c
> > +++ b/app/test/test_acl.c
> > @@ -278,8 +278,8 @@ test_classify_alg(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
> >
> >         /* set given classify alg, skip test if alg is not supported */
> >         ret = rte_acl_set_ctx_classify(acx, alg);
> > -       if (ret == -ENOTSUP)
> > -               return 0;
> > +       if (ret != 0)
> > +               return (ret == -ENOTSUP) ? 0 : ret;
> 
> Does this really belong to this patch?
> I would expect it in "test/acl: expand classify test coverage".

Yes, you right.
Do you want me to re-spin?
Konstantin


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v4 13/14] test/acl: add AVX512 classify support
  2020-10-14 10:32           ` Ananyev, Konstantin
@ 2020-10-14 10:35             ` David Marchand
  0 siblings, 0 replies; 70+ messages in thread
From: David Marchand @ 2020-10-14 10:35 UTC (permalink / raw)
  To: Ananyev, Konstantin
  Cc: dev, Jerin Jacob Kollanukkaran,
	Ruifeng Wang (Arm Technology China),
	Medvedkin, Vladimir

On Wed, Oct 14, 2020 at 12:32 PM Ananyev, Konstantin
<konstantin.ananyev@intel.com> wrote:
> > On Tue, Oct 6, 2020 at 5:13 PM Konstantin Ananyev
> > <konstantin.ananyev@intel.com> wrote:
> > >
> > > Add AVX512 classify to the test coverage.
> > >
> > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > > ---
> > >  app/test/test_acl.c | 6 ++++--
> > >  1 file changed, 4 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/app/test/test_acl.c b/app/test/test_acl.c
> > > index 333b347579..5b32347954 100644
> > > --- a/app/test/test_acl.c
> > > +++ b/app/test/test_acl.c
> > > @@ -278,8 +278,8 @@ test_classify_alg(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
> > >
> > >         /* set given classify alg, skip test if alg is not supported */
> > >         ret = rte_acl_set_ctx_classify(acx, alg);
> > > -       if (ret == -ENOTSUP)
> > > -               return 0;
> > > +       if (ret != 0)
> > > +               return (ret == -ENOTSUP) ? 0 : ret;
> >
> > Does this really belong to this patch?
> > I would expect it in "test/acl: expand classify test coverage".
>
> Yes, you right.
> Do you want me to re-spin?

No need for this, thanks.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v4 00/14] acl: introduce AVX512 classify methods
  2020-10-06 15:03     ` [dpdk-dev] [PATCH v4 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
                         ` (13 preceding siblings ...)
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 14/14] app/acl: " Konstantin Ananyev
@ 2020-10-14 12:40       ` David Marchand
  14 siblings, 0 replies; 70+ messages in thread
From: David Marchand @ 2020-10-14 12:40 UTC (permalink / raw)
  To: Konstantin Ananyev
  Cc: dev, Jerin Jacob Kollanukkaran,
	Ruifeng Wang (Arm Technology China),
	Vladimir Medvedkin

On Tue, Oct 6, 2020 at 5:08 PM Konstantin Ananyev
<konstantin.ananyev@intel.com> wrote:
>
> These patch series introduce support of AVX512 specific classify
> implementation for ACL library.
> It adds two new algorithms:
>  - RTE_ACL_CLASSIFY_AVX512X16 - can process up to 16 flows in parallel.
>    It uses 256-bit width instructions/registers only
>    (to avoid frequency level change).
>    On my SKX box test-acl shows ~15-30% improvement
>    (depending on rule-set and input burst size)
>    when switching from AVX2 to AVX512X16 classify algorithms.
>  - RTE_ACL_CLASSIFY_AVX512X32 - can process up to 32 flows in parallel.
>    It uses 512-bit width instructions/registers and provides higher
>    performance then AVX512X16, but can cause frequency level change.
>    On my SKX box test-acl shows ~50-70% improvement
>    (depending on rule-set and input burst size)
>    when switching from AVX2 to AVX512X32 classify algorithms.
>    ICX and CLX testing showed similar level of speedup.

Series applied, thanks.


-- 
David Marchand


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v4 12/14] acl: deduplicate AVX512 code paths
  2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 12/14] acl: deduplicate AVX512 code paths Konstantin Ananyev
@ 2020-10-16 15:56         ` Ferruh Yigit
  2020-10-16 16:20           ` Thomas Monjalon
  0 siblings, 1 reply; 70+ messages in thread
From: Ferruh Yigit @ 2020-10-16 15:56 UTC (permalink / raw)
  To: Konstantin Ananyev, dev
  Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Thomas Monjalon,
	David Marchand

On 10/6/2020 4:03 PM, Konstantin Ananyev wrote:
> Current rte_acl_classify_avx512x32() and rte_acl_classify_avx512x16()
> code paths are very similar. The only differences are due to
> 256/512 register/instrincts naming conventions.
> So to deduplicate the code:
>    - Move common code into “acl_run_avx512_common.h”
>    - Use macros to hide difference in naming conventions
> 
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

<...>

> @@ -120,7 +161,7 @@ _m256_mask_gather_epi8x4(__m256i pdata, __mmask8 mask)
>   
>   	static const uint32_t zero;

icc complains about this, although it is static [1].
Would it be acceptable to initialize variable explicitly to '0'?


[1]
In file included from ../lib/librte_acl/acl_run_avx512.c(110):
../lib/librte_acl/acl_run_avx512x8.h(162): warning #300: const variable "zero" 
requires an initializer
         static const uint32_t zero;
                                  ^

In file included from ../lib/librte_acl/acl_run_avx512.c(137):
../lib/librte_acl/acl_run_avx512x16.h(198): warning #300: const variable "zero" 
requires an initializer
         static const uint32_t zero;
                                  ^

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [dpdk-dev] [PATCH v4 12/14] acl: deduplicate AVX512 code paths
  2020-10-16 15:56         ` Ferruh Yigit
@ 2020-10-16 16:20           ` Thomas Monjalon
  0 siblings, 0 replies; 70+ messages in thread
From: Thomas Monjalon @ 2020-10-16 16:20 UTC (permalink / raw)
  To: Ferruh Yigit
  Cc: Konstantin Ananyev, dev, jerinj, ruifeng.wang,
	vladimir.medvedkin, David Marchand, olivier.matz

16/10/2020 17:56, Ferruh Yigit:
> On 10/6/2020 4:03 PM, Konstantin Ananyev wrote:
> > @@ -120,7 +161,7 @@ _m256_mask_gather_epi8x4(__m256i pdata, __mmask8 mask)
> >   
> >   	static const uint32_t zero;
> 
> icc complains about this, although it is static [1].
> Would it be acceptable to initialize variable explicitly to '0'?
> 
> [1]
> In file included from ../lib/librte_acl/acl_run_avx512.c(110):
> ../lib/librte_acl/acl_run_avx512x8.h(162): warning #300: const variable "zero" 
> requires an initializer
>          static const uint32_t zero;

icc should be fixed in my opinion

It's already hard to have code warning-free for gcc and clang,
I don't think we should bother with warnings from exotic compilers.
It is not a blocking issue, right?



^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2020-10-16 16:55 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-07 16:28 [dpdk-dev] [PATCH 20.11 0/7] acl: introduce AVX512 classify method Konstantin Ananyev
2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 1/7] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 2/7] app/acl: few small improvements Konstantin Ananyev
2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 3/7] acl: remove of unused enum value Konstantin Ananyev
2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 4/7] acl: add infrastructure to support AVX512 classify Konstantin Ananyev
2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 5/7] app/acl: add AVX512 classify support Konstantin Ananyev
2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 6/7] acl: introduce AVX512 classify implementation Konstantin Ananyev
2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 7/7] acl: enhance " Konstantin Ananyev
2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 01/12] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 02/12] doc: fix mixing classify methods in ACL guide Konstantin Ananyev
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 03/12] acl: remove of unused enum value Konstantin Ananyev
2020-09-27  3:27     ` Ruifeng Wang
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 04/12] acl: remove library constructor Konstantin Ananyev
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 05/12] app/acl: few small improvements Konstantin Ananyev
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 06/12] test/acl: expand classify test coverage Konstantin Ananyev
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 07/12] acl: add infrastructure to support AVX512 classify Konstantin Ananyev
2020-09-16  9:11     ` Bruce Richardson
2020-09-16  9:36       ` Medvedkin, Vladimir
2020-09-16  9:49         ` Bruce Richardson
2020-09-16 10:06           ` Ananyev, Konstantin
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 08/12] acl: introduce AVX512 classify implementation Konstantin Ananyev
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 09/12] acl: enhance " Konstantin Ananyev
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 10/12] acl: for AVX512 classify use 4B load whenever possible Konstantin Ananyev
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 11/12] test/acl: add AVX512 classify support Konstantin Ananyev
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 12/12] app/acl: " Konstantin Ananyev
2020-10-05 18:45   ` [dpdk-dev] [PATCH v3 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 01/14] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 02/14] doc: fix missing classify methods in ACL guide Konstantin Ananyev
2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 03/14] acl: remove of unused enum value Konstantin Ananyev
2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 04/14] acl: remove library constructor Konstantin Ananyev
2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 05/14] app/acl: few small improvements Konstantin Ananyev
2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 06/14] test/acl: expand classify test coverage Konstantin Ananyev
2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 07/14] acl: add infrastructure to support AVX512 classify Konstantin Ananyev
2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 08/14] acl: introduce 256-bit width AVX512 classify implementation Konstantin Ananyev
2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 09/14] acl: update default classify algorithm selection Konstantin Ananyev
2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 10/14] acl: introduce 512-bit width AVX512 classify implementation Konstantin Ananyev
2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 11/14] acl: for AVX512 classify use 4B load whenever possible Konstantin Ananyev
2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 12/14] acl: deduplicate AVX512 code paths Konstantin Ananyev
2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 13/14] test/acl: add AVX512 classify support Konstantin Ananyev
2020-10-05 18:45     ` [dpdk-dev] [PATCH v3 14/14] app/acl: " Konstantin Ananyev
2020-10-06 15:03     ` [dpdk-dev] [PATCH v4 00/14] acl: introduce AVX512 classify methods Konstantin Ananyev
2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 01/14] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
2020-10-08 13:42         ` [dpdk-dev] [dpdk-stable] " David Marchand
2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 02/14] doc: fix missing classify methods in ACL guide Konstantin Ananyev
2020-10-08 13:42         ` David Marchand
2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 03/14] acl: remove of unused enum value Konstantin Ananyev
2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 04/14] acl: remove library constructor Konstantin Ananyev
2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 05/14] app/acl: few small improvements Konstantin Ananyev
2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 06/14] test/acl: expand classify test coverage Konstantin Ananyev
2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 07/14] acl: add infrastructure to support AVX512 classify Konstantin Ananyev
2020-10-13 19:17         ` David Marchand
2020-10-13 22:26           ` Ananyev, Konstantin
2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 08/14] acl: introduce 256-bit width AVX512 classify implementation Konstantin Ananyev
2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 09/14] acl: update default classify algorithm selection Konstantin Ananyev
2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 10/14] acl: introduce 512-bit width AVX512 classify implementation Konstantin Ananyev
2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 11/14] acl: for AVX512 classify use 4B load whenever possible Konstantin Ananyev
2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 12/14] acl: deduplicate AVX512 code paths Konstantin Ananyev
2020-10-16 15:56         ` Ferruh Yigit
2020-10-16 16:20           ` Thomas Monjalon
2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 13/14] test/acl: add AVX512 classify support Konstantin Ananyev
2020-10-14 10:26         ` David Marchand
2020-10-14 10:32           ` Ananyev, Konstantin
2020-10-14 10:35             ` David Marchand
2020-10-06 15:03       ` [dpdk-dev] [PATCH v4 14/14] app/acl: " Konstantin Ananyev
2020-10-14 12:40       ` [dpdk-dev] [PATCH v4 00/14] acl: introduce AVX512 classify methods David Marchand
2020-10-06 15:05     ` [dpdk-dev] [PATCH v3 " David Marchand
2020-10-06 16:07       ` Ananyev, Konstantin
2020-10-08 10:49         ` David Marchand
2020-10-14  9:23         ` Kinsella, Ray

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).