DPDK patches and discussions
 help / color / Atom feed
* [dpdk-dev] [PATCH 20.11 0/7] acl: introduce AVX512 classify method
@ 2020-08-07 16:28 Konstantin Ananyev
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 1/7] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
                   ` (7 more replies)
  0 siblings, 8 replies; 26+ messages in thread
From: Konstantin Ananyev @ 2020-08-07 16:28 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

These patch series introduce support of AVX512 specific classify
implementation for ACL library.
Inside it contains two code-paths –
one uses mostly 256 bit instruction/registers and can
process up to 16 flows in parallel.
second uses 512 bit instruction/registers over majority of
places and can process up to 32 flows in parallel.
These internal code-path selection is done internally based
on input burst size and is totally opaque to the user.
On my SKX box test-acl shows ~20-65% improvement
(depending on rule-set and input burst size) 
when switching from AVX2 to AVX512 classify algorithms.

Note that this change introduce a formal ABI incompatibility
with previous versions of ACL library.

TODO list:
- Deduplicate 8/16 code paths
- Update default algorithm selection
- Update docs

These patch series depends on:
https://patches.dpdk.org/patch/70429/
to be applied first.  

Konstantin Ananyev (7):
  acl: fix x86 build when compiler doesn't support AVX2
  app/acl: few small improvements
  acl: remove of unused enum value
  acl: add infrastructure to support AVX512 classify
  app/acl: add AVX512 classify support
  acl: introduce AVX512 classify implementation
  acl: enhance AVX512 classify implementation

 app/test-acl/main.c                |  19 +-
 config/x86/meson.build             |   3 +-
 lib/librte_acl/Makefile            |  26 ++
 lib/librte_acl/acl.h               |   4 +
 lib/librte_acl/acl_run_avx512.c    | 140 +++++++
 lib/librte_acl/acl_run_avx512x16.h | 635 +++++++++++++++++++++++++++++
 lib/librte_acl/acl_run_avx512x8.h  | 614 ++++++++++++++++++++++++++++
 lib/librte_acl/meson.build         |  39 ++
 lib/librte_acl/rte_acl.c           |  19 +-
 lib/librte_acl/rte_acl.h           |   2 +-
 10 files changed, 1493 insertions(+), 8 deletions(-)
 create mode 100644 lib/librte_acl/acl_run_avx512.c
 create mode 100644 lib/librte_acl/acl_run_avx512x16.h
 create mode 100644 lib/librte_acl/acl_run_avx512x8.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH 20.11 1/7] acl: fix x86 build when compiler doesn't support AVX2
  2020-08-07 16:28 [dpdk-dev] [PATCH 20.11 0/7] acl: introduce AVX512 classify method Konstantin Ananyev
@ 2020-08-07 16:28 ` Konstantin Ananyev
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 2/7] app/acl: few small improvements Konstantin Ananyev
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Konstantin Ananyev @ 2020-08-07 16:28 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev, stable

Right now we define dummy version of rte_acl_classify_avx2()
when both X86 and AVX2 are not detected, though it should be
for non-AVX2 case only.

Fixes: e53ce4e41379 ("acl: remove use of weak functions")
Cc: stable@dpdk.org

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/rte_acl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 777ec4d34..715b02359 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -16,7 +16,6 @@ static struct rte_tailq_elem rte_acl_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_acl_tailq)
 
-#ifndef RTE_ARCH_X86
 #ifndef CC_AVX2_SUPPORT
 /*
  * If the compiler doesn't support AVX2 instructions,
@@ -33,6 +32,7 @@ rte_acl_classify_avx2(__rte_unused const struct rte_acl_ctx *ctx,
 }
 #endif
 
+#ifndef RTE_ARCH_X86
 int
 rte_acl_classify_sse(__rte_unused const struct rte_acl_ctx *ctx,
 	__rte_unused const uint8_t **data,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH 20.11 2/7] app/acl: few small improvements
  2020-08-07 16:28 [dpdk-dev] [PATCH 20.11 0/7] acl: introduce AVX512 classify method Konstantin Ananyev
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 1/7] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
@ 2020-08-07 16:28 ` Konstantin Ananyev
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 3/7] acl: remove of unused enum value Konstantin Ananyev
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Konstantin Ananyev @ 2020-08-07 16:28 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

- enhance output to print extra stats
- use rte_rdtsc_precise() for cycle measurements

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test-acl/main.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/app/test-acl/main.c b/app/test-acl/main.c
index 0a5dfb621..d9b65517c 100644
--- a/app/test-acl/main.c
+++ b/app/test-acl/main.c
@@ -862,9 +862,10 @@ search_ip5tuples(__rte_unused void *arg)
 {
 	uint64_t pkt, start, tm;
 	uint32_t i, lcore;
+	long double st;
 
 	lcore = rte_lcore_id();
-	start = rte_rdtsc();
+	start = rte_rdtsc_precise();
 	pkt = 0;
 
 	for (i = 0; i != config.iter_num; i++) {
@@ -872,12 +873,16 @@ search_ip5tuples(__rte_unused void *arg)
 			config.trace_step, config.alg.name);
 	}
 
-	tm = rte_rdtsc() - start;
+	tm = rte_rdtsc_precise() - start;
+
+	st = (long double)tm / rte_get_timer_hz();
 	dump_verbose(DUMP_NONE, stdout,
 		"%s  @lcore %u: %" PRIu32 " iterations, %" PRIu64 " pkts, %"
-		PRIu32 " categories, %" PRIu64 " cycles, %#Lf cycles/pkt\n",
-		__func__, lcore, i, pkt, config.run_categories,
-		tm, (pkt == 0) ? 0 : (long double)tm / pkt);
+		PRIu32 " categories, %" PRIu64 " cycles (%.2Lf sec), "
+		"%.2Lf cycles/pkt, %.2Lf pkt/sec\n",
+		__func__, lcore, i, pkt,
+		config.run_categories, tm, st,
+		(pkt == 0) ? 0 : (long double)tm / pkt, pkt / st);
 
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH 20.11 3/7] acl: remove of unused enum value
  2020-08-07 16:28 [dpdk-dev] [PATCH 20.11 0/7] acl: introduce AVX512 classify method Konstantin Ananyev
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 1/7] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 2/7] app/acl: few small improvements Konstantin Ananyev
@ 2020-08-07 16:28 ` Konstantin Ananyev
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 4/7] acl: add infrastructure to support AVX512 classify Konstantin Ananyev
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Konstantin Ananyev @ 2020-08-07 16:28 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Removal of unused enum value (RTE_ACL_CLASSIFY_NUM).
This enum value is not used inside DPDK, while it prevents
to add new classify algorithms without causing an ABI breakage.

Note that this change introduce a formal ABI incompatibility
with previous versions of ACL library.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/rte_acl.h | 1 -
 1 file changed, 1 deletion(-)

diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index aa22e70c6..b814423a6 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -241,7 +241,6 @@ enum rte_acl_classify_alg {
 	RTE_ACL_CLASSIFY_AVX2 = 3,    /**< requires AVX2 support. */
 	RTE_ACL_CLASSIFY_NEON = 4,    /**< requires NEON support. */
 	RTE_ACL_CLASSIFY_ALTIVEC = 5,    /**< requires ALTIVEC support. */
-	RTE_ACL_CLASSIFY_NUM          /* should always be the last one. */
 };
 
 /**
-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH 20.11 4/7] acl: add infrastructure to support AVX512 classify
  2020-08-07 16:28 [dpdk-dev] [PATCH 20.11 0/7] acl: introduce AVX512 classify method Konstantin Ananyev
                   ` (2 preceding siblings ...)
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 3/7] acl: remove of unused enum value Konstantin Ananyev
@ 2020-08-07 16:28 ` Konstantin Ananyev
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 5/7] app/acl: add AVX512 classify support Konstantin Ananyev
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Konstantin Ananyev @ 2020-08-07 16:28 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Add necessary changes to support new AVX512 specific ACL classify
algorithm:
 - changes in meson.build and Makefile to check that build tools
   (compiler, assembler, etc.) do properly support AVX512.
 - dummy rte_acl_classify_avx512() for targets where AVX512
   implementation couldn't be properly supported.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 config/x86/meson.build          |  3 ++-
 lib/librte_acl/Makefile         | 26 ++++++++++++++++++++++
 lib/librte_acl/acl.h            |  4 ++++
 lib/librte_acl/acl_run_avx512.c | 17 ++++++++++++++
 lib/librte_acl/meson.build      | 39 +++++++++++++++++++++++++++++++++
 lib/librte_acl/rte_acl.c        | 17 ++++++++++++++
 lib/librte_acl/rte_acl.h        |  1 +
 7 files changed, 106 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_acl/acl_run_avx512.c

diff --git a/config/x86/meson.build b/config/x86/meson.build
index 6ec020ef6..c5626e914 100644
--- a/config/x86/meson.build
+++ b/config/x86/meson.build
@@ -23,7 +23,8 @@ foreach f:base_flags
 endforeach
 
 optional_flags = ['AES', 'PCLMUL',
-		'AVX', 'AVX2', 'AVX512F',
+		'AVX', 'AVX2',
+		'AVX512F', 'AVX512VL', 'AVX512CD', 'AVX512BW',
 		'RDRND', 'RDSEED']
 foreach f:optional_flags
 	if cc.get_define('__@0@__'.format(f), args: machine_args) == '1'
diff --git a/lib/librte_acl/Makefile b/lib/librte_acl/Makefile
index f4332b044..8bd469c2b 100644
--- a/lib/librte_acl/Makefile
+++ b/lib/librte_acl/Makefile
@@ -58,6 +58,32 @@ ifeq ($(CC_AVX2_SUPPORT), 1)
 	CFLAGS_rte_acl.o += -DCC_AVX2_SUPPORT
 endif
 
+# compile AVX512 version if:
+# we are building 64-bit binary AND binutils can generate proper code
+ifeq ($(CONFIG_RTE_ARCH_X86_64),y)
+
+	BINUTIL_OK=$(shell AS=as; \
+		$(RTE_SDK)/buildtools/binutils-avx512-check.sh && \
+		echo 1)
+	ifeq ($(BINUTIL_OK), 1)
+
+		# If the compiler supports AVX512 instructions,
+		# then add support for AVX512 classify method.
+
+		CC_AVX512_FLAGS=$(shell $(CC) \
+		-mavx512f -mavx512vl -mavx512cd -mavx512bw \
+		-dM -E - </dev/null 2>&1 | grep AVX512 | wc -l)
+		ifeq ($(CC_AVX512_FLAGS), 4)
+			CFLAGS_acl_run_avx512.o += -mavx512f
+			CFLAGS_acl_run_avx512.o += -mavx512vl
+			CFLAGS_acl_run_avx512.o += -mavx512cd
+			CFLAGS_acl_run_avx512.o += -mavx512bw
+			SRCS-$(CONFIG_RTE_LIBRTE_ACL) += acl_run_avx512.c
+			CFLAGS_rte_acl.o += -DCC_AVX512_SUPPORT
+		endif
+	endif
+endif
+
 # install this header file
 SYMLINK-$(CONFIG_RTE_LIBRTE_ACL)-include := rte_acl_osdep.h
 SYMLINK-$(CONFIG_RTE_LIBRTE_ACL)-include += rte_acl.h
diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 39d45a0c2..2022cf253 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -201,6 +201,10 @@ int
 rte_acl_classify_avx2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories);
 
+int
+rte_acl_classify_avx512(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories);
+
 int
 rte_acl_classify_neon(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories);
diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
new file mode 100644
index 000000000..67274989d
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#include "acl_run_sse.h"
+
+int
+rte_acl_classify_avx512(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	if (num >= MAX_SEARCHES_SSE8)
+		return search_sse_8(ctx, data, results, num, categories);
+	if (num >= MAX_SEARCHES_SSE4)
+		return search_sse_4(ctx, data, results, num, categories);
+
+	return rte_acl_classify_scalar(ctx, data, results, num, categories);
+}
diff --git a/lib/librte_acl/meson.build b/lib/librte_acl/meson.build
index d1e2c184c..b2fd61cad 100644
--- a/lib/librte_acl/meson.build
+++ b/lib/librte_acl/meson.build
@@ -27,6 +27,45 @@ if dpdk_conf.has('RTE_ARCH_X86')
 		cflags += '-DCC_AVX2_SUPPORT'
 	endif
 
+	# compile AVX512 version if:
+	# we are building 64-bit binary AND binutils can generate proper code
+
+	if dpdk_conf.has('RTE_ARCH_X86_64') and binutils_ok.returncode() == 0
+
+		# compile AVX512 version if either:
+		# a. we have AVX512 supported in minimum instruction set
+		#    baseline
+		# b. it's not minimum instruction set, but supported by
+		#    compiler
+		#
+		# in former case, just add avx512 C file to files list
+		# in latter case, compile c file to static lib, using correct
+		# compiler flags, and then have the .o file from static lib
+		# linked into main lib.
+
+		if dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512F') and \
+			dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512VL') and \
+			dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512CD') and \
+			dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512BW')
+
+			sources += files('acl_run_avx512.c')
+			cflags += '-DCC_AVX512_SUPPORT'
+
+		elif cc.has_multi_arguments('-mavx512f', '-mavx512vl',
+					'-mavx512cd', '-mavx512bw')
+
+			avx512_tmplib = static_library('avx512_tmp',
+				'acl_run_avx512.c',
+				dependencies: static_rte_eal,
+				c_args: cflags +
+					['-mavx512f', '-mavx512vl',
+					 '-mavx512cd', '-mavx512bw'])
+			objs += avx512_tmplib.extract_objects(
+					'acl_run_avx512.c')
+			cflags += '-DCC_AVX512_SUPPORT'
+		endif
+	endif
+
 elif dpdk_conf.has('RTE_ARCH_ARM') or dpdk_conf.has('RTE_ARCH_ARM64')
 	cflags += '-flax-vector-conversions'
 	sources += files('acl_run_neon.c')
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 715b02359..71b4afb08 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -16,6 +16,22 @@ static struct rte_tailq_elem rte_acl_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_acl_tailq)
 
+#ifndef CC_AVX512_SUPPORT
+/*
+ * If the compiler doesn't support AVX512 instructions,
+ * then the dummy one would be used instead for AVX512 classify method.
+ */
+int
+rte_acl_classify_avx512(__rte_unused const struct rte_acl_ctx *ctx,
+	__rte_unused const uint8_t **data,
+	__rte_unused uint32_t *results,
+	__rte_unused uint32_t num,
+	__rte_unused uint32_t categories)
+{
+	return -ENOTSUP;
+}
+#endif
+
 #ifndef CC_AVX2_SUPPORT
 /*
  * If the compiler doesn't support AVX2 instructions,
@@ -77,6 +93,7 @@ static const rte_acl_classify_t classify_fns[] = {
 	[RTE_ACL_CLASSIFY_AVX2] = rte_acl_classify_avx2,
 	[RTE_ACL_CLASSIFY_NEON] = rte_acl_classify_neon,
 	[RTE_ACL_CLASSIFY_ALTIVEC] = rte_acl_classify_altivec,
+	[RTE_ACL_CLASSIFY_AVX512] = rte_acl_classify_avx512,
 };
 
 /* by default, use always available scalar code path. */
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index b814423a6..6f39042fc 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -241,6 +241,7 @@ enum rte_acl_classify_alg {
 	RTE_ACL_CLASSIFY_AVX2 = 3,    /**< requires AVX2 support. */
 	RTE_ACL_CLASSIFY_NEON = 4,    /**< requires NEON support. */
 	RTE_ACL_CLASSIFY_ALTIVEC = 5,    /**< requires ALTIVEC support. */
+	RTE_ACL_CLASSIFY_AVX512 = 6,    /**< requires AVX512 support. */
 };
 
 /**
-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH 20.11 5/7] app/acl: add AVX512 classify support
  2020-08-07 16:28 [dpdk-dev] [PATCH 20.11 0/7] acl: introduce AVX512 classify method Konstantin Ananyev
                   ` (3 preceding siblings ...)
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 4/7] acl: add infrastructure to support AVX512 classify Konstantin Ananyev
@ 2020-08-07 16:28 ` Konstantin Ananyev
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 6/7] acl: introduce AVX512 classify implementation Konstantin Ananyev
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Konstantin Ananyev @ 2020-08-07 16:28 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Add ability to use AVX512 classify method.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test-acl/main.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/app/test-acl/main.c b/app/test-acl/main.c
index d9b65517c..19b714335 100644
--- a/app/test-acl/main.c
+++ b/app/test-acl/main.c
@@ -81,6 +81,10 @@ static const struct acl_alg acl_alg[] = {
 		.name = "altivec",
 		.alg = RTE_ACL_CLASSIFY_ALTIVEC,
 	},
+	{
+		.name = "avx512",
+		.alg = RTE_ACL_CLASSIFY_AVX512,
+	},
 };
 
 static struct {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH 20.11 6/7] acl: introduce AVX512 classify implementation
  2020-08-07 16:28 [dpdk-dev] [PATCH 20.11 0/7] acl: introduce AVX512 classify method Konstantin Ananyev
                   ` (4 preceding siblings ...)
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 5/7] app/acl: add AVX512 classify support Konstantin Ananyev
@ 2020-08-07 16:28 ` Konstantin Ananyev
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 7/7] acl: enhance " Konstantin Ananyev
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
  7 siblings, 0 replies; 26+ messages in thread
From: Konstantin Ananyev @ 2020-08-07 16:28 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Add search_avx512x8x2() which uses mostly 256-bit width
registers/instructions and is able to process up to 16 flows in
parallel.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl_run_avx512.c   | 120 ++++++
 lib/librte_acl/acl_run_avx512x8.h | 614 ++++++++++++++++++++++++++++++
 2 files changed, 734 insertions(+)
 create mode 100644 lib/librte_acl/acl_run_avx512x8.h

diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
index 67274989d..8ee996679 100644
--- a/lib/librte_acl/acl_run_avx512.c
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -4,10 +4,130 @@
 
 #include "acl_run_sse.h"
 
+/*sizeof(uint32_t) << match_log == sizeof(struct rte_acl_match_results)*/
+static const uint32_t match_log = 5;
+
+struct acl_flow_avx512 {
+	uint32_t num_packets;       /* number of packets processed */
+	uint32_t total_packets;     /* max number of packets to process */
+	uint32_t root_index;        /* current root index */
+	const uint64_t *trans;      /* transition table */
+	const uint32_t *data_index; /* input data indexes */
+	const uint8_t **idata;      /* input data */
+	uint32_t *matches;          /* match indexes */
+};
+
+static inline void
+acl_set_flow_avx512(struct acl_flow_avx512 *flow, const struct rte_acl_ctx *ctx,
+	uint32_t trie, const uint8_t *data[], uint32_t *matches,
+	uint32_t total_packets)
+{
+	flow->num_packets = 0;
+	flow->total_packets = total_packets;
+	flow->root_index = ctx->trie[trie].root_index;
+	flow->trans = ctx->trans_table;
+	flow->data_index = ctx->trie[trie].data_index;
+	flow->idata = data;
+	flow->matches = matches;
+}
+
+/*
+ * Resolve matches for multiple categories (LE 8, use 128b instuctions/regs)
+ */
+static inline void
+resolve_mcle8_avx512x1(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_cat, uint32_t nb_trie)
+{
+	const int32_t *pri;
+	const uint32_t *pm, *res;
+	uint32_t i, j, k, mi, mn;
+	__mmask8 msk;
+	xmm_t cp, cr, np, nr;
+
+	res = pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != nb_pkt; k++, result += nb_cat) {
+
+		mi = match[k] << match_log;
+
+		for (j = 0; j != nb_cat; j += RTE_ACL_RESULTS_MULTIPLIER) {
+
+			cr = _mm_loadu_si128((const xmm_t *)(res + mi + j));
+			cp = _mm_loadu_si128((const xmm_t *)(pri + mi + j));
+
+			for (i = 1, pm = match + nb_pkt; i != nb_trie;
+				i++, pm += nb_pkt) {
+
+				mn = j + (pm[k] << match_log);
+
+				nr = _mm_loadu_si128((const xmm_t *)(res + mn));
+				np = _mm_loadu_si128((const xmm_t *)(pri + mn));
+
+				msk = _mm_cmpgt_epi32_mask(cp, np);
+				cr = _mm_mask_mov_epi32(nr, msk, cr);
+				cp = _mm_mask_mov_epi32(np, msk, cp);
+			}
+
+			_mm_storeu_si128((xmm_t *)(result + j), cr);
+		}
+	}
+}
+
+/*
+ * Resolve matches for multiple categories (GT 8, use 512b instuctions/regs)
+ */
+static inline void
+resolve_mcgt8_avx512x1(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_cat, uint32_t nb_trie)
+{
+	const int32_t *pri;
+	const uint32_t *pm, *res;
+	uint32_t i, k, mi;
+	__mmask16 cm, sm;
+	__m512i cp, cr, np, nr;
+
+	const uint32_t match_log = 5;
+
+	res = pr->results;
+	pri = pr->priority;
+
+	cm = (1 << nb_cat) - 1;
+
+	for (k = 0; k != nb_pkt; k++, result += nb_cat) {
+
+		mi = match[k] << match_log;
+
+		cr = _mm512_maskz_loadu_epi32(cm, res + mi);
+		cp = _mm512_maskz_loadu_epi32(cm, pri + mi);
+
+		for (i = 1, pm = match + nb_pkt; i != nb_trie;
+				i++, pm += nb_pkt) {
+
+			mi = pm[k] << match_log;
+
+			nr = _mm512_maskz_loadu_epi32(cm, res + mi);
+			np = _mm512_maskz_loadu_epi32(cm, pri + mi);
+
+			sm = _mm512_cmpgt_epi32_mask(cp, np);
+			cr = _mm512_mask_mov_epi32(nr, sm, cr);
+			cp = _mm512_mask_mov_epi32(np, sm, cp);
+		}
+
+		_mm512_mask_storeu_epi32(result, cm, cr);
+	}
+}
+
+#include "acl_run_avx512x8.h"
+
 int
 rte_acl_classify_avx512(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories)
 {
+	if (num >= MAX_SEARCHES_AVX16)
+		return search_avx512x8x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE8)
 		return search_sse_8(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE4)
diff --git a/lib/librte_acl/acl_run_avx512x8.h b/lib/librte_acl/acl_run_avx512x8.h
new file mode 100644
index 000000000..63b1d872f
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512x8.h
@@ -0,0 +1,614 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#define NUM_AVX512X8X2	(2 * CHAR_BIT)
+#define MSK_AVX512X8X2	(NUM_AVX512X8X2 - 1)
+
+static const rte_ymm_t ymm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const rte_ymm_t ymm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const rte_ymm_t ymm_trlo_idle = {
+	.u32 = {
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+	},
+};
+
+static const rte_ymm_t ymm_trhi_idle = {
+	.u32 = {
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+	},
+};
+
+static const rte_ymm_t ymm_shuffle_input = {
+	.u32 = {
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+	},
+};
+
+static const rte_ymm_t ymm_four_32 = {
+	.u32 = {
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+	},
+};
+
+static const rte_ymm_t ymm_idx_add = {
+	.u32 = {
+		0, 1, 2, 3,
+		4, 5, 6, 7,
+	},
+};
+
+static const rte_ymm_t ymm_range_base = {
+	.u32 = {
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+	},
+};
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes not supposed to be encountered here.
+ * For quad range nodes:
+ * Calculate number of range boundaries that are less than the
+ * input value. Range boundaries for each node are in signed 8 bit,
+ * ordered from -128 to 127.
+ * This is effectively a popcnt of bytes that are greater than the
+ * input byte.
+ * Single nodes are processed in the same ways as quad range nodes.
+ */
+static __rte_always_inline ymm_t
+calc_addr8(ymm_t index_mask, ymm_t next_input, ymm_t shuffle_input,
+	ymm_t four_32, ymm_t range_base, ymm_t tr_lo, ymm_t tr_hi)
+{
+	ymm_t addr, in, node_type, r, t;
+	ymm_t dfa_msk, dfa_ofs, quad_ofs;
+
+	t = _mm256_xor_si256(index_mask, index_mask);
+	in = _mm256_shuffle_epi8(next_input, shuffle_input);
+
+	/* Calc node type and node addr */
+	node_type = _mm256_andnot_si256(index_mask, tr_lo);
+	addr = _mm256_and_si256(index_mask, tr_lo);
+
+	/* mask for DFA type(0) nodes */
+	dfa_msk = _mm256_cmpeq_epi32(node_type, t);
+
+	/* DFA calculations. */
+	r = _mm256_srli_epi32(in, 30);
+	r = _mm256_add_epi8(r, range_base);
+	t = _mm256_srli_epi32(in, 24);
+	r = _mm256_shuffle_epi8(tr_hi, r);
+
+	dfa_ofs = _mm256_sub_epi32(t, r);
+
+	/* QUAD/SINGLE calculations. */
+	t = _mm256_cmpgt_epi8(in, tr_hi);
+	t = _mm256_lzcnt_epi32(t);
+	t = _mm256_srli_epi32(t, 3);
+	quad_ofs = _mm256_sub_epi32(four_32, t);
+
+	/* blend DFA and QUAD/SINGLE. */
+	t = _mm256_blendv_epi8(quad_ofs, dfa_ofs, dfa_msk);
+
+	/* calculate address for next transitions. */
+	addr = _mm256_add_epi32(addr, t);
+	return addr;
+}
+
+/*
+ * Process 8 transitions in parallel.
+ * tr_lo contains low 32 bits for 8 transitions.
+ * tr_hi contains high 32 bits for 8 transitions.
+ * next_input contains up to 4 input bytes for 8 flows.
+ */
+static __rte_always_inline ymm_t
+transition8(ymm_t next_input, const uint64_t *trans, ymm_t *tr_lo, ymm_t *tr_hi)
+{
+	const int32_t *tr;
+	ymm_t addr;
+
+	tr = (const int32_t *)(uintptr_t)trans;
+
+	/* Calculate the address (array index) for all 8 transitions. */
+	addr = calc_addr8(ymm_index_mask.y, next_input, ymm_shuffle_input.y,
+		ymm_four_32.y, ymm_range_base.y, *tr_lo, *tr_hi);
+
+	/* load lower 32 bits of 8 transactions at once. */
+	*tr_lo = _mm256_i32gather_epi32(tr, addr, sizeof(trans[0]));
+
+	next_input = _mm256_srli_epi32(next_input, CHAR_BIT);
+
+	/* load high 32 bits of 8 transactions at once. */
+	*tr_hi = _mm256_i32gather_epi32(tr + 1, addr, sizeof(trans[0]));
+
+	return next_input;
+}
+
+/*
+ * Execute first transition for up to 8 flows in parallel.
+ * next_input should contain one input byte for up to 8 flows.
+ * msk - mask of active flows.
+ * tr_lo contains low 32 bits for up to 8 transitions.
+ * tr_hi contains high 32 bits for up to 8 transitions.
+ */
+static __rte_always_inline void
+first_trans8(const struct acl_flow_avx512 *flow, ymm_t next_input,
+	__mmask8 msk, ymm_t *tr_lo, ymm_t *tr_hi)
+{
+	const int32_t *tr;
+	ymm_t addr, root;
+
+	tr = (const int32_t *)(uintptr_t)flow->trans;
+
+	addr = _mm256_set1_epi32(UINT8_MAX);
+	root = _mm256_set1_epi32(flow->root_index);
+
+	addr = _mm256_and_si256(next_input, addr);
+	addr = _mm256_add_epi32(root, addr);
+
+	/* load lower 32 bits of 8 transactions at once. */
+	*tr_lo = _mm256_mmask_i32gather_epi32(*tr_lo, msk, addr, tr,
+		sizeof(flow->trans[0]));
+
+	/* load high 32 bits of 8 transactions at once. */
+	*tr_hi = _mm256_mmask_i32gather_epi32(*tr_hi, msk, addr, (tr + 1),
+		sizeof(flow->trans[0]));
+}
+
+/*
+ * Load and return next 4 input bytes for up to 8 flows in parallel.
+ * pdata - 8 pointers to flow input data
+ * mask - mask of active flows.
+ * di - data indexes for these 8 flows.
+ */
+static inline ymm_t
+get_next_4bytes_avx512x8(const struct acl_flow_avx512 *flow, __m512i pdata,
+	__mmask8 mask, ymm_t *di)
+{
+	const int32_t *div;
+	ymm_t one, zero;
+	ymm_t inp, t;
+	__m512i p;
+
+	div = (const int32_t *)flow->data_index;
+
+	one = _mm256_set1_epi32(1);
+	zero = _mm256_xor_si256(one, one);
+
+	/* load data offsets for given indexes */
+	t = _mm256_mmask_i32gather_epi32(zero, mask, *di, div, sizeof(div[0]));
+
+	/* increment data indexes */
+	*di = _mm256_mask_add_epi32(*di, mask, *di, one);
+
+	p = _mm512_cvtepu32_epi64(t);
+	p = _mm512_add_epi64(p, pdata);
+
+	/* load input bytes */
+	inp = _mm512_mask_i64gather_epi32(zero, mask, p, NULL, sizeof(uint8_t));
+	return inp;
+}
+
+/*
+ * Start up to 8 new flows.
+ * num - number of flows to start
+ * msk - mask of new flows.
+ * pdata - pointers to flow input data
+ * di - data indexes for these flows.
+ */
+static inline void
+start_flow8(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
+	__m512i *pdata, ymm_t *idx, ymm_t *di)
+{
+	uint32_t nm;
+	ymm_t ni;
+	__m512i nd;
+
+	/* load input data pointers for new flows */
+	nm = (1 << num) - 1;
+	nd = _mm512_maskz_loadu_epi64(nm, flow->idata + flow->num_packets);
+
+	/* calculate match indexes of new flows */
+	ni = _mm256_set1_epi32(flow->num_packets);
+	ni = _mm256_add_epi32(ni, ymm_idx_add.y);
+
+	/* merge new and existing flows data */
+	*pdata = _mm512_mask_expand_epi64(*pdata, msk, nd);
+	*idx = _mm256_mask_expand_epi32(*idx, msk, ni);
+	*di = _mm256_maskz_mov_epi32(msk ^ UINT8_MAX, *di);
+
+	flow->num_packets += num;
+}
+
+/*
+ * Update flow and result masks based on the number of unprocessed flows.
+ */
+static inline uint32_t
+update_flow_mask8(const struct acl_flow_avx512 *flow, __mmask8 *fmsk,
+	__mmask8 *rmsk)
+{
+	uint32_t i, j, k, m, n;
+
+	fmsk[0] ^= rmsk[0];
+	m = rmsk[0];
+
+	k = __builtin_popcount(m);
+	n = flow->total_packets - flow->num_packets;
+
+	if (n < k) {
+		/* reduce mask */
+		for (i = k - n; i != 0; i--) {
+			j = sizeof(m) * CHAR_BIT - 1 - __builtin_clz(m);
+			m ^= 1 << j;
+		}
+	} else
+		n = k;
+
+	rmsk[0] = m;
+	fmsk[0] |= rmsk[0];
+
+	return n;
+}
+
+/*
+ * Process found matches for up to 8 flows.
+ * fmsk - mask of active flows
+ * rmsk - maks of found matches
+ * pdata - pointers to flow input data
+ * di - data indexes for these flows
+ * idx - match indexed for given flows
+ * tr_lo contains low 32 bits for up to 8 transitions.
+ * tr_hi contains high 32 bits for up to 8 transitions.
+ */
+static inline uint32_t
+match_process_avx512x8(struct acl_flow_avx512 *flow, __mmask8 *fmsk,
+	__mmask8 *rmsk,	__m512i *pdata, ymm_t *di, ymm_t *idx,
+	ymm_t *tr_lo, ymm_t *tr_hi)
+{
+	uint32_t n;
+	ymm_t res;
+
+	if (rmsk[0] == 0)
+		return 0;
+
+	/* extract match indexes */
+	res = _mm256_and_si256(tr_lo[0], ymm_index_mask.y);
+
+	/* mask  matched transitions to nop */
+	tr_lo[0] = _mm256_mask_mov_epi32(tr_lo[0], rmsk[0], ymm_trlo_idle.y);
+	tr_hi[0] = _mm256_mask_mov_epi32(tr_hi[0], rmsk[0], ymm_trhi_idle.y);
+
+	/* save found match indexes */
+	_mm256_mask_i32scatter_epi32(flow->matches, rmsk[0],
+		idx[0], res, sizeof(flow->matches[0]));
+
+	/* update masks and start new flows for matches */
+	n = update_flow_mask8(flow, fmsk, rmsk);
+	start_flow8(flow, n, rmsk[0], pdata, idx, di);
+
+	return n;
+}
+
+
+static inline void
+match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, __mmask8 fm[2],
+	__m512i pdata[2], ymm_t di[2], ymm_t idx[2], ymm_t inp[2],
+	ymm_t tr_lo[2], ymm_t tr_hi[2])
+{
+	uint32_t n[2];
+	__mmask8 rm[2];
+
+	/* check for matches */
+	rm[0] = _mm256_test_epi32_mask(tr_lo[0], ymm_match_mask.y);
+	rm[1] = _mm256_test_epi32_mask(tr_lo[1], ymm_match_mask.y);
+
+	/* till unprocessed matches exist */
+	while ((rm[0] | rm[1]) != 0) {
+
+		/* process matches and start new flows */
+		n[0] = match_process_avx512x8(flow, &fm[0], &rm[0], &pdata[0],
+			&di[0], &idx[0], &tr_lo[0], &tr_hi[0]);
+		n[1] = match_process_avx512x8(flow, &fm[1], &rm[1], &pdata[1],
+			&di[1], &idx[1], &tr_lo[1], &tr_hi[1]);
+
+		/* execute first transition for new flows, if any */
+
+		if (n[0] != 0) {
+			inp[0] = get_next_4bytes_avx512x8(flow, pdata[0], rm[0],
+				&di[0]);
+			first_trans8(flow, inp[0], rm[0], &tr_lo[0], &tr_hi[0]);
+
+			rm[0] = _mm256_test_epi32_mask(tr_lo[0],
+				ymm_match_mask.y);
+		}
+
+		if (n[1] != 0) {
+			inp[1] = get_next_4bytes_avx512x8(flow, pdata[1], rm[1],
+				&di[1]);
+			first_trans8(flow, inp[1], rm[1], &tr_lo[1], &tr_hi[1]);
+
+			rm[1] = _mm256_test_epi32_mask(tr_lo[1],
+				ymm_match_mask.y);
+		}
+	}
+}
+
+/*
+ * Perform search for up to 16 flows in parallel.
+ * Use two sets of metadata, each serves 8 flows max.
+ * So in fact we perform search for 2x8 flows.
+ */
+static inline void
+search_trie_avx512x8x2(struct acl_flow_avx512 *flow)
+{
+	__mmask8 fm[2];
+	__m512i pdata[2];
+	ymm_t di[2], idx[2], inp[2], tr_lo[2], tr_hi[2];
+
+	/* first 1B load */
+	start_flow8(flow, CHAR_BIT, UINT8_MAX, &pdata[0], &idx[0], &di[0]);
+	start_flow8(flow, CHAR_BIT, UINT8_MAX, &pdata[1], &idx[1], &di[1]);
+
+	inp[0] = get_next_4bytes_avx512x8(flow, pdata[0], UINT8_MAX, &di[0]);
+	inp[1] = get_next_4bytes_avx512x8(flow, pdata[1], UINT8_MAX, &di[1]);
+
+	first_trans8(flow, inp[0], UINT8_MAX, &tr_lo[0], &tr_hi[0]);
+	first_trans8(flow, inp[1], UINT8_MAX, &tr_lo[1], &tr_hi[1]);
+
+	fm[0] = UINT8_MAX;
+	fm[1] = UINT8_MAX;
+
+	/* match check */
+	match_check_process_avx512x8x2(flow, fm, pdata, di, idx, inp,
+		tr_lo, tr_hi);
+
+	while ((fm[0] | fm[1]) != 0) {
+
+		/* load next 4B */
+
+		inp[0] = get_next_4bytes_avx512x8(flow, pdata[0], fm[0],
+			&di[0]);
+		inp[1] = get_next_4bytes_avx512x8(flow, pdata[1], fm[1],
+			&di[1]);
+
+		/* main 4B loop */
+
+		inp[0] = transition8(inp[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		inp[1] = transition8(inp[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		inp[0] = transition8(inp[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		inp[1] = transition8(inp[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		inp[0] = transition8(inp[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		inp[1] = transition8(inp[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		inp[0] = transition8(inp[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		inp[1] = transition8(inp[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		/* check for matches */
+		match_check_process_avx512x8x2(flow, fm, pdata, di, idx, inp,
+			tr_lo, tr_hi);
+	}
+}
+
+/*
+ * resolve match index to actual result/priority offset.
+ */
+static inline ymm_t
+resolve_match_idx_avx512x8(ymm_t mi)
+{
+	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
+		1 << (match_log + 2));
+	return _mm256_slli_epi32(mi, match_log);
+}
+
+
+/*
+ * Resolve multiple matches for the same flow based on priority.
+ */
+static inline ymm_t
+resolve_pri_avx512x8(const int32_t res[], const int32_t pri[],
+	const uint32_t match[], __mmask8 msk, uint32_t nb_trie,
+	uint32_t nb_skip)
+{
+	uint32_t i;
+	const uint32_t *pm;
+	__mmask8 m;
+	ymm_t cp, cr, np, nr, mch;
+
+	const ymm_t zero = _mm256_set1_epi32(0);
+
+	mch = _mm256_maskz_loadu_epi32(msk, match);
+	mch = resolve_match_idx_avx512x8(mch);
+
+	cr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0]));
+	cp = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0]));
+
+	for (i = 1, pm = match + nb_skip; i != nb_trie;
+			i++, pm += nb_skip) {
+
+		mch = _mm256_maskz_loadu_epi32(msk, pm);
+		mch = resolve_match_idx_avx512x8(mch);
+
+		nr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res,
+			sizeof(res[0]));
+		np = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri,
+			sizeof(pri[0]));
+
+		m = _mm256_cmpgt_epi32_mask(cp, np);
+		cr = _mm256_mask_mov_epi32(nr, m, cr);
+		cp = _mm256_mask_mov_epi32(np, m, cp);
+	}
+
+	return cr;
+}
+
+/*
+ * Resolve num (<= 8) matches for single category
+ */
+static inline void
+resolve_sc_avx512x8(uint32_t result[], const int32_t res[], const int32_t pri[],
+	const uint32_t match[], uint32_t nb_pkt, uint32_t nb_trie,
+	uint32_t nb_skip)
+{
+	__mmask8 msk;
+	ymm_t cr;
+
+	msk = (1 << nb_pkt) - 1;
+	cr = resolve_pri_avx512x8(res, pri, match, msk, nb_trie, nb_skip);
+	_mm256_mask_storeu_epi32(result, msk, cr);
+}
+
+/*
+ * Resolve matches for single category
+ */
+static inline void
+resolve_sc_avx512x8x2(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_trie)
+{
+	uint32_t i, j, k, n;
+	const uint32_t *pm;
+	const int32_t *res, *pri;
+	__mmask8 m[2];
+	ymm_t cp[2], cr[2], np[2], nr[2], mch[2];
+
+	res = (const int32_t *)pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != (nb_pkt & ~MSK_AVX512X8X2); k += NUM_AVX512X8X2) {
+
+		j = k + CHAR_BIT;
+
+		/* load match indexes for first trie */
+		mch[0] = _mm256_loadu_si256((const ymm_t *)(match + k));
+		mch[1] = _mm256_loadu_si256((const ymm_t *)(match + j));
+
+		mch[0] = resolve_match_idx_avx512x8(mch[0]);
+		mch[1] = resolve_match_idx_avx512x8(mch[1]);
+
+		/* load matches and their priorities for first trie */
+
+		cr[0] = _mm256_i32gather_epi32(res, mch[0], sizeof(res[0]));
+		cr[1] = _mm256_i32gather_epi32(res, mch[1], sizeof(res[0]));
+
+		cp[0] = _mm256_i32gather_epi32(pri, mch[0], sizeof(pri[0]));
+		cp[1] = _mm256_i32gather_epi32(pri, mch[1], sizeof(pri[0]));
+
+		/* select match with highest priority */
+		for (i = 1, pm = match + nb_pkt; i != nb_trie;
+				i++, pm += nb_pkt) {
+
+			mch[0] = _mm256_loadu_si256((const ymm_t *)(pm + k));
+			mch[1] = _mm256_loadu_si256((const ymm_t *)(pm + j));
+
+			mch[0] = resolve_match_idx_avx512x8(mch[0]);
+			mch[1] = resolve_match_idx_avx512x8(mch[1]);
+
+			nr[0] = _mm256_i32gather_epi32(res, mch[0],
+				sizeof(res[0]));
+			nr[1] = _mm256_i32gather_epi32(res, mch[1],
+				sizeof(res[0]));
+
+			np[0] = _mm256_i32gather_epi32(pri, mch[0],
+				sizeof(pri[0]));
+			np[1] = _mm256_i32gather_epi32(pri, mch[1],
+				sizeof(pri[0]));
+
+			m[0] = _mm256_cmpgt_epi32_mask(cp[0], np[0]);
+			m[1] = _mm256_cmpgt_epi32_mask(cp[1], np[1]);
+
+			cr[0] = _mm256_mask_mov_epi32(nr[0], m[0], cr[0]);
+			cr[1] = _mm256_mask_mov_epi32(nr[1], m[1], cr[1]);
+
+			cp[0] = _mm256_mask_mov_epi32(np[0], m[0], cp[0]);
+			cp[1] = _mm256_mask_mov_epi32(np[1], m[1], cp[1]);
+		}
+
+		_mm256_storeu_si256((ymm_t *)(result + k), cr[0]);
+		_mm256_storeu_si256((ymm_t *)(result + j), cr[1]);
+	}
+
+	n = nb_pkt - k;
+	if (n != 0) {
+		if (n > CHAR_BIT) {
+			resolve_sc_avx512x8(result + k, res, pri, match + k,
+				CHAR_BIT, nb_trie, nb_pkt);
+			k += CHAR_BIT;
+			n -= CHAR_BIT;
+		}
+		resolve_sc_avx512x8(result + k, res, pri, match + k, n,
+				nb_trie, nb_pkt);
+	}
+}
+
+static inline int
+search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	uint32_t i, *pm;
+	const struct rte_acl_match_results *pr;
+	struct acl_flow_avx512 flow;
+	uint32_t match[ctx->num_tries * total_packets];
+
+	for (i = 0, pm = match; i != ctx->num_tries; i++, pm += total_packets) {
+
+		/* setup for next trie */
+		acl_set_flow_avx512(&flow, ctx, i, data, pm, total_packets);
+
+		/* process the trie */
+		search_trie_avx512x8x2(&flow);
+	}
+
+	/* resolve matches */
+	pr = (const struct rte_acl_match_results *)
+		(ctx->trans_table + ctx->match_index);
+
+	if (categories == 1)
+		resolve_sc_avx512x8x2(results, pr, match, total_packets,
+			ctx->num_tries);
+	else if (categories <= RTE_ACL_MAX_CATEGORIES / 2)
+		resolve_mcle8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+	else
+		resolve_mcgt8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+
+	return 0;
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH 20.11 7/7] acl: enhance AVX512 classify implementation
  2020-08-07 16:28 [dpdk-dev] [PATCH 20.11 0/7] acl: introduce AVX512 classify method Konstantin Ananyev
                   ` (5 preceding siblings ...)
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 6/7] acl: introduce AVX512 classify implementation Konstantin Ananyev
@ 2020-08-07 16:28 ` " Konstantin Ananyev
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
  7 siblings, 0 replies; 26+ messages in thread
From: Konstantin Ananyev @ 2020-08-07 16:28 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Add search_avx512x16x2() which uses mostly 512-bit width
registers/instructions and is able to process up to 32 flows in
parallel.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

These patch depends on:
https://patches.dpdk.org/patch/70429/
to be applied first.

 lib/librte_acl/acl_run_avx512.c    |   3 +
 lib/librte_acl/acl_run_avx512x16.h | 635 +++++++++++++++++++++++++++++
 2 files changed, 638 insertions(+)
 create mode 100644 lib/librte_acl/acl_run_avx512x16.h

diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
index 8ee996679..332e359fb 100644
--- a/lib/librte_acl/acl_run_avx512.c
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -121,11 +121,14 @@ resolve_mcgt8_avx512x1(uint32_t result[],
 }
 
 #include "acl_run_avx512x8.h"
+#include "acl_run_avx512x16.h"
 
 int
 rte_acl_classify_avx512(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories)
 {
+	if (num >= 2 * MAX_SEARCHES_AVX16)
+		return search_avx512x16x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_AVX16)
 		return search_avx512x8x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE8)
diff --git a/lib/librte_acl/acl_run_avx512x16.h b/lib/librte_acl/acl_run_avx512x16.h
new file mode 100644
index 000000000..53216bda3
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512x16.h
@@ -0,0 +1,635 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#define	MASK16_BIT	(sizeof(__mmask16) * CHAR_BIT)
+
+#define NUM_AVX512X16X2	(2 * MASK16_BIT)
+#define MSK_AVX512X16X2	(NUM_AVX512X16X2 - 1)
+
+static const __rte_x86_zmm_t zmm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_trlo_idle = {
+	.u32 = {
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+		RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_trhi_idle = {
+	.u32 = {
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_shuffle_input = {
+	.u32 = {
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_four_32 = {
+	.u32 = {
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_idx_add = {
+	.u32 = {
+		0, 1, 2, 3,
+		4, 5, 6, 7,
+		8, 9, 10, 11,
+		12, 13, 14, 15,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_range_base = {
+	.u32 = {
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+	},
+};
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes not supposed to be encountered here.
+ * For quad range nodes:
+ * Calculate number of range boundaries that are less than the
+ * input value. Range boundaries for each node are in signed 8 bit,
+ * ordered from -128 to 127.
+ * This is effectively a popcnt of bytes that are greater than the
+ * input byte.
+ * Single nodes are processed in the same ways as quad range nodes.
+ */
+static __rte_always_inline __m512i
+calc_addr16(__m512i index_mask, __m512i next_input, __m512i shuffle_input,
+	__m512i four_32, __m512i range_base, __m512i tr_lo, __m512i tr_hi)
+{
+	__mmask64 qm;
+	__mmask16 dfa_msk;
+	__m512i addr, in, node_type, r, t;
+	__m512i dfa_ofs, quad_ofs;
+
+	t = _mm512_xor_si512(index_mask, index_mask);
+	in = _mm512_shuffle_epi8(next_input, shuffle_input);
+
+	/* Calc node type and node addr */
+	node_type = _mm512_andnot_si512(index_mask, tr_lo);
+	addr = _mm512_and_si512(index_mask, tr_lo);
+
+	/* mask for DFA type(0) nodes */
+	dfa_msk = _mm512_cmpeq_epi32_mask(node_type, t);
+
+	/* DFA calculations. */
+	r = _mm512_srli_epi32(in, 30);
+	r = _mm512_add_epi8(r, range_base);
+	t = _mm512_srli_epi32(in, 24);
+	r = _mm512_shuffle_epi8(tr_hi, r);
+
+	dfa_ofs = _mm512_sub_epi32(t, r);
+
+	/* QUAD/SINGLE calculations. */
+	qm = _mm512_cmpgt_epi8_mask(in, tr_hi);
+	t = _mm512_maskz_set1_epi8(qm, (uint8_t)UINT8_MAX);
+	t = _mm512_lzcnt_epi32(t);
+	t = _mm512_srli_epi32(t, 3);
+	quad_ofs = _mm512_sub_epi32(four_32, t);
+
+	/* blend DFA and QUAD/SINGLE. */
+	t = _mm512_mask_mov_epi32(quad_ofs, dfa_msk, dfa_ofs);
+
+	/* calculate address for next transitions. */
+	addr = _mm512_add_epi32(addr, t);
+	return addr;
+}
+
+/*
+ * Process 8 transitions in parallel.
+ * tr_lo contains low 32 bits for 8 transition.
+ * tr_hi contains high 32 bits for 8 transition.
+ * next_input contains up to 4 input bytes for 8 flows.
+ */
+static __rte_always_inline __m512i
+transition16(__m512i next_input, const uint64_t *trans, __m512i *tr_lo,
+	__m512i *tr_hi)
+{
+	const int32_t *tr;
+	__m512i addr;
+
+	tr = (const int32_t *)(uintptr_t)trans;
+
+	/* Calculate the address (array index) for all 8 transitions. */
+	addr = calc_addr16(zmm_index_mask.z, next_input, zmm_shuffle_input.z,
+		zmm_four_32.z, zmm_range_base.z, *tr_lo, *tr_hi);
+
+	/* load lower 32 bits of 8 transactions at once. */
+	*tr_lo = _mm512_i32gather_epi32(addr, tr, sizeof(trans[0]));
+
+	next_input = _mm512_srli_epi32(next_input, CHAR_BIT);
+
+	/* load high 32 bits of 8 transactions at once. */
+	*tr_hi = _mm512_i32gather_epi32(addr, (tr + 1), sizeof(trans[0]));
+
+	return next_input;
+}
+
+static __rte_always_inline void
+first_trans16(const struct acl_flow_avx512 *flow, __m512i next_input,
+	__mmask16 msk, __m512i *tr_lo, __m512i *tr_hi)
+{
+	const int32_t *tr;
+	__m512i addr, root;
+
+	tr = (const int32_t *)(uintptr_t)flow->trans;
+
+	addr = _mm512_set1_epi32(UINT8_MAX);
+	root = _mm512_set1_epi32(flow->root_index);
+
+	addr = _mm512_and_si512(next_input, addr);
+	addr = _mm512_add_epi32(root, addr);
+
+	/* load lower 32 bits of 8 transactions at once. */
+	*tr_lo = _mm512_mask_i32gather_epi32(*tr_lo, msk, addr, tr,
+		sizeof(flow->trans[0]));
+
+	/* load high 32 bits of 8 transactions at once. */
+	*tr_hi = _mm512_mask_i32gather_epi32(*tr_hi, msk, addr, (tr + 1),
+		sizeof(flow->trans[0]));
+}
+
+static inline __m512i
+get_next_4bytes_avx512x16(const struct acl_flow_avx512 *flow, __m512i pdata[2],
+	uint32_t msk, __m512i *di)
+{
+	const int32_t *div;
+	__m512i one, zero, t, p[2];
+	ymm_t inp[2];
+
+	static const __rte_x86_zmm_t zmm_pminp = {
+		.u32 = {
+			0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+			0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17,
+		},
+	};
+
+	const __mmask16 pmidx_msk = 0x5555;
+
+	static const __rte_x86_zmm_t zmm_pmidx[2] = {
+		[0] = {
+			.u32 = {
+				0, 0, 1, 0, 2, 0, 3, 0,
+				4, 0, 5, 0, 6, 0, 7, 0,
+			},
+		},
+		[1] = {
+			.u32 = {
+				8, 0, 9, 0, 10, 0, 11, 0,
+				12, 0, 13, 0, 14, 0, 15, 0,
+			},
+		},
+	};
+
+	div = (const int32_t *)flow->data_index;
+
+	one = _mm512_set1_epi32(1);
+	zero = _mm512_xor_si512(one, one);
+
+	t = _mm512_mask_i32gather_epi32(zero, msk, *di, div, sizeof(div[0]));
+
+	*di = _mm512_mask_add_epi32(*di, msk, *di, one);
+
+	p[0] = _mm512_maskz_permutexvar_epi32(pmidx_msk, zmm_pmidx[0].z, t);
+	p[1] = _mm512_maskz_permutexvar_epi32(pmidx_msk, zmm_pmidx[1].z, t);
+
+	p[0] = _mm512_add_epi64(p[0], pdata[0]);
+	p[1] = _mm512_add_epi64(p[1], pdata[1]);
+
+	inp[0] = _mm512_mask_i64gather_epi32(_mm512_castsi512_si256(zero),
+		(msk & UINT8_MAX), p[0], NULL, sizeof(uint8_t));
+	inp[1] = _mm512_mask_i64gather_epi32(_mm512_castsi512_si256(zero),
+		(msk >> CHAR_BIT), p[1], NULL, sizeof(uint8_t));
+
+	return _mm512_permutex2var_epi32(_mm512_castsi256_si512(inp[0]),
+			zmm_pminp.z, _mm512_castsi256_si512(inp[1]));
+}
+
+static inline void
+start_flow16(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
+	__m512i pdata[2], __m512i *idx, __m512i *di)
+{
+	uint32_t n, nm[2];
+	__m512i ni, nd[2];
+
+	n = __builtin_popcount(msk & UINT8_MAX);
+	nm[0] = (1 << n) - 1;
+	nm[1] = (1 << (num - n)) - 1;
+
+	nd[0] = _mm512_maskz_loadu_epi64(nm[0],
+		flow->idata + flow->num_packets);
+	nd[1] = _mm512_maskz_loadu_epi64(nm[1],
+		flow->idata + flow->num_packets + n);
+
+	ni = _mm512_set1_epi32(flow->num_packets);
+	ni = _mm512_add_epi32(ni, zmm_idx_add.z);
+
+	pdata[0] = _mm512_mask_expand_epi64(pdata[0], (msk & UINT8_MAX), nd[0]);
+	pdata[1] = _mm512_mask_expand_epi64(pdata[1], (msk >> CHAR_BIT), nd[1]);
+
+	*idx = _mm512_mask_expand_epi32(*idx, msk, ni);
+	*di = _mm512_maskz_mov_epi32(msk ^ UINT16_MAX, *di);
+
+	flow->num_packets += num;
+}
+
+static inline uint32_t
+update_flow_mask16(const struct acl_flow_avx512 *flow, __mmask16 *fmsk,
+	__mmask16 *rmsk)
+{
+	uint32_t i, j, k, m, n;
+
+	fmsk[0] ^= rmsk[0];
+	m = rmsk[0];
+
+	k = __builtin_popcount(m);
+	n = flow->total_packets - flow->num_packets;
+
+	if (n < k) {
+		/* reduce mask */
+		for (i = k - n; i != 0; i--) {
+			j = sizeof(m) * CHAR_BIT - 1 - __builtin_clz(m);
+			m ^= 1 << j;
+		}
+	} else
+		n = k;
+
+	rmsk[0] = m;
+	fmsk[0] |= rmsk[0];
+
+	return n;
+}
+
+static inline uint32_t
+match_process_avx512x16(struct acl_flow_avx512 *flow, __mmask16 *fmsk,
+	__mmask16 *rmsk, __m512i pdata[2], __m512i *di, __m512i *idx,
+	__m512i *tr_lo, __m512i *tr_hi)
+{
+	uint32_t n;
+	__m512i res;
+
+	if (rmsk[0] == 0)
+		return 0;
+
+	/* extract match indexes */
+	res = _mm512_and_si512(tr_lo[0], zmm_index_mask.z);
+
+	/* mask  matched transitions to nop */
+	tr_lo[0] = _mm512_mask_mov_epi32(tr_lo[0], rmsk[0], zmm_trlo_idle.z);
+	tr_hi[0] = _mm512_mask_mov_epi32(tr_hi[0], rmsk[0], zmm_trhi_idle.z);
+
+	/* save found match indexes */
+	_mm512_mask_i32scatter_epi32(flow->matches, rmsk[0],
+		idx[0], res, sizeof(flow->matches[0]));
+
+	/* update masks and start new flows for matches */
+	n = update_flow_mask16(flow, fmsk, rmsk);
+	start_flow16(flow, n, rmsk[0], pdata, idx, di);
+
+	return n;
+}
+
+static inline void
+match_check_process_avx512x16x2(struct acl_flow_avx512 *flow, __mmask16 fm[2],
+	__m512i pdata[4], __m512i di[2], __m512i idx[2], __m512i inp[2],
+	__m512i tr_lo[2], __m512i tr_hi[2])
+{
+	uint32_t n[2];
+	__mmask16 rm[2];
+
+	/* check for matches */
+	rm[0] = _mm512_test_epi32_mask(tr_lo[0], zmm_match_mask.z);
+	rm[1] = _mm512_test_epi32_mask(tr_lo[1], zmm_match_mask.z);
+
+	while ((rm[0] | rm[1]) != 0) {
+
+		n[0] = match_process_avx512x16(flow, &fm[0], &rm[0], &pdata[0],
+			&di[0], &idx[0], &tr_lo[0], &tr_hi[0]);
+		n[1] = match_process_avx512x16(flow, &fm[1], &rm[1], &pdata[2],
+			&di[1], &idx[1], &tr_lo[1], &tr_hi[1]);
+
+		if (n[0] != 0) {
+			inp[0] = get_next_4bytes_avx512x16(flow, &pdata[0],
+				rm[0], &di[0]);
+			first_trans16(flow, inp[0], rm[0], &tr_lo[0],
+				&tr_hi[0]);
+			rm[0] = _mm512_test_epi32_mask(tr_lo[0],
+				zmm_match_mask.z);
+		}
+
+		if (n[1] != 0) {
+			inp[1] = get_next_4bytes_avx512x16(flow, &pdata[2],
+				rm[1], &di[1]);
+			first_trans16(flow, inp[1], rm[1], &tr_lo[1],
+				&tr_hi[1]);
+			rm[1] = _mm512_test_epi32_mask(tr_lo[1],
+				zmm_match_mask.z);
+		}
+	}
+}
+
+static inline void
+search_trie_avx512x16x2(struct acl_flow_avx512 *flow)
+{
+	__mmask16 fm[2];
+	__m512i di[2], idx[2], in[2], pdata[4], tr_lo[2], tr_hi[2];
+
+	/* first 1B load */
+	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[0], &idx[0], &di[0]);
+	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[2], &idx[1], &di[1]);
+
+	in[0] = get_next_4bytes_avx512x16(flow, &pdata[0], UINT16_MAX, &di[0]);
+	in[1] = get_next_4bytes_avx512x16(flow, &pdata[2], UINT16_MAX, &di[1]);
+
+	first_trans16(flow, in[0], UINT16_MAX, &tr_lo[0], &tr_hi[0]);
+	first_trans16(flow, in[1], UINT16_MAX, &tr_lo[1], &tr_hi[1]);
+
+	fm[0] = UINT16_MAX;
+	fm[1] = UINT16_MAX;
+
+	/* match check */
+	match_check_process_avx512x16x2(flow, fm, pdata, di, idx, in,
+		tr_lo, tr_hi);
+
+	while ((fm[0] | fm[1]) != 0) {
+
+		/* load next 4B */
+
+		in[0] = get_next_4bytes_avx512x16(flow, &pdata[0], fm[0],
+			&di[0]);
+		in[1] = get_next_4bytes_avx512x16(flow, &pdata[2], fm[1],
+			&di[1]);
+
+		/* main 4B loop */
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		/* check for matches */
+		match_check_process_avx512x16x2(flow, fm, pdata, di, idx, in,
+			tr_lo, tr_hi);
+	}
+}
+
+static inline __m512i
+resolve_match_idx_avx512x16(__m512i mi)
+{
+	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
+		1 << (match_log + 2));
+	return _mm512_slli_epi32(mi, match_log);
+}
+
+static inline __m512i
+resolve_pri_avx512x16(const int32_t res[], const int32_t pri[],
+	const uint32_t match[], __mmask16 msk, uint32_t nb_trie,
+	uint32_t nb_skip)
+{
+	uint32_t i;
+	const uint32_t *pm;
+	__mmask16 m;
+	__m512i cp, cr, np, nr, mch;
+
+	const __m512i zero = _mm512_set1_epi32(0);
+
+	mch = _mm512_maskz_loadu_epi32(msk, match);
+	mch = resolve_match_idx_avx512x16(mch);
+
+	cr = _mm512_mask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0]));
+	cp = _mm512_mask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0]));
+
+	for (i = 1, pm = match + nb_skip; i != nb_trie;
+			i++, pm += nb_skip) {
+
+		mch = _mm512_maskz_loadu_epi32(msk, pm);
+		mch = resolve_match_idx_avx512x16(mch);
+
+		nr = _mm512_mask_i32gather_epi32(zero, msk, mch, res,
+			sizeof(res[0]));
+		np = _mm512_mask_i32gather_epi32(zero, msk, mch, pri,
+			sizeof(pri[0]));
+
+		m = _mm512_cmpgt_epi32_mask(cp, np);
+		cr = _mm512_mask_mov_epi32(nr, m, cr);
+		cp = _mm512_mask_mov_epi32(np, m, cp);
+	}
+
+	return cr;
+}
+
+/*
+ * Resolve num (<= 16) matches for single category
+ */
+static inline void
+resolve_sc_avx512x16(uint32_t result[], const int32_t res[],
+	const int32_t pri[], const uint32_t match[], uint32_t nb_pkt,
+	uint32_t nb_trie, uint32_t nb_skip)
+{
+	__mmask16 msk;
+	__m512i cr;
+
+	msk = (1 << nb_pkt) - 1;
+	cr = resolve_pri_avx512x16(res, pri, match, msk, nb_trie, nb_skip);
+	_mm512_mask_storeu_epi32(result, msk, cr);
+}
+
+/*
+ * Resolve matches for single category
+ */
+static inline void
+resolve_sc_avx512x16x2(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_trie)
+{
+	uint32_t i, j, k, n;
+	const uint32_t *pm;
+	const int32_t *res, *pri;
+	__mmask16 m[2];
+	__m512i cp[2], cr[2], np[2], nr[2], mch[2];
+
+	res = (const int32_t *)pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != (nb_pkt & ~MSK_AVX512X16X2); k += NUM_AVX512X16X2) {
+
+		j = k + MASK16_BIT;
+
+		/* load match indexes for first trie */
+		mch[0] = _mm512_loadu_si512(match + k);
+		mch[1] = _mm512_loadu_si512(match + j);
+
+		mch[0] = resolve_match_idx_avx512x16(mch[0]);
+		mch[1] = resolve_match_idx_avx512x16(mch[1]);
+
+		/* load matches and their priorities for first trie */
+
+		cr[0] = _mm512_i32gather_epi32(mch[0], res, sizeof(res[0]));
+		cr[1] = _mm512_i32gather_epi32(mch[1], res, sizeof(res[0]));
+
+		cp[0] = _mm512_i32gather_epi32(mch[0], pri, sizeof(pri[0]));
+		cp[1] = _mm512_i32gather_epi32(mch[1], pri, sizeof(pri[0]));
+
+		/* select match with highest priority */
+		for (i = 1, pm = match + nb_pkt; i != nb_trie;
+				i++, pm += nb_pkt) {
+
+			mch[0] = _mm512_loadu_si512(pm + k);
+			mch[1] = _mm512_loadu_si512(pm + j);
+
+			mch[0] = resolve_match_idx_avx512x16(mch[0]);
+			mch[1] = resolve_match_idx_avx512x16(mch[1]);
+
+			nr[0] = _mm512_i32gather_epi32(mch[0], res,
+				sizeof(res[0]));
+			nr[1] = _mm512_i32gather_epi32(mch[1], res,
+				sizeof(res[0]));
+
+			np[0] = _mm512_i32gather_epi32(mch[0], pri,
+				sizeof(pri[0]));
+			np[1] = _mm512_i32gather_epi32(mch[1], pri,
+				sizeof(pri[0]));
+
+			m[0] = _mm512_cmpgt_epi32_mask(cp[0], np[0]);
+			m[1] = _mm512_cmpgt_epi32_mask(cp[1], np[1]);
+
+			cr[0] = _mm512_mask_mov_epi32(nr[0], m[0], cr[0]);
+			cr[1] = _mm512_mask_mov_epi32(nr[1], m[1], cr[1]);
+
+			cp[0] = _mm512_mask_mov_epi32(np[0], m[0], cp[0]);
+			cp[1] = _mm512_mask_mov_epi32(np[1], m[1], cp[1]);
+		}
+
+		_mm512_storeu_si512(result + k, cr[0]);
+		_mm512_storeu_si512(result + j, cr[1]);
+	}
+
+	n = nb_pkt - k;
+	if (n != 0) {
+		if (n > MASK16_BIT) {
+			resolve_sc_avx512x16(result + k, res, pri, match + k,
+				MASK16_BIT, nb_trie, nb_pkt);
+			k += MASK16_BIT;
+			n -= MASK16_BIT;
+		}
+		resolve_sc_avx512x16(result + k, res, pri, match + k, n,
+				nb_trie, nb_pkt);
+	}
+}
+
+static inline int
+search_avx512x16x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	uint32_t i, *pm;
+	const struct rte_acl_match_results *pr;
+	struct acl_flow_avx512 flow;
+	uint32_t match[ctx->num_tries * total_packets];
+
+	for (i = 0, pm = match; i != ctx->num_tries; i++, pm += total_packets) {
+
+		/* setup for next trie */
+		acl_set_flow_avx512(&flow, ctx, i, data, pm, total_packets);
+
+		/* process the trie */
+		search_trie_avx512x16x2(&flow);
+	}
+
+	/* resolve matches */
+	pr = (const struct rte_acl_match_results *)
+		(ctx->trans_table + ctx->match_index);
+
+	if (categories == 1)
+		resolve_sc_avx512x16x2(results, pr, match, total_packets,
+			ctx->num_tries);
+	else if (categories <= RTE_ACL_MAX_CATEGORIES / 2)
+		resolve_mcle8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+	else
+		resolve_mcgt8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+
+	return 0;
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method
  2020-08-07 16:28 [dpdk-dev] [PATCH 20.11 0/7] acl: introduce AVX512 classify method Konstantin Ananyev
                   ` (6 preceding siblings ...)
  2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 7/7] acl: enhance " Konstantin Ananyev
@ 2020-09-15 16:50 ` Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 01/12] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
                     ` (11 more replies)
  7 siblings, 12 replies; 26+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

These patch series introduce support of AVX512 specific classify
implementation for ACL library.
Inside it contains two code-paths –
one uses mostly 256 bit instruction/registers and can
process up to 16 flows in parallel.
second uses 512 bit instruction/registers over majority of
places and can process up to 32 flows in parallel.
This runtime code-path selection is done internally based
on input burst size and is totally opaque to the user.
On my SKX box test-acl shows ~20-65% improvement
(depending on rule-set and input burst size)
when switching from AVX2 to AVX512 classify algorithms.
ICX and CLX testing showed similar level of speedup: up to ~50-60%.
Current AVX512 classify implementation is only supported on x86_64.
Note that this series introduce a formal ABI incompatibility
with previous versions of ACL library.

v1 -> v2:
  Deduplicated 8/16 code paths as much as possible
  Updated default algorithm selection
    Removed library constructor to make it easier integrate with
    https://patches.dpdk.org/project/dpdk/list/?series=11831
  Updated docs

These patch series depends on:
https://patches.dpdk.org/patch/73922/mbox/
to be applied first.

Konstantin Ananyev (12):
  acl: fix x86 build when compiler doesn't support AVX2
  doc: fix mixing classify methods in ACL guide
  acl: remove of unused enum value
  acl: remove library constructor
  app/acl: few small improvements
  test/acl: expand classify test coverage
  acl: add infrastructure to support AVX512 classify
  acl: introduce AVX512 classify implementation
  acl: enhance AVX512 classify implementation
  acl: for AVX512 classify use 4B load whenever possible
  test/acl: add AVX512 classify support
  app/acl: add AVX512 classify support

 app/test-acl/main.c                           |  19 +-
 app/test/test_acl.c                           | 104 ++--
 config/x86/meson.build                        |   3 +-
 .../prog_guide/packet_classif_access_ctrl.rst |  15 +
 doc/guides/rel_notes/deprecation.rst          |   4 -
 doc/guides/rel_notes/release_20_11.rst        |   9 +
 lib/librte_acl/acl.h                          |  12 +
 lib/librte_acl/acl_bld.c                      |  34 ++
 lib/librte_acl/acl_gen.c                      |   2 +-
 lib/librte_acl/acl_run_avx512.c               | 331 +++++++++++
 lib/librte_acl/acl_run_avx512x16.h            | 526 ++++++++++++++++++
 lib/librte_acl/acl_run_avx512x8.h             | 439 +++++++++++++++
 lib/librte_acl/meson.build                    |  39 ++
 lib/librte_acl/rte_acl.c                      | 198 +++++--
 lib/librte_acl/rte_acl.h                      |   3 +-
 15 files changed, 1638 insertions(+), 100 deletions(-)
 create mode 100644 lib/librte_acl/acl_run_avx512.c
 create mode 100644 lib/librte_acl/acl_run_avx512x16.h
 create mode 100644 lib/librte_acl/acl_run_avx512x8.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v2 01/12] acl: fix x86 build when compiler doesn't support AVX2
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
@ 2020-09-15 16:50   ` Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 02/12] doc: fix mixing classify methods in ACL guide Konstantin Ananyev
                     ` (10 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev, stable

Right now we define dummy version of rte_acl_classify_avx2()
when both X86 and AVX2 are not detected, though it should be
for non-AVX2 case only.

Fixes: e53ce4e41379 ("acl: remove use of weak functions")
Cc: stable@dpdk.org

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/rte_acl.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 777ec4d34..715b02359 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -16,7 +16,6 @@ static struct rte_tailq_elem rte_acl_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_acl_tailq)
 
-#ifndef RTE_ARCH_X86
 #ifndef CC_AVX2_SUPPORT
 /*
  * If the compiler doesn't support AVX2 instructions,
@@ -33,6 +32,7 @@ rte_acl_classify_avx2(__rte_unused const struct rte_acl_ctx *ctx,
 }
 #endif
 
+#ifndef RTE_ARCH_X86
 int
 rte_acl_classify_sse(__rte_unused const struct rte_acl_ctx *ctx,
 	__rte_unused const uint8_t **data,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v2 02/12] doc: fix mixing classify methods in ACL guide
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 01/12] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
@ 2020-09-15 16:50   ` Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 03/12] acl: remove of unused enum value Konstantin Ananyev
                     ` (9 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev, stable

Add brief description for missing ACL classify algorithms:
RTE_ACL_CLASSIFY_NEON and RTE_ACL_CLASSIFY_ALTIVEC.

Fixes: 34fa6c27c156 ("acl: add NEON optimization for ARMv8")
Fixes: 1d73135f9f1c ("acl: add AltiVec for ppc64")
Cc: stable@dpdk.org

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 doc/guides/prog_guide/packet_classif_access_ctrl.rst | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/doc/guides/prog_guide/packet_classif_access_ctrl.rst b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
index 0345512b9..daf03e6d7 100644
--- a/doc/guides/prog_guide/packet_classif_access_ctrl.rst
+++ b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
@@ -373,6 +373,12 @@ There are several implementations of classify algorithm:
 
 *   **RTE_ACL_CLASSIFY_AVX2**: vector implementation, can process up to 16 flows in parallel. Requires AVX2 support.
 
+*   **RTE_ACL_CLASSIFY_NEON**: vector implementation, can process up to 8 flows
+    in parallel. Requires NEON support.
+
+*   **RTE_ACL_CLASSIFY_ALTIVEC**: vector implementation, can process up to 8
+    flows in parallel. Requires ALTIVEC support.
+
 It is purely a runtime decision which method to choose, there is no build-time difference.
 All implementations operates over the same internal RT structures and use similar principles. The main difference is that vector implementations can manually exploit IA SIMD instructions and process several input data flows in parallel.
 At startup ACL library determines the highest available classify method for the given platform and sets it as default one. Though the user has an ability to override the default classifier function for a given ACL context or perform particular search using non-default classify method. In that case it is user responsibility to make sure that given platform supports selected classify implementation.
-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v2 03/12] acl: remove of unused enum value
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 01/12] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 02/12] doc: fix mixing classify methods in ACL guide Konstantin Ananyev
@ 2020-09-15 16:50   ` Konstantin Ananyev
  2020-09-27  3:27     ` Ruifeng Wang
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 04/12] acl: remove library constructor Konstantin Ananyev
                     ` (8 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Removal of unused enum value (RTE_ACL_CLASSIFY_NUM).
This enum value is not used inside DPDK, while it prevents
to add new classify algorithms without causing an ABI breakage.

Note that this change introduce a formal ABI incompatibility
with previous versions of ACL library.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 doc/guides/rel_notes/deprecation.rst   | 4 ----
 doc/guides/rel_notes/release_20_11.rst | 4 ++++
 lib/librte_acl/rte_acl.h               | 1 -
 3 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/doc/guides/rel_notes/deprecation.rst b/doc/guides/rel_notes/deprecation.rst
index 52168f775..3279a01ef 100644
--- a/doc/guides/rel_notes/deprecation.rst
+++ b/doc/guides/rel_notes/deprecation.rst
@@ -288,10 +288,6 @@ Deprecation Notices
   - https://patches.dpdk.org/patch/71457/
   - https://patches.dpdk.org/patch/71456/
 
-* acl: ``RTE_ACL_CLASSIFY_NUM`` enum value will be removed.
-  This enum value is not used inside DPDK, while it prevents to add new
-  classify algorithms without causing an ABI breakage.
-
 * sched: To allow more traffic classes, flexible mapping of pipe queues to
   traffic classes, and subport level configuration of pipes and queues
   changes will be made to macros, data structures and API functions defined
diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index b729bdf20..a9a1b0305 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -97,6 +97,10 @@ API Changes
   and the function ``rte_rawdev_queue_conf_get()``
   from ``void`` to ``int`` allowing the return of error codes from drivers.
 
+* acl: ``RTE_ACL_CLASSIFY_NUM`` enum value has been removed.
+  This enum value was not used inside DPDK, while it prevented to add new
+  classify algorithms without causing an ABI breakage.
+
 
 ABI Changes
 -----------
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index aa22e70c6..b814423a6 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -241,7 +241,6 @@ enum rte_acl_classify_alg {
 	RTE_ACL_CLASSIFY_AVX2 = 3,    /**< requires AVX2 support. */
 	RTE_ACL_CLASSIFY_NEON = 4,    /**< requires NEON support. */
 	RTE_ACL_CLASSIFY_ALTIVEC = 5,    /**< requires ALTIVEC support. */
-	RTE_ACL_CLASSIFY_NUM          /* should always be the last one. */
 };
 
 /**
-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v2 04/12] acl: remove library constructor
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
                     ` (2 preceding siblings ...)
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 03/12] acl: remove of unused enum value Konstantin Ananyev
@ 2020-09-15 16:50   ` Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 05/12] app/acl: few small improvements Konstantin Ananyev
                     ` (7 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Right now ACL library determines best possible (default) classify method
on a given platform with specilal constructor function rte_acl_init().
This patch makes the following changes:
 - Move selection of default classify method into a separate private
   function and call it for each ACL context creation (rte_acl_create()).
 - Remove library constructor function
 - Make rte_acl_set_ctx_classify() to check that requested algorithm
   is supported on given platform.

The purpose of these changes to improve and simplify algorithm selection
process and prepare ACL library to be integrated with:
add max SIMD bitwidth to EAL
(https://patches.dpdk.org/project/dpdk/list/?series=11831)
patch-set

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/rte_acl.c | 166 ++++++++++++++++++++++++++++++---------
 lib/librte_acl/rte_acl.h |   1 +
 2 files changed, 132 insertions(+), 35 deletions(-)

diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index 715b02359..fbcf45fdc 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -79,57 +79,153 @@ static const rte_acl_classify_t classify_fns[] = {
 	[RTE_ACL_CLASSIFY_ALTIVEC] = rte_acl_classify_altivec,
 };
 
-/* by default, use always available scalar code path. */
-static enum rte_acl_classify_alg rte_acl_default_classify =
-	RTE_ACL_CLASSIFY_SCALAR;
+/*
+ * Helper function for acl_check_alg.
+ * Check support for ARM specific classify methods.
+ */
+static int
+acl_check_alg_arm(enum rte_acl_classify_alg alg)
+{
+	if (alg == RTE_ACL_CLASSIFY_NEON) {
+#if defined(RTE_ARCH_ARM64)
+		return 0;
+#elif defined(RTE_ARCH_ARM)
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON))
+			return 0;
+		return -ENOTSUP;
+#else
+		return -ENOTSUP;
+#endif
+	}
+
+	return -EINVAL;
+}
 
-static void
-rte_acl_set_default_classify(enum rte_acl_classify_alg alg)
+/*
+ * Helper function for acl_check_alg.
+ * Check support for PPC specific classify methods.
+ */
+static int
+acl_check_alg_ppc(enum rte_acl_classify_alg alg)
 {
-	rte_acl_default_classify = alg;
+	if (alg == RTE_ACL_CLASSIFY_ALTIVEC) {
+#if defined(RTE_ARCH_PPC_64)
+		return 0;
+#else
+		return -ENOTSUP;
+#endif
+	}
+
+	return -EINVAL;
 }
 
-extern int
-rte_acl_set_ctx_classify(struct rte_acl_ctx *ctx, enum rte_acl_classify_alg alg)
+/*
+ * Helper function for acl_check_alg.
+ * Check support for x86 specific classify methods.
+ */
+static int
+acl_check_alg_x86(enum rte_acl_classify_alg alg)
 {
-	if (ctx == NULL || (uint32_t)alg >= RTE_DIM(classify_fns))
-		return -EINVAL;
+	if (alg == RTE_ACL_CLASSIFY_AVX2) {
+#ifdef CC_AVX2_SUPPORT
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
+			return 0;
+#endif
+		return -ENOTSUP;
+	}
 
-	ctx->alg = alg;
-	return 0;
+	if (alg == RTE_ACL_CLASSIFY_SSE) {
+#ifdef RTE_ARCH_X86
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
+			return 0;
+#endif
+		return -ENOTSUP;
+	}
+
+	return -EINVAL;
 }
 
 /*
- * Select highest available classify method as default one.
- * Note that CLASSIFY_AVX2 should be set as a default only
- * if both conditions are met:
- * at build time compiler supports AVX2 and target cpu supports AVX2.
+ * Check if input alg is supported by given platform/binary.
+ * Note that both conditions should be met:
+ * - at build time compiler supports ISA used by given methos
+ *   at run time target cpu supports necessary ISA.
  */
-RTE_INIT(rte_acl_init)
+static int
+acl_check_alg(enum rte_acl_classify_alg alg)
 {
-	enum rte_acl_classify_alg alg = RTE_ACL_CLASSIFY_DEFAULT;
+	switch (alg) {
+	case RTE_ACL_CLASSIFY_NEON:
+		return acl_check_alg_arm(alg);
+	case RTE_ACL_CLASSIFY_ALTIVEC:
+		return acl_check_alg_ppc(alg);
+	case RTE_ACL_CLASSIFY_AVX2:
+	case RTE_ACL_CLASSIFY_SSE:
+		return acl_check_alg_x86(alg);
+	/* scalar method is supported on all platforms */
+	case RTE_ACL_CLASSIFY_SCALAR:
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
 
-#if defined(RTE_ARCH_ARM64)
-	alg =  RTE_ACL_CLASSIFY_NEON;
-#elif defined(RTE_ARCH_ARM)
-	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_NEON))
-		alg =  RTE_ACL_CLASSIFY_NEON;
+/*
+ * Get preferred alg for given platform.
+ */
+static enum rte_acl_classify_alg
+acl_get_best_alg(void)
+{
+	/*
+	 * array of supported methods for each platform.
+	 * Note that order is important - from most to less preferable.
+	 */
+	static const enum rte_acl_classify_alg alg[] = {
+#if defined(RTE_ARCH_ARM)
+		RTE_ACL_CLASSIFY_NEON,
 #elif defined(RTE_ARCH_PPC_64)
-	alg = RTE_ACL_CLASSIFY_ALTIVEC;
-#else
-#ifdef CC_AVX2_SUPPORT
-	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
-		alg = RTE_ACL_CLASSIFY_AVX2;
-	else if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
-#else
-	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE4_1))
+		RTE_ACL_CLASSIFY_ALTIVEC,
+#elif defined(RTE_ARCH_X86)
+		RTE_ACL_CLASSIFY_AVX2,
+		RTE_ACL_CLASSIFY_SSE,
 #endif
-		alg = RTE_ACL_CLASSIFY_SSE;
+		RTE_ACL_CLASSIFY_SCALAR,
+	};
 
-#endif
-	rte_acl_set_default_classify(alg);
+	uint32_t i;
+
+	/* find best possible alg */
+	for (i = 0; i != RTE_DIM(alg) && acl_check_alg(alg[i]) != 0; i++)
+		;
+
+	/* we always have to find something suitable */
+	RTE_VERIFY(i != RTE_DIM(alg));
+	return alg[i];
+}
+
+extern int
+rte_acl_set_ctx_classify(struct rte_acl_ctx *ctx, enum rte_acl_classify_alg alg)
+{
+	int32_t rc;
+
+	/* formal parameters check */
+	if (ctx == NULL || (uint32_t)alg >= RTE_DIM(classify_fns))
+		return -EINVAL;
+
+	/* user asked us to select the *best* one */
+	if (alg == RTE_ACL_CLASSIFY_DEFAULT)
+		alg = acl_get_best_alg();
+
+	/* check that given alg is supported */
+	rc = acl_check_alg(alg);
+	if (rc != 0)
+		return rc;
+
+	ctx->alg = alg;
+	return 0;
 }
 
+
 int
 rte_acl_classify_alg(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories,
@@ -262,7 +358,7 @@ rte_acl_create(const struct rte_acl_param *param)
 		ctx->max_rules = param->max_rule_num;
 		ctx->rule_sz = param->rule_size;
 		ctx->socket_id = param->socket_id;
-		ctx->alg = rte_acl_default_classify;
+		ctx->alg = acl_get_best_alg();
 		strlcpy(ctx->name, param->name, sizeof(ctx->name));
 
 		te->data = (void *) ctx;
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index b814423a6..3999f15de 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -329,6 +329,7 @@ rte_acl_classify_alg(const struct rte_acl_ctx *ctx,
  *   existing algorithm, and that it could be run on the given CPU.
  * @return
  *   - -EINVAL if the parameters are invalid.
+ *   - -ENOTSUP requested algorithm is not supported by given platform.
  *   - Zero if operation completed successfully.
  */
 extern int
-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v2 05/12] app/acl: few small improvements
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
                     ` (3 preceding siblings ...)
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 04/12] acl: remove library constructor Konstantin Ananyev
@ 2020-09-15 16:50   ` Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 06/12] test/acl: expand classify test coverage Konstantin Ananyev
                     ` (6 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

- enhance output to print extra stats
- use rte_rdtsc_precise() for cycle measurements

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test-acl/main.c | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/app/test-acl/main.c b/app/test-acl/main.c
index 0a5dfb621..d9b65517c 100644
--- a/app/test-acl/main.c
+++ b/app/test-acl/main.c
@@ -862,9 +862,10 @@ search_ip5tuples(__rte_unused void *arg)
 {
 	uint64_t pkt, start, tm;
 	uint32_t i, lcore;
+	long double st;
 
 	lcore = rte_lcore_id();
-	start = rte_rdtsc();
+	start = rte_rdtsc_precise();
 	pkt = 0;
 
 	for (i = 0; i != config.iter_num; i++) {
@@ -872,12 +873,16 @@ search_ip5tuples(__rte_unused void *arg)
 			config.trace_step, config.alg.name);
 	}
 
-	tm = rte_rdtsc() - start;
+	tm = rte_rdtsc_precise() - start;
+
+	st = (long double)tm / rte_get_timer_hz();
 	dump_verbose(DUMP_NONE, stdout,
 		"%s  @lcore %u: %" PRIu32 " iterations, %" PRIu64 " pkts, %"
-		PRIu32 " categories, %" PRIu64 " cycles, %#Lf cycles/pkt\n",
-		__func__, lcore, i, pkt, config.run_categories,
-		tm, (pkt == 0) ? 0 : (long double)tm / pkt);
+		PRIu32 " categories, %" PRIu64 " cycles (%.2Lf sec), "
+		"%.2Lf cycles/pkt, %.2Lf pkt/sec\n",
+		__func__, lcore, i, pkt,
+		config.run_categories, tm, st,
+		(pkt == 0) ? 0 : (long double)tm / pkt, pkt / st);
 
 	return 0;
 }
-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v2 06/12] test/acl: expand classify test coverage
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
                     ` (4 preceding siblings ...)
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 05/12] app/acl: few small improvements Konstantin Ananyev
@ 2020-09-15 16:50   ` Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 07/12] acl: add infrastructure to support AVX512 classify Konstantin Ananyev
                     ` (5 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Make classify test to run for all supported methods.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test/test_acl.c | 103 ++++++++++++++++++++++----------------------
 1 file changed, 51 insertions(+), 52 deletions(-)

diff --git a/app/test/test_acl.c b/app/test/test_acl.c
index 316bf4d06..333b34757 100644
--- a/app/test/test_acl.c
+++ b/app/test/test_acl.c
@@ -266,22 +266,20 @@ rte_acl_ipv4vlan_build(struct rte_acl_ctx *ctx,
 }
 
 /*
- * Test scalar and SSE ACL lookup.
+ * Test ACL lookup (selected alg).
  */
 static int
-test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
-	size_t dim)
+test_classify_alg(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
+	const uint8_t *data[], size_t dim, enum rte_acl_classify_alg alg)
 {
-	int ret, i;
-	uint32_t result, count;
+	int32_t ret;
+	uint32_t i, result, count;
 	uint32_t results[dim * RTE_ACL_MAX_CATEGORIES];
-	const uint8_t *data[dim];
-	/* swap all bytes in the data to network order */
-	bswap_test_data(test_data, dim, 1);
 
-	/* store pointers to test data */
-	for (i = 0; i < (int) dim; i++)
-		data[i] = (uint8_t *)&test_data[i];
+	/* set given classify alg, skip test if alg is not supported */
+	ret = rte_acl_set_ctx_classify(acx, alg);
+	if (ret == -ENOTSUP)
+		return 0;
 
 	/**
 	 * these will run quite a few times, it's necessary to test code paths
@@ -291,12 +289,13 @@ test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
 		ret = rte_acl_classify(acx, data, results,
 				count, RTE_ACL_MAX_CATEGORIES);
 		if (ret != 0) {
-			printf("Line %i: SSE classify failed!\n", __LINE__);
-			goto err;
+			printf("Line %i: classify(alg=%d) failed!\n",
+				__LINE__, alg);
+			return ret;
 		}
 
 		/* check if we allow everything we should allow */
-		for (i = 0; i < (int) count; i++) {
+		for (i = 0; i < count; i++) {
 			result =
 				results[i * RTE_ACL_MAX_CATEGORIES + ACL_ALLOW];
 			if (result != test_data[i].allow) {
@@ -304,63 +303,63 @@ test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
 					"(expected %"PRIu32" got %"PRIu32")!\n",
 					__LINE__, i, test_data[i].allow,
 					result);
-				ret = -EINVAL;
-				goto err;
+				return -EINVAL;
 			}
 		}
 
 		/* check if we deny everything we should deny */
-		for (i = 0; i < (int) count; i++) {
+		for (i = 0; i < count; i++) {
 			result = results[i * RTE_ACL_MAX_CATEGORIES + ACL_DENY];
 			if (result != test_data[i].deny) {
 				printf("Line %i: Error in deny results at %i "
 					"(expected %"PRIu32" got %"PRIu32")!\n",
 					__LINE__, i, test_data[i].deny,
 					result);
-				ret = -EINVAL;
-				goto err;
+				return -EINVAL;
 			}
 		}
 	}
 
-	/* make a quick check for scalar */
-	ret = rte_acl_classify_alg(acx, data, results,
-			dim, RTE_ACL_MAX_CATEGORIES,
-			RTE_ACL_CLASSIFY_SCALAR);
-	if (ret != 0) {
-		printf("Line %i: scalar classify failed!\n", __LINE__);
-		goto err;
-	}
+	/* restore default classify alg */
+	return rte_acl_set_ctx_classify(acx, RTE_ACL_CLASSIFY_DEFAULT);
+}
 
-	/* check if we allow everything we should allow */
-	for (i = 0; i < (int) dim; i++) {
-		result = results[i * RTE_ACL_MAX_CATEGORIES + ACL_ALLOW];
-		if (result != test_data[i].allow) {
-			printf("Line %i: Error in allow results at %i "
-					"(expected %"PRIu32" got %"PRIu32")!\n",
-					__LINE__, i, test_data[i].allow,
-					result);
-			ret = -EINVAL;
-			goto err;
-		}
-	}
+/*
+ * Test ACL lookup (all possible methods).
+ */
+static int
+test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
+	size_t dim)
+{
+	int32_t ret;
+	uint32_t i;
+	const uint8_t *data[dim];
 
-	/* check if we deny everything we should deny */
-	for (i = 0; i < (int) dim; i++) {
-		result = results[i * RTE_ACL_MAX_CATEGORIES + ACL_DENY];
-		if (result != test_data[i].deny) {
-			printf("Line %i: Error in deny results at %i "
-					"(expected %"PRIu32" got %"PRIu32")!\n",
-					__LINE__, i, test_data[i].deny,
-					result);
-			ret = -EINVAL;
-			goto err;
-		}
-	}
+	static const enum rte_acl_classify_alg alg[] = {
+		RTE_ACL_CLASSIFY_SCALAR,
+		RTE_ACL_CLASSIFY_SSE,
+		RTE_ACL_CLASSIFY_AVX2,
+		RTE_ACL_CLASSIFY_NEON,
+		RTE_ACL_CLASSIFY_ALTIVEC,
+	};
+
+	/* swap all bytes in the data to network order */
+	bswap_test_data(test_data, dim, 1);
+
+	/* store pointers to test data */
+	for (i = 0; i < dim; i++)
+		data[i] = (uint8_t *)&test_data[i];
 
 	ret = 0;
+	for (i = 0; i != RTE_DIM(alg); i++) {
+		ret = test_classify_alg(acx, test_data, data, dim, alg[i]);
+		if (ret < 0) {
+			printf("Line %i: %s() for alg=%d failed, errno=%d\n",
+				__LINE__, __func__, alg[i], -ret);
+			break;
+		}
+	}
 
-err:
 	/* swap data back to cpu order so that next time tests don't fail */
 	bswap_test_data(test_data, dim, 0);
 	return ret;
-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v2 07/12] acl: add infrastructure to support AVX512 classify
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
                     ` (5 preceding siblings ...)
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 06/12] test/acl: expand classify test coverage Konstantin Ananyev
@ 2020-09-15 16:50   ` Konstantin Ananyev
  2020-09-16  9:11     ` Bruce Richardson
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 08/12] acl: introduce AVX512 classify implementation Konstantin Ananyev
                     ` (4 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Add necessary changes to support new AVX512 specific ACL classify
algorithm:
 - changes in meson.build to check that build tools
   (compiler, assembler, etc.) do properly support AVX512.
 - run-time checks to make sure target platform does support AVX512.
 - dummy rte_acl_classify_avx512() for targets where AVX512
   implementation couldn't be properly supported.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 config/x86/meson.build          |  3 ++-
 lib/librte_acl/acl.h            |  4 ++++
 lib/librte_acl/acl_run_avx512.c | 17 ++++++++++++++
 lib/librte_acl/meson.build      | 39 +++++++++++++++++++++++++++++++++
 lib/librte_acl/rte_acl.c        | 29 ++++++++++++++++++++++++
 lib/librte_acl/rte_acl.h        |  1 +
 6 files changed, 92 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_acl/acl_run_avx512.c

diff --git a/config/x86/meson.build b/config/x86/meson.build
index 6ec020ef6..c5626e914 100644
--- a/config/x86/meson.build
+++ b/config/x86/meson.build
@@ -23,7 +23,8 @@ foreach f:base_flags
 endforeach
 
 optional_flags = ['AES', 'PCLMUL',
-		'AVX', 'AVX2', 'AVX512F',
+		'AVX', 'AVX2',
+		'AVX512F', 'AVX512VL', 'AVX512CD', 'AVX512BW',
 		'RDRND', 'RDSEED']
 foreach f:optional_flags
 	if cc.get_define('__@0@__'.format(f), args: machine_args) == '1'
diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 39d45a0c2..2022cf253 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -201,6 +201,10 @@ int
 rte_acl_classify_avx2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories);
 
+int
+rte_acl_classify_avx512(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories);
+
 int
 rte_acl_classify_neon(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories);
diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
new file mode 100644
index 000000000..67274989d
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -0,0 +1,17 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#include "acl_run_sse.h"
+
+int
+rte_acl_classify_avx512(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t num, uint32_t categories)
+{
+	if (num >= MAX_SEARCHES_SSE8)
+		return search_sse_8(ctx, data, results, num, categories);
+	if (num >= MAX_SEARCHES_SSE4)
+		return search_sse_4(ctx, data, results, num, categories);
+
+	return rte_acl_classify_scalar(ctx, data, results, num, categories);
+}
diff --git a/lib/librte_acl/meson.build b/lib/librte_acl/meson.build
index d1e2c184c..b2fd61cad 100644
--- a/lib/librte_acl/meson.build
+++ b/lib/librte_acl/meson.build
@@ -27,6 +27,45 @@ if dpdk_conf.has('RTE_ARCH_X86')
 		cflags += '-DCC_AVX2_SUPPORT'
 	endif
 
+	# compile AVX512 version if:
+	# we are building 64-bit binary AND binutils can generate proper code
+
+	if dpdk_conf.has('RTE_ARCH_X86_64') and binutils_ok.returncode() == 0
+
+		# compile AVX512 version if either:
+		# a. we have AVX512 supported in minimum instruction set
+		#    baseline
+		# b. it's not minimum instruction set, but supported by
+		#    compiler
+		#
+		# in former case, just add avx512 C file to files list
+		# in latter case, compile c file to static lib, using correct
+		# compiler flags, and then have the .o file from static lib
+		# linked into main lib.
+
+		if dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512F') and \
+			dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512VL') and \
+			dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512CD') and \
+			dpdk_conf.has('RTE_MACHINE_CPUFLAG_AVX512BW')
+
+			sources += files('acl_run_avx512.c')
+			cflags += '-DCC_AVX512_SUPPORT'
+
+		elif cc.has_multi_arguments('-mavx512f', '-mavx512vl',
+					'-mavx512cd', '-mavx512bw')
+
+			avx512_tmplib = static_library('avx512_tmp',
+				'acl_run_avx512.c',
+				dependencies: static_rte_eal,
+				c_args: cflags +
+					['-mavx512f', '-mavx512vl',
+					 '-mavx512cd', '-mavx512bw'])
+			objs += avx512_tmplib.extract_objects(
+					'acl_run_avx512.c')
+			cflags += '-DCC_AVX512_SUPPORT'
+		endif
+	endif
+
 elif dpdk_conf.has('RTE_ARCH_ARM') or dpdk_conf.has('RTE_ARCH_ARM64')
 	cflags += '-flax-vector-conversions'
 	sources += files('acl_run_neon.c')
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index fbcf45fdc..fdcb7a798 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -16,6 +16,22 @@ static struct rte_tailq_elem rte_acl_tailq = {
 };
 EAL_REGISTER_TAILQ(rte_acl_tailq)
 
+#ifndef CC_AVX512_SUPPORT
+/*
+ * If the compiler doesn't support AVX512 instructions,
+ * then the dummy one would be used instead for AVX512 classify method.
+ */
+int
+rte_acl_classify_avx512(__rte_unused const struct rte_acl_ctx *ctx,
+	__rte_unused const uint8_t **data,
+	__rte_unused uint32_t *results,
+	__rte_unused uint32_t num,
+	__rte_unused uint32_t categories)
+{
+	return -ENOTSUP;
+}
+#endif
+
 #ifndef CC_AVX2_SUPPORT
 /*
  * If the compiler doesn't support AVX2 instructions,
@@ -77,6 +93,7 @@ static const rte_acl_classify_t classify_fns[] = {
 	[RTE_ACL_CLASSIFY_AVX2] = rte_acl_classify_avx2,
 	[RTE_ACL_CLASSIFY_NEON] = rte_acl_classify_neon,
 	[RTE_ACL_CLASSIFY_ALTIVEC] = rte_acl_classify_altivec,
+	[RTE_ACL_CLASSIFY_AVX512] = rte_acl_classify_avx512,
 };
 
 /*
@@ -126,6 +143,17 @@ acl_check_alg_ppc(enum rte_acl_classify_alg alg)
 static int
 acl_check_alg_x86(enum rte_acl_classify_alg alg)
 {
+	if (alg == RTE_ACL_CLASSIFY_AVX512) {
+#ifdef CC_AVX512_SUPPORT
+		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F) &&
+			rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512VL) &&
+			rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512CD) &&
+			rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512BW))
+			return 0;
+#endif
+		return -ENOTSUP;
+	}
+
 	if (alg == RTE_ACL_CLASSIFY_AVX2) {
 #ifdef CC_AVX2_SUPPORT
 		if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
@@ -159,6 +187,7 @@ acl_check_alg(enum rte_acl_classify_alg alg)
 		return acl_check_alg_arm(alg);
 	case RTE_ACL_CLASSIFY_ALTIVEC:
 		return acl_check_alg_ppc(alg);
+	case RTE_ACL_CLASSIFY_AVX512:
 	case RTE_ACL_CLASSIFY_AVX2:
 	case RTE_ACL_CLASSIFY_SSE:
 		return acl_check_alg_x86(alg);
diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h
index 3999f15de..d243a1c84 100644
--- a/lib/librte_acl/rte_acl.h
+++ b/lib/librte_acl/rte_acl.h
@@ -241,6 +241,7 @@ enum rte_acl_classify_alg {
 	RTE_ACL_CLASSIFY_AVX2 = 3,    /**< requires AVX2 support. */
 	RTE_ACL_CLASSIFY_NEON = 4,    /**< requires NEON support. */
 	RTE_ACL_CLASSIFY_ALTIVEC = 5,    /**< requires ALTIVEC support. */
+	RTE_ACL_CLASSIFY_AVX512 = 6,    /**< requires AVX512 support. */
 };
 
 /**
-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v2 08/12] acl: introduce AVX512 classify implementation
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
                     ` (6 preceding siblings ...)
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 07/12] acl: add infrastructure to support AVX512 classify Konstantin Ananyev
@ 2020-09-15 16:50   ` Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 09/12] acl: enhance " Konstantin Ananyev
                     ` (3 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Introduce classify implementation that uses AVX512 specific ISA.
Current approach uses mix of 256i/512-bit width registers/instructions
and is able to process up to 16 flows in parallel.
Note that for now only 64-bit version of rte_acl_classify_avx512()
is available.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl.h              |   7 +
 lib/librte_acl/acl_gen.c          |   2 +-
 lib/librte_acl/acl_run_avx512.c   | 145 +++++++
 lib/librte_acl/acl_run_avx512x8.h | 620 ++++++++++++++++++++++++++++++
 4 files changed, 773 insertions(+), 1 deletion(-)
 create mode 100644 lib/librte_acl/acl_run_avx512x8.h

diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 2022cf253..3f0719f33 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -76,6 +76,13 @@ struct rte_acl_bitset {
  * input_byte - ((uint8_t *)&transition)[4 + input_byte / 64].
  */
 
+/*
+ * Each ACL RT contains an idle nomatch node:
+ * a SINGLE node at predefined position (RTE_ACL_DFA_SIZE)
+ * that points to itself.
+ */
+#define RTE_ACL_IDLE_NODE	(RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE)
+
 /*
  * Structure of a node is a set of ptrs and each ptr has a bit map
  * of values associated with this transition.
diff --git a/lib/librte_acl/acl_gen.c b/lib/librte_acl/acl_gen.c
index f1b9d12f1..e759a2ca1 100644
--- a/lib/librte_acl/acl_gen.c
+++ b/lib/librte_acl/acl_gen.c
@@ -496,7 +496,7 @@ rte_acl_gen(struct rte_acl_ctx *ctx, struct rte_acl_trie *trie,
 	 * highest index, that points to itself)
 	 */
 
-	node_array[RTE_ACL_DFA_SIZE] = RTE_ACL_DFA_SIZE | RTE_ACL_NODE_SINGLE;
+	node_array[RTE_ACL_DFA_SIZE] = RTE_ACL_IDLE_NODE;
 
 	for (n = 0; n < RTE_ACL_DFA_SIZE; n++)
 		node_array[n] = no_match;
diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
index 67274989d..353a3c004 100644
--- a/lib/librte_acl/acl_run_avx512.c
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -4,10 +4,155 @@
 
 #include "acl_run_sse.h"
 
+/*sizeof(uint32_t) << match_log == sizeof(struct rte_acl_match_results)*/
+static const uint32_t match_log = 5;
+
+struct acl_flow_avx512 {
+	uint32_t num_packets;       /* number of packets processed */
+	uint32_t total_packets;     /* max number of packets to process */
+	uint32_t root_index;        /* current root index */
+	const uint64_t *trans;      /* transition table */
+	const uint32_t *data_index; /* input data indexes */
+	const uint8_t **idata;      /* input data */
+	uint32_t *matches;          /* match indexes */
+};
+
+static inline void
+acl_set_flow_avx512(struct acl_flow_avx512 *flow, const struct rte_acl_ctx *ctx,
+	uint32_t trie, const uint8_t *data[], uint32_t *matches,
+	uint32_t total_packets)
+{
+	flow->num_packets = 0;
+	flow->total_packets = total_packets;
+	flow->root_index = ctx->trie[trie].root_index;
+	flow->trans = ctx->trans_table;
+	flow->data_index = ctx->trie[trie].data_index;
+	flow->idata = data;
+	flow->matches = matches;
+}
+
+/*
+ * Resolve matches for multiple categories (LE 8, use 128b instuctions/regs)
+ */
+static inline void
+resolve_mcle8_avx512x1(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_cat, uint32_t nb_trie)
+{
+	const int32_t *pri;
+	const uint32_t *pm, *res;
+	uint32_t i, j, k, mi, mn;
+	__mmask8 msk;
+	xmm_t cp, cr, np, nr;
+
+	res = pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != nb_pkt; k++, result += nb_cat) {
+
+		mi = match[k] << match_log;
+
+		for (j = 0; j != nb_cat; j += RTE_ACL_RESULTS_MULTIPLIER) {
+
+			cr = _mm_loadu_si128((const xmm_t *)(res + mi + j));
+			cp = _mm_loadu_si128((const xmm_t *)(pri + mi + j));
+
+			for (i = 1, pm = match + nb_pkt; i != nb_trie;
+				i++, pm += nb_pkt) {
+
+				mn = j + (pm[k] << match_log);
+
+				nr = _mm_loadu_si128((const xmm_t *)(res + mn));
+				np = _mm_loadu_si128((const xmm_t *)(pri + mn));
+
+				msk = _mm_cmpgt_epi32_mask(cp, np);
+				cr = _mm_mask_mov_epi32(nr, msk, cr);
+				cp = _mm_mask_mov_epi32(np, msk, cp);
+			}
+
+			_mm_storeu_si128((xmm_t *)(result + j), cr);
+		}
+	}
+}
+
+/*
+ * Resolve matches for multiple categories (GT 8, use 512b instuctions/regs)
+ */
+static inline void
+resolve_mcgt8_avx512x1(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_cat, uint32_t nb_trie)
+{
+	const int32_t *pri;
+	const uint32_t *pm, *res;
+	uint32_t i, k, mi;
+	__mmask16 cm, sm;
+	__m512i cp, cr, np, nr;
+
+	const uint32_t match_log = 5;
+
+	res = pr->results;
+	pri = pr->priority;
+
+	cm = (1 << nb_cat) - 1;
+
+	for (k = 0; k != nb_pkt; k++, result += nb_cat) {
+
+		mi = match[k] << match_log;
+
+		cr = _mm512_maskz_loadu_epi32(cm, res + mi);
+		cp = _mm512_maskz_loadu_epi32(cm, pri + mi);
+
+		for (i = 1, pm = match + nb_pkt; i != nb_trie;
+				i++, pm += nb_pkt) {
+
+			mi = pm[k] << match_log;
+
+			nr = _mm512_maskz_loadu_epi32(cm, res + mi);
+			np = _mm512_maskz_loadu_epi32(cm, pri + mi);
+
+			sm = _mm512_cmpgt_epi32_mask(cp, np);
+			cr = _mm512_mask_mov_epi32(nr, sm, cr);
+			cp = _mm512_mask_mov_epi32(np, sm, cp);
+		}
+
+		_mm512_mask_storeu_epi32(result, cm, cr);
+	}
+}
+
+static inline ymm_t
+_m512_mask_gather_epi8x8(__m512i pdata, __mmask8 mask)
+{
+	__m512i t;
+	rte_ymm_t v;
+	__rte_x86_zmm_t p;
+
+	static const uint32_t zero;
+
+	t = _mm512_set1_epi64((uintptr_t)&zero);
+	p.z = _mm512_mask_mov_epi64(t, mask, pdata);
+
+	v.u32[0] = *(uint8_t *)p.u64[0];
+	v.u32[1] = *(uint8_t *)p.u64[1];
+	v.u32[2] = *(uint8_t *)p.u64[2];
+	v.u32[3] = *(uint8_t *)p.u64[3];
+	v.u32[4] = *(uint8_t *)p.u64[4];
+	v.u32[5] = *(uint8_t *)p.u64[5];
+	v.u32[6] = *(uint8_t *)p.u64[6];
+	v.u32[7] = *(uint8_t *)p.u64[7];
+
+	return v.y;
+}
+
+
+#include "acl_run_avx512x8.h"
+
 int
 rte_acl_classify_avx512(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories)
 {
+	if (num >= MAX_SEARCHES_AVX16)
+		return search_avx512x8x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE8)
 		return search_sse_8(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE4)
diff --git a/lib/librte_acl/acl_run_avx512x8.h b/lib/librte_acl/acl_run_avx512x8.h
new file mode 100644
index 000000000..66fc26b26
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512x8.h
@@ -0,0 +1,620 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+#define NUM_AVX512X8X2	(2 * CHAR_BIT)
+#define MSK_AVX512X8X2	(NUM_AVX512X8X2 - 1)
+
+static const rte_ymm_t ymm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const rte_ymm_t ymm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const rte_ymm_t ymm_trlo_idle = {
+	.u32 = {
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+	},
+};
+
+static const rte_ymm_t ymm_trhi_idle = {
+	.u32 = {
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+	},
+};
+
+static const rte_ymm_t ymm_shuffle_input = {
+	.u32 = {
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+	},
+};
+
+static const rte_ymm_t ymm_four_32 = {
+	.u32 = {
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+	},
+};
+
+static const rte_ymm_t ymm_idx_add = {
+	.u32 = {
+		0, 1, 2, 3,
+		4, 5, 6, 7,
+	},
+};
+
+static const rte_ymm_t ymm_range_base = {
+	.u32 = {
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+	},
+};
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes not supposed to be encountered here.
+ * For quad range nodes:
+ * Calculate number of range boundaries that are less than the
+ * input value. Range boundaries for each node are in signed 8 bit,
+ * ordered from -128 to 127.
+ * This is effectively a popcnt of bytes that are greater than the
+ * input byte.
+ * Single nodes are processed in the same ways as quad range nodes.
+ */
+static __rte_always_inline ymm_t
+calc_addr8(ymm_t index_mask, ymm_t next_input, ymm_t shuffle_input,
+	ymm_t four_32, ymm_t range_base, ymm_t tr_lo, ymm_t tr_hi)
+{
+	ymm_t addr, in, node_type, r, t;
+	ymm_t dfa_msk, dfa_ofs, quad_ofs;
+
+	t = _mm256_xor_si256(index_mask, index_mask);
+	in = _mm256_shuffle_epi8(next_input, shuffle_input);
+
+	/* Calc node type and node addr */
+	node_type = _mm256_andnot_si256(index_mask, tr_lo);
+	addr = _mm256_and_si256(index_mask, tr_lo);
+
+	/* mask for DFA type(0) nodes */
+	dfa_msk = _mm256_cmpeq_epi32(node_type, t);
+
+	/* DFA calculations. */
+	r = _mm256_srli_epi32(in, 30);
+	r = _mm256_add_epi8(r, range_base);
+	t = _mm256_srli_epi32(in, 24);
+	r = _mm256_shuffle_epi8(tr_hi, r);
+
+	dfa_ofs = _mm256_sub_epi32(t, r);
+
+	/* QUAD/SINGLE calculations. */
+	t = _mm256_cmpgt_epi8(in, tr_hi);
+	t = _mm256_lzcnt_epi32(t);
+	t = _mm256_srli_epi32(t, 3);
+	quad_ofs = _mm256_sub_epi32(four_32, t);
+
+	/* blend DFA and QUAD/SINGLE. */
+	t = _mm256_blendv_epi8(quad_ofs, dfa_ofs, dfa_msk);
+
+	/* calculate address for next transitions. */
+	addr = _mm256_add_epi32(addr, t);
+	return addr;
+}
+
+/*
+ * Process 8 transitions in parallel.
+ * tr_lo contains low 32 bits for 8 transitions.
+ * tr_hi contains high 32 bits for 8 transitions.
+ * next_input contains up to 4 input bytes for 8 flows.
+ */
+static __rte_always_inline ymm_t
+transition8(ymm_t next_input, const uint64_t *trans, ymm_t *tr_lo, ymm_t *tr_hi)
+{
+	const int32_t *tr;
+	ymm_t addr;
+
+	tr = (const int32_t *)(uintptr_t)trans;
+
+	/* Calculate the address (array index) for all 8 transitions. */
+	addr = calc_addr8(ymm_index_mask.y, next_input, ymm_shuffle_input.y,
+		ymm_four_32.y, ymm_range_base.y, *tr_lo, *tr_hi);
+
+	/* load lower 32 bits of 8 transactions at once. */
+	*tr_lo = _mm256_i32gather_epi32(tr, addr, sizeof(trans[0]));
+
+	next_input = _mm256_srli_epi32(next_input, CHAR_BIT);
+
+	/* load high 32 bits of 8 transactions at once. */
+	*tr_hi = _mm256_i32gather_epi32(tr + 1, addr, sizeof(trans[0]));
+
+	return next_input;
+}
+
+/*
+ * Execute first transition for up to 8 flows in parallel.
+ * next_input should contain one input byte for up to 8 flows.
+ * msk - mask of active flows.
+ * tr_lo contains low 32 bits for up to 8 transitions.
+ * tr_hi contains high 32 bits for up to 8 transitions.
+ */
+static __rte_always_inline void
+first_trans8(const struct acl_flow_avx512 *flow, ymm_t next_input,
+	__mmask8 msk, ymm_t *tr_lo, ymm_t *tr_hi)
+{
+	const int32_t *tr;
+	ymm_t addr, root;
+
+	tr = (const int32_t *)(uintptr_t)flow->trans;
+
+	addr = _mm256_set1_epi32(UINT8_MAX);
+	root = _mm256_set1_epi32(flow->root_index);
+
+	addr = _mm256_and_si256(next_input, addr);
+	addr = _mm256_add_epi32(root, addr);
+
+	/* load lower 32 bits of 8 transactions at once. */
+	*tr_lo = _mm256_mmask_i32gather_epi32(*tr_lo, msk, addr, tr,
+		sizeof(flow->trans[0]));
+
+	/* load high 32 bits of 8 transactions at once. */
+	*tr_hi = _mm256_mmask_i32gather_epi32(*tr_hi, msk, addr, (tr + 1),
+		sizeof(flow->trans[0]));
+}
+
+/*
+ * Load and return next 4 input bytes for up to 8 flows in parallel.
+ * pdata - 8 pointers to flow input data
+ * mask - mask of active flows.
+ * di - data indexes for these 8 flows.
+ */
+static inline ymm_t
+get_next_bytes_avx512x8(const struct acl_flow_avx512 *flow, __m512i pdata,
+	__mmask8 mask, ymm_t *di, uint32_t bnum)
+{
+	const int32_t *div;
+	ymm_t one, zero;
+	ymm_t inp, t;
+	__m512i p;
+
+	div = (const int32_t *)flow->data_index;
+
+	one = _mm256_set1_epi32(1);
+	zero = _mm256_xor_si256(one, one);
+
+	/* load data offsets for given indexes */
+	t = _mm256_mmask_i32gather_epi32(zero, mask, *di, div, sizeof(div[0]));
+
+	/* increment data indexes */
+	*di = _mm256_mask_add_epi32(*di, mask, *di, one);
+
+	p = _mm512_cvtepu32_epi64(t);
+	p = _mm512_add_epi64(p, pdata);
+
+	/* load input byte(s), either one or four */
+	if (bnum == sizeof(uint8_t))
+		inp = _m512_mask_gather_epi8x8(p, mask);
+	else
+		inp = _mm512_mask_i64gather_epi32(zero, mask, p, NULL,
+			sizeof(uint8_t));
+	return inp;
+}
+
+/*
+ * Start up to 8 new flows.
+ * num - number of flows to start
+ * msk - mask of new flows.
+ * pdata - pointers to flow input data
+ * di - data indexes for these flows.
+ */
+static inline void
+start_flow8(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
+	__m512i *pdata, ymm_t *idx, ymm_t *di)
+{
+	uint32_t nm;
+	ymm_t ni;
+	__m512i nd;
+
+	/* load input data pointers for new flows */
+	nm = (1 << num) - 1;
+	nd = _mm512_maskz_loadu_epi64(nm, flow->idata + flow->num_packets);
+
+	/* calculate match indexes of new flows */
+	ni = _mm256_set1_epi32(flow->num_packets);
+	ni = _mm256_add_epi32(ni, ymm_idx_add.y);
+
+	/* merge new and existing flows data */
+	*pdata = _mm512_mask_expand_epi64(*pdata, msk, nd);
+	*idx = _mm256_mask_expand_epi32(*idx, msk, ni);
+	*di = _mm256_maskz_mov_epi32(msk ^ UINT8_MAX, *di);
+
+	flow->num_packets += num;
+}
+
+/*
+ * Update flow and result masks based on the number of unprocessed flows.
+ */
+static inline uint32_t
+update_flow_mask8(const struct acl_flow_avx512 *flow, __mmask8 *fmsk,
+	__mmask8 *rmsk)
+{
+	uint32_t i, j, k, m, n;
+
+	fmsk[0] ^= rmsk[0];
+	m = rmsk[0];
+
+	k = __builtin_popcount(m);
+	n = flow->total_packets - flow->num_packets;
+
+	if (n < k) {
+		/* reduce mask */
+		for (i = k - n; i != 0; i--) {
+			j = sizeof(m) * CHAR_BIT - 1 - __builtin_clz(m);
+			m ^= 1 << j;
+		}
+	} else
+		n = k;
+
+	rmsk[0] = m;
+	fmsk[0] |= rmsk[0];
+
+	return n;
+}
+
+/*
+ * Process found matches for up to 8 flows.
+ * fmsk - mask of active flows
+ * rmsk - mask of found matches
+ * pdata - pointers to flow input data
+ * di - data indexes for these flows
+ * idx - match indexed for given flows
+ * tr_lo contains low 32 bits for up to 8 transitions.
+ * tr_hi contains high 32 bits for up to 8 transitions.
+ */
+static inline uint32_t
+match_process_avx512x8(struct acl_flow_avx512 *flow, __mmask8 *fmsk,
+	__mmask8 *rmsk,	__m512i *pdata, ymm_t *di, ymm_t *idx,
+	ymm_t *tr_lo, ymm_t *tr_hi)
+{
+	uint32_t n;
+	ymm_t res;
+
+	if (rmsk[0] == 0)
+		return 0;
+
+	/* extract match indexes */
+	res = _mm256_and_si256(tr_lo[0], ymm_index_mask.y);
+
+	/* mask  matched transitions to nop */
+	tr_lo[0] = _mm256_mask_mov_epi32(tr_lo[0], rmsk[0], ymm_trlo_idle.y);
+	tr_hi[0] = _mm256_mask_mov_epi32(tr_hi[0], rmsk[0], ymm_trhi_idle.y);
+
+	/* save found match indexes */
+	_mm256_mask_i32scatter_epi32(flow->matches, rmsk[0],
+		idx[0], res, sizeof(flow->matches[0]));
+
+	/* update masks and start new flows for matches */
+	n = update_flow_mask8(flow, fmsk, rmsk);
+	start_flow8(flow, n, rmsk[0], pdata, idx, di);
+
+	return n;
+}
+
+
+static inline void
+match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, __mmask8 fm[2],
+	__m512i pdata[2], ymm_t di[2], ymm_t idx[2], ymm_t inp[2],
+	ymm_t tr_lo[2], ymm_t tr_hi[2])
+{
+	uint32_t n[2];
+	__mmask8 rm[2];
+
+	/* check for matches */
+	rm[0] = _mm256_test_epi32_mask(tr_lo[0], ymm_match_mask.y);
+	rm[1] = _mm256_test_epi32_mask(tr_lo[1], ymm_match_mask.y);
+
+	/* till unprocessed matches exist */
+	while ((rm[0] | rm[1]) != 0) {
+
+		/* process matches and start new flows */
+		n[0] = match_process_avx512x8(flow, &fm[0], &rm[0], &pdata[0],
+			&di[0], &idx[0], &tr_lo[0], &tr_hi[0]);
+		n[1] = match_process_avx512x8(flow, &fm[1], &rm[1], &pdata[1],
+			&di[1], &idx[1], &tr_lo[1], &tr_hi[1]);
+
+		/* execute first transition for new flows, if any */
+
+		if (n[0] != 0) {
+			inp[0] = get_next_bytes_avx512x8(flow, pdata[0], rm[0],
+				&di[0], sizeof(uint8_t));
+			first_trans8(flow, inp[0], rm[0], &tr_lo[0], &tr_hi[0]);
+
+			rm[0] = _mm256_test_epi32_mask(tr_lo[0],
+				ymm_match_mask.y);
+		}
+
+		if (n[1] != 0) {
+			inp[1] = get_next_bytes_avx512x8(flow, pdata[1], rm[1],
+				&di[1], sizeof(uint8_t));
+			first_trans8(flow, inp[1], rm[1], &tr_lo[1], &tr_hi[1]);
+
+			rm[1] = _mm256_test_epi32_mask(tr_lo[1],
+				ymm_match_mask.y);
+		}
+	}
+}
+
+/*
+ * Perform search for up to 16 flows in parallel.
+ * Use two sets of metadata, each serves 8 flows max.
+ * So in fact we perform search for 2x8 flows.
+ */
+static inline void
+search_trie_avx512x8x2(struct acl_flow_avx512 *flow)
+{
+	__mmask8 fm[2];
+	__m512i pdata[2];
+	ymm_t di[2], idx[2], inp[2], tr_lo[2], tr_hi[2];
+
+	/* first 1B load */
+	start_flow8(flow, CHAR_BIT, UINT8_MAX, &pdata[0], &idx[0], &di[0]);
+	start_flow8(flow, CHAR_BIT, UINT8_MAX, &pdata[1], &idx[1], &di[1]);
+
+	inp[0] = get_next_bytes_avx512x8(flow, pdata[0], UINT8_MAX, &di[0],
+		sizeof(uint8_t));
+	inp[1] = get_next_bytes_avx512x8(flow, pdata[1], UINT8_MAX, &di[1],
+		sizeof(uint8_t));
+
+	first_trans8(flow, inp[0], UINT8_MAX, &tr_lo[0], &tr_hi[0]);
+	first_trans8(flow, inp[1], UINT8_MAX, &tr_lo[1], &tr_hi[1]);
+
+	fm[0] = UINT8_MAX;
+	fm[1] = UINT8_MAX;
+
+	/* match check */
+	match_check_process_avx512x8x2(flow, fm, pdata, di, idx, inp,
+		tr_lo, tr_hi);
+
+	while ((fm[0] | fm[1]) != 0) {
+
+		/* load next 4B */
+
+		inp[0] = get_next_bytes_avx512x8(flow, pdata[0], fm[0],
+			&di[0], sizeof(uint32_t));
+		inp[1] = get_next_bytes_avx512x8(flow, pdata[1], fm[1],
+			&di[1], sizeof(uint32_t));
+
+		/* main 4B loop */
+
+		inp[0] = transition8(inp[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		inp[1] = transition8(inp[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		inp[0] = transition8(inp[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		inp[1] = transition8(inp[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		inp[0] = transition8(inp[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		inp[1] = transition8(inp[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		inp[0] = transition8(inp[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		inp[1] = transition8(inp[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		/* check for matches */
+		match_check_process_avx512x8x2(flow, fm, pdata, di, idx, inp,
+			tr_lo, tr_hi);
+	}
+}
+
+/*
+ * resolve match index to actual result/priority offset.
+ */
+static inline ymm_t
+resolve_match_idx_avx512x8(ymm_t mi)
+{
+	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
+		1 << (match_log + 2));
+	return _mm256_slli_epi32(mi, match_log);
+}
+
+
+/*
+ * Resolve multiple matches for the same flow based on priority.
+ */
+static inline ymm_t
+resolve_pri_avx512x8(const int32_t res[], const int32_t pri[],
+	const uint32_t match[], __mmask8 msk, uint32_t nb_trie,
+	uint32_t nb_skip)
+{
+	uint32_t i;
+	const uint32_t *pm;
+	__mmask8 m;
+	ymm_t cp, cr, np, nr, mch;
+
+	const ymm_t zero = _mm256_set1_epi32(0);
+
+	mch = _mm256_maskz_loadu_epi32(msk, match);
+	mch = resolve_match_idx_avx512x8(mch);
+
+	cr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0]));
+	cp = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0]));
+
+	for (i = 1, pm = match + nb_skip; i != nb_trie;
+			i++, pm += nb_skip) {
+
+		mch = _mm256_maskz_loadu_epi32(msk, pm);
+		mch = resolve_match_idx_avx512x8(mch);
+
+		nr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res,
+			sizeof(res[0]));
+		np = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri,
+			sizeof(pri[0]));
+
+		m = _mm256_cmpgt_epi32_mask(cp, np);
+		cr = _mm256_mask_mov_epi32(nr, m, cr);
+		cp = _mm256_mask_mov_epi32(np, m, cp);
+	}
+
+	return cr;
+}
+
+/*
+ * Resolve num (<= 8) matches for single category
+ */
+static inline void
+resolve_sc_avx512x8(uint32_t result[], const int32_t res[], const int32_t pri[],
+	const uint32_t match[], uint32_t nb_pkt, uint32_t nb_trie,
+	uint32_t nb_skip)
+{
+	__mmask8 msk;
+	ymm_t cr;
+
+	msk = (1 << nb_pkt) - 1;
+	cr = resolve_pri_avx512x8(res, pri, match, msk, nb_trie, nb_skip);
+	_mm256_mask_storeu_epi32(result, msk, cr);
+}
+
+/*
+ * Resolve matches for single category
+ */
+static inline void
+resolve_sc_avx512x8x2(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_trie)
+{
+	uint32_t i, j, k, n;
+	const uint32_t *pm;
+	const int32_t *res, *pri;
+	__mmask8 m[2];
+	ymm_t cp[2], cr[2], np[2], nr[2], mch[2];
+
+	res = (const int32_t *)pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != (nb_pkt & ~MSK_AVX512X8X2); k += NUM_AVX512X8X2) {
+
+		j = k + CHAR_BIT;
+
+		/* load match indexes for first trie */
+		mch[0] = _mm256_loadu_si256((const ymm_t *)(match + k));
+		mch[1] = _mm256_loadu_si256((const ymm_t *)(match + j));
+
+		mch[0] = resolve_match_idx_avx512x8(mch[0]);
+		mch[1] = resolve_match_idx_avx512x8(mch[1]);
+
+		/* load matches and their priorities for first trie */
+
+		cr[0] = _mm256_i32gather_epi32(res, mch[0], sizeof(res[0]));
+		cr[1] = _mm256_i32gather_epi32(res, mch[1], sizeof(res[0]));
+
+		cp[0] = _mm256_i32gather_epi32(pri, mch[0], sizeof(pri[0]));
+		cp[1] = _mm256_i32gather_epi32(pri, mch[1], sizeof(pri[0]));
+
+		/* select match with highest priority */
+		for (i = 1, pm = match + nb_pkt; i != nb_trie;
+				i++, pm += nb_pkt) {
+
+			mch[0] = _mm256_loadu_si256((const ymm_t *)(pm + k));
+			mch[1] = _mm256_loadu_si256((const ymm_t *)(pm + j));
+
+			mch[0] = resolve_match_idx_avx512x8(mch[0]);
+			mch[1] = resolve_match_idx_avx512x8(mch[1]);
+
+			nr[0] = _mm256_i32gather_epi32(res, mch[0],
+				sizeof(res[0]));
+			nr[1] = _mm256_i32gather_epi32(res, mch[1],
+				sizeof(res[0]));
+
+			np[0] = _mm256_i32gather_epi32(pri, mch[0],
+				sizeof(pri[0]));
+			np[1] = _mm256_i32gather_epi32(pri, mch[1],
+				sizeof(pri[0]));
+
+			m[0] = _mm256_cmpgt_epi32_mask(cp[0], np[0]);
+			m[1] = _mm256_cmpgt_epi32_mask(cp[1], np[1]);
+
+			cr[0] = _mm256_mask_mov_epi32(nr[0], m[0], cr[0]);
+			cr[1] = _mm256_mask_mov_epi32(nr[1], m[1], cr[1]);
+
+			cp[0] = _mm256_mask_mov_epi32(np[0], m[0], cp[0]);
+			cp[1] = _mm256_mask_mov_epi32(np[1], m[1], cp[1]);
+		}
+
+		_mm256_storeu_si256((ymm_t *)(result + k), cr[0]);
+		_mm256_storeu_si256((ymm_t *)(result + j), cr[1]);
+	}
+
+	n = nb_pkt - k;
+	if (n != 0) {
+		if (n > CHAR_BIT) {
+			resolve_sc_avx512x8(result + k, res, pri, match + k,
+				CHAR_BIT, nb_trie, nb_pkt);
+			k += CHAR_BIT;
+			n -= CHAR_BIT;
+		}
+		resolve_sc_avx512x8(result + k, res, pri, match + k, n,
+				nb_trie, nb_pkt);
+	}
+}
+
+static inline int
+search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	uint32_t i, *pm;
+	const struct rte_acl_match_results *pr;
+	struct acl_flow_avx512 flow;
+	uint32_t match[ctx->num_tries * total_packets];
+
+	for (i = 0, pm = match; i != ctx->num_tries; i++, pm += total_packets) {
+
+		/* setup for next trie */
+		acl_set_flow_avx512(&flow, ctx, i, data, pm, total_packets);
+
+		/* process the trie */
+		search_trie_avx512x8x2(&flow);
+	}
+
+	/* resolve matches */
+	pr = (const struct rte_acl_match_results *)
+		(ctx->trans_table + ctx->match_index);
+
+	if (categories == 1)
+		resolve_sc_avx512x8x2(results, pr, match, total_packets,
+			ctx->num_tries);
+	else if (categories <= RTE_ACL_MAX_CATEGORIES / 2)
+		resolve_mcle8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+	else
+		resolve_mcgt8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+
+	return 0;
+}
-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v2 09/12] acl: enhance AVX512 classify implementation
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
                     ` (7 preceding siblings ...)
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 08/12] acl: introduce AVX512 classify implementation Konstantin Ananyev
@ 2020-09-15 16:50   ` " Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 10/12] acl: for AVX512 classify use 4B load whenever possible Konstantin Ananyev
                     ` (2 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Add search_avx512x16x2() which uses mostly 512-bit width
registers/instructions and is able to process up to 32 flows in
parallel. That allows to futher speedup rte_acl_classify_avx512()
for bursts with 32+ requests.
Run-time code-path selection is done internally based
on input burst size and is totally opaque to the user.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

These patch depends on:
https://patches.dpdk.org/patch/73922/mbox/
to be applied first.

 .../prog_guide/packet_classif_access_ctrl.rst |   9 +
 doc/guides/rel_notes/release_20_11.rst        |   5 +
 lib/librte_acl/acl_run_avx512.c               | 162 ++++++
 lib/librte_acl/acl_run_avx512x16.h            | 526 ++++++++++++++++++
 lib/librte_acl/acl_run_avx512x8.h             | 195 +------
 5 files changed, 709 insertions(+), 188 deletions(-)
 create mode 100644 lib/librte_acl/acl_run_avx512x16.h

diff --git a/doc/guides/prog_guide/packet_classif_access_ctrl.rst b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
index daf03e6d7..f6c64fbd9 100644
--- a/doc/guides/prog_guide/packet_classif_access_ctrl.rst
+++ b/doc/guides/prog_guide/packet_classif_access_ctrl.rst
@@ -379,10 +379,19 @@ There are several implementations of classify algorithm:
 *   **RTE_ACL_CLASSIFY_ALTIVEC**: vector implementation, can process up to 8
     flows in parallel. Requires ALTIVEC support.
 
+*   **RTE_ACL_CLASSIFY_AVX512**: vector implementation, can process up to 32
+    flows in parallel. Requires AVX512 support.
+
 It is purely a runtime decision which method to choose, there is no build-time difference.
 All implementations operates over the same internal RT structures and use similar principles. The main difference is that vector implementations can manually exploit IA SIMD instructions and process several input data flows in parallel.
 At startup ACL library determines the highest available classify method for the given platform and sets it as default one. Though the user has an ability to override the default classifier function for a given ACL context or perform particular search using non-default classify method. In that case it is user responsibility to make sure that given platform supports selected classify implementation.
 
+.. note::
+
+     Right now ``RTE_ACL_CLASSIFY_AVX512`` is not selected by default
+     (due to possible frequency level change), but it can be selected at
+     runtime by apps through the use of ACL API: ``rte_acl_set_ctx_classify``.
+
 Application Programming Interface (API) Usage
 ---------------------------------------------
 
diff --git a/doc/guides/rel_notes/release_20_11.rst b/doc/guides/rel_notes/release_20_11.rst
index a9a1b0305..acdd12ef9 100644
--- a/doc/guides/rel_notes/release_20_11.rst
+++ b/doc/guides/rel_notes/release_20_11.rst
@@ -55,6 +55,11 @@ New Features
      Also, make sure to start the actual text at the margin.
      =======================================================
 
+* **Add new AVX512 specific classify algorithm for ACL library.**
+
+  Added new ``RTE_ACL_CLASSIFY_AVX512`` vector implementation,
+  which can processup to 32 flows in parallel. Requires AVX512 support.
+
 
 Removed Items
 -------------
diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
index 353a3c004..60762b7d6 100644
--- a/lib/librte_acl/acl_run_avx512.c
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -4,6 +4,11 @@
 
 #include "acl_run_sse.h"
 
+#define	MASK16_BIT	(sizeof(__mmask16) * CHAR_BIT)
+
+#define NUM_AVX512X16X2	(2 * MASK16_BIT)
+#define MSK_AVX512X16X2	(NUM_AVX512X16X2 - 1)
+
 /*sizeof(uint32_t) << match_log == sizeof(struct rte_acl_match_results)*/
 static const uint32_t match_log = 5;
 
@@ -31,6 +36,36 @@ acl_set_flow_avx512(struct acl_flow_avx512 *flow, const struct rte_acl_ctx *ctx,
 	flow->matches = matches;
 }
 
+/*
+ * Update flow and result masks based on the number of unprocessed flows.
+ */
+static inline uint32_t
+update_flow_mask(const struct acl_flow_avx512 *flow, uint32_t *fmsk,
+	uint32_t *rmsk)
+{
+	uint32_t i, j, k, m, n;
+
+	fmsk[0] ^= rmsk[0];
+	m = rmsk[0];
+
+	k = __builtin_popcount(m);
+	n = flow->total_packets - flow->num_packets;
+
+	if (n < k) {
+		/* reduce mask */
+		for (i = k - n; i != 0; i--) {
+			j = sizeof(m) * CHAR_BIT - 1 - __builtin_clz(m);
+			m ^= 1 << j;
+		}
+	} else
+		n = k;
+
+	rmsk[0] = m;
+	fmsk[0] |= rmsk[0];
+
+	return n;
+}
+
 /*
  * Resolve matches for multiple categories (LE 8, use 128b instuctions/regs)
  */
@@ -144,13 +179,140 @@ _m512_mask_gather_epi8x8(__m512i pdata, __mmask8 mask)
 	return v.y;
 }
 
+/*
+ * resolve match index to actual result/priority offset.
+ */
+static inline __m512i
+resolve_match_idx_avx512x16(__m512i mi)
+{
+	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
+		1 << (match_log + 2));
+	return _mm512_slli_epi32(mi, match_log);
+}
+
+/*
+ * Resolve multiple matches for the same flow based on priority.
+ */
+static inline __m512i
+resolve_pri_avx512x16(const int32_t res[], const int32_t pri[],
+	const uint32_t match[], __mmask16 msk, uint32_t nb_trie,
+	uint32_t nb_skip)
+{
+	uint32_t i;
+	const uint32_t *pm;
+	__mmask16 m;
+	__m512i cp, cr, np, nr, mch;
+
+	const __m512i zero = _mm512_set1_epi32(0);
+
+	/* get match indexes */
+	mch = _mm512_maskz_loadu_epi32(msk, match);
+	mch = resolve_match_idx_avx512x16(mch);
+
+	/* read result and priority values for first trie */
+	cr = _mm512_mask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0]));
+	cp = _mm512_mask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0]));
+
+	/*
+	 * read result and priority values for next tries and select one
+	 * with highest priority.
+	 */
+	for (i = 1, pm = match + nb_skip; i != nb_trie;
+			i++, pm += nb_skip) {
+
+		mch = _mm512_maskz_loadu_epi32(msk, pm);
+		mch = resolve_match_idx_avx512x16(mch);
+
+		nr = _mm512_mask_i32gather_epi32(zero, msk, mch, res,
+			sizeof(res[0]));
+		np = _mm512_mask_i32gather_epi32(zero, msk, mch, pri,
+			sizeof(pri[0]));
+
+		m = _mm512_cmpgt_epi32_mask(cp, np);
+		cr = _mm512_mask_mov_epi32(nr, m, cr);
+		cp = _mm512_mask_mov_epi32(np, m, cp);
+	}
+
+	return cr;
+}
+
+/*
+ * Resolve num (<= 16) matches for single category
+ */
+static inline void
+resolve_sc_avx512x16(uint32_t result[], const int32_t res[],
+	const int32_t pri[], const uint32_t match[], uint32_t nb_pkt,
+	uint32_t nb_trie, uint32_t nb_skip)
+{
+	__mmask16 msk;
+	__m512i cr;
+
+	msk = (1 << nb_pkt) - 1;
+	cr = resolve_pri_avx512x16(res, pri, match, msk, nb_trie, nb_skip);
+	_mm512_mask_storeu_epi32(result, msk, cr);
+}
+
+/*
+ * Resolve matches for single category
+ */
+static inline void
+resolve_sc_avx512x16x2(uint32_t result[],
+	const struct rte_acl_match_results pr[], const uint32_t match[],
+	uint32_t nb_pkt, uint32_t nb_trie)
+{
+	uint32_t j, k, n;
+	const int32_t *res, *pri;
+	__m512i cr[2];
+
+	res = (const int32_t *)pr->results;
+	pri = pr->priority;
+
+	for (k = 0; k != (nb_pkt & ~MSK_AVX512X16X2); k += NUM_AVX512X16X2) {
+
+		j = k + MASK16_BIT;
+
+		cr[0] = resolve_pri_avx512x16(res, pri, match + k, UINT16_MAX,
+				nb_trie, nb_pkt);
+		cr[1] = resolve_pri_avx512x16(res, pri, match + j, UINT16_MAX,
+				nb_trie, nb_pkt);
+
+		_mm512_storeu_si512(result + k, cr[0]);
+		_mm512_storeu_si512(result + j, cr[1]);
+	}
+
+	n = nb_pkt - k;
+	if (n != 0) {
+		if (n > MASK16_BIT) {
+			resolve_sc_avx512x16(result + k, res, pri, match + k,
+				MASK16_BIT, nb_trie, nb_pkt);
+			k += MASK16_BIT;
+			n -= MASK16_BIT;
+		}
+		resolve_sc_avx512x16(result + k, res, pri, match + k, n,
+				nb_trie, nb_pkt);
+	}
+}
 
 #include "acl_run_avx512x8.h"
+#include "acl_run_avx512x16.h"
 
 int
 rte_acl_classify_avx512(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t num, uint32_t categories)
 {
+	const uint32_t max_iter = MAX_SEARCHES_AVX16 * MAX_SEARCHES_AVX16;
+
+	/* split huge lookup (gt 256) into series of fixed size ones */
+	while (num > max_iter) {
+		search_avx512x16x2(ctx, data, results, max_iter, categories);
+		data += max_iter;
+		results += max_iter * categories;
+		num -= max_iter;
+	}
+
+	/* select classify method based on number of remainig requests */
+	if (num >= 2 * MAX_SEARCHES_AVX16)
+		return search_avx512x16x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_AVX16)
 		return search_avx512x8x2(ctx, data, results, num, categories);
 	if (num >= MAX_SEARCHES_SSE8)
diff --git a/lib/librte_acl/acl_run_avx512x16.h b/lib/librte_acl/acl_run_avx512x16.h
new file mode 100644
index 000000000..45b0b4db6
--- /dev/null
+++ b/lib/librte_acl/acl_run_avx512x16.h
@@ -0,0 +1,526 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2020 Intel Corporation
+ */
+
+static const __rte_x86_zmm_t zmm_match_mask = {
+	.u32 = {
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+		RTE_ACL_NODE_MATCH,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_index_mask = {
+	.u32 = {
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+		RTE_ACL_NODE_INDEX,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_trlo_idle = {
+	.u32 = {
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+		RTE_ACL_IDLE_NODE,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_trhi_idle = {
+	.u32 = {
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+		0, 0, 0, 0,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_shuffle_input = {
+	.u32 = {
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+		0x00000000, 0x04040404, 0x08080808, 0x0c0c0c0c,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_four_32 = {
+	.u32 = {
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+		4, 4, 4, 4,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_idx_add = {
+	.u32 = {
+		0, 1, 2, 3,
+		4, 5, 6, 7,
+		8, 9, 10, 11,
+		12, 13, 14, 15,
+	},
+};
+
+static const __rte_x86_zmm_t zmm_range_base = {
+	.u32 = {
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+		0xffffff00, 0xffffff04, 0xffffff08, 0xffffff0c,
+	},
+};
+
+/*
+ * Calculate the address of the next transition for
+ * all types of nodes. Note that only DFA nodes and range
+ * nodes actually transition to another node. Match
+ * nodes not supposed to be encountered here.
+ * For quad range nodes:
+ * Calculate number of range boundaries that are less than the
+ * input value. Range boundaries for each node are in signed 8 bit,
+ * ordered from -128 to 127.
+ * This is effectively a popcnt of bytes that are greater than the
+ * input byte.
+ * Single nodes are processed in the same ways as quad range nodes.
+ */
+static __rte_always_inline __m512i
+calc_addr16(__m512i index_mask, __m512i next_input, __m512i shuffle_input,
+	__m512i four_32, __m512i range_base, __m512i tr_lo, __m512i tr_hi)
+{
+	__mmask64 qm;
+	__mmask16 dfa_msk;
+	__m512i addr, in, node_type, r, t;
+	__m512i dfa_ofs, quad_ofs;
+
+	t = _mm512_xor_si512(index_mask, index_mask);
+	in = _mm512_shuffle_epi8(next_input, shuffle_input);
+
+	/* Calc node type and node addr */
+	node_type = _mm512_andnot_si512(index_mask, tr_lo);
+	addr = _mm512_and_si512(index_mask, tr_lo);
+
+	/* mask for DFA type(0) nodes */
+	dfa_msk = _mm512_cmpeq_epi32_mask(node_type, t);
+
+	/* DFA calculations. */
+	r = _mm512_srli_epi32(in, 30);
+	r = _mm512_add_epi8(r, range_base);
+	t = _mm512_srli_epi32(in, 24);
+	r = _mm512_shuffle_epi8(tr_hi, r);
+
+	dfa_ofs = _mm512_sub_epi32(t, r);
+
+	/* QUAD/SINGLE calculations. */
+	qm = _mm512_cmpgt_epi8_mask(in, tr_hi);
+	t = _mm512_maskz_set1_epi8(qm, (uint8_t)UINT8_MAX);
+	t = _mm512_lzcnt_epi32(t);
+	t = _mm512_srli_epi32(t, 3);
+	quad_ofs = _mm512_sub_epi32(four_32, t);
+
+	/* blend DFA and QUAD/SINGLE. */
+	t = _mm512_mask_mov_epi32(quad_ofs, dfa_msk, dfa_ofs);
+
+	/* calculate address for next transitions. */
+	addr = _mm512_add_epi32(addr, t);
+	return addr;
+}
+
+/*
+ * Process 16 transitions in parallel.
+ * tr_lo contains low 32 bits for 16 transition.
+ * tr_hi contains high 32 bits for 16 transition.
+ * next_input contains up to 4 input bytes for 16 flows.
+ */
+static __rte_always_inline __m512i
+transition16(__m512i next_input, const uint64_t *trans, __m512i *tr_lo,
+	__m512i *tr_hi)
+{
+	const int32_t *tr;
+	__m512i addr;
+
+	tr = (const int32_t *)(uintptr_t)trans;
+
+	/* Calculate the address (array index) for all 16 transitions. */
+	addr = calc_addr16(zmm_index_mask.z, next_input, zmm_shuffle_input.z,
+		zmm_four_32.z, zmm_range_base.z, *tr_lo, *tr_hi);
+
+	/* load lower 32 bits of 16 transactions at once. */
+	*tr_lo = _mm512_i32gather_epi32(addr, tr, sizeof(trans[0]));
+
+	next_input = _mm512_srli_epi32(next_input, CHAR_BIT);
+
+	/* load high 32 bits of 16 transactions at once. */
+	*tr_hi = _mm512_i32gather_epi32(addr, (tr + 1), sizeof(trans[0]));
+
+	return next_input;
+}
+
+/*
+ * Execute first transition for up to 16 flows in parallel.
+ * next_input should contain one input byte for up to 16 flows.
+ * msk - mask of active flows.
+ * tr_lo contains low 32 bits for up to 16 transitions.
+ * tr_hi contains high 32 bits for up to 16 transitions.
+ */
+static __rte_always_inline void
+first_trans16(const struct acl_flow_avx512 *flow, __m512i next_input,
+	__mmask16 msk, __m512i *tr_lo, __m512i *tr_hi)
+{
+	const int32_t *tr;
+	__m512i addr, root;
+
+	tr = (const int32_t *)(uintptr_t)flow->trans;
+
+	addr = _mm512_set1_epi32(UINT8_MAX);
+	root = _mm512_set1_epi32(flow->root_index);
+
+	addr = _mm512_and_si512(next_input, addr);
+	addr = _mm512_add_epi32(root, addr);
+
+	/* load lower 32 bits of 16 transactions at once. */
+	*tr_lo = _mm512_mask_i32gather_epi32(*tr_lo, msk, addr, tr,
+		sizeof(flow->trans[0]));
+
+	/* load high 32 bits of 16 transactions at once. */
+	*tr_hi = _mm512_mask_i32gather_epi32(*tr_hi, msk, addr, (tr + 1),
+		sizeof(flow->trans[0]));
+}
+
+/*
+ * Load and return next 4 input bytes for up to 16 flows in parallel.
+ * pdata - 8x2 pointers to flow input data
+ * mask - mask of active flows.
+ * di - data indexes for these 16 flows.
+ */
+static inline __m512i
+get_next_bytes_avx512x16(const struct acl_flow_avx512 *flow, __m512i pdata[2],
+	uint32_t msk, __m512i *di, uint32_t bnum)
+{
+	const int32_t *div;
+	__m512i one, zero, t, p[2];
+	ymm_t inp[2];
+
+	static const __rte_x86_zmm_t zmm_pminp = {
+		.u32 = {
+			0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+			0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17,
+		},
+	};
+
+	const __mmask16 pmidx_msk = 0x5555;
+
+	static const __rte_x86_zmm_t zmm_pmidx[2] = {
+		[0] = {
+			.u32 = {
+				0, 0, 1, 0, 2, 0, 3, 0,
+				4, 0, 5, 0, 6, 0, 7, 0,
+			},
+		},
+		[1] = {
+			.u32 = {
+				8, 0, 9, 0, 10, 0, 11, 0,
+				12, 0, 13, 0, 14, 0, 15, 0,
+			},
+		},
+	};
+
+	div = (const int32_t *)flow->data_index;
+
+	one = _mm512_set1_epi32(1);
+	zero = _mm512_xor_si512(one, one);
+
+	/* load data offsets for given indexes */
+	t = _mm512_mask_i32gather_epi32(zero, msk, *di, div, sizeof(div[0]));
+
+	/* increment data indexes */
+	*di = _mm512_mask_add_epi32(*di, msk, *di, one);
+
+	/*
+	 * unsigned expand 32-bit indexes to 64-bit
+	 * (for later pointer arithmetic), i.e:
+	 * for (i = 0; i != 16; i++)
+	 *   p[i/8].u64[i%8] = (uint64_t)t.u32[i];
+	 */
+	p[0] = _mm512_maskz_permutexvar_epi32(pmidx_msk, zmm_pmidx[0].z, t);
+	p[1] = _mm512_maskz_permutexvar_epi32(pmidx_msk, zmm_pmidx[1].z, t);
+
+	p[0] = _mm512_add_epi64(p[0], pdata[0]);
+	p[1] = _mm512_add_epi64(p[1], pdata[1]);
+
+	/* load input byte(s), either one or four */
+	if (bnum == sizeof(uint8_t)) {
+		inp[0] = _m512_mask_gather_epi8x8(p[0], (msk & UINT8_MAX));
+		inp[1] = _m512_mask_gather_epi8x8(p[1], (msk >> CHAR_BIT));
+	} else {
+		inp[0] = _mm512_mask_i64gather_epi32(
+				_mm512_castsi512_si256(zero), (msk & UINT8_MAX),
+				p[0], NULL, sizeof(uint8_t));
+		inp[1] = _mm512_mask_i64gather_epi32(
+				_mm512_castsi512_si256(zero), (msk >> CHAR_BIT),
+				p[1], NULL, sizeof(uint8_t));
+	}
+
+	/* squeeze input into one 512-bit register */
+	return _mm512_permutex2var_epi32(_mm512_castsi256_si512(inp[0]),
+			zmm_pminp.z, _mm512_castsi256_si512(inp[1]));
+}
+
+/*
+ * Start up to 16 new flows.
+ * num - number of flows to start
+ * msk - mask of new flows.
+ * pdata - pointers to flow input data
+ * idx - match indexed for given flows
+ * di - data indexes for these flows.
+ */
+static inline void
+start_flow16(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
+	__m512i pdata[2], __m512i *idx, __m512i *di)
+{
+	uint32_t n, nm[2];
+	__m512i ni, nd[2];
+
+	/* load input data pointers for new flows */
+	n = __builtin_popcount(msk & UINT8_MAX);
+	nm[0] = (1 << n) - 1;
+	nm[1] = (1 << (num - n)) - 1;
+
+	nd[0] = _mm512_maskz_loadu_epi64(nm[0],
+		flow->idata + flow->num_packets);
+	nd[1] = _mm512_maskz_loadu_epi64(nm[1],
+		flow->idata + flow->num_packets + n);
+
+	/* calculate match indexes of new flows */
+	ni = _mm512_set1_epi32(flow->num_packets);
+	ni = _mm512_add_epi32(ni, zmm_idx_add.z);
+
+	/* merge new and existing flows data */
+	pdata[0] = _mm512_mask_expand_epi64(pdata[0], (msk & UINT8_MAX), nd[0]);
+	pdata[1] = _mm512_mask_expand_epi64(pdata[1], (msk >> CHAR_BIT), nd[1]);
+
+	/* update match and data indexes */
+	*idx = _mm512_mask_expand_epi32(*idx, msk, ni);
+	*di = _mm512_maskz_mov_epi32(msk ^ UINT16_MAX, *di);
+
+	flow->num_packets += num;
+}
+
+/*
+ * Process found matches for up to 16 flows.
+ * fmsk - mask of active flows
+ * rmsk - mask of found matches
+ * pdata - pointers to flow input data
+ * di - data indexes for these flows
+ * idx - match indexed for given flows
+ * tr_lo contains low 32 bits for up to 8 transitions.
+ * tr_hi contains high 32 bits for up to 8 transitions.
+ */
+static inline uint32_t
+match_process_avx512x16(struct acl_flow_avx512 *flow, uint32_t *fmsk,
+	uint32_t *rmsk, __m512i pdata[2], __m512i *di, __m512i *idx,
+	__m512i *tr_lo, __m512i *tr_hi)
+{
+	uint32_t n;
+	__m512i res;
+
+	if (rmsk[0] == 0)
+		return 0;
+
+	/* extract match indexes */
+	res = _mm512_and_si512(tr_lo[0], zmm_index_mask.z);
+
+	/* mask  matched transitions to nop */
+	tr_lo[0] = _mm512_mask_mov_epi32(tr_lo[0], rmsk[0], zmm_trlo_idle.z);
+	tr_hi[0] = _mm512_mask_mov_epi32(tr_hi[0], rmsk[0], zmm_trhi_idle.z);
+
+	/* save found match indexes */
+	_mm512_mask_i32scatter_epi32(flow->matches, rmsk[0],
+		idx[0], res, sizeof(flow->matches[0]));
+
+	/* update masks and start new flows for matches */
+	n = update_flow_mask(flow, fmsk, rmsk);
+	start_flow16(flow, n, rmsk[0], pdata, idx, di);
+
+	return n;
+}
+
+/*
+ * Test for matches ut to 32 (2x16) flows at once,
+ * if matches exist - process them and start new flows.
+ */
+static inline void
+match_check_process_avx512x16x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
+	__m512i pdata[4], __m512i di[2], __m512i idx[2], __m512i inp[2],
+	__m512i tr_lo[2], __m512i tr_hi[2])
+{
+	uint32_t n[2];
+	uint32_t rm[2];
+
+	/* check for matches */
+	rm[0] = _mm512_test_epi32_mask(tr_lo[0], zmm_match_mask.z);
+	rm[1] = _mm512_test_epi32_mask(tr_lo[1], zmm_match_mask.z);
+
+	/* till unprocessed matches exist */
+	while ((rm[0] | rm[1]) != 0) {
+
+		/* process matches and start new flows */
+		n[0] = match_process_avx512x16(flow, &fm[0], &rm[0], &pdata[0],
+			&di[0], &idx[0], &tr_lo[0], &tr_hi[0]);
+		n[1] = match_process_avx512x16(flow, &fm[1], &rm[1], &pdata[2],
+			&di[1], &idx[1], &tr_lo[1], &tr_hi[1]);
+
+		/* execute first transition for new flows, if any */
+
+		if (n[0] != 0) {
+			inp[0] = get_next_bytes_avx512x16(flow, &pdata[0],
+				rm[0], &di[0], sizeof(uint8_t));
+			first_trans16(flow, inp[0], rm[0], &tr_lo[0],
+				&tr_hi[0]);
+			rm[0] = _mm512_test_epi32_mask(tr_lo[0],
+				zmm_match_mask.z);
+		}
+
+		if (n[1] != 0) {
+			inp[1] = get_next_bytes_avx512x16(flow, &pdata[2],
+				rm[1], &di[1], sizeof(uint8_t));
+			first_trans16(flow, inp[1], rm[1], &tr_lo[1],
+				&tr_hi[1]);
+			rm[1] = _mm512_test_epi32_mask(tr_lo[1],
+				zmm_match_mask.z);
+		}
+	}
+}
+
+/*
+ * Perform search for up to 32 flows in parallel.
+ * Use two sets of metadata, each serves 16 flows max.
+ * So in fact we perform search for 2x16 flows.
+ */
+static inline void
+search_trie_avx512x16x2(struct acl_flow_avx512 *flow)
+{
+	uint32_t fm[2];
+	__m512i di[2], idx[2], in[2], pdata[4], tr_lo[2], tr_hi[2];
+
+	/* first 1B load */
+	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[0], &idx[0], &di[0]);
+	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[2], &idx[1], &di[1]);
+
+	in[0] = get_next_bytes_avx512x16(flow, &pdata[0], UINT16_MAX, &di[0],
+		sizeof(uint8_t));
+	in[1] = get_next_bytes_avx512x16(flow, &pdata[2], UINT16_MAX, &di[1],
+		sizeof(uint8_t));
+
+	first_trans16(flow, in[0], UINT16_MAX, &tr_lo[0], &tr_hi[0]);
+	first_trans16(flow, in[1], UINT16_MAX, &tr_lo[1], &tr_hi[1]);
+
+	fm[0] = UINT16_MAX;
+	fm[1] = UINT16_MAX;
+
+	/* match check */
+	match_check_process_avx512x16x2(flow, fm, pdata, di, idx, in,
+		tr_lo, tr_hi);
+
+	while ((fm[0] | fm[1]) != 0) {
+
+		/* load next 4B */
+
+		in[0] = get_next_bytes_avx512x16(flow, &pdata[0], fm[0],
+			&di[0], sizeof(uint32_t));
+		in[1] = get_next_bytes_avx512x16(flow, &pdata[2], fm[1],
+			&di[1], sizeof(uint32_t));
+
+		/* main 4B loop */
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		in[0] = transition16(in[0], flow->trans, &tr_lo[0], &tr_hi[0]);
+		in[1] = transition16(in[1], flow->trans, &tr_lo[1], &tr_hi[1]);
+
+		/* check for matches */
+		match_check_process_avx512x16x2(flow, fm, pdata, di, idx, in,
+			tr_lo, tr_hi);
+	}
+}
+
+static inline int
+search_avx512x16x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
+	uint32_t *results, uint32_t total_packets, uint32_t categories)
+{
+	uint32_t i, *pm;
+	const struct rte_acl_match_results *pr;
+	struct acl_flow_avx512 flow;
+	uint32_t match[ctx->num_tries * total_packets];
+
+	for (i = 0, pm = match; i != ctx->num_tries; i++, pm += total_packets) {
+
+		/* setup for next trie */
+		acl_set_flow_avx512(&flow, ctx, i, data, pm, total_packets);
+
+		/* process the trie */
+		search_trie_avx512x16x2(&flow);
+	}
+
+	/* resolve matches */
+	pr = (const struct rte_acl_match_results *)
+		(ctx->trans_table + ctx->match_index);
+
+	if (categories == 1)
+		resolve_sc_avx512x16x2(results, pr, match, total_packets,
+			ctx->num_tries);
+	else if (categories <= RTE_ACL_MAX_CATEGORIES / 2)
+		resolve_mcle8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+	else
+		resolve_mcgt8_avx512x1(results, pr, match, total_packets,
+			categories, ctx->num_tries);
+
+	return 0;
+}
diff --git a/lib/librte_acl/acl_run_avx512x8.h b/lib/librte_acl/acl_run_avx512x8.h
index 66fc26b26..82171e8e0 100644
--- a/lib/librte_acl/acl_run_avx512x8.h
+++ b/lib/librte_acl/acl_run_avx512x8.h
@@ -260,36 +260,6 @@ start_flow8(struct acl_flow_avx512 *flow, uint32_t num, uint32_t msk,
 	flow->num_packets += num;
 }
 
-/*
- * Update flow and result masks based on the number of unprocessed flows.
- */
-static inline uint32_t
-update_flow_mask8(const struct acl_flow_avx512 *flow, __mmask8 *fmsk,
-	__mmask8 *rmsk)
-{
-	uint32_t i, j, k, m, n;
-
-	fmsk[0] ^= rmsk[0];
-	m = rmsk[0];
-
-	k = __builtin_popcount(m);
-	n = flow->total_packets - flow->num_packets;
-
-	if (n < k) {
-		/* reduce mask */
-		for (i = k - n; i != 0; i--) {
-			j = sizeof(m) * CHAR_BIT - 1 - __builtin_clz(m);
-			m ^= 1 << j;
-		}
-	} else
-		n = k;
-
-	rmsk[0] = m;
-	fmsk[0] |= rmsk[0];
-
-	return n;
-}
-
 /*
  * Process found matches for up to 8 flows.
  * fmsk - mask of active flows
@@ -301,8 +271,8 @@ update_flow_mask8(const struct acl_flow_avx512 *flow, __mmask8 *fmsk,
  * tr_hi contains high 32 bits for up to 8 transitions.
  */
 static inline uint32_t
-match_process_avx512x8(struct acl_flow_avx512 *flow, __mmask8 *fmsk,
-	__mmask8 *rmsk,	__m512i *pdata, ymm_t *di, ymm_t *idx,
+match_process_avx512x8(struct acl_flow_avx512 *flow, uint32_t *fmsk,
+	uint32_t *rmsk,	__m512i *pdata, ymm_t *di, ymm_t *idx,
 	ymm_t *tr_lo, ymm_t *tr_hi)
 {
 	uint32_t n;
@@ -323,7 +293,7 @@ match_process_avx512x8(struct acl_flow_avx512 *flow, __mmask8 *fmsk,
 		idx[0], res, sizeof(flow->matches[0]));
 
 	/* update masks and start new flows for matches */
-	n = update_flow_mask8(flow, fmsk, rmsk);
+	n = update_flow_mask(flow, fmsk, rmsk);
 	start_flow8(flow, n, rmsk[0], pdata, idx, di);
 
 	return n;
@@ -331,12 +301,12 @@ match_process_avx512x8(struct acl_flow_avx512 *flow, __mmask8 *fmsk,
 
 
 static inline void
-match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, __mmask8 fm[2],
+match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
 	__m512i pdata[2], ymm_t di[2], ymm_t idx[2], ymm_t inp[2],
 	ymm_t tr_lo[2], ymm_t tr_hi[2])
 {
 	uint32_t n[2];
-	__mmask8 rm[2];
+	uint32_t rm[2];
 
 	/* check for matches */
 	rm[0] = _mm256_test_epi32_mask(tr_lo[0], ymm_match_mask.y);
@@ -381,7 +351,7 @@ match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, __mmask8 fm[2],
 static inline void
 search_trie_avx512x8x2(struct acl_flow_avx512 *flow)
 {
-	__mmask8 fm[2];
+	uint32_t fm[2];
 	__m512i pdata[2];
 	ymm_t di[2], idx[2], inp[2], tr_lo[2], tr_hi[2];
 
@@ -433,157 +403,6 @@ search_trie_avx512x8x2(struct acl_flow_avx512 *flow)
 	}
 }
 
-/*
- * resolve match index to actual result/priority offset.
- */
-static inline ymm_t
-resolve_match_idx_avx512x8(ymm_t mi)
-{
-	RTE_BUILD_BUG_ON(sizeof(struct rte_acl_match_results) !=
-		1 << (match_log + 2));
-	return _mm256_slli_epi32(mi, match_log);
-}
-
-
-/*
- * Resolve multiple matches for the same flow based on priority.
- */
-static inline ymm_t
-resolve_pri_avx512x8(const int32_t res[], const int32_t pri[],
-	const uint32_t match[], __mmask8 msk, uint32_t nb_trie,
-	uint32_t nb_skip)
-{
-	uint32_t i;
-	const uint32_t *pm;
-	__mmask8 m;
-	ymm_t cp, cr, np, nr, mch;
-
-	const ymm_t zero = _mm256_set1_epi32(0);
-
-	mch = _mm256_maskz_loadu_epi32(msk, match);
-	mch = resolve_match_idx_avx512x8(mch);
-
-	cr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res, sizeof(res[0]));
-	cp = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri, sizeof(pri[0]));
-
-	for (i = 1, pm = match + nb_skip; i != nb_trie;
-			i++, pm += nb_skip) {
-
-		mch = _mm256_maskz_loadu_epi32(msk, pm);
-		mch = resolve_match_idx_avx512x8(mch);
-
-		nr = _mm256_mmask_i32gather_epi32(zero, msk, mch, res,
-			sizeof(res[0]));
-		np = _mm256_mmask_i32gather_epi32(zero, msk, mch, pri,
-			sizeof(pri[0]));
-
-		m = _mm256_cmpgt_epi32_mask(cp, np);
-		cr = _mm256_mask_mov_epi32(nr, m, cr);
-		cp = _mm256_mask_mov_epi32(np, m, cp);
-	}
-
-	return cr;
-}
-
-/*
- * Resolve num (<= 8) matches for single category
- */
-static inline void
-resolve_sc_avx512x8(uint32_t result[], const int32_t res[], const int32_t pri[],
-	const uint32_t match[], uint32_t nb_pkt, uint32_t nb_trie,
-	uint32_t nb_skip)
-{
-	__mmask8 msk;
-	ymm_t cr;
-
-	msk = (1 << nb_pkt) - 1;
-	cr = resolve_pri_avx512x8(res, pri, match, msk, nb_trie, nb_skip);
-	_mm256_mask_storeu_epi32(result, msk, cr);
-}
-
-/*
- * Resolve matches for single category
- */
-static inline void
-resolve_sc_avx512x8x2(uint32_t result[],
-	const struct rte_acl_match_results pr[], const uint32_t match[],
-	uint32_t nb_pkt, uint32_t nb_trie)
-{
-	uint32_t i, j, k, n;
-	const uint32_t *pm;
-	const int32_t *res, *pri;
-	__mmask8 m[2];
-	ymm_t cp[2], cr[2], np[2], nr[2], mch[2];
-
-	res = (const int32_t *)pr->results;
-	pri = pr->priority;
-
-	for (k = 0; k != (nb_pkt & ~MSK_AVX512X8X2); k += NUM_AVX512X8X2) {
-
-		j = k + CHAR_BIT;
-
-		/* load match indexes for first trie */
-		mch[0] = _mm256_loadu_si256((const ymm_t *)(match + k));
-		mch[1] = _mm256_loadu_si256((const ymm_t *)(match + j));
-
-		mch[0] = resolve_match_idx_avx512x8(mch[0]);
-		mch[1] = resolve_match_idx_avx512x8(mch[1]);
-
-		/* load matches and their priorities for first trie */
-
-		cr[0] = _mm256_i32gather_epi32(res, mch[0], sizeof(res[0]));
-		cr[1] = _mm256_i32gather_epi32(res, mch[1], sizeof(res[0]));
-
-		cp[0] = _mm256_i32gather_epi32(pri, mch[0], sizeof(pri[0]));
-		cp[1] = _mm256_i32gather_epi32(pri, mch[1], sizeof(pri[0]));
-
-		/* select match with highest priority */
-		for (i = 1, pm = match + nb_pkt; i != nb_trie;
-				i++, pm += nb_pkt) {
-
-			mch[0] = _mm256_loadu_si256((const ymm_t *)(pm + k));
-			mch[1] = _mm256_loadu_si256((const ymm_t *)(pm + j));
-
-			mch[0] = resolve_match_idx_avx512x8(mch[0]);
-			mch[1] = resolve_match_idx_avx512x8(mch[1]);
-
-			nr[0] = _mm256_i32gather_epi32(res, mch[0],
-				sizeof(res[0]));
-			nr[1] = _mm256_i32gather_epi32(res, mch[1],
-				sizeof(res[0]));
-
-			np[0] = _mm256_i32gather_epi32(pri, mch[0],
-				sizeof(pri[0]));
-			np[1] = _mm256_i32gather_epi32(pri, mch[1],
-				sizeof(pri[0]));
-
-			m[0] = _mm256_cmpgt_epi32_mask(cp[0], np[0]);
-			m[1] = _mm256_cmpgt_epi32_mask(cp[1], np[1]);
-
-			cr[0] = _mm256_mask_mov_epi32(nr[0], m[0], cr[0]);
-			cr[1] = _mm256_mask_mov_epi32(nr[1], m[1], cr[1]);
-
-			cp[0] = _mm256_mask_mov_epi32(np[0], m[0], cp[0]);
-			cp[1] = _mm256_mask_mov_epi32(np[1], m[1], cp[1]);
-		}
-
-		_mm256_storeu_si256((ymm_t *)(result + k), cr[0]);
-		_mm256_storeu_si256((ymm_t *)(result + j), cr[1]);
-	}
-
-	n = nb_pkt - k;
-	if (n != 0) {
-		if (n > CHAR_BIT) {
-			resolve_sc_avx512x8(result + k, res, pri, match + k,
-				CHAR_BIT, nb_trie, nb_pkt);
-			k += CHAR_BIT;
-			n -= CHAR_BIT;
-		}
-		resolve_sc_avx512x8(result + k, res, pri, match + k, n,
-				nb_trie, nb_pkt);
-	}
-}
-
 static inline int
 search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 	uint32_t *results, uint32_t total_packets, uint32_t categories)
@@ -607,7 +426,7 @@ search_avx512x8x2(const struct rte_acl_ctx *ctx, const uint8_t **data,
 		(ctx->trans_table + ctx->match_index);
 
 	if (categories == 1)
-		resolve_sc_avx512x8x2(results, pr, match, total_packets,
+		resolve_sc_avx512x16x2(results, pr, match, total_packets,
 			ctx->num_tries);
 	else if (categories <= RTE_ACL_MAX_CATEGORIES / 2)
 		resolve_mcle8_avx512x1(results, pr, match, total_packets,
-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v2 10/12] acl: for AVX512 classify use 4B load whenever possible
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
                     ` (8 preceding siblings ...)
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 09/12] acl: enhance " Konstantin Ananyev
@ 2020-09-15 16:50   ` Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 11/12] test/acl: add AVX512 classify support Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 12/12] app/acl: " Konstantin Ananyev
  11 siblings, 0 replies; 26+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

With current ACL implementation first field in the rule definition
has always to be one byte long. Though for optimising classify
implementation it might be useful to be able to use 4B reads
(as we do for rest of the fields).
So at build phase, check user provided field definitions to determine
is it safe to do 4B loads for first ACL field.
Then at run-time this information can be used to choose classify
behavior.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_acl/acl.h               |  1 +
 lib/librte_acl/acl_bld.c           | 34 ++++++++++++++++++++++++++++++
 lib/librte_acl/acl_run_avx512.c    |  7 ++++++
 lib/librte_acl/acl_run_avx512x16.h |  8 +++----
 lib/librte_acl/acl_run_avx512x8.h  |  8 +++----
 lib/librte_acl/rte_acl.c           |  1 +
 6 files changed, 51 insertions(+), 8 deletions(-)

diff --git a/lib/librte_acl/acl.h b/lib/librte_acl/acl.h
index 3f0719f33..493dec2a2 100644
--- a/lib/librte_acl/acl.h
+++ b/lib/librte_acl/acl.h
@@ -169,6 +169,7 @@ struct rte_acl_ctx {
 	int32_t             socket_id;
 	/** Socket ID to allocate memory from. */
 	enum rte_acl_classify_alg alg;
+	uint32_t           first_load_sz;
 	void               *rules;
 	uint32_t            max_rules;
 	uint32_t            rule_sz;
diff --git a/lib/librte_acl/acl_bld.c b/lib/librte_acl/acl_bld.c
index d1f920b09..da10864cd 100644
--- a/lib/librte_acl/acl_bld.c
+++ b/lib/librte_acl/acl_bld.c
@@ -1581,6 +1581,37 @@ acl_check_bld_param(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg)
 	return 0;
 }
 
+/*
+ * With current ACL implementation first field in the rule definition
+ * has always to be one byte long. Though for optimising *classify*
+ * implementation it might be useful to be able to use 4B reads
+ * (as we do for rest of the fields).
+ * This function checks input config to determine is it safe to do 4B
+ * loads for first ACL field. For that we need to make sure that
+ * first field in our rule definition doesn't have the biggest offset,
+ * i.e. we still do have other fields located after the first one.
+ * Contrary if first field has the largest offset, then it means
+ * first field can occupy the very last byte in the input data buffer,
+ * and we have to do single byte load for it.
+ */
+static uint32_t
+get_first_load_size(const struct rte_acl_config *cfg)
+{
+	uint32_t i, max_ofs, ofs;
+
+	ofs = 0;
+	max_ofs = 0;
+
+	for (i = 0; i != cfg->num_fields; i++) {
+		if (cfg->defs[i].field_index == 0)
+			ofs = cfg->defs[i].offset;
+		else if (max_ofs < cfg->defs[i].offset)
+			max_ofs = cfg->defs[i].offset;
+	}
+
+	return (ofs < max_ofs) ? sizeof(uint32_t) : sizeof(uint8_t);
+}
+
 int
 rte_acl_build(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg)
 {
@@ -1618,6 +1649,9 @@ rte_acl_build(struct rte_acl_ctx *ctx, const struct rte_acl_config *cfg)
 				/* set data indexes. */
 				acl_set_data_indexes(ctx);
 
+				/* determine can we always do 4B load */
+				ctx->first_load_sz = get_first_load_size(cfg);
+
 				/* copy in build config. */
 				ctx->config = *cfg;
 			}
diff --git a/lib/librte_acl/acl_run_avx512.c b/lib/librte_acl/acl_run_avx512.c
index 60762b7d6..51bfa6a3b 100644
--- a/lib/librte_acl/acl_run_avx512.c
+++ b/lib/librte_acl/acl_run_avx512.c
@@ -16,6 +16,7 @@ struct acl_flow_avx512 {
 	uint32_t num_packets;       /* number of packets processed */
 	uint32_t total_packets;     /* max number of packets to process */
 	uint32_t root_index;        /* current root index */
+	uint32_t first_load_sz;     /* first load size for new packet */
 	const uint64_t *trans;      /* transition table */
 	const uint32_t *data_index; /* input data indexes */
 	const uint8_t **idata;      /* input data */
@@ -29,6 +30,7 @@ acl_set_flow_avx512(struct acl_flow_avx512 *flow, const struct rte_acl_ctx *ctx,
 {
 	flow->num_packets = 0;
 	flow->total_packets = total_packets;
+	flow->first_load_sz = ctx->first_load_sz;
 	flow->root_index = ctx->trie[trie].root_index;
 	flow->trans = ctx->trans_table;
 	flow->data_index = ctx->trie[trie].data_index;
@@ -155,6 +157,11 @@ resolve_mcgt8_avx512x1(uint32_t result[],
 	}
 }
 
+/*
+ * unfortunately current AVX512 ISA doesn't provide ability for
+ * gather load on a byte quantity. So we have to mimic it in SW,
+ * by doing 8x1B scalar loads.
+ */
 static inline ymm_t
 _m512_mask_gather_epi8x8(__m512i pdata, __mmask8 mask)
 {
diff --git a/lib/librte_acl/acl_run_avx512x16.h b/lib/librte_acl/acl_run_avx512x16.h
index 45b0b4db6..df5f6135f 100644
--- a/lib/librte_acl/acl_run_avx512x16.h
+++ b/lib/librte_acl/acl_run_avx512x16.h
@@ -413,7 +413,7 @@ match_check_process_avx512x16x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
 
 		if (n[0] != 0) {
 			inp[0] = get_next_bytes_avx512x16(flow, &pdata[0],
-				rm[0], &di[0], sizeof(uint8_t));
+				rm[0], &di[0], flow->first_load_sz);
 			first_trans16(flow, inp[0], rm[0], &tr_lo[0],
 				&tr_hi[0]);
 			rm[0] = _mm512_test_epi32_mask(tr_lo[0],
@@ -422,7 +422,7 @@ match_check_process_avx512x16x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
 
 		if (n[1] != 0) {
 			inp[1] = get_next_bytes_avx512x16(flow, &pdata[2],
-				rm[1], &di[1], sizeof(uint8_t));
+				rm[1], &di[1], flow->first_load_sz);
 			first_trans16(flow, inp[1], rm[1], &tr_lo[1],
 				&tr_hi[1]);
 			rm[1] = _mm512_test_epi32_mask(tr_lo[1],
@@ -447,9 +447,9 @@ search_trie_avx512x16x2(struct acl_flow_avx512 *flow)
 	start_flow16(flow, MASK16_BIT, UINT16_MAX, &pdata[2], &idx[1], &di[1]);
 
 	in[0] = get_next_bytes_avx512x16(flow, &pdata[0], UINT16_MAX, &di[0],
-		sizeof(uint8_t));
+			flow->first_load_sz);
 	in[1] = get_next_bytes_avx512x16(flow, &pdata[2], UINT16_MAX, &di[1],
-		sizeof(uint8_t));
+			flow->first_load_sz);
 
 	first_trans16(flow, in[0], UINT16_MAX, &tr_lo[0], &tr_hi[0]);
 	first_trans16(flow, in[1], UINT16_MAX, &tr_lo[1], &tr_hi[1]);
diff --git a/lib/librte_acl/acl_run_avx512x8.h b/lib/librte_acl/acl_run_avx512x8.h
index 82171e8e0..777451973 100644
--- a/lib/librte_acl/acl_run_avx512x8.h
+++ b/lib/librte_acl/acl_run_avx512x8.h
@@ -325,7 +325,7 @@ match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
 
 		if (n[0] != 0) {
 			inp[0] = get_next_bytes_avx512x8(flow, pdata[0], rm[0],
-				&di[0], sizeof(uint8_t));
+				&di[0], flow->first_load_sz);
 			first_trans8(flow, inp[0], rm[0], &tr_lo[0], &tr_hi[0]);
 
 			rm[0] = _mm256_test_epi32_mask(tr_lo[0],
@@ -334,7 +334,7 @@ match_check_process_avx512x8x2(struct acl_flow_avx512 *flow, uint32_t fm[2],
 
 		if (n[1] != 0) {
 			inp[1] = get_next_bytes_avx512x8(flow, pdata[1], rm[1],
-				&di[1], sizeof(uint8_t));
+				&di[1], flow->first_load_sz);
 			first_trans8(flow, inp[1], rm[1], &tr_lo[1], &tr_hi[1]);
 
 			rm[1] = _mm256_test_epi32_mask(tr_lo[1],
@@ -360,9 +360,9 @@ search_trie_avx512x8x2(struct acl_flow_avx512 *flow)
 	start_flow8(flow, CHAR_BIT, UINT8_MAX, &pdata[1], &idx[1], &di[1]);
 
 	inp[0] = get_next_bytes_avx512x8(flow, pdata[0], UINT8_MAX, &di[0],
-		sizeof(uint8_t));
+			flow->first_load_sz);
 	inp[1] = get_next_bytes_avx512x8(flow, pdata[1], UINT8_MAX, &di[1],
-		sizeof(uint8_t));
+			flow->first_load_sz);
 
 	first_trans8(flow, inp[0], UINT8_MAX, &tr_lo[0], &tr_hi[0]);
 	first_trans8(flow, inp[1], UINT8_MAX, &tr_lo[1], &tr_hi[1]);
diff --git a/lib/librte_acl/rte_acl.c b/lib/librte_acl/rte_acl.c
index fdcb7a798..9f16d28ea 100644
--- a/lib/librte_acl/rte_acl.c
+++ b/lib/librte_acl/rte_acl.c
@@ -486,6 +486,7 @@ rte_acl_dump(const struct rte_acl_ctx *ctx)
 	printf("acl context <%s>@%p\n", ctx->name, ctx);
 	printf("  socket_id=%"PRId32"\n", ctx->socket_id);
 	printf("  alg=%"PRId32"\n", ctx->alg);
+	printf("  first_load_sz=%"PRIu32"\n", ctx->first_load_sz);
 	printf("  max_rules=%"PRIu32"\n", ctx->max_rules);
 	printf("  rule_size=%"PRIu32"\n", ctx->rule_sz);
 	printf("  num_rules=%"PRIu32"\n", ctx->num_rules);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v2 11/12] test/acl: add AVX512 classify support
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
                     ` (9 preceding siblings ...)
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 10/12] acl: for AVX512 classify use 4B load whenever possible Konstantin Ananyev
@ 2020-09-15 16:50   ` Konstantin Ananyev
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 12/12] app/acl: " Konstantin Ananyev
  11 siblings, 0 replies; 26+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Add AVX512 classify to the test coverage.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test/test_acl.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/app/test/test_acl.c b/app/test/test_acl.c
index 333b34757..11d69d2d5 100644
--- a/app/test/test_acl.c
+++ b/app/test/test_acl.c
@@ -278,8 +278,8 @@ test_classify_alg(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
 
 	/* set given classify alg, skip test if alg is not supported */
 	ret = rte_acl_set_ctx_classify(acx, alg);
-	if (ret == -ENOTSUP)
-		return 0;
+	if (ret != 0)
+		return (ret == -ENOTSUP) ? 0 : ret;
 
 	/**
 	 * these will run quite a few times, it's necessary to test code paths
@@ -341,6 +341,7 @@ test_classify_run(struct rte_acl_ctx *acx, struct ipv4_7tuple test_data[],
 		RTE_ACL_CLASSIFY_AVX2,
 		RTE_ACL_CLASSIFY_NEON,
 		RTE_ACL_CLASSIFY_ALTIVEC,
+		RTE_ACL_CLASSIFY_AVX512,
 	};
 
 	/* swap all bytes in the data to network order */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [dpdk-dev] [PATCH v2 12/12] app/acl: add AVX512 classify support
  2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
                     ` (10 preceding siblings ...)
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 11/12] test/acl: add AVX512 classify support Konstantin Ananyev
@ 2020-09-15 16:50   ` " Konstantin Ananyev
  11 siblings, 0 replies; 26+ messages in thread
From: Konstantin Ananyev @ 2020-09-15 16:50 UTC (permalink / raw)
  To: dev; +Cc: jerinj, ruifeng.wang, vladimir.medvedkin, Konstantin Ananyev

Add ability to use AVX512 classify method.

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 app/test-acl/main.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/app/test-acl/main.c b/app/test-acl/main.c
index d9b65517c..19b714335 100644
--- a/app/test-acl/main.c
+++ b/app/test-acl/main.c
@@ -81,6 +81,10 @@ static const struct acl_alg acl_alg[] = {
 		.name = "altivec",
 		.alg = RTE_ACL_CLASSIFY_ALTIVEC,
 	},
+	{
+		.name = "avx512",
+		.alg = RTE_ACL_CLASSIFY_AVX512,
+	},
 };
 
 static struct {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] [PATCH v2 07/12] acl: add infrastructure to support AVX512 classify
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 07/12] acl: add infrastructure to support AVX512 classify Konstantin Ananyev
@ 2020-09-16  9:11     ` Bruce Richardson
  2020-09-16  9:36       ` Medvedkin, Vladimir
  0 siblings, 1 reply; 26+ messages in thread
From: Bruce Richardson @ 2020-09-16  9:11 UTC (permalink / raw)
  To: Konstantin Ananyev; +Cc: dev, jerinj, ruifeng.wang, vladimir.medvedkin

On Tue, Sep 15, 2020 at 05:50:20PM +0100, Konstantin Ananyev wrote:
> Add necessary changes to support new AVX512 specific ACL classify
> algorithm:
>  - changes in meson.build to check that build tools
>    (compiler, assembler, etc.) do properly support AVX512.
>  - run-time checks to make sure target platform does support AVX512.
>  - dummy rte_acl_classify_avx512() for targets where AVX512
>    implementation couldn't be properly supported.
> 
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---

This all looks correct, though I wonder do you really need to check all
those AVX512 flags in each case? Since "F" is always present in any AVX512
implementation perhaps it can be checked, though if the other three always
need to be checked I can understand if you want to keep it there for
completeness. [Are all the other 3 used in your code?]

Acked-by: Bruce Richardson <bruce.richardson@intel.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] [PATCH v2 07/12] acl: add infrastructure to support AVX512 classify
  2020-09-16  9:11     ` Bruce Richardson
@ 2020-09-16  9:36       ` Medvedkin, Vladimir
  2020-09-16  9:49         ` Bruce Richardson
  0 siblings, 1 reply; 26+ messages in thread
From: Medvedkin, Vladimir @ 2020-09-16  9:36 UTC (permalink / raw)
  To: Bruce Richardson, Konstantin Ananyev; +Cc: dev, jerinj, ruifeng.wang

Hi Bruce,

On 16/09/2020 10:11, Bruce Richardson wrote:
> On Tue, Sep 15, 2020 at 05:50:20PM +0100, Konstantin Ananyev wrote:
>> Add necessary changes to support new AVX512 specific ACL classify
>> algorithm:
>>   - changes in meson.build to check that build tools
>>     (compiler, assembler, etc.) do properly support AVX512.
>>   - run-time checks to make sure target platform does support AVX512.
>>   - dummy rte_acl_classify_avx512() for targets where AVX512
>>     implementation couldn't be properly supported.
>>
>> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
>> ---
> 
> This all looks correct, though I wonder do you really need to check all
> those AVX512 flags in each case? Since "F" is always present in any AVX512
> implementation perhaps it can be checked, though if the other three always
> need to be checked I can understand if you want to keep it there for
> completeness. [Are all the other 3 used in your code?]
> 

As for me it is good to check all the flags supported by compiler. Some 
old (but still supported by dpdk) gcc can't compile the code in some 
circumstances. For example:

gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)   <-- pretty 
old but still supported, right?

gcc -march=native -dM -E - < /dev/null | grep "AVX512"
#define __AVX512F__ 1
#define __AVX512BW__ 1
#define __AVX512CD__ 1
#define __AVX512DQ__ 1

Does not support __AVX512VL__

from acl_run_avx512x8.h in first_trans8 there is 
_mm256_mmask_i32gather_epi32 which requires this flag, so compilation 
will fail.

> Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> 

-- 
Regards,
Vladimir

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] [PATCH v2 07/12] acl: add infrastructure to support AVX512 classify
  2020-09-16  9:36       ` Medvedkin, Vladimir
@ 2020-09-16  9:49         ` Bruce Richardson
  2020-09-16 10:06           ` Ananyev, Konstantin
  0 siblings, 1 reply; 26+ messages in thread
From: Bruce Richardson @ 2020-09-16  9:49 UTC (permalink / raw)
  To: Medvedkin, Vladimir; +Cc: Konstantin Ananyev, dev, jerinj, ruifeng.wang

On Wed, Sep 16, 2020 at 10:36:32AM +0100, Medvedkin, Vladimir wrote:
> Hi Bruce,
> 
> On 16/09/2020 10:11, Bruce Richardson wrote:
> > On Tue, Sep 15, 2020 at 05:50:20PM +0100, Konstantin Ananyev wrote:
> > > Add necessary changes to support new AVX512 specific ACL classify
> > > algorithm:
> > >   - changes in meson.build to check that build tools
> > >     (compiler, assembler, etc.) do properly support AVX512.
> > >   - run-time checks to make sure target platform does support AVX512.
> > >   - dummy rte_acl_classify_avx512() for targets where AVX512
> > >     implementation couldn't be properly supported.
> > > 
> > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > > ---
> > 
> > This all looks correct, though I wonder do you really need to check all
> > those AVX512 flags in each case? Since "F" is always present in any AVX512
> > implementation perhaps it can be checked, though if the other three always
> > need to be checked I can understand if you want to keep it there for
> > completeness. [Are all the other 3 used in your code?]
> > 
> 
> As for me it is good to check all the flags supported by compiler. Some old
> (but still supported by dpdk) gcc can't compile the code in some
> circumstances. For example:
> 
> gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)   <-- pretty old
> but still supported, right?
> 
> gcc -march=native -dM -E - < /dev/null | grep "AVX512"
> #define __AVX512F__ 1
> #define __AVX512BW__ 1
> #define __AVX512CD__ 1
> #define __AVX512DQ__ 1
> 
> Does not support __AVX512VL__
> 
Interesting, seems like checking them all to be sure is the right approach
so.
My ack stands so, and ignore the comment.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] [PATCH v2 07/12] acl: add infrastructure to support AVX512 classify
  2020-09-16  9:49         ` Bruce Richardson
@ 2020-09-16 10:06           ` Ananyev, Konstantin
  0 siblings, 0 replies; 26+ messages in thread
From: Ananyev, Konstantin @ 2020-09-16 10:06 UTC (permalink / raw)
  To: Richardson, Bruce, Medvedkin, Vladimir; +Cc: dev, jerinj, ruifeng.wang

 
> On Wed, Sep 16, 2020 at 10:36:32AM +0100, Medvedkin, Vladimir wrote:
> > Hi Bruce,
> >
> > On 16/09/2020 10:11, Bruce Richardson wrote:
> > > On Tue, Sep 15, 2020 at 05:50:20PM +0100, Konstantin Ananyev wrote:
> > > > Add necessary changes to support new AVX512 specific ACL classify
> > > > algorithm:
> > > >   - changes in meson.build to check that build tools
> > > >     (compiler, assembler, etc.) do properly support AVX512.
> > > >   - run-time checks to make sure target platform does support AVX512.
> > > >   - dummy rte_acl_classify_avx512() for targets where AVX512
> > > >     implementation couldn't be properly supported.
> > > >
> > > > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > > > ---
> > >
> > > This all looks correct, though I wonder do you really need to check all
> > > those AVX512 flags in each case? Since "F" is always present in any AVX512
> > > implementation perhaps it can be checked, though if the other three always
> > > need to be checked I can understand if you want to keep it there for
> > > completeness. [Are all the other 3 used in your code?]

Yep, ACL uses all of them.
Thanks
Konstantin

> > >
> >
> > As for me it is good to check all the flags supported by compiler. Some old
> > (but still supported by dpdk) gcc can't compile the code in some
> > circumstances. For example:
> >
> > gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.12)   <-- pretty old
> > but still supported, right?
> >
> > gcc -march=native -dM -E - < /dev/null | grep "AVX512"
> > #define __AVX512F__ 1
> > #define __AVX512BW__ 1
> > #define __AVX512CD__ 1
> > #define __AVX512DQ__ 1
> >
> > Does not support __AVX512VL__
> >
> Interesting, seems like checking them all to be sure is the right approach
> so.
> My ack stands so, and ignore the comment.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [dpdk-dev] [PATCH v2 03/12] acl: remove of unused enum value
  2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 03/12] acl: remove of unused enum value Konstantin Ananyev
@ 2020-09-27  3:27     ` Ruifeng Wang
  0 siblings, 0 replies; 26+ messages in thread
From: Ruifeng Wang @ 2020-09-27  3:27 UTC (permalink / raw)
  To: Konstantin Ananyev, dev; +Cc: jerinj, vladimir.medvedkin, nd

> -----Original Message-----
> From: Konstantin Ananyev <konstantin.ananyev@intel.com>
> Sent: Wednesday, September 16, 2020 12:50 AM
> To: dev@dpdk.org
> Cc: jerinj@marvell.com; Ruifeng Wang <Ruifeng.Wang@arm.com>;
> vladimir.medvedkin@intel.com; Konstantin Ananyev
> <konstantin.ananyev@intel.com>
> Subject: [PATCH v2 03/12] acl: remove of unused enum value
> 
> Removal of unused enum value (RTE_ACL_CLASSIFY_NUM).
> This enum value is not used inside DPDK, while it prevents to add new
> classify algorithms without causing an ABI breakage.
> 
> Note that this change introduce a formal ABI incompatibility with previous
> versions of ACL library.
> 
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
>  doc/guides/rel_notes/deprecation.rst   | 4 ----
>  doc/guides/rel_notes/release_20_11.rst | 4 ++++
>  lib/librte_acl/rte_acl.h               | 1 -
>  3 files changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/doc/guides/rel_notes/deprecation.rst
> b/doc/guides/rel_notes/deprecation.rst
> index 52168f775..3279a01ef 100644
> --- a/doc/guides/rel_notes/deprecation.rst
> +++ b/doc/guides/rel_notes/deprecation.rst
> @@ -288,10 +288,6 @@ Deprecation Notices
>    - https://patches.dpdk.org/patch/71457/
>    - https://patches.dpdk.org/patch/71456/
> 
> -* acl: ``RTE_ACL_CLASSIFY_NUM`` enum value will be removed.
> -  This enum value is not used inside DPDK, while it prevents to add new
> -  classify algorithms without causing an ABI breakage.
> -
>  * sched: To allow more traffic classes, flexible mapping of pipe queues to
>    traffic classes, and subport level configuration of pipes and queues
>    changes will be made to macros, data structures and API functions defined
> diff --git a/doc/guides/rel_notes/release_20_11.rst
> b/doc/guides/rel_notes/release_20_11.rst
> index b729bdf20..a9a1b0305 100644
> --- a/doc/guides/rel_notes/release_20_11.rst
> +++ b/doc/guides/rel_notes/release_20_11.rst
> @@ -97,6 +97,10 @@ API Changes
>    and the function ``rte_rawdev_queue_conf_get()``
>    from ``void`` to ``int`` allowing the return of error codes from drivers.
> 
> +* acl: ``RTE_ACL_CLASSIFY_NUM`` enum value has been removed.
> +  This enum value was not used inside DPDK, while it prevented to add
> +new
> +  classify algorithms without causing an ABI breakage.
> +
> 
>  ABI Changes
>  -----------
> diff --git a/lib/librte_acl/rte_acl.h b/lib/librte_acl/rte_acl.h index
> aa22e70c6..b814423a6 100644
> --- a/lib/librte_acl/rte_acl.h
> +++ b/lib/librte_acl/rte_acl.h
> @@ -241,7 +241,6 @@ enum rte_acl_classify_alg {
>  	RTE_ACL_CLASSIFY_AVX2 = 3,    /**< requires AVX2 support. */
>  	RTE_ACL_CLASSIFY_NEON = 4,    /**< requires NEON support. */
>  	RTE_ACL_CLASSIFY_ALTIVEC = 5,    /**< requires ALTIVEC support. */
> -	RTE_ACL_CLASSIFY_NUM          /* should always be the last one. */
>  };
> 
>  /**
> --
> 2.17.1

Looks good from ABI perspective.
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, back to index

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-08-07 16:28 [dpdk-dev] [PATCH 20.11 0/7] acl: introduce AVX512 classify method Konstantin Ananyev
2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 1/7] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 2/7] app/acl: few small improvements Konstantin Ananyev
2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 3/7] acl: remove of unused enum value Konstantin Ananyev
2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 4/7] acl: add infrastructure to support AVX512 classify Konstantin Ananyev
2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 5/7] app/acl: add AVX512 classify support Konstantin Ananyev
2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 6/7] acl: introduce AVX512 classify implementation Konstantin Ananyev
2020-08-07 16:28 ` [dpdk-dev] [PATCH 20.11 7/7] acl: enhance " Konstantin Ananyev
2020-09-15 16:50 ` [dpdk-dev] [PATCH v2 00/12] acl: introduce AVX512 classify method Konstantin Ananyev
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 01/12] acl: fix x86 build when compiler doesn't support AVX2 Konstantin Ananyev
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 02/12] doc: fix mixing classify methods in ACL guide Konstantin Ananyev
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 03/12] acl: remove of unused enum value Konstantin Ananyev
2020-09-27  3:27     ` Ruifeng Wang
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 04/12] acl: remove library constructor Konstantin Ananyev
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 05/12] app/acl: few small improvements Konstantin Ananyev
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 06/12] test/acl: expand classify test coverage Konstantin Ananyev
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 07/12] acl: add infrastructure to support AVX512 classify Konstantin Ananyev
2020-09-16  9:11     ` Bruce Richardson
2020-09-16  9:36       ` Medvedkin, Vladimir
2020-09-16  9:49         ` Bruce Richardson
2020-09-16 10:06           ` Ananyev, Konstantin
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 08/12] acl: introduce AVX512 classify implementation Konstantin Ananyev
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 09/12] acl: enhance " Konstantin Ananyev
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 10/12] acl: for AVX512 classify use 4B load whenever possible Konstantin Ananyev
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 11/12] test/acl: add AVX512 classify support Konstantin Ananyev
2020-09-15 16:50   ` [dpdk-dev] [PATCH v2 12/12] app/acl: " Konstantin Ananyev

DPDK patches and discussions

Archives are clonable:
	git clone --mirror http://inbox.dpdk.org/dev/0 dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 dev dev/ http://inbox.dpdk.org/dev \
		dev@dpdk.org
	public-inbox-index dev


Newsgroup available over NNTP:
	nntp://inbox.dpdk.org/inbox.dpdk.dev


AGPL code for this site: git clone https://public-inbox.org/ public-inbox