DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH v3 0/3] dynamic linking support
@ 2017-09-26  7:41 Xiaoyun Li
  2017-09-26  7:41 ` [dpdk-dev] [PATCH v3 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
                   ` (3 more replies)
  0 siblings, 4 replies; 88+ messages in thread
From: Xiaoyun Li @ 2017-09-26  7:41 UTC (permalink / raw)
  To: bruce.richardson, konstantin.ananyev
  Cc: wenzhuo.lu, helin.zhang, dev, Xiaoyun Li

This patchset dynamically selects functions at run-time based on CPU flags
that current machine supports. This patchset modifies mempcy, memcpy perf
test and x86 EFD, using function pointers and bind them at constructor time.
Then in the cloud environment, users can compiler once for the minimum target
such as 'haswell'(not 'native') and run on different platforms (equal or above
haswell) and can get ISA optimization based on running CPU.

Xiaoyun Li (3):
  eal/x86: run-time dispatch over memcpy
  app/test: run-time dispatch over memcpy perf test
  efd: run-time dispatch over x86 EFD functions

---
v2
* Use gcc function multi-versioning to avoid compilation issues.
* Add macros for AVX512 and AVX2. Only if users enable AVX512 and the
compiler supports it, the AVX512 codes would be compiled. Only if the
compiler supports AVX2, the AVX2 codes would be compiled.

v3
* Reduce function calls via only keep rte_memcpy_xxx.
* Add conditions that when copy size is small, use inline code path.
Otherwise, use dynamic code path.
* To support attribute target, clang version must be greater than 3.7.
Otherwise, would choose SSE/AVX code path, the same as before.
* Move two mocro functions to the top of the code since they would be
used in inline SSE/AVX and dynamic SSE/AVX codes.

 .../common/include/arch/x86/rte_memcpy.h           | 1232 ++++++++++++++++++--
 lib/librte_efd/rte_efd_x86.h                       |   41 +-
 test/test/test_memcpy_perf.c                       |   40 +-
 3 files changed, 1200 insertions(+), 113 deletions(-)

-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [dpdk-dev] [PATCH v3 1/3] eal/x86: run-time dispatch over memcpy
  2017-09-26  7:41 [dpdk-dev] [PATCH v3 0/3] dynamic linking support Xiaoyun Li
@ 2017-09-26  7:41 ` Xiaoyun Li
  2017-10-01 23:41   ` Ananyev, Konstantin
  2017-09-26  7:41 ` [dpdk-dev] [PATCH v3 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 88+ messages in thread
From: Xiaoyun Li @ 2017-09-26  7:41 UTC (permalink / raw)
  To: bruce.richardson, konstantin.ananyev
  Cc: wenzhuo.lu, helin.zhang, dev, Xiaoyun Li

This patch dynamically selects functions of memcpy at run-time based
on CPU flags that current machine supports. This patch uses function
pointers which are bind to the relative functions at constrctor time.
In addition, AVX512 instructions set would be compiled only if users
config it enabled and the compiler supports it.

Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
---
v2
* Use gcc function multi-versioning to avoid compilation issues.
* Add macros for AVX512 and AVX2. Only if users enable AVX512 and the
compiler supports it, the AVX512 codes would be compiled. Only if the
compiler supports AVX2, the AVX2 codes would be compiled.

v3
* Reduce function calls via only keep rte_memcpy_xxx.
* Add conditions that when copy size is small, use inline code path.
Otherwise, use dynamic code path.
* To support attribute target, clang version must be greater than 3.7.
Otherwise, would choose SSE/AVX code path, the same as before.
* Move two mocro functions to the top of the code since they would be
used in inline SSE/AVX and dynamic SSE/AVX codes.

 .../common/include/arch/x86/rte_memcpy.h           | 1232 ++++++++++++++++++--
 1 file changed, 1135 insertions(+), 97 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index 74c280c..ed6c412 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -45,6 +45,8 @@
 #include <string.h>
 #include <rte_vect.h>
 #include <rte_common.h>
+#include <rte_cpuflags.h>
+#include <rte_log.h>
 
 #ifdef __cplusplus
 extern "C" {
@@ -68,6 +70,100 @@ extern "C" {
 static __rte_always_inline void *
 rte_memcpy(void *dst, const void *src, size_t n);
 
+/**
+ * Macro for copying unaligned block from one location to another with constant load offset,
+ * 47 bytes leftover maximum,
+ * locations should not overlap.
+ * Requirements:
+ * - Store is aligned
+ * - Load offset is <offset>, which must be immediate value within [1, 15]
+ * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
+ * - <dst>, <src>, <len> must be variables
+ * - __m128i <xmm0> ~ <xmm8> must be pre-defined
+ */
+#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)                                                     \
+__extension__ ({                                                                                            \
+    int tmp;                                                                                                \
+    while (len >= 128 + 16 - offset) {                                                                      \
+        xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));                  \
+        len -= 128;                                                                                         \
+        xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));                  \
+        xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));                  \
+        xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16));                  \
+        xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16));                  \
+        xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16));                  \
+        xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16));                  \
+        xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \
+        xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \
+        src = (const uint8_t *)src + 128;                                                                   \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));        \
+        dst = (uint8_t *)dst + 128;                                                                         \
+    }                                                                                                       \
+    tmp = len;                                                                                              \
+    len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
+    tmp -= len;                                                                                             \
+    src = (const uint8_t *)src + tmp;                                                                       \
+    dst = (uint8_t *)dst + tmp;                                                                             \
+    if (len >= 32 + 16 - offset) {                                                                          \
+        while (len >= 32 + 16 - offset) {                                                                   \
+            xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));              \
+            len -= 32;                                                                                      \
+            xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));              \
+            xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));              \
+            src = (const uint8_t *)src + 32;                                                                \
+            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));    \
+            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));    \
+            dst = (uint8_t *)dst + 32;                                                                      \
+        }                                                                                                   \
+        tmp = len;                                                                                          \
+        len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
+        tmp -= len;                                                                                         \
+        src = (const uint8_t *)src + tmp;                                                                   \
+        dst = (uint8_t *)dst + tmp;                                                                         \
+    }                                                                                                       \
+})
+
+/**
+ * Macro for copying unaligned block from one location to another,
+ * 47 bytes leftover maximum,
+ * locations should not overlap.
+ * Use switch here because the aligning instruction requires immediate value for shift count.
+ * Requirements:
+ * - Store is aligned
+ * - Load offset is <offset>, which must be within [1, 15]
+ * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
+ * - <dst>, <src>, <len> must be variables
+ * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined
+ */
+#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                   \
+__extension__ ({                                                      \
+    switch (offset) {                                                 \
+    case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break;    \
+    case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break;    \
+    case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break;    \
+    case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break;    \
+    case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break;    \
+    case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break;    \
+    case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break;    \
+    case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break;    \
+    case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break;    \
+    case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break;    \
+    case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;    \
+    case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break;    \
+    case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break;    \
+    case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break;    \
+    case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break;    \
+    default:;                                                         \
+    }                                                                 \
+})
+
 #ifdef RTE_MACHINE_CPUFLAG_AVX512F
 
 #define ALIGNMENT_MASK 0x3F
@@ -589,100 +685,6 @@ rte_mov256(uint8_t *dst, const uint8_t *src)
 	rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
 }
 
-/**
- * Macro for copying unaligned block from one location to another with constant load offset,
- * 47 bytes leftover maximum,
- * locations should not overlap.
- * Requirements:
- * - Store is aligned
- * - Load offset is <offset>, which must be immediate value within [1, 15]
- * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
- * - <dst>, <src>, <len> must be variables
- * - __m128i <xmm0> ~ <xmm8> must be pre-defined
- */
-#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)                                                     \
-__extension__ ({                                                                                            \
-    int tmp;                                                                                                \
-    while (len >= 128 + 16 - offset) {                                                                      \
-        xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));                  \
-        len -= 128;                                                                                         \
-        xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));                  \
-        xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));                  \
-        xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16));                  \
-        xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16));                  \
-        xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16));                  \
-        xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16));                  \
-        xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \
-        xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \
-        src = (const uint8_t *)src + 128;                                                                   \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));        \
-        dst = (uint8_t *)dst + 128;                                                                         \
-    }                                                                                                       \
-    tmp = len;                                                                                              \
-    len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
-    tmp -= len;                                                                                             \
-    src = (const uint8_t *)src + tmp;                                                                       \
-    dst = (uint8_t *)dst + tmp;                                                                             \
-    if (len >= 32 + 16 - offset) {                                                                          \
-        while (len >= 32 + 16 - offset) {                                                                   \
-            xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));              \
-            len -= 32;                                                                                      \
-            xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));              \
-            xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));              \
-            src = (const uint8_t *)src + 32;                                                                \
-            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));    \
-            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));    \
-            dst = (uint8_t *)dst + 32;                                                                      \
-        }                                                                                                   \
-        tmp = len;                                                                                          \
-        len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
-        tmp -= len;                                                                                         \
-        src = (const uint8_t *)src + tmp;                                                                   \
-        dst = (uint8_t *)dst + tmp;                                                                         \
-    }                                                                                                       \
-})
-
-/**
- * Macro for copying unaligned block from one location to another,
- * 47 bytes leftover maximum,
- * locations should not overlap.
- * Use switch here because the aligning instruction requires immediate value for shift count.
- * Requirements:
- * - Store is aligned
- * - Load offset is <offset>, which must be within [1, 15]
- * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
- * - <dst>, <src>, <len> must be variables
- * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined
- */
-#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                   \
-__extension__ ({                                                      \
-    switch (offset) {                                                 \
-    case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break;    \
-    case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break;    \
-    case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break;    \
-    case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break;    \
-    case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break;    \
-    case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break;    \
-    case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break;    \
-    case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break;    \
-    case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break;    \
-    case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break;    \
-    case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;    \
-    case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break;    \
-    case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break;    \
-    case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break;    \
-    case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break;    \
-    default:;                                                         \
-    }                                                                 \
-})
-
 static inline void *
 rte_memcpy_generic(void *dst, const void *src, size_t n)
 {
@@ -888,13 +890,1049 @@ rte_memcpy_aligned(void *dst, const void *src, size_t n)
 	return ret;
 }
 
+/*
+ * Run-time dispatch impementation of memcpy.
+ */
+
+typedef void * (*rte_memcpy_t)(void *dst, const void *src, size_t n);
+static rte_memcpy_t rte_memcpy_ptr;
+
+/**
+ * AVX512 implementation below
+ */
+#ifdef CC_SUPPORT_AVX512
+__attribute__((target("avx512f")))
+static inline void *
+rte_memcpy_AVX512F(void *dst, const void *src, size_t n)
+{
+	if (!(((uintptr_t)dst | (uintptr_t)src) & 0x3F)) {
+		void *ret = dst;
+
+		/* Copy size <= 16 bytes */
+		if (n < 16) {
+			if (n & 0x01) {
+				*(uint8_t *)dst = *(const uint8_t *)src;
+				src = (const uint8_t *)src + 1;
+				dst = (uint8_t *)dst + 1;
+			}
+			if (n & 0x02) {
+				*(uint16_t *)dst = *(const uint16_t *)src;
+				src = (const uint16_t *)src + 1;
+				dst = (uint16_t *)dst + 1;
+			}
+			if (n & 0x04) {
+				*(uint32_t *)dst = *(const uint32_t *)src;
+				src = (const uint32_t *)src + 1;
+				dst = (uint32_t *)dst + 1;
+			}
+			if (n & 0x08)
+				*(uint64_t *)dst = *(const uint64_t *)src;
+
+			return ret;
+		}
+
+		/* Copy 16 <= size <= 32 bytes */
+		if (n <= 32) {
+			__m128i xmm0, xmm1;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src - 16 + n));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst - 16 + n), xmm1);
+
+			return ret;
+		}
+
+		/* Copy 32 < size <= 64 bytes */
+		if (n <= 64) {
+			__m256i ymm0, ymm1;
+			ymm0 = _mm256_loadu_si256((const __m256i *)src);
+			ymm1 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src - 32 + n));
+			_mm256_storeu_si256((__m256i *)dst, ymm0);
+			_mm256_storeu_si256((__m256i *)
+				((uint8_t *)dst - 32 + n), ymm1);
+
+			return ret;
+		}
+
+		/* Copy 64 bytes blocks */
+		for (; n >= 64; n -= 64) {
+			__m512i zmm0;
+			zmm0 = _mm512_loadu_si512((const void *)src);
+			_mm512_storeu_si512((void *)dst, zmm0);
+			dst = (uint8_t *)dst + 64;
+			src = (const uint8_t *)src + 64;
+		}
+
+		/* Copy whatever left */
+		__m512i zmm0;
+		zmm0 = _mm512_loadu_si512((const void *)
+			((const uint8_t *)src - 64 + n));
+		_mm512_storeu_si512((void *)((uint8_t *)dst - 64 + n), zmm0);
+
+		return ret;
+	} else {
+		uintptr_t dstu = (uintptr_t)dst;
+		uintptr_t srcu = (uintptr_t)src;
+		void *ret = dst;
+		size_t dstofss;
+		size_t bits;
+
+		/**
+		 * Copy less than 16 bytes
+		 */
+		if (n < 16) {
+			if (n & 0x01) {
+				*(uint8_t *)dstu = *(const uint8_t *)srcu;
+				srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+				dstu = (uintptr_t)((uint8_t *)dstu + 1);
+			}
+			if (n & 0x02) {
+				*(uint16_t *)dstu = *(const uint16_t *)srcu;
+				srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+				dstu = (uintptr_t)((uint16_t *)dstu + 1);
+			}
+			if (n & 0x04) {
+				*(uint32_t *)dstu = *(const uint32_t *)srcu;
+				srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+				dstu = (uintptr_t)((uint32_t *)dstu + 1);
+			}
+			if (n & 0x08)
+				*(uint64_t *)dstu = *(const uint64_t *)srcu;
+			return ret;
+		}
+
+		/**
+		 * Fast way when copy size doesn't exceed 512 bytes
+		 */
+		if (n <= 32) {
+			__m128i xmm0, xmm1;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src - 16 + n));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst - 16 + n), xmm1);
+			return ret;
+		}
+		if (n <= 64) {
+			__m256i ymm0, ymm1;
+			ymm0 = _mm256_loadu_si256((const __m256i *)src);
+			ymm1 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src - 32 + n));
+			_mm256_storeu_si256((__m256i *)dst, ymm0);
+			_mm256_storeu_si256((__m256i *)
+				((uint8_t *)dst - 32 + n), ymm1);
+			return ret;
+		}
+		if (n <= 512) {
+			if (n >= 256) {
+				n -= 256;
+				__m512i zmm0, zmm1, zmm2, zmm3;
+				zmm0 = _mm512_loadu_si512((const void *)src);
+				zmm1 = _mm512_loadu_si512((const void *)
+					((const uint8_t *)src + 64));
+				zmm2 = _mm512_loadu_si512((const void *)
+					((const uint8_t *)src + 2*64));
+				zmm3 = _mm512_loadu_si512((const void *)
+					((const uint8_t *)src + 3*64));
+				_mm512_storeu_si512((void *)dst, zmm0);
+				_mm512_storeu_si512((void *)
+					((uint8_t *)dst + 64), zmm1);
+				_mm512_storeu_si512((void *)
+					((uint8_t *)dst + 2*64), zmm2);
+				_mm512_storeu_si512((void *)
+					((uint8_t *)dst + 3*64), zmm3);
+				src = (const uint8_t *)src + 256;
+				dst = (uint8_t *)dst + 256;
+			}
+			if (n >= 128) {
+				n -= 128;
+				__m512i zmm0, zmm1;
+				zmm0 = _mm512_loadu_si512((const void *)src);
+				zmm1 = _mm512_loadu_si512((const void *)
+					((const uint8_t *)src + 64));
+				_mm512_storeu_si512((void *)dst, zmm0);
+				_mm512_storeu_si512((void *)
+					((uint8_t *)dst + 64), zmm1);
+				src = (const uint8_t *)src + 128;
+				dst = (uint8_t *)dst + 128;
+			}
+COPY_BLOCK_128_BACK63:
+			if (n > 64) {
+				__m512i zmm0, zmm1;
+				zmm0 = _mm512_loadu_si512((const void *)src);
+				zmm1 = _mm512_loadu_si512((const void *)
+					((const uint8_t *)src - 64 + n));
+				_mm512_storeu_si512((void *)dst, zmm0);
+				_mm512_storeu_si512((void *)
+					((uint8_t *)dst - 64 + n), zmm1);
+				return ret;
+			}
+			if (n > 0) {
+				__m512i zmm0;
+				zmm0 = _mm512_loadu_si512((const void *)
+					((const uint8_t *)src - 64 + n));
+				_mm512_storeu_si512((void *)
+					((uint8_t *)dst - 64 + n), zmm0);
+			}
+			return ret;
+		}
+
+		/**
+		 * Make store aligned when copy size exceeds 512 bytes
+		 */
+		dstofss = ((uintptr_t)dst & 0x3F);
+		if (dstofss > 0) {
+			dstofss = 64 - dstofss;
+			n -= dstofss;
+			__m512i zmm0;
+			zmm0 = _mm512_loadu_si512((const void *)src);
+			_mm512_storeu_si512((void *)dst, zmm0);
+			src = (const uint8_t *)src + dstofss;
+			dst = (uint8_t *)dst + dstofss;
+		}
+
+		/**
+		 * Copy 512-byte blocks.
+		 * Use copy block function for better instruction order control,
+		 * which is important when load is unaligned.
+		 */
+		__m512i zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, zmm7;
+
+		while (n >= 512) {
+			zmm0 = _mm512_loadu_si512((const void *)
+				((const uint8_t *)src + 0 * 64));
+			n -= 512;
+			zmm1 = _mm512_loadu_si512((const void *)
+				((const uint8_t *)src + 1 * 64));
+			zmm2 = _mm512_loadu_si512((const void *)
+				((const uint8_t *)src + 2 * 64));
+			zmm3 = _mm512_loadu_si512((const void *)
+				((const uint8_t *)src + 3 * 64));
+			zmm4 = _mm512_loadu_si512((const void *)
+				((const uint8_t *)src + 4 * 64));
+			zmm5 = _mm512_loadu_si512((const void *)
+				((const uint8_t *)src + 5 * 64));
+			zmm6 = _mm512_loadu_si512((const void *)
+				((const uint8_t *)src + 6 * 64));
+			zmm7 = _mm512_loadu_si512((const void *)
+				((const uint8_t *)src + 7 * 64));
+			src = (const uint8_t *)src + 512;
+			_mm512_storeu_si512((void *)
+				((uint8_t *)dst + 0 * 64), zmm0);
+			_mm512_storeu_si512((void *)
+				((uint8_t *)dst + 1 * 64), zmm1);
+			_mm512_storeu_si512((void *)
+				((uint8_t *)dst + 2 * 64), zmm2);
+			_mm512_storeu_si512((void *)
+				((uint8_t *)dst + 3 * 64), zmm3);
+			_mm512_storeu_si512((void *)
+				((uint8_t *)dst + 4 * 64), zmm4);
+			_mm512_storeu_si512((void *)
+				((uint8_t *)dst + 5 * 64), zmm5);
+			_mm512_storeu_si512((void *)
+				((uint8_t *)dst + 6 * 64), zmm6);
+			_mm512_storeu_si512((void *)
+				((uint8_t *)dst + 7 * 64), zmm7);
+			dst = (uint8_t *)dst + 512;
+		}
+		bits = n;
+		n = n & 511;
+		bits -= n;
+		src = (const uint8_t *)src + bits;
+		dst = (uint8_t *)dst + bits;
+
+		/**
+		 * Copy 128-byte blocks.
+		 * Use copy block function for better instruction order control,
+		 * which is important when load is unaligned.
+		 */
+		if (n >= 128) {
+			__m512i zmm0, zmm1;
+
+			while (n >= 128) {
+				zmm0 = _mm512_loadu_si512((const void *)
+					((const uint8_t *)src + 0 * 64));
+				n -= 128;
+				zmm1 = _mm512_loadu_si512((const void *)
+					((const uint8_t *)src + 1 * 64));
+				src = (const uint8_t *)src + 128;
+				_mm512_storeu_si512((void *)
+					((uint8_t *)dst + 0 * 64), zmm0);
+				_mm512_storeu_si512((void *)
+					((uint8_t *)dst + 1 * 64), zmm1);
+				dst = (uint8_t *)dst + 128;
+			}
+			bits = n;
+			n = n & 127;
+			bits -= n;
+			src = (const uint8_t *)src + bits;
+			dst = (uint8_t *)dst + bits;
+		}
+
+		/**
+		 * Copy whatever left
+		 */
+		goto COPY_BLOCK_128_BACK63;
+	}
+}
+#endif
+
+/**
+ * AVX2 implementation below
+ */
+#ifdef CC_SUPPORT_AVX2
+__attribute__((target("avx2")))
+static inline void *
+rte_memcpy_AVX2(void *dst, const void *src, size_t n)
+{
+	if (!(((uintptr_t)dst | (uintptr_t)src) & 0x1F)) {
+		void *ret = dst;
+
+		/* Copy size <= 16 bytes */
+		if (n < 16) {
+			if (n & 0x01) {
+				*(uint8_t *)dst = *(const uint8_t *)src;
+				src = (const uint8_t *)src + 1;
+				dst = (uint8_t *)dst + 1;
+			}
+			if (n & 0x02) {
+				*(uint16_t *)dst = *(const uint16_t *)src;
+				src = (const uint16_t *)src + 1;
+				dst = (uint16_t *)dst + 1;
+			}
+			if (n & 0x04) {
+				*(uint32_t *)dst = *(const uint32_t *)src;
+				src = (const uint32_t *)src + 1;
+				dst = (uint32_t *)dst + 1;
+			}
+			if (n & 0x08)
+				*(uint64_t *)dst = *(const uint64_t *)src;
+
+			return ret;
+		}
+
+		/* Copy 16 <= size <= 32 bytes */
+		if (n <= 32) {
+			__m128i xmm0, xmm1;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src - 16 + n));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst - 16 + n), xmm1);
+
+			return ret;
+		}
+
+		/* Copy 32 < size <= 64 bytes */
+		if (n <= 64) {
+			__m256i ymm0, ymm1;
+			ymm0 = _mm256_loadu_si256((const __m256i *)src);
+			ymm1 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src - 32 + n));
+			_mm256_storeu_si256((__m256i *)dst, ymm0);
+			_mm256_storeu_si256((__m256i *)
+				((uint8_t *)dst - 32 + n), ymm1);
+
+			return ret;
+		}
+
+		/* Copy 64 bytes blocks */
+		for (; n >= 64; n -= 64) {
+			__m256i ymm0, ymm1;
+			ymm0 = _mm256_loadu_si256((const __m256i *)src);
+			ymm1 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 32));
+			_mm256_storeu_si256((__m256i *)dst, ymm0);
+			_mm256_storeu_si256((__m256i *)
+				((uint8_t *)dst + 32), ymm1);
+			dst = (uint8_t *)dst + 64;
+			src = (const uint8_t *)src + 64;
+		}
+
+		/* Copy whatever left */
+		__m256i ymm0, ymm1;
+		ymm0 = _mm256_loadu_si256((const __m256i *)
+			((const uint8_t *)src - 64 + n));
+		ymm1 = _mm256_loadu_si256((const __m256i *)
+			((const uint8_t *)src - 32 + n));
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst - 64 + n), ymm0);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst - 32 + n), ymm1);
+
+		return ret;
+	} else {
+		uintptr_t dstu = (uintptr_t)dst;
+		uintptr_t srcu = (uintptr_t)src;
+		void *ret = dst;
+		size_t dstofss;
+		size_t bits;
+
+		/**
+		 * Copy less than 16 bytes
+		 */
+		if (n < 16) {
+			if (n & 0x01) {
+				*(uint8_t *)dstu = *(const uint8_t *)srcu;
+				srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+				dstu = (uintptr_t)((uint8_t *)dstu + 1);
+			}
+			if (n & 0x02) {
+				*(uint16_t *)dstu = *(const uint16_t *)srcu;
+				srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+				dstu = (uintptr_t)((uint16_t *)dstu + 1);
+			}
+			if (n & 0x04) {
+				*(uint32_t *)dstu = *(const uint32_t *)srcu;
+				srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+				dstu = (uintptr_t)((uint32_t *)dstu + 1);
+			}
+			if (n & 0x08)
+				*(uint64_t *)dstu = *(const uint64_t *)srcu;
+			return ret;
+		}
+
+		/**
+		 * Fast way when copy size doesn't exceed 256 bytes
+		 */
+		if (n <= 32) {
+			__m128i xmm0, xmm1;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src - 16 + n));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst - 16 + n), xmm1);
+			return ret;
+		}
+		if (n <= 48) {
+			__m128i xmm0, xmm1, xmm2;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src + 16));
+			xmm2 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src - 16 + n));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst + 16), xmm1);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst - 16 + n), xmm2);
+			return ret;
+		}
+		if (n <= 64) {
+			__m256i ymm0, ymm1;
+			ymm0 = _mm256_loadu_si256((const __m256i *)src);
+			ymm1 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src - 32 + n));
+			_mm256_storeu_si256((__m256i *)dst, ymm0);
+			_mm256_storeu_si256((__m256i *)
+				((uint8_t *)dst - 32 + n), ymm1);
+			return ret;
+		}
+		if (n <= 256) {
+			if (n >= 128) {
+				n -= 128;
+				__m256i ymm0, ymm1, ymm2, ymm3;
+				ymm0 = _mm256_loadu_si256((const __m256i *)src);
+				ymm1 = _mm256_loadu_si256((const __m256i *)
+					((const uint8_t *)src + 32));
+				ymm2 = _mm256_loadu_si256((const __m256i *)
+					((const uint8_t *)src + 2*32));
+				ymm3 = _mm256_loadu_si256((const __m256i *)
+					((const uint8_t *)src + 3*32));
+				_mm256_storeu_si256((__m256i *)dst, ymm0);
+				_mm256_storeu_si256((__m256i *)
+					((uint8_t *)dst + 32), ymm1);
+				_mm256_storeu_si256((__m256i *)
+					((uint8_t *)dst + 2*32), ymm2);
+				_mm256_storeu_si256((__m256i *)
+					((uint8_t *)dst + 3*32), ymm3);
+				src = (const uint8_t *)src + 128;
+				dst = (uint8_t *)dst + 128;
+			}
+COPY_BLOCK_128_BACK31:
+			if (n >= 64) {
+				n -= 64;
+				__m256i ymm0, ymm1;
+				ymm0 = _mm256_loadu_si256((const __m256i *)src);
+				ymm1 = _mm256_loadu_si256((const __m256i *)
+					((const uint8_t *)src + 32));
+				_mm256_storeu_si256((__m256i *)dst, ymm0);
+				_mm256_storeu_si256((__m256i *)
+					((uint8_t *)dst + 32), ymm1);
+				src = (const uint8_t *)src + 64;
+				dst = (uint8_t *)dst + 64;
+			}
+			if (n > 32) {
+				__m256i ymm0, ymm1;
+				ymm0 = _mm256_loadu_si256((const __m256i *)src);
+				ymm1 = _mm256_loadu_si256((const __m256i *)
+					((const uint8_t *)src - 32 + n));
+				_mm256_storeu_si256((__m256i *)dst, ymm0);
+				_mm256_storeu_si256((__m256i *)
+					((uint8_t *)dst - 32 + n), ymm1);
+				return ret;
+			}
+			if (n > 0) {
+				__m256i ymm0;
+				ymm0 = _mm256_loadu_si256((const __m256i *)
+					((const uint8_t *)src - 32 + n));
+				_mm256_storeu_si256((__m256i *)
+					((uint8_t *)dst - 32 + n), ymm0);
+			}
+			return ret;
+		}
+
+		/**
+		 * Make store aligned when copy size exceeds 256 bytes
+		 */
+		dstofss = (uintptr_t)dst & 0x1F;
+		if (dstofss > 0) {
+			dstofss = 32 - dstofss;
+			n -= dstofss;
+			__m256i ymm0;
+			ymm0 = _mm256_loadu_si256((const __m256i *)src);
+			_mm256_storeu_si256((__m256i *)dst, ymm0);
+			src = (const uint8_t *)src + dstofss;
+			dst = (uint8_t *)dst + dstofss;
+		}
+
+		/**
+		 * Copy 128-byte blocks
+		 */
+		__m256i ymm0, ymm1, ymm2, ymm3;
+
+		while (n >= 128) {
+			ymm0 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 0 * 32));
+			n -= 128;
+			ymm1 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 1 * 32));
+			ymm2 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 2 * 32));
+			ymm3 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 3 * 32));
+			src = (const uint8_t *)src + 128;
+			_mm256_storeu_si256((__m256i *)
+				((uint8_t *)dst + 0 * 32), ymm0);
+			_mm256_storeu_si256((__m256i *)
+				((uint8_t *)dst + 1 * 32), ymm1);
+			_mm256_storeu_si256((__m256i *)
+				((uint8_t *)dst + 2 * 32), ymm2);
+			_mm256_storeu_si256((__m256i *)
+				((uint8_t *)dst + 3 * 32), ymm3);
+			dst = (uint8_t *)dst + 128;
+		}
+		bits = n;
+		n = n & 127;
+		bits -= n;
+		src = (const uint8_t *)src + bits;
+		dst = (uint8_t *)dst + bits;
+
+		/**
+		 * Copy whatever left
+		 */
+		goto COPY_BLOCK_128_BACK31;
+	}
+}
+#endif
+
+/**
+ * SSE & AVX implementation below
+ */
+static inline void *
+rte_memcpy_DEFAULT(void *dst, const void *src, size_t n)
+{
+	if (!(((uintptr_t)dst | (uintptr_t)src) & 0x0F)) {
+		void *ret = dst;
+
+		/* Copy size <= 16 bytes */
+		if (n < 16) {
+			if (n & 0x01) {
+				*(uint8_t *)dst = *(const uint8_t *)src;
+				src = (const uint8_t *)src + 1;
+				dst = (uint8_t *)dst + 1;
+			}
+			if (n & 0x02) {
+				*(uint16_t *)dst = *(const uint16_t *)src;
+				src = (const uint16_t *)src + 1;
+				dst = (uint16_t *)dst + 1;
+			}
+			if (n & 0x04) {
+				*(uint32_t *)dst = *(const uint32_t *)src;
+				src = (const uint32_t *)src + 1;
+				dst = (uint32_t *)dst + 1;
+			}
+			if (n & 0x08)
+				*(uint64_t *)dst = *(const uint64_t *)src;
+
+			return ret;
+		}
+
+		/* Copy 16 <= size <= 32 bytes */
+		if (n <= 32) {
+			__m128i xmm0, xmm1;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src - 16 + n));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst - 16 + n), xmm1);
+
+			return ret;
+		}
+
+		/* Copy 32 < size <= 64 bytes */
+		if (n <= 64) {
+			__m128i xmm0, xmm1, xmm2, xmm3;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src + 16));
+			xmm2 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src - 32 + n));
+			xmm3 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src - 16 + n));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst + 16), xmm1);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst - 32 + n), xmm2);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst - 16 + n), xmm3);
+
+			return ret;
+		}
+
+		/* Copy 64 bytes blocks */
+		for (; n >= 64; n -= 64) {
+			__m128i xmm0, xmm1, xmm2, xmm3;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src + 16));
+			xmm2 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src + 2*16));
+			xmm3 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src + 3*16));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst + 16), xmm1);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst + 2*16), xmm2);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst + 3*16), xmm3);
+			dst = (uint8_t *)dst + 64;
+			src = (const uint8_t *)src + 64;
+		}
+
+		/* Copy whatever left */
+		__m128i xmm0, xmm1, xmm2, xmm3;
+		xmm0 = _mm_loadu_si128((const __m128i *)
+			((const uint8_t *)src - 64 + n));
+		xmm1 = _mm_loadu_si128((const __m128i *)
+			((const uint8_t *)src - 48 + n));
+		xmm2 = _mm_loadu_si128((const __m128i *)
+			((const uint8_t *)src - 32 + n));
+		xmm3 = _mm_loadu_si128((const __m128i *)
+			((const uint8_t *)src - 16 + n));
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst - 64 + n), xmm0);
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst - 48 + n), xmm1);
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst - 32 + n), xmm2);
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst - 16 + n), xmm3);
+
+		return ret;
+	} else {
+		__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
+		uintptr_t dstu = (uintptr_t)dst;
+		uintptr_t srcu = (uintptr_t)src;
+		void *ret = dst;
+		size_t dstofss;
+		size_t srcofs;
+
+		/**
+		 * Copy less than 16 bytes
+		 */
+		if (n < 16) {
+			if (n & 0x01) {
+				*(uint8_t *)dstu = *(const uint8_t *)srcu;
+				srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+				dstu = (uintptr_t)((uint8_t *)dstu + 1);
+			}
+			if (n & 0x02) {
+				*(uint16_t *)dstu = *(const uint16_t *)srcu;
+				srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+				dstu = (uintptr_t)((uint16_t *)dstu + 1);
+			}
+			if (n & 0x04) {
+				*(uint32_t *)dstu = *(const uint32_t *)srcu;
+				srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+				dstu = (uintptr_t)((uint32_t *)dstu + 1);
+			}
+			if (n & 0x08)
+				*(uint64_t *)dstu = *(const uint64_t *)srcu;
+			return ret;
+		}
+
+		/**
+		 * Fast way when copy size doesn't exceed 512 bytes
+		 */
+		if (n <= 32) {
+			__m128i xmm0, xmm1;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src - 16 + n));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst - 16 + n), xmm1);
+			return ret;
+		}
+		if (n <= 48) {
+			__m128i xmm0, xmm1, xmm2;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src + 16));
+			xmm2 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src - 16 + n));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst + 16), xmm1);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst - 16 + n), xmm2);
+			return ret;
+		}
+		if (n <= 64) {
+			__m128i xmm0, xmm1, xmm2, xmm3;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src + 16));
+			xmm2 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src + 32));
+			xmm3 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src - 16 + n));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst + 16), xmm1);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst + 32), xmm2);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst - 16 + n), xmm3);
+			return ret;
+		}
+		if (n <= 128)
+			goto COPY_BLOCK_128_BACK15;
+		if (n <= 512) {
+			if (n >= 256) {
+				n -= 256;
+				__m128i xmm0, xmm1;
+				xmm0 = _mm_loadu_si128((const __m128i *)src);
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 16));
+				_mm_storeu_si128((__m128i *)dst, xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 2*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 3*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 2*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 3*16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 4*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 5*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 4*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 5*16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 6*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 7*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 6*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 7*16), xmm1);
+
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 128));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 128 + 16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 128), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 128 + 16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 128 + 2*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 128 + 3*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 128 + 2*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 128 + 3*16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 128 + 4*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 128 + 5*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 128 + 4*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 128 + 5*16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 128 + 6*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 128 + 7*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 128 + 6*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 128 + 7*16), xmm1);
+				src = (const uint8_t *)src + 256;
+				dst = (uint8_t *)dst + 256;
+			}
+COPY_BLOCK_255_BACK15:
+			if (n >= 128) {
+				n -= 128;
+				__m128i xmm0, xmm1;
+				xmm0 = _mm_loadu_si128((const __m128i *)src);
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 16));
+				_mm_storeu_si128((__m128i *)dst, xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 2*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 3*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 2*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 3*16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 4*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 5*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 4*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 5*16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 6*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 7*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 6*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 7*16), xmm1);
+				src = (const uint8_t *)src + 128;
+				dst = (uint8_t *)dst + 128;
+			}
+COPY_BLOCK_128_BACK15:
+			if (n >= 64) {
+				n -= 64;
+				__m128i xmm0, xmm1;
+				xmm0 = _mm_loadu_si128((const __m128i *)src);
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 16));
+				_mm_storeu_si128((__m128i *)dst, xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 16), xmm1);
+
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 2*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 3*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 2*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 3*16), xmm1);
+				src = (const uint8_t *)src + 64;
+				dst = (uint8_t *)dst + 64;
+			}
+COPY_BLOCK_64_BACK15:
+			if (n >= 32) {
+				n -= 32;
+				__m128i xmm0, xmm1;
+				xmm0 = _mm_loadu_si128((const __m128i *)src);
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 16));
+				_mm_storeu_si128((__m128i *)dst, xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 16), xmm1);
+				src = (const uint8_t *)src + 32;
+				dst = (uint8_t *)dst + 32;
+			}
+			if (n > 16) {
+				__m128i xmm0, xmm1;
+				xmm0 = _mm_loadu_si128((const __m128i *)src);
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src - 16 + n));
+				_mm_storeu_si128((__m128i *)dst, xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst - 16 + n), xmm1);
+				return ret;
+			}
+			if (n > 0) {
+				__m128i xmm0;
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src - 16 + n));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst - 16 + n), xmm0);
+			}
+			return ret;
+		}
+
+		/**
+		 * Make store aligned when copy size exceeds 512 bytes,
+		 * and make sure the first 15 bytes are copied, because
+		 * unaligned copy functions require up to 15 bytes
+		 * backwards access.
+		 */
+		dstofss = (uintptr_t)dst & 0x0F;
+		if (dstofss > 0) {
+			dstofss = 16 - dstofss + 16;
+			n -= dstofss;
+			__m128i xmm0, xmm1;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src + 16));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst + 16), xmm1);
+			src = (const uint8_t *)src + dstofss;
+			dst = (uint8_t *)dst + dstofss;
+		}
+		srcofs = ((uintptr_t)src & 0x0F);
+
+		/**
+		 * For aligned copy
+		 */
+		if (srcofs == 0) {
+			/**
+			 * Copy 256-byte blocks
+			 */
+			for (; n >= 256; n -= 256) {
+				__m128i xmm0, xmm1;
+				xmm0 = _mm_loadu_si128((const __m128i *)src);
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 16));
+				_mm_storeu_si128((__m128i *)dst, xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 2*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 3*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 2*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 3*16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 4*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 5*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 4*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 5*16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 6*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 7*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 6*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 7*16), xmm1);
+
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 8*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 9*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 8*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 9*16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 10*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 11*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 10*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 11*16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 12*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 13*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 12*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 13*16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 14*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 15*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 14*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 15*16), xmm1);
+				dst = (uint8_t *)dst + 256;
+				src = (const uint8_t *)src + 256;
+			}
+
+			/**
+			 * Copy whatever left
+			 */
+			goto COPY_BLOCK_255_BACK15;
+		}
+
+		/**
+		 * For copy with unaligned load
+		 */
+		MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
+
+		/**
+		 * Copy whatever left
+		 */
+		goto COPY_BLOCK_64_BACK15;
+	}
+}
+
+static void __attribute__((constructor))
+rte_memcpy_init(void)
+{
+#ifdef CC_SUPPORT_AVX512
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F)) {
+		rte_memcpy_ptr = rte_memcpy_AVX512F;
+		RTE_LOG(DEBUG, EAL, "AVX512 is using!\n");
+	} else
+#endif
+#ifdef CC_SUPPORT_AVX2
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) {
+		rte_memcpy_ptr = rte_memcpy_AVX2;
+		RTE_LOG(DEBUG, EAL, "AVX2 is using!\n");
+	} else
+#endif
+	{
+		rte_memcpy_ptr = rte_memcpy_DEFAULT;
+		RTE_LOG(DEBUG, EAL, "Default SSE/AVX is using!\n");
+	}
+}
+
+#define MEMCPY_THRESH 128
 static inline void *
 rte_memcpy(void *dst, const void *src, size_t n)
 {
-	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
-		return rte_memcpy_aligned(dst, src, n);
+	if (n <= MEMCPY_THRESH) {
+		if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
+			return rte_memcpy_aligned(dst, src, n);
+		else
+			return rte_memcpy_generic(dst, src, n);
+	}
 	else
-		return rte_memcpy_generic(dst, src, n);
+		return (*rte_memcpy_ptr)(dst, src, n);
 }
 
 #ifdef __cplusplus
-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [dpdk-dev] [PATCH v3 2/3] app/test: run-time dispatch over memcpy perf test
  2017-09-26  7:41 [dpdk-dev] [PATCH v3 0/3] dynamic linking support Xiaoyun Li
  2017-09-26  7:41 ` [dpdk-dev] [PATCH v3 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
@ 2017-09-26  7:41 ` Xiaoyun Li
  2017-09-26  7:41 ` [dpdk-dev] [PATCH v3 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
  2017-10-02 16:13 ` [dpdk-dev] [PATCH v4 0/3] run-time Linking support Xiaoyun Li
  3 siblings, 0 replies; 88+ messages in thread
From: Xiaoyun Li @ 2017-09-26  7:41 UTC (permalink / raw)
  To: bruce.richardson, konstantin.ananyev
  Cc: wenzhuo.lu, helin.zhang, dev, Xiaoyun Li

This patch modifies assignment of alignment unit from build-time
to run-time based on CPU flags that machine supports.

Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
---
 test/test/test_memcpy_perf.c | 40 +++++++++++++++++++++++++++-------------
 1 file changed, 27 insertions(+), 13 deletions(-)

diff --git a/test/test/test_memcpy_perf.c b/test/test/test_memcpy_perf.c
index ff3aaaa..33def3b 100644
--- a/test/test/test_memcpy_perf.c
+++ b/test/test/test_memcpy_perf.c
@@ -79,13 +79,7 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
 #define TEST_BATCH_SIZE         100
 
 /* Data is aligned on this many bytes (power of 2) */
-#ifdef RTE_MACHINE_CPUFLAG_AVX512F
-#define ALIGNMENT_UNIT          64
-#elif defined RTE_MACHINE_CPUFLAG_AVX2
-#define ALIGNMENT_UNIT          32
-#else /* RTE_MACHINE_CPUFLAG */
-#define ALIGNMENT_UNIT          16
-#endif /* RTE_MACHINE_CPUFLAG */
+static uint8_t alignment_unit = 16;
 
 /*
  * Pointers used in performance tests. The two large buffers are for uncached
@@ -100,20 +94,39 @@ static int
 init_buffers(void)
 {
 	unsigned i;
+#ifdef CC_SUPPORT_AVX512
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F))
+		alignment_unit = 64;
+	else
+#endif
+#ifdef CC_SUPPORT_AVX2
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
+		alignment_unit = 32;
+	else
+#endif
+		alignment_unit = 16;
 
-	large_buf_read = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	large_buf_read = rte_malloc("memcpy",
+				    LARGE_BUFFER_SIZE + alignment_unit,
+				    alignment_unit);
 	if (large_buf_read == NULL)
 		goto error_large_buf_read;
 
-	large_buf_write = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	large_buf_write = rte_malloc("memcpy",
+				     LARGE_BUFFER_SIZE + alignment_unit,
+				     alignment_unit);
 	if (large_buf_write == NULL)
 		goto error_large_buf_write;
 
-	small_buf_read = rte_malloc("memcpy", SMALL_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	small_buf_read = rte_malloc("memcpy",
+				    SMALL_BUFFER_SIZE + alignment_unit,
+				    alignment_unit);
 	if (small_buf_read == NULL)
 		goto error_small_buf_read;
 
-	small_buf_write = rte_malloc("memcpy", SMALL_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	small_buf_write = rte_malloc("memcpy",
+				     SMALL_BUFFER_SIZE + alignment_unit,
+				     alignment_unit);
 	if (small_buf_write == NULL)
 		goto error_small_buf_write;
 
@@ -153,7 +166,7 @@ static inline size_t
 get_rand_offset(size_t uoffset)
 {
 	return ((rte_rand() % (LARGE_BUFFER_SIZE - SMALL_BUFFER_SIZE)) &
-			~(ALIGNMENT_UNIT - 1)) + uoffset;
+			~(alignment_unit - 1)) + uoffset;
 }
 
 /* Fill in source and destination addresses. */
@@ -321,7 +334,8 @@ perf_test(void)
 		   "(bytes)        (ticks)        (ticks)        (ticks)        (ticks)\n"
 		   "------- -------------- -------------- -------------- --------------");
 
-	printf("\n========================== %2dB aligned ============================", ALIGNMENT_UNIT);
+	printf("\n========================= %2dB aligned ============================",
+		alignment_unit);
 	/* Do aligned tests where size is a variable */
 	perf_test_variable_aligned();
 	printf("\n------- -------------- -------------- -------------- --------------");
-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [dpdk-dev] [PATCH v3 3/3] efd: run-time dispatch over x86 EFD functions
  2017-09-26  7:41 [dpdk-dev] [PATCH v3 0/3] dynamic linking support Xiaoyun Li
  2017-09-26  7:41 ` [dpdk-dev] [PATCH v3 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
  2017-09-26  7:41 ` [dpdk-dev] [PATCH v3 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
@ 2017-09-26  7:41 ` Xiaoyun Li
  2017-10-02  0:08   ` Ananyev, Konstantin
  2017-10-02 16:13 ` [dpdk-dev] [PATCH v4 0/3] run-time Linking support Xiaoyun Li
  3 siblings, 1 reply; 88+ messages in thread
From: Xiaoyun Li @ 2017-09-26  7:41 UTC (permalink / raw)
  To: bruce.richardson, konstantin.ananyev
  Cc: wenzhuo.lu, helin.zhang, dev, Xiaoyun Li

This patch dynamically selects x86 EFD functions at run-time.
This patch uses function pointer and binds it to the relative
function based on CPU flags at constructor time.

Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
---
 lib/librte_efd/rte_efd_x86.h | 41 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 38 insertions(+), 3 deletions(-)

diff --git a/lib/librte_efd/rte_efd_x86.h b/lib/librte_efd/rte_efd_x86.h
index 34f37d7..93b6743 100644
--- a/lib/librte_efd/rte_efd_x86.h
+++ b/lib/librte_efd/rte_efd_x86.h
@@ -43,12 +43,29 @@
 #define EFD_LOAD_SI128(val) _mm_lddqu_si128(val)
 #endif
 
+typedef efd_value_t
+(*efd_lookup_internal_avx2_t)(const efd_hashfunc_t *group_hash_idx,
+		const efd_lookuptbl_t *group_lookup_table,
+		const uint32_t hash_val_a, const uint32_t hash_val_b);
+
+static efd_lookup_internal_avx2_t efd_lookup_internal_avx2_ptr;
+
 static inline efd_value_t
 efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
 		const efd_lookuptbl_t *group_lookup_table,
 		const uint32_t hash_val_a, const uint32_t hash_val_b)
 {
-#ifdef RTE_MACHINE_CPUFLAG_AVX2
+	return (*efd_lookup_internal_avx2_ptr)(group_hash_idx,
+					       group_lookup_table,
+					       hash_val_a, hash_val_b);
+}
+
+#ifdef CC_SUPPORT_AVX2
+static inline efd_value_t
+efd_lookup_internal_avx2_AVX2(const efd_hashfunc_t *group_hash_idx,
+		const efd_lookuptbl_t *group_lookup_table,
+		const uint32_t hash_val_a, const uint32_t hash_val_b)
+{
 	efd_value_t value = 0;
 	uint32_t i = 0;
 	__m256i vhash_val_a = _mm256_set1_epi32(hash_val_a);
@@ -74,13 +91,31 @@ efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
 	}
 
 	return value;
-#else
+}
+#endif
+
+static inline efd_value_t
+efd_lookup_internal_avx2_DEFAULT(const efd_hashfunc_t *group_hash_idx,
+		const efd_lookuptbl_t *group_lookup_table,
+		const uint32_t hash_val_a, const uint32_t hash_val_b)
+{
 	RTE_SET_USED(group_hash_idx);
 	RTE_SET_USED(group_lookup_table);
 	RTE_SET_USED(hash_val_a);
 	RTE_SET_USED(hash_val_b);
 	/* Return dummy value, only to avoid compilation breakage */
 	return 0;
-#endif
+}
 
+static void __attribute__((constructor))
+rte_efd_x86_init(void)
+{
+#ifdef CC_SUPPORT_AVX2
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
+		efd_lookup_internal_avx2_ptr = efd_lookup_internal_avx2_AVX2;
+	else
+		efd_lookup_internal_avx2_ptr = efd_lookup_internal_avx2_DEFAULT;
+#else
+	efd_lookup_internal_avx2_ptr = efd_lookup_internal_avx2_DEFAULT;
+#endif
 }
-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] eal/x86: run-time dispatch over memcpy
  2017-09-26  7:41 ` [dpdk-dev] [PATCH v3 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
@ 2017-10-01 23:41   ` Ananyev, Konstantin
  2017-10-02  0:12     ` Li, Xiaoyun
  0 siblings, 1 reply; 88+ messages in thread
From: Ananyev, Konstantin @ 2017-10-01 23:41 UTC (permalink / raw)
  To: Li, Xiaoyun, Richardson, Bruce; +Cc: Lu, Wenzhuo, Zhang, Helin, dev

Hi Xiaoyun,

> This patch dynamically selects functions of memcpy at run-time based
> on CPU flags that current machine supports. This patch uses function
> pointers which are bind to the relative functions at constrctor time.
> In addition, AVX512 instructions set would be compiled only if users
> config it enabled and the compiler supports it.
> 
> Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
> ---
> v2
> * Use gcc function multi-versioning to avoid compilation issues.
> * Add macros for AVX512 and AVX2. Only if users enable AVX512 and the
> compiler supports it, the AVX512 codes would be compiled. Only if the
> compiler supports AVX2, the AVX2 codes would be compiled.
> 
> v3
> * Reduce function calls via only keep rte_memcpy_xxx.
> * Add conditions that when copy size is small, use inline code path.
> Otherwise, use dynamic code path.
> * To support attribute target, clang version must be greater than 3.7.
> Otherwise, would choose SSE/AVX code path, the same as before.
> * Move two mocro functions to the top of the code since they would be
> used in inline SSE/AVX and dynamic SSE/AVX codes.
> 
>  .../common/include/arch/x86/rte_memcpy.h           | 1232 ++++++++++++++++++--
>  1 file changed, 1135 insertions(+), 97 deletions(-)
> 
> diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> index 74c280c..ed6c412 100644
> --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> @@ -45,6 +45,8 @@
>  #include <string.h>
>  #include <rte_vect.h>
>  #include <rte_common.h>
> +#include <rte_cpuflags.h>
> +#include <rte_log.h>
> 
>  #ifdef __cplusplus
>  extern "C" {
> @@ -68,6 +70,100 @@ extern "C" {
>  static __rte_always_inline void *
>  rte_memcpy(void *dst, const void *src, size_t n);
> 
> +/**
> + * Macro for copying unaligned block from one location to another with constant load offset,
> + * 47 bytes leftover maximum,
> + * locations should not overlap.
> + * Requirements:
> + * - Store is aligned
> + * - Load offset is <offset>, which must be immediate value within [1, 15]
> + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
> + * - <dst>, <src>, <len> must be variables
> + * - __m128i <xmm0> ~ <xmm8> must be pre-defined
> + */
> +#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)                                                     \
> +__extension__ ({                                                                                            \
> +    int tmp;                                                                                                \
> +    while (len >= 128 + 16 - offset) {                                                                      \
> +        xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));                  \
> +        len -= 128;                                                                                         \
> +        xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));                  \
> +        xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));                  \
> +        xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16));                  \
> +        xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16));                  \
> +        xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16));                  \
> +        xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16));                  \
> +        xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \
> +        xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \
> +        src = (const uint8_t *)src + 128;                                                                   \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));        \
> +        dst = (uint8_t *)dst + 128;                                                                         \
> +    }                                                                                                       \
> +    tmp = len;                                                                                              \
> +    len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
> +    tmp -= len;                                                                                             \
> +    src = (const uint8_t *)src + tmp;                                                                       \
> +    dst = (uint8_t *)dst + tmp;                                                                             \
> +    if (len >= 32 + 16 - offset) {                                                                          \
> +        while (len >= 32 + 16 - offset) {                                                                   \
> +            xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));              \
> +            len -= 32;                                                                                      \
> +            xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));              \
> +            xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));              \
> +            src = (const uint8_t *)src + 32;                                                                \
> +            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));    \
> +            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));    \
> +            dst = (uint8_t *)dst + 32;                                                                      \
> +        }                                                                                                   \
> +        tmp = len;                                                                                          \
> +        len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
> +        tmp -= len;                                                                                         \
> +        src = (const uint8_t *)src + tmp;                                                                   \
> +        dst = (uint8_t *)dst + tmp;                                                                         \
> +    }                                                                                                       \
> +})
> +
> +/**
> + * Macro for copying unaligned block from one location to another,
> + * 47 bytes leftover maximum,
> + * locations should not overlap.
> + * Use switch here because the aligning instruction requires immediate value for shift count.
> + * Requirements:
> + * - Store is aligned
> + * - Load offset is <offset>, which must be within [1, 15]
> + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
> + * - <dst>, <src>, <len> must be variables
> + * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined
> + */
> +#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                   \
> +__extension__ ({                                                      \
> +    switch (offset) {                                                 \
> +    case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break;    \
> +    case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break;    \
> +    case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break;    \
> +    case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break;    \
> +    case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break;    \
> +    case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break;    \
> +    case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break;    \
> +    case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break;    \
> +    case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break;    \
> +    case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break;    \
> +    case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;    \
> +    case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break;    \
> +    case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break;    \
> +    case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break;    \
> +    case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break;    \
> +    default:;                                                         \
> +    }                                                                 \
> +})
> +
>  #ifdef RTE_MACHINE_CPUFLAG_AVX512F
> 
>  #define ALIGNMENT_MASK 0x3F
> @@ -589,100 +685,6 @@ rte_mov256(uint8_t *dst, const uint8_t *src)
>  	rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
>  }
> 
> -/**
> - * Macro for copying unaligned block from one location to another with constant load offset,
> - * 47 bytes leftover maximum,
> - * locations should not overlap.
> - * Requirements:
> - * - Store is aligned
> - * - Load offset is <offset>, which must be immediate value within [1, 15]
> - * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
> - * - <dst>, <src>, <len> must be variables
> - * - __m128i <xmm0> ~ <xmm8> must be pre-defined
> - */
> -#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)                                                     \
> -__extension__ ({                                                                                            \
> -    int tmp;                                                                                                \
> -    while (len >= 128 + 16 - offset) {                                                                      \
> -        xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));                  \
> -        len -= 128;                                                                                         \
> -        xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));                  \
> -        xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));                  \
> -        xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16));                  \
> -        xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16));                  \
> -        xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16));                  \
> -        xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16));                  \
> -        xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \
> -        xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \
> -        src = (const uint8_t *)src + 128;                                                                   \
> -        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \
> -        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \
> -        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \
> -        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \
> -        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \
> -        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \
> -        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \
> -        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));        \
> -        dst = (uint8_t *)dst + 128;                                                                         \
> -    }                                                                                                       \
> -    tmp = len;                                                                                              \
> -    len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
> -    tmp -= len;                                                                                             \
> -    src = (const uint8_t *)src + tmp;                                                                       \
> -    dst = (uint8_t *)dst + tmp;                                                                             \
> -    if (len >= 32 + 16 - offset) {                                                                          \
> -        while (len >= 32 + 16 - offset) {                                                                   \
> -            xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));              \
> -            len -= 32;                                                                                      \
> -            xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));              \
> -            xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));              \
> -            src = (const uint8_t *)src + 32;                                                                \
> -            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));    \
> -            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));    \
> -            dst = (uint8_t *)dst + 32;                                                                      \
> -        }                                                                                                   \
> -        tmp = len;                                                                                          \
> -        len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
> -        tmp -= len;                                                                                         \
> -        src = (const uint8_t *)src + tmp;                                                                   \
> -        dst = (uint8_t *)dst + tmp;                                                                         \
> -    }                                                                                                       \
> -})
> -
> -/**
> - * Macro for copying unaligned block from one location to another,
> - * 47 bytes leftover maximum,
> - * locations should not overlap.
> - * Use switch here because the aligning instruction requires immediate value for shift count.
> - * Requirements:
> - * - Store is aligned
> - * - Load offset is <offset>, which must be within [1, 15]
> - * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
> - * - <dst>, <src>, <len> must be variables
> - * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined
> - */
> -#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                   \
> -__extension__ ({                                                      \
> -    switch (offset) {                                                 \
> -    case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break;    \
> -    case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break;    \
> -    case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break;    \
> -    case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break;    \
> -    case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break;    \
> -    case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break;    \
> -    case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break;    \
> -    case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break;    \
> -    case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break;    \
> -    case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break;    \
> -    case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;    \
> -    case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break;    \
> -    case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break;    \
> -    case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break;    \
> -    case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break;    \
> -    default:;                                                         \
> -    }                                                                 \
> -})
> -
>  static inline void *
>  rte_memcpy_generic(void *dst, const void *src, size_t n)
>  {
> @@ -888,13 +890,1049 @@ rte_memcpy_aligned(void *dst, const void *src, size_t n)
>  	return ret;
>  }
> 
> +/*
> + * Run-time dispatch impementation of memcpy.
> + */
> +
> +typedef void * (*rte_memcpy_t)(void *dst, const void *src, size_t n);
> +static rte_memcpy_t rte_memcpy_ptr;
> +
> +/**
> + * AVX512 implementation below
> + */
> +#ifdef CC_SUPPORT_AVX512
> +__attribute__((target("avx512f")))
> +static inline void *
> +rte_memcpy_AVX512F(void *dst, const void *src, size_t n)
> +{
> +	if (!(((uintptr_t)dst | (uintptr_t)src) & 0x3F)) {
> +		void *ret = dst;
> +
> +		/* Copy size <= 16 bytes */
> +		if (n < 16) {
> +			if (n & 0x01) {
> +				*(uint8_t *)dst = *(const uint8_t *)src;
> +				src = (const uint8_t *)src + 1;
> +				dst = (uint8_t *)dst + 1;
> +			}
> +			if (n & 0x02) {
> +				*(uint16_t *)dst = *(const uint16_t *)src;
> +				src = (const uint16_t *)src + 1;
> +				dst = (uint16_t *)dst + 1;
> +			}
> +			if (n & 0x04) {
> +				*(uint32_t *)dst = *(const uint32_t *)src;
> +				src = (const uint32_t *)src + 1;
> +				dst = (uint32_t *)dst + 1;
> +			}
> +			if (n & 0x08)
> +				*(uint64_t *)dst = *(const uint64_t *)src;
> +
> +			return ret;
> +		}
> +
> +		/* Copy 16 <= size <= 32 bytes */
> +		if (n <= 32) {
> +			__m128i xmm0, xmm1;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src - 16 + n));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst - 16 + n), xmm1);
> +
> +			return ret;
> +		}
> +
> +		/* Copy 32 < size <= 64 bytes */
> +		if (n <= 64) {
> +			__m256i ymm0, ymm1;
> +			ymm0 = _mm256_loadu_si256((const __m256i *)src);
> +			ymm1 = _mm256_loadu_si256((const __m256i *)
> +				((const uint8_t *)src - 32 + n));
> +			_mm256_storeu_si256((__m256i *)dst, ymm0);
> +			_mm256_storeu_si256((__m256i *)
> +				((uint8_t *)dst - 32 + n), ymm1);
> +
> +			return ret;
> +		}
> +
> +		/* Copy 64 bytes blocks */
> +		for (; n >= 64; n -= 64) {
> +			__m512i zmm0;
> +			zmm0 = _mm512_loadu_si512((const void *)src);
> +			_mm512_storeu_si512((void *)dst, zmm0);
> +			dst = (uint8_t *)dst + 64;
> +			src = (const uint8_t *)src + 64;
> +		}
> +
> +		/* Copy whatever left */
> +		__m512i zmm0;
> +		zmm0 = _mm512_loadu_si512((const void *)
> +			((const uint8_t *)src - 64 + n));
> +		_mm512_storeu_si512((void *)((uint8_t *)dst - 64 + n), zmm0);
> +
> +		return ret;
> +	} else {
> +		uintptr_t dstu = (uintptr_t)dst;
> +		uintptr_t srcu = (uintptr_t)src;
> +		void *ret = dst;
> +		size_t dstofss;
> +		size_t bits;
> +
> +		/**
> +		 * Copy less than 16 bytes
> +		 */
> +		if (n < 16) {
> +			if (n & 0x01) {
> +				*(uint8_t *)dstu = *(const uint8_t *)srcu;
> +				srcu = (uintptr_t)((const uint8_t *)srcu + 1);
> +				dstu = (uintptr_t)((uint8_t *)dstu + 1);
> +			}
> +			if (n & 0x02) {
> +				*(uint16_t *)dstu = *(const uint16_t *)srcu;
> +				srcu = (uintptr_t)((const uint16_t *)srcu + 1);
> +				dstu = (uintptr_t)((uint16_t *)dstu + 1);
> +			}
> +			if (n & 0x04) {
> +				*(uint32_t *)dstu = *(const uint32_t *)srcu;
> +				srcu = (uintptr_t)((const uint32_t *)srcu + 1);
> +				dstu = (uintptr_t)((uint32_t *)dstu + 1);
> +			}
> +			if (n & 0x08)
> +				*(uint64_t *)dstu = *(const uint64_t *)srcu;
> +			return ret;
> +		}
> +
> +		/**
> +		 * Fast way when copy size doesn't exceed 512 bytes
> +		 */
> +		if (n <= 32) {
> +			__m128i xmm0, xmm1;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src - 16 + n));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst - 16 + n), xmm1);
> +			return ret;
> +		}
> +		if (n <= 64) {
> +			__m256i ymm0, ymm1;
> +			ymm0 = _mm256_loadu_si256((const __m256i *)src);
> +			ymm1 = _mm256_loadu_si256((const __m256i *)
> +				((const uint8_t *)src - 32 + n));
> +			_mm256_storeu_si256((__m256i *)dst, ymm0);
> +			_mm256_storeu_si256((__m256i *)
> +				((uint8_t *)dst - 32 + n), ymm1);
> +			return ret;
> +		}
> +		if (n <= 512) {
> +			if (n >= 256) {
> +				n -= 256;
> +				__m512i zmm0, zmm1, zmm2, zmm3;
> +				zmm0 = _mm512_loadu_si512((const void *)src);
> +				zmm1 = _mm512_loadu_si512((const void *)
> +					((const uint8_t *)src + 64));
> +				zmm2 = _mm512_loadu_si512((const void *)
> +					((const uint8_t *)src + 2*64));
> +				zmm3 = _mm512_loadu_si512((const void *)
> +					((const uint8_t *)src + 3*64));
> +				_mm512_storeu_si512((void *)dst, zmm0);
> +				_mm512_storeu_si512((void *)
> +					((uint8_t *)dst + 64), zmm1);
> +				_mm512_storeu_si512((void *)
> +					((uint8_t *)dst + 2*64), zmm2);
> +				_mm512_storeu_si512((void *)
> +					((uint8_t *)dst + 3*64), zmm3);
> +				src = (const uint8_t *)src + 256;
> +				dst = (uint8_t *)dst + 256;
> +			}
> +			if (n >= 128) {
> +				n -= 128;
> +				__m512i zmm0, zmm1;
> +				zmm0 = _mm512_loadu_si512((const void *)src);
> +				zmm1 = _mm512_loadu_si512((const void *)
> +					((const uint8_t *)src + 64));
> +				_mm512_storeu_si512((void *)dst, zmm0);
> +				_mm512_storeu_si512((void *)
> +					((uint8_t *)dst + 64), zmm1);
> +				src = (const uint8_t *)src + 128;
> +				dst = (uint8_t *)dst + 128;
> +			}
> +COPY_BLOCK_128_BACK63:
> +			if (n > 64) {
> +				__m512i zmm0, zmm1;
> +				zmm0 = _mm512_loadu_si512((const void *)src);
> +				zmm1 = _mm512_loadu_si512((const void *)
> +					((const uint8_t *)src - 64 + n));
> +				_mm512_storeu_si512((void *)dst, zmm0);
> +				_mm512_storeu_si512((void *)
> +					((uint8_t *)dst - 64 + n), zmm1);
> +				return ret;
> +			}
> +			if (n > 0) {
> +				__m512i zmm0;
> +				zmm0 = _mm512_loadu_si512((const void *)
> +					((const uint8_t *)src - 64 + n));
> +				_mm512_storeu_si512((void *)
> +					((uint8_t *)dst - 64 + n), zmm0);
> +			}
> +			return ret;
> +		}
> +
> +		/**
> +		 * Make store aligned when copy size exceeds 512 bytes
> +		 */
> +		dstofss = ((uintptr_t)dst & 0x3F);
> +		if (dstofss > 0) {
> +			dstofss = 64 - dstofss;
> +			n -= dstofss;
> +			__m512i zmm0;
> +			zmm0 = _mm512_loadu_si512((const void *)src);
> +			_mm512_storeu_si512((void *)dst, zmm0);
> +			src = (const uint8_t *)src + dstofss;
> +			dst = (uint8_t *)dst + dstofss;
> +		}
> +
> +		/**
> +		 * Copy 512-byte blocks.
> +		 * Use copy block function for better instruction order control,
> +		 * which is important when load is unaligned.
> +		 */
> +		__m512i zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, zmm7;
> +
> +		while (n >= 512) {
> +			zmm0 = _mm512_loadu_si512((const void *)
> +				((const uint8_t *)src + 0 * 64));
> +			n -= 512;
> +			zmm1 = _mm512_loadu_si512((const void *)
> +				((const uint8_t *)src + 1 * 64));
> +			zmm2 = _mm512_loadu_si512((const void *)
> +				((const uint8_t *)src + 2 * 64));
> +			zmm3 = _mm512_loadu_si512((const void *)
> +				((const uint8_t *)src + 3 * 64));
> +			zmm4 = _mm512_loadu_si512((const void *)
> +				((const uint8_t *)src + 4 * 64));
> +			zmm5 = _mm512_loadu_si512((const void *)
> +				((const uint8_t *)src + 5 * 64));
> +			zmm6 = _mm512_loadu_si512((const void *)
> +				((const uint8_t *)src + 6 * 64));
> +			zmm7 = _mm512_loadu_si512((const void *)
> +				((const uint8_t *)src + 7 * 64));
> +			src = (const uint8_t *)src + 512;
> +			_mm512_storeu_si512((void *)
> +				((uint8_t *)dst + 0 * 64), zmm0);
> +			_mm512_storeu_si512((void *)
> +				((uint8_t *)dst + 1 * 64), zmm1);
> +			_mm512_storeu_si512((void *)
> +				((uint8_t *)dst + 2 * 64), zmm2);
> +			_mm512_storeu_si512((void *)
> +				((uint8_t *)dst + 3 * 64), zmm3);
> +			_mm512_storeu_si512((void *)
> +				((uint8_t *)dst + 4 * 64), zmm4);
> +			_mm512_storeu_si512((void *)
> +				((uint8_t *)dst + 5 * 64), zmm5);
> +			_mm512_storeu_si512((void *)
> +				((uint8_t *)dst + 6 * 64), zmm6);
> +			_mm512_storeu_si512((void *)
> +				((uint8_t *)dst + 7 * 64), zmm7);
> +			dst = (uint8_t *)dst + 512;
> +		}
> +		bits = n;
> +		n = n & 511;
> +		bits -= n;
> +		src = (const uint8_t *)src + bits;
> +		dst = (uint8_t *)dst + bits;
> +
> +		/**
> +		 * Copy 128-byte blocks.
> +		 * Use copy block function for better instruction order control,
> +		 * which is important when load is unaligned.
> +		 */
> +		if (n >= 128) {
> +			__m512i zmm0, zmm1;
> +
> +			while (n >= 128) {
> +				zmm0 = _mm512_loadu_si512((const void *)
> +					((const uint8_t *)src + 0 * 64));
> +				n -= 128;
> +				zmm1 = _mm512_loadu_si512((const void *)
> +					((const uint8_t *)src + 1 * 64));
> +				src = (const uint8_t *)src + 128;
> +				_mm512_storeu_si512((void *)
> +					((uint8_t *)dst + 0 * 64), zmm0);
> +				_mm512_storeu_si512((void *)
> +					((uint8_t *)dst + 1 * 64), zmm1);
> +				dst = (uint8_t *)dst + 128;
> +			}
> +			bits = n;
> +			n = n & 127;
> +			bits -= n;
> +			src = (const uint8_t *)src + bits;
> +			dst = (uint8_t *)dst + bits;
> +		}
> +
> +		/**
> +		 * Copy whatever left
> +		 */
> +		goto COPY_BLOCK_128_BACK63;
> +	}
> +}
> +#endif
> +
> +/**
> + * AVX2 implementation below
> + */
> +#ifdef CC_SUPPORT_AVX2
> +__attribute__((target("avx2")))
> +static inline void *
> +rte_memcpy_AVX2(void *dst, const void *src, size_t n)
> +{
> +	if (!(((uintptr_t)dst | (uintptr_t)src) & 0x1F)) {
> +		void *ret = dst;
> +
> +		/* Copy size <= 16 bytes */
> +		if (n < 16) {
> +			if (n & 0x01) {
> +				*(uint8_t *)dst = *(const uint8_t *)src;
> +				src = (const uint8_t *)src + 1;
> +				dst = (uint8_t *)dst + 1;
> +			}
> +			if (n & 0x02) {
> +				*(uint16_t *)dst = *(const uint16_t *)src;
> +				src = (const uint16_t *)src + 1;
> +				dst = (uint16_t *)dst + 1;
> +			}
> +			if (n & 0x04) {
> +				*(uint32_t *)dst = *(const uint32_t *)src;
> +				src = (const uint32_t *)src + 1;
> +				dst = (uint32_t *)dst + 1;
> +			}
> +			if (n & 0x08)
> +				*(uint64_t *)dst = *(const uint64_t *)src;
> +
> +			return ret;
> +		}
> +
> +		/* Copy 16 <= size <= 32 bytes */
> +		if (n <= 32) {
> +			__m128i xmm0, xmm1;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src - 16 + n));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst - 16 + n), xmm1);
> +
> +			return ret;
> +		}
> +
> +		/* Copy 32 < size <= 64 bytes */
> +		if (n <= 64) {
> +			__m256i ymm0, ymm1;
> +			ymm0 = _mm256_loadu_si256((const __m256i *)src);
> +			ymm1 = _mm256_loadu_si256((const __m256i *)
> +				((const uint8_t *)src - 32 + n));
> +			_mm256_storeu_si256((__m256i *)dst, ymm0);
> +			_mm256_storeu_si256((__m256i *)
> +				((uint8_t *)dst - 32 + n), ymm1);
> +
> +			return ret;
> +		}
> +
> +		/* Copy 64 bytes blocks */
> +		for (; n >= 64; n -= 64) {
> +			__m256i ymm0, ymm1;
> +			ymm0 = _mm256_loadu_si256((const __m256i *)src);
> +			ymm1 = _mm256_loadu_si256((const __m256i *)
> +				((const uint8_t *)src + 32));
> +			_mm256_storeu_si256((__m256i *)dst, ymm0);
> +			_mm256_storeu_si256((__m256i *)
> +				((uint8_t *)dst + 32), ymm1);
> +			dst = (uint8_t *)dst + 64;
> +			src = (const uint8_t *)src + 64;
> +		}
> +
> +		/* Copy whatever left */
> +		__m256i ymm0, ymm1;
> +		ymm0 = _mm256_loadu_si256((const __m256i *)
> +			((const uint8_t *)src - 64 + n));
> +		ymm1 = _mm256_loadu_si256((const __m256i *)
> +			((const uint8_t *)src - 32 + n));
> +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst - 64 + n), ymm0);
> +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst - 32 + n), ymm1);
> +
> +		return ret;
> +	} else {
> +		uintptr_t dstu = (uintptr_t)dst;
> +		uintptr_t srcu = (uintptr_t)src;
> +		void *ret = dst;
> +		size_t dstofss;
> +		size_t bits;
> +
> +		/**
> +		 * Copy less than 16 bytes
> +		 */
> +		if (n < 16) {
> +			if (n & 0x01) {
> +				*(uint8_t *)dstu = *(const uint8_t *)srcu;
> +				srcu = (uintptr_t)((const uint8_t *)srcu + 1);
> +				dstu = (uintptr_t)((uint8_t *)dstu + 1);
> +			}
> +			if (n & 0x02) {
> +				*(uint16_t *)dstu = *(const uint16_t *)srcu;
> +				srcu = (uintptr_t)((const uint16_t *)srcu + 1);
> +				dstu = (uintptr_t)((uint16_t *)dstu + 1);
> +			}
> +			if (n & 0x04) {
> +				*(uint32_t *)dstu = *(const uint32_t *)srcu;
> +				srcu = (uintptr_t)((const uint32_t *)srcu + 1);
> +				dstu = (uintptr_t)((uint32_t *)dstu + 1);
> +			}
> +			if (n & 0x08)
> +				*(uint64_t *)dstu = *(const uint64_t *)srcu;
> +			return ret;
> +		}
> +
> +		/**
> +		 * Fast way when copy size doesn't exceed 256 bytes
> +		 */
> +		if (n <= 32) {
> +			__m128i xmm0, xmm1;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src - 16 + n));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst - 16 + n), xmm1);
> +			return ret;
> +		}
> +		if (n <= 48) {
> +			__m128i xmm0, xmm1, xmm2;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src + 16));
> +			xmm2 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src - 16 + n));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst + 16), xmm1);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst - 16 + n), xmm2);
> +			return ret;
> +		}
> +		if (n <= 64) {
> +			__m256i ymm0, ymm1;
> +			ymm0 = _mm256_loadu_si256((const __m256i *)src);
> +			ymm1 = _mm256_loadu_si256((const __m256i *)
> +				((const uint8_t *)src - 32 + n));
> +			_mm256_storeu_si256((__m256i *)dst, ymm0);
> +			_mm256_storeu_si256((__m256i *)
> +				((uint8_t *)dst - 32 + n), ymm1);
> +			return ret;
> +		}
> +		if (n <= 256) {
> +			if (n >= 128) {
> +				n -= 128;
> +				__m256i ymm0, ymm1, ymm2, ymm3;
> +				ymm0 = _mm256_loadu_si256((const __m256i *)src);
> +				ymm1 = _mm256_loadu_si256((const __m256i *)
> +					((const uint8_t *)src + 32));
> +				ymm2 = _mm256_loadu_si256((const __m256i *)
> +					((const uint8_t *)src + 2*32));
> +				ymm3 = _mm256_loadu_si256((const __m256i *)
> +					((const uint8_t *)src + 3*32));
> +				_mm256_storeu_si256((__m256i *)dst, ymm0);
> +				_mm256_storeu_si256((__m256i *)
> +					((uint8_t *)dst + 32), ymm1);
> +				_mm256_storeu_si256((__m256i *)
> +					((uint8_t *)dst + 2*32), ymm2);
> +				_mm256_storeu_si256((__m256i *)
> +					((uint8_t *)dst + 3*32), ymm3);
> +				src = (const uint8_t *)src + 128;
> +				dst = (uint8_t *)dst + 128;
> +			}
> +COPY_BLOCK_128_BACK31:
> +			if (n >= 64) {
> +				n -= 64;
> +				__m256i ymm0, ymm1;
> +				ymm0 = _mm256_loadu_si256((const __m256i *)src);
> +				ymm1 = _mm256_loadu_si256((const __m256i *)
> +					((const uint8_t *)src + 32));
> +				_mm256_storeu_si256((__m256i *)dst, ymm0);
> +				_mm256_storeu_si256((__m256i *)
> +					((uint8_t *)dst + 32), ymm1);
> +				src = (const uint8_t *)src + 64;
> +				dst = (uint8_t *)dst + 64;
> +			}
> +			if (n > 32) {
> +				__m256i ymm0, ymm1;
> +				ymm0 = _mm256_loadu_si256((const __m256i *)src);
> +				ymm1 = _mm256_loadu_si256((const __m256i *)
> +					((const uint8_t *)src - 32 + n));
> +				_mm256_storeu_si256((__m256i *)dst, ymm0);
> +				_mm256_storeu_si256((__m256i *)
> +					((uint8_t *)dst - 32 + n), ymm1);
> +				return ret;
> +			}
> +			if (n > 0) {
> +				__m256i ymm0;
> +				ymm0 = _mm256_loadu_si256((const __m256i *)
> +					((const uint8_t *)src - 32 + n));
> +				_mm256_storeu_si256((__m256i *)
> +					((uint8_t *)dst - 32 + n), ymm0);
> +			}
> +			return ret;
> +		}
> +
> +		/**
> +		 * Make store aligned when copy size exceeds 256 bytes
> +		 */
> +		dstofss = (uintptr_t)dst & 0x1F;
> +		if (dstofss > 0) {
> +			dstofss = 32 - dstofss;
> +			n -= dstofss;
> +			__m256i ymm0;
> +			ymm0 = _mm256_loadu_si256((const __m256i *)src);
> +			_mm256_storeu_si256((__m256i *)dst, ymm0);
> +			src = (const uint8_t *)src + dstofss;
> +			dst = (uint8_t *)dst + dstofss;
> +		}
> +
> +		/**
> +		 * Copy 128-byte blocks
> +		 */
> +		__m256i ymm0, ymm1, ymm2, ymm3;
> +
> +		while (n >= 128) {
> +			ymm0 = _mm256_loadu_si256((const __m256i *)
> +				((const uint8_t *)src + 0 * 32));
> +			n -= 128;
> +			ymm1 = _mm256_loadu_si256((const __m256i *)
> +				((const uint8_t *)src + 1 * 32));
> +			ymm2 = _mm256_loadu_si256((const __m256i *)
> +				((const uint8_t *)src + 2 * 32));
> +			ymm3 = _mm256_loadu_si256((const __m256i *)
> +				((const uint8_t *)src + 3 * 32));
> +			src = (const uint8_t *)src + 128;
> +			_mm256_storeu_si256((__m256i *)
> +				((uint8_t *)dst + 0 * 32), ymm0);
> +			_mm256_storeu_si256((__m256i *)
> +				((uint8_t *)dst + 1 * 32), ymm1);
> +			_mm256_storeu_si256((__m256i *)
> +				((uint8_t *)dst + 2 * 32), ymm2);
> +			_mm256_storeu_si256((__m256i *)
> +				((uint8_t *)dst + 3 * 32), ymm3);
> +			dst = (uint8_t *)dst + 128;
> +		}
> +		bits = n;
> +		n = n & 127;
> +		bits -= n;
> +		src = (const uint8_t *)src + bits;
> +		dst = (uint8_t *)dst + bits;
> +
> +		/**
> +		 * Copy whatever left
> +		 */
> +		goto COPY_BLOCK_128_BACK31;
> +	}
> +}
> +#endif
> +
> +/**
> + * SSE & AVX implementation below
> + */
> +static inline void *
> +rte_memcpy_DEFAULT(void *dst, const void *src, size_t n)
> +{
> +	if (!(((uintptr_t)dst | (uintptr_t)src) & 0x0F)) {
> +		void *ret = dst;
> +
> +		/* Copy size <= 16 bytes */
> +		if (n < 16) {
> +			if (n & 0x01) {
> +				*(uint8_t *)dst = *(const uint8_t *)src;
> +				src = (const uint8_t *)src + 1;
> +				dst = (uint8_t *)dst + 1;
> +			}
> +			if (n & 0x02) {
> +				*(uint16_t *)dst = *(const uint16_t *)src;
> +				src = (const uint16_t *)src + 1;
> +				dst = (uint16_t *)dst + 1;
> +			}
> +			if (n & 0x04) {
> +				*(uint32_t *)dst = *(const uint32_t *)src;
> +				src = (const uint32_t *)src + 1;
> +				dst = (uint32_t *)dst + 1;
> +			}
> +			if (n & 0x08)
> +				*(uint64_t *)dst = *(const uint64_t *)src;
> +
> +			return ret;
> +		}
> +
> +		/* Copy 16 <= size <= 32 bytes */
> +		if (n <= 32) {
> +			__m128i xmm0, xmm1;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src - 16 + n));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst - 16 + n), xmm1);
> +
> +			return ret;
> +		}
> +
> +		/* Copy 32 < size <= 64 bytes */
> +		if (n <= 64) {
> +			__m128i xmm0, xmm1, xmm2, xmm3;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src + 16));
> +			xmm2 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src - 32 + n));
> +			xmm3 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src - 16 + n));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst + 16), xmm1);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst - 32 + n), xmm2);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst - 16 + n), xmm3);
> +
> +			return ret;
> +		}
> +
> +		/* Copy 64 bytes blocks */
> +		for (; n >= 64; n -= 64) {
> +			__m128i xmm0, xmm1, xmm2, xmm3;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src + 16));
> +			xmm2 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src + 2*16));
> +			xmm3 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src + 3*16));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst + 16), xmm1);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst + 2*16), xmm2);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst + 3*16), xmm3);
> +			dst = (uint8_t *)dst + 64;
> +			src = (const uint8_t *)src + 64;
> +		}
> +
> +		/* Copy whatever left */
> +		__m128i xmm0, xmm1, xmm2, xmm3;
> +		xmm0 = _mm_loadu_si128((const __m128i *)
> +			((const uint8_t *)src - 64 + n));
> +		xmm1 = _mm_loadu_si128((const __m128i *)
> +			((const uint8_t *)src - 48 + n));
> +		xmm2 = _mm_loadu_si128((const __m128i *)
> +			((const uint8_t *)src - 32 + n));
> +		xmm3 = _mm_loadu_si128((const __m128i *)
> +			((const uint8_t *)src - 16 + n));
> +		_mm_storeu_si128((__m128i *)((uint8_t *)dst - 64 + n), xmm0);
> +		_mm_storeu_si128((__m128i *)((uint8_t *)dst - 48 + n), xmm1);
> +		_mm_storeu_si128((__m128i *)((uint8_t *)dst - 32 + n), xmm2);
> +		_mm_storeu_si128((__m128i *)((uint8_t *)dst - 16 + n), xmm3);
> +
> +		return ret;
> +	} else {
> +		__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
> +		uintptr_t dstu = (uintptr_t)dst;
> +		uintptr_t srcu = (uintptr_t)src;
> +		void *ret = dst;
> +		size_t dstofss;
> +		size_t srcofs;
> +
> +		/**
> +		 * Copy less than 16 bytes
> +		 */
> +		if (n < 16) {
> +			if (n & 0x01) {
> +				*(uint8_t *)dstu = *(const uint8_t *)srcu;
> +				srcu = (uintptr_t)((const uint8_t *)srcu + 1);
> +				dstu = (uintptr_t)((uint8_t *)dstu + 1);
> +			}
> +			if (n & 0x02) {
> +				*(uint16_t *)dstu = *(const uint16_t *)srcu;
> +				srcu = (uintptr_t)((const uint16_t *)srcu + 1);
> +				dstu = (uintptr_t)((uint16_t *)dstu + 1);
> +			}
> +			if (n & 0x04) {
> +				*(uint32_t *)dstu = *(const uint32_t *)srcu;
> +				srcu = (uintptr_t)((const uint32_t *)srcu + 1);
> +				dstu = (uintptr_t)((uint32_t *)dstu + 1);
> +			}
> +			if (n & 0x08)
> +				*(uint64_t *)dstu = *(const uint64_t *)srcu;
> +			return ret;
> +		}
> +
> +		/**
> +		 * Fast way when copy size doesn't exceed 512 bytes
> +		 */
> +		if (n <= 32) {
> +			__m128i xmm0, xmm1;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src - 16 + n));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst - 16 + n), xmm1);
> +			return ret;
> +		}
> +		if (n <= 48) {
> +			__m128i xmm0, xmm1, xmm2;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src + 16));
> +			xmm2 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src - 16 + n));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst + 16), xmm1);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst - 16 + n), xmm2);
> +			return ret;
> +		}
> +		if (n <= 64) {
> +			__m128i xmm0, xmm1, xmm2, xmm3;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src + 16));
> +			xmm2 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src + 32));
> +			xmm3 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src - 16 + n));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst + 16), xmm1);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst + 32), xmm2);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst - 16 + n), xmm3);
> +			return ret;
> +		}
> +		if (n <= 128)
> +			goto COPY_BLOCK_128_BACK15;
> +		if (n <= 512) {
> +			if (n >= 256) {
> +				n -= 256;
> +				__m128i xmm0, xmm1;
> +				xmm0 = _mm_loadu_si128((const __m128i *)src);
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 16));
> +				_mm_storeu_si128((__m128i *)dst, xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 2*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 3*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 2*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 3*16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 4*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 5*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 4*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 5*16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 6*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 7*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 6*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 7*16), xmm1);
> +
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 128));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 128 + 16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 128), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 128 + 16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 128 + 2*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 128 + 3*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 128 + 2*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 128 + 3*16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 128 + 4*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 128 + 5*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 128 + 4*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 128 + 5*16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 128 + 6*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 128 + 7*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 128 + 6*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 128 + 7*16), xmm1);
> +				src = (const uint8_t *)src + 256;
> +				dst = (uint8_t *)dst + 256;
> +			}
> +COPY_BLOCK_255_BACK15:
> +			if (n >= 128) {
> +				n -= 128;
> +				__m128i xmm0, xmm1;
> +				xmm0 = _mm_loadu_si128((const __m128i *)src);
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 16));
> +				_mm_storeu_si128((__m128i *)dst, xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 2*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 3*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 2*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 3*16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 4*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 5*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 4*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 5*16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 6*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 7*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 6*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 7*16), xmm1);
> +				src = (const uint8_t *)src + 128;
> +				dst = (uint8_t *)dst + 128;
> +			}
> +COPY_BLOCK_128_BACK15:
> +			if (n >= 64) {
> +				n -= 64;
> +				__m128i xmm0, xmm1;
> +				xmm0 = _mm_loadu_si128((const __m128i *)src);
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 16));
> +				_mm_storeu_si128((__m128i *)dst, xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 16), xmm1);
> +
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 2*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 3*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 2*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 3*16), xmm1);
> +				src = (const uint8_t *)src + 64;
> +				dst = (uint8_t *)dst + 64;
> +			}
> +COPY_BLOCK_64_BACK15:
> +			if (n >= 32) {
> +				n -= 32;
> +				__m128i xmm0, xmm1;
> +				xmm0 = _mm_loadu_si128((const __m128i *)src);
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 16));
> +				_mm_storeu_si128((__m128i *)dst, xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 16), xmm1);
> +				src = (const uint8_t *)src + 32;
> +				dst = (uint8_t *)dst + 32;
> +			}
> +			if (n > 16) {
> +				__m128i xmm0, xmm1;
> +				xmm0 = _mm_loadu_si128((const __m128i *)src);
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src - 16 + n));
> +				_mm_storeu_si128((__m128i *)dst, xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst - 16 + n), xmm1);
> +				return ret;
> +			}
> +			if (n > 0) {
> +				__m128i xmm0;
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src - 16 + n));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst - 16 + n), xmm0);
> +			}
> +			return ret;
> +		}
> +
> +		/**
> +		 * Make store aligned when copy size exceeds 512 bytes,
> +		 * and make sure the first 15 bytes are copied, because
> +		 * unaligned copy functions require up to 15 bytes
> +		 * backwards access.
> +		 */
> +		dstofss = (uintptr_t)dst & 0x0F;
> +		if (dstofss > 0) {
> +			dstofss = 16 - dstofss + 16;
> +			n -= dstofss;
> +			__m128i xmm0, xmm1;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src + 16));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst + 16), xmm1);
> +			src = (const uint8_t *)src + dstofss;
> +			dst = (uint8_t *)dst + dstofss;
> +		}
> +		srcofs = ((uintptr_t)src & 0x0F);
> +
> +		/**
> +		 * For aligned copy
> +		 */
> +		if (srcofs == 0) {
> +			/**
> +			 * Copy 256-byte blocks
> +			 */
> +			for (; n >= 256; n -= 256) {
> +				__m128i xmm0, xmm1;
> +				xmm0 = _mm_loadu_si128((const __m128i *)src);
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 16));
> +				_mm_storeu_si128((__m128i *)dst, xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 2*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 3*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 2*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 3*16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 4*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 5*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 4*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 5*16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 6*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 7*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 6*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 7*16), xmm1);
> +
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 8*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 9*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 8*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 9*16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 10*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 11*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 10*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 11*16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 12*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 13*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 12*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 13*16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 14*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 15*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 14*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 15*16), xmm1);
> +				dst = (uint8_t *)dst + 256;
> +				src = (const uint8_t *)src + 256;
> +			}
> +
> +			/**
> +			 * Copy whatever left
> +			 */
> +			goto COPY_BLOCK_255_BACK15;
> +		}
> +
> +		/**
> +		 * For copy with unaligned load
> +		 */
> +		MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
> +
> +		/**
> +		 * Copy whatever left
> +		 */
> +		goto COPY_BLOCK_64_BACK15;
> +	}
> +}
> +
> +static void __attribute__((constructor))

That means that each file with '#include <re_memcpy.h> will have its own copy
of that function:
$ objdump -d  x86_64-native-linuxapp-gcc/app/testpmd  | grep '<rte_memcpy_init>:' | sort -u | wc -l
233
Same story for rte_memcpy_ptr and rte_memcpy_DEFAULT, etc...
Obviously we need (and want) only one copy of that stuff per binary.


> +rte_memcpy_init(void)
> +{

> +#ifdef CC_SUPPORT_AVX512
> +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F)) {
> +		rte_memcpy_ptr = rte_memcpy_AVX512F;
> +		RTE_LOG(DEBUG, EAL, "AVX512 is using!\n");
> +	} else
> +#endif
> +#ifdef CC_SUPPORT_AVX2

Why do you assume this macro will be defined?
By whom?
There is no such macro with gcc:
$ gcc -march=native -dM -E - </dev/null 2>&1 | grep AVX2
#define __AVX2__ 1
, and you don't define it yourself.
When building with '-march=native' on BDW only rte_memcpy_DEFAULT get compiled.

To summarize: as I understand the goal of that patch was
(assuming that our current rte_memcpy() implementation is good in terms of both performance and functionality):
1. Based on current rte_memcpy() implementation define 3 x86 arch specific rte_memcpy flavors:
  a) rte_memcpy_SSE 
  b) rte_memcpy_AVX2
  c) rte_memcpy_AVX512
2. Select appropriate flavor based on current HW at runtime,
    i.e. both 3 flavors should be present in the binary and selection should be made
    at program startup.

As I can see none of the goals was achieved with the current patch,
instead a lot of redundant code was introduced.
So I think it is NACK for the current version.
What I think need to be done instead:

1.  mv lib/librte_eal/common/include/arch/x86/rte_memcpy.h  lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
2. inside rte_memcpy_internal.h rename rte_memcpy() into rte_memcpy_internal(). 
3. create 3 files:
    rte_memcpy_sse.c
    rte_memcpy_avx2.c
    rte_memcpy_avx512.c

Inside each of these files we define corresponding rte_memcpy_xxx() function.
I.E:
rte_memcpy_avx2.c:
....
#ifndef RTE_MACHINE_CPUFLAG_AVX2
#error "no avx2 support"
endif

#include "rte_memcpy_internal.h"
...

void *
rte_memcpy_avx2(void *dst, const void *src, size_t n)
{
   return rte_memcpy_internal(dst, src, n); 
}

4. Make changes inside lib/librte_eal/Makefile to ensure that each of rte_memcpy_xxx()
get build with appropriate -march flags (I.E: avx2 with -mavx2, etc.)
You can use librte_acl/Makefile as a reference.

5. Create rte_memcpy.c and put rte_memcpy_ptr/rte_memcpy_init() definitions in that file.
6. Create new rte_memcpy.h  and define rte_memcpy() in it:

...
#include <rte_memcpy_internal.h>
...

+#define RTE_X86_MEMCPY_THRESH 128
static inline void *
rte_memcpy(void *dst, const void *src, size_t n)
{
	if (n <= RTE_X86_MEMCPY_THRESH) 
                  return rte_memcpy_internal(dst, src, n);
               else
	   return (*rte_memcpy_ptr)(dst, src, n);
}

7. Test it properly - i.e. build dpdk with default target and make sure each of 3 flavors
could be selected properly at runtime based on underlying arch.

8. As a possible future improvement - with such changes we don't need a generic inline
implementation. We can think about creating a faster version that need to copy
<= 128B.

Konstantin

> +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) {
> +		rte_memcpy_ptr = rte_memcpy_AVX2;
> +		RTE_LOG(DEBUG, EAL, "AVX2 is using!\n");
> +	} else
> +#endif
> +	{
> +		rte_memcpy_ptr = rte_memcpy_DEFAULT;
> +		RTE_LOG(DEBUG, EAL, "Default SSE/AVX is using!\n");
> +	}
> +}
> +
> +#define MEMCPY_THRESH 128
>  static inline void *
>  rte_memcpy(void *dst, const void *src, size_t n)
>  {
> -	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
> -		return rte_memcpy_aligned(dst, src, n);
> +	if (n <= MEMCPY_THRESH) {
> +		if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
> +			return rte_memcpy_aligned(dst, src, n);
> +		else
> +			return rte_memcpy_generic(dst, src, n);
> +	}
>  	else
> -		return rte_memcpy_generic(dst, src, n);
> +		return (*rte_memcpy_ptr)(dst, src, n);
>  }
> 
>  #ifdef __cplusplus
> --
> 2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/3] efd: run-time dispatch over x86 EFD functions
  2017-09-26  7:41 ` [dpdk-dev] [PATCH v3 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
@ 2017-10-02  0:08   ` Ananyev, Konstantin
  2017-10-02  0:09     ` Li, Xiaoyun
  2017-10-02  9:35     ` Ananyev, Konstantin
  0 siblings, 2 replies; 88+ messages in thread
From: Ananyev, Konstantin @ 2017-10-02  0:08 UTC (permalink / raw)
  To: Li, Xiaoyun, Richardson, Bruce; +Cc: Lu, Wenzhuo, Zhang, Helin, dev



> 
> This patch dynamically selects x86 EFD functions at run-time.

I don't think it really does.
In fact, I am not sure that we need to touch EFD at all here -
from what I can see, it already does dynamic selection properly.
Konstantin 

> This patch uses function pointer and binds it to the relative
> function based on CPU flags at constructor time.
> 
> Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
> ---
>  lib/librte_efd/rte_efd_x86.h | 41 ++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 38 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/librte_efd/rte_efd_x86.h b/lib/librte_efd/rte_efd_x86.h
> index 34f37d7..93b6743 100644
> --- a/lib/librte_efd/rte_efd_x86.h
> +++ b/lib/librte_efd/rte_efd_x86.h
> @@ -43,12 +43,29 @@
>  #define EFD_LOAD_SI128(val) _mm_lddqu_si128(val)
>  #endif
> 
> +typedef efd_value_t
> +(*efd_lookup_internal_avx2_t)(const efd_hashfunc_t *group_hash_idx,
> +		const efd_lookuptbl_t *group_lookup_table,
> +		const uint32_t hash_val_a, const uint32_t hash_val_b);
> +
> +static efd_lookup_internal_avx2_t efd_lookup_internal_avx2_ptr;
> +
>  static inline efd_value_t
>  efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
>  		const efd_lookuptbl_t *group_lookup_table,
>  		const uint32_t hash_val_a, const uint32_t hash_val_b)
>  {
> -#ifdef RTE_MACHINE_CPUFLAG_AVX2
> +	return (*efd_lookup_internal_avx2_ptr)(group_hash_idx,
> +					       group_lookup_table,
> +					       hash_val_a, hash_val_b);
> +}
> +
> +#ifdef CC_SUPPORT_AVX2
> +static inline efd_value_t
> +efd_lookup_internal_avx2_AVX2(const efd_hashfunc_t *group_hash_idx,
> +		const efd_lookuptbl_t *group_lookup_table,
> +		const uint32_t hash_val_a, const uint32_t hash_val_b)
> +{
>  	efd_value_t value = 0;
>  	uint32_t i = 0;
>  	__m256i vhash_val_a = _mm256_set1_epi32(hash_val_a);
> @@ -74,13 +91,31 @@ efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
>  	}
> 
>  	return value;
> -#else
> +}
> +#endif
> +
> +static inline efd_value_t
> +efd_lookup_internal_avx2_DEFAULT(const efd_hashfunc_t *group_hash_idx,
> +		const efd_lookuptbl_t *group_lookup_table,
> +		const uint32_t hash_val_a, const uint32_t hash_val_b)
> +{
>  	RTE_SET_USED(group_hash_idx);
>  	RTE_SET_USED(group_lookup_table);
>  	RTE_SET_USED(hash_val_a);
>  	RTE_SET_USED(hash_val_b);
>  	/* Return dummy value, only to avoid compilation breakage */
>  	return 0;
> -#endif
> +}
> 
> +static void __attribute__((constructor))
> +rte_efd_x86_init(void)
> +{
> +#ifdef CC_SUPPORT_AVX2
> +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> +		efd_lookup_internal_avx2_ptr = efd_lookup_internal_avx2_AVX2;
> +	else
> +		efd_lookup_internal_avx2_ptr = efd_lookup_internal_avx2_DEFAULT;
> +#else
> +	efd_lookup_internal_avx2_ptr = efd_lookup_internal_avx2_DEFAULT;
> +#endif
>  }
> --
> 2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/3] efd: run-time dispatch over x86 EFD functions
  2017-10-02  0:08   ` Ananyev, Konstantin
@ 2017-10-02  0:09     ` Li, Xiaoyun
  2017-10-02  9:35     ` Ananyev, Konstantin
  1 sibling, 0 replies; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-02  0:09 UTC (permalink / raw)
  To: Ananyev, Konstantin, Richardson, Bruce; +Cc: Lu, Wenzhuo, Zhang, Helin, dev

OK.
Won't touch it in next version.

Best Regards,
Xiaoyun Li



> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Monday, October 2, 2017 08:08
> To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> <helin.zhang@intel.com>; dev@dpdk.org
> Subject: RE: [PATCH v3 3/3] efd: run-time dispatch over x86 EFD functions
> 
> 
> 
> >
> > This patch dynamically selects x86 EFD functions at run-time.
> 
> I don't think it really does.
> In fact, I am not sure that we need to touch EFD at all here - from what I can
> see, it already does dynamic selection properly.
> Konstantin
> 
> > This patch uses function pointer and binds it to the relative function
> > based on CPU flags at constructor time.
> >
> > Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
> > ---
> >  lib/librte_efd/rte_efd_x86.h | 41
> > ++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 38 insertions(+), 3 deletions(-)
> >
> > diff --git a/lib/librte_efd/rte_efd_x86.h
> > b/lib/librte_efd/rte_efd_x86.h index 34f37d7..93b6743 100644
> > --- a/lib/librte_efd/rte_efd_x86.h
> > +++ b/lib/librte_efd/rte_efd_x86.h
> > @@ -43,12 +43,29 @@
> >  #define EFD_LOAD_SI128(val) _mm_lddqu_si128(val)  #endif
> >
> > +typedef efd_value_t
> > +(*efd_lookup_internal_avx2_t)(const efd_hashfunc_t *group_hash_idx,
> > +		const efd_lookuptbl_t *group_lookup_table,
> > +		const uint32_t hash_val_a, const uint32_t hash_val_b);
> > +
> > +static efd_lookup_internal_avx2_t efd_lookup_internal_avx2_ptr;
> > +
> >  static inline efd_value_t
> >  efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
> >  		const efd_lookuptbl_t *group_lookup_table,
> >  		const uint32_t hash_val_a, const uint32_t hash_val_b)  { -
> #ifdef
> > RTE_MACHINE_CPUFLAG_AVX2
> > +	return (*efd_lookup_internal_avx2_ptr)(group_hash_idx,
> > +					       group_lookup_table,
> > +					       hash_val_a, hash_val_b);
> > +}
> > +
> > +#ifdef CC_SUPPORT_AVX2
> > +static inline efd_value_t
> > +efd_lookup_internal_avx2_AVX2(const efd_hashfunc_t *group_hash_idx,
> > +		const efd_lookuptbl_t *group_lookup_table,
> > +		const uint32_t hash_val_a, const uint32_t hash_val_b) {
> >  	efd_value_t value = 0;
> >  	uint32_t i = 0;
> >  	__m256i vhash_val_a = _mm256_set1_epi32(hash_val_a); @@ -
> 74,13
> > +91,31 @@ efd_lookup_internal_avx2(const efd_hashfunc_t
> *group_hash_idx,
> >  	}
> >
> >  	return value;
> > -#else
> > +}
> > +#endif
> > +
> > +static inline efd_value_t
> > +efd_lookup_internal_avx2_DEFAULT(const efd_hashfunc_t
> *group_hash_idx,
> > +		const efd_lookuptbl_t *group_lookup_table,
> > +		const uint32_t hash_val_a, const uint32_t hash_val_b) {
> >  	RTE_SET_USED(group_hash_idx);
> >  	RTE_SET_USED(group_lookup_table);
> >  	RTE_SET_USED(hash_val_a);
> >  	RTE_SET_USED(hash_val_b);
> >  	/* Return dummy value, only to avoid compilation breakage */
> >  	return 0;
> > -#endif
> > +}
> >
> > +static void __attribute__((constructor))
> > +rte_efd_x86_init(void)
> > +{
> > +#ifdef CC_SUPPORT_AVX2
> > +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> > +		efd_lookup_internal_avx2_ptr =
> efd_lookup_internal_avx2_AVX2;
> > +	else
> > +		efd_lookup_internal_avx2_ptr =
> efd_lookup_internal_avx2_DEFAULT;
> > +#else
> > +	efd_lookup_internal_avx2_ptr = efd_lookup_internal_avx2_DEFAULT;
> > +#endif
> >  }
> > --
> > 2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-01 23:41   ` Ananyev, Konstantin
@ 2017-10-02  0:12     ` Li, Xiaoyun
  0 siblings, 0 replies; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-02  0:12 UTC (permalink / raw)
  To: Ananyev, Konstantin, Richardson, Bruce; +Cc: Lu, Wenzhuo, Zhang, Helin, dev

Hi

> That means that each file with '#include <re_memcpy.h> will have its own
> copy
> of that function:
> $ objdump -d  x86_64-native-linuxapp-gcc/app/testpmd  | grep
> '<rte_memcpy_init>:' | sort -u | wc -l
> 233
> Same story for rte_memcpy_ptr and rte_memcpy_DEFAULT, etc...
> Obviously we need (and want) only one copy of that stuff per binary.
> 
> > +#ifdef CC_SUPPORT_AVX2
> 
> Why do you assume this macro will be defined?
> By whom?
> There is no such macro with gcc:
> $ gcc -march=native -dM -E - </dev/null 2>&1 | grep AVX2
> #define __AVX2__ 1
> , and you don't define it yourself.
> When building with '-march=native' on BDW only rte_memcpy_DEFAULT get
> compiled.
> 

I defined it myself. But when I sort the patch, I forgot to modify the file in this version. Sorry about that.
It should be like this. To check whether the compiler supports AVX2 or AVX512.

diff --git a/mk/rte.cpuflags.mk b/mk/rte.cpuflags.mk
index a813c91..92399ec 100644
--- a/mk/rte.cpuflags.mk
+++ b/mk/rte.cpuflags.mk
@@ -141,3 +141,17 @@ space:= $(empty) $(empty)
 CPUFLAGSTMP1 := $(addprefix RTE_CPUFLAG_,$(CPUFLAGS))
 CPUFLAGSTMP2 := $(subst $(space),$(comma),$(CPUFLAGSTMP1))
 CPUFLAGS_LIST := -DRTE_COMPILE_TIME_CPUFLAGS=$(CPUFLAGSTMP2)
+
+# Check if the compiler supports AVX512.
+CC_SUPPORT_AVX512 := $(shell $(CC) -march=skylake-avx512 -dM -E - < /dev/null 2>&1 | grep -q AVX512 && echo 1)
+ifeq ($(CC_SUPPORT_AVX512),1)
+ifeq ($(CONFIG_RTE_ENABLE_AVX512),y)
+MACHINE_CFLAGS += -DCC_SUPPORT_AVX512
+endif
+endif
+
+# Check if the compiler supports AVX2.
+CC_SUPPORT_AVX2 := $(shell $(CC) -march=core-avx2 -dM -E - < /dev/null 2>&1 | grep -q AVX2 && echo 1)
+ifeq ($(CC_SUPPORT_AVX2),1)
+MACHINE_CFLAGS += -DCC_SUPPORT_AVX2
+endif


> To summarize: as I understand the goal of that patch was
> (assuming that our current rte_memcpy() implementation is good in terms of
> both performance and functionality):
> 1. Based on current rte_memcpy() implementation define 3 x86 arch specific
> rte_memcpy flavors:
>   a) rte_memcpy_SSE
>   b) rte_memcpy_AVX2
>   c) rte_memcpy_AVX512
> 2. Select appropriate flavor based on current HW at runtime,
>     i.e. both 3 flavors should be present in the binary and selection should be
> made
>     at program startup.
> 
> As I can see none of the goals was achieved with the current patch,
> instead a lot of redundant code was introduced.
> So I think it is NACK for the current version.
> What I think need to be done instead:
> 
> 1.  mv lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
> 2. inside rte_memcpy_internal.h rename rte_memcpy() into
> rte_memcpy_internal().
> 3. create 3 files:
>     rte_memcpy_sse.c
>     rte_memcpy_avx2.c
>     rte_memcpy_avx512.c
> 
> Inside each of these files we define corresponding rte_memcpy_xxx()
> function.
> I.E:
> rte_memcpy_avx2.c:
> ....
> #ifndef RTE_MACHINE_CPUFLAG_AVX2
> #error "no avx2 support"
> endif
> 
> #include "rte_memcpy_internal.h"
> ...
> 
> void *
> rte_memcpy_avx2(void *dst, const void *src, size_t n)
> {
>    return rte_memcpy_internal(dst, src, n);
> }
> 
> 4. Make changes inside lib/librte_eal/Makefile to ensure that each of
> rte_memcpy_xxx()
> get build with appropriate -march flags (I.E: avx2 with -mavx2, etc.)
> You can use librte_acl/Makefile as a reference.
> 
> 5. Create rte_memcpy.c and put rte_memcpy_ptr/rte_memcpy_init()
> definitions in that file.
> 6. Create new rte_memcpy.h  and define rte_memcpy() in it:
> 
> ...
> #include <rte_memcpy_internal.h>
> ...
> 
> +#define RTE_X86_MEMCPY_THRESH 128
> static inline void *
> rte_memcpy(void *dst, const void *src, size_t n)
> {
> 	if (n <= RTE_X86_MEMCPY_THRESH)
>                   return rte_memcpy_internal(dst, src, n);
>                else
> 	   return (*rte_memcpy_ptr)(dst, src, n);
> }
> 
> 7. Test it properly - i.e. build dpdk with default target and make sure each of
> 3 flavors
> could be selected properly at runtime based on underlying arch.
> 
> 8. As a possible future improvement - with such changes we don't need a
> generic inline
> implementation. We can think about creating a faster version that need to
> copy
> <= 128B.
> 
> Konstantin
> 

Will modify it in next version.
Thanks.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v3 3/3] efd: run-time dispatch over x86 EFD functions
  2017-10-02  0:08   ` Ananyev, Konstantin
  2017-10-02  0:09     ` Li, Xiaoyun
@ 2017-10-02  9:35     ` Ananyev, Konstantin
  1 sibling, 0 replies; 88+ messages in thread
From: Ananyev, Konstantin @ 2017-10-02  9:35 UTC (permalink / raw)
  To: Ananyev, Konstantin, Li, Xiaoyun, Richardson, Bruce
  Cc: Lu, Wenzhuo, Zhang, Helin, dev


> 
> >
> > This patch dynamically selects x86 EFD functions at run-time.
> 
> I don't think it really does.
> In fact, I am not sure that we need to touch EFD at all here -
> from what I can see, it already does dynamic selection properly.

Actually I was wrong here - in some cases it doesn't work properly.
As I can see for default target proper avx2 code wouldn't be compiled.
So some work still needed here - same as for memcpy().
Konstantin


> Konstantin
> 
> > This patch uses function pointer and binds it to the relative
> > function based on CPU flags at constructor time.
> >
> > Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
> > ---
> >  lib/librte_efd/rte_efd_x86.h | 41 ++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 38 insertions(+), 3 deletions(-)
> >
> > diff --git a/lib/librte_efd/rte_efd_x86.h b/lib/librte_efd/rte_efd_x86.h
> > index 34f37d7..93b6743 100644
> > --- a/lib/librte_efd/rte_efd_x86.h
> > +++ b/lib/librte_efd/rte_efd_x86.h
> > @@ -43,12 +43,29 @@
> >  #define EFD_LOAD_SI128(val) _mm_lddqu_si128(val)
> >  #endif
> >
> > +typedef efd_value_t
> > +(*efd_lookup_internal_avx2_t)(const efd_hashfunc_t *group_hash_idx,
> > +		const efd_lookuptbl_t *group_lookup_table,
> > +		const uint32_t hash_val_a, const uint32_t hash_val_b);
> > +
> > +static efd_lookup_internal_avx2_t efd_lookup_internal_avx2_ptr;
> > +
> >  static inline efd_value_t
> >  efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
> >  		const efd_lookuptbl_t *group_lookup_table,
> >  		const uint32_t hash_val_a, const uint32_t hash_val_b)
> >  {
> > -#ifdef RTE_MACHINE_CPUFLAG_AVX2
> > +	return (*efd_lookup_internal_avx2_ptr)(group_hash_idx,
> > +					       group_lookup_table,
> > +					       hash_val_a, hash_val_b);
> > +}
> > +
> > +#ifdef CC_SUPPORT_AVX2
> > +static inline efd_value_t
> > +efd_lookup_internal_avx2_AVX2(const efd_hashfunc_t *group_hash_idx,
> > +		const efd_lookuptbl_t *group_lookup_table,
> > +		const uint32_t hash_val_a, const uint32_t hash_val_b)
> > +{
> >  	efd_value_t value = 0;
> >  	uint32_t i = 0;
> >  	__m256i vhash_val_a = _mm256_set1_epi32(hash_val_a);
> > @@ -74,13 +91,31 @@ efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
> >  	}
> >
> >  	return value;
> > -#else
> > +}
> > +#endif
> > +
> > +static inline efd_value_t
> > +efd_lookup_internal_avx2_DEFAULT(const efd_hashfunc_t *group_hash_idx,
> > +		const efd_lookuptbl_t *group_lookup_table,
> > +		const uint32_t hash_val_a, const uint32_t hash_val_b)
> > +{
> >  	RTE_SET_USED(group_hash_idx);
> >  	RTE_SET_USED(group_lookup_table);
> >  	RTE_SET_USED(hash_val_a);
> >  	RTE_SET_USED(hash_val_b);
> >  	/* Return dummy value, only to avoid compilation breakage */
> >  	return 0;
> > -#endif
> > +}
> >
> > +static void __attribute__((constructor))
> > +rte_efd_x86_init(void)
> > +{
> > +#ifdef CC_SUPPORT_AVX2
> > +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> > +		efd_lookup_internal_avx2_ptr = efd_lookup_internal_avx2_AVX2;
> > +	else
> > +		efd_lookup_internal_avx2_ptr = efd_lookup_internal_avx2_DEFAULT;
> > +#else
> > +	efd_lookup_internal_avx2_ptr = efd_lookup_internal_avx2_DEFAULT;
> > +#endif
> >  }
> > --
> > 2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [dpdk-dev] [PATCH v4 0/3] run-time Linking support
  2017-09-26  7:41 [dpdk-dev] [PATCH v3 0/3] dynamic linking support Xiaoyun Li
                   ` (2 preceding siblings ...)
  2017-09-26  7:41 ` [dpdk-dev] [PATCH v3 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
@ 2017-10-02 16:13 ` Xiaoyun Li
  2017-10-02 16:13   ` [dpdk-dev] [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
                     ` (3 more replies)
  3 siblings, 4 replies; 88+ messages in thread
From: Xiaoyun Li @ 2017-10-02 16:13 UTC (permalink / raw)
  To: konstantin.ananyev, bruce.richardson
  Cc: wenzhuo.lu, helin.zhang, dev, Xiaoyun Li

This patchset dynamically selects functions at run-time based on CPU flags
that current machine supports. This patchset modifies mempcy, memcpy perf
test and x86 EFD, using function pointers and bind them at constructor time.
Then in the cloud environment, users can compiler once for the minimum target
such as 'haswell'(not 'native') and run on different platforms (equal or above
haswell) and can get ISA optimization based on running CPU.

Xiaoyun Li (3):
  eal/x86: run-time dispatch over memcpy
  app/test: run-time dispatch over memcpy perf test
  efd: run-time dispatch over x86 EFD functions

---
v2
* Use gcc function multi-versioning to avoid compilation issues.
* Add macros for AVX512 and AVX2. Only if users enable AVX512 and the
compiler supports it, the AVX512 codes would be compiled. Only if the
compiler supports AVX2, the AVX2 codes would be compiled.

v3
* Reduce function calls via only keep rte_memcpy_xxx.
* Add conditions that when copy size is small, use inline code path.
Otherwise, use dynamic code path.
* To support attribute target, clang version must be greater than 3.7.
Otherwise, would choose SSE/AVX code path, the same as before.
* Move two mocro functions to the top of the code since they would be
used in inline SSE/AVX and dynamic SSE/AVX codes.

v4
* Modify rte_memcpy.h to several .c files and modify makefiles to compile
AVX2 and AVX512 files.

 lib/librte_eal/bsdapp/eal/Makefile                 |  17 +
 .../common/include/arch/x86/rte_memcpy.c           |  59 ++
 .../common/include/arch/x86/rte_memcpy.h           | 861 +------------------
 .../common/include/arch/x86/rte_memcpy_avx2.c      | 291 +++++++
 .../common/include/arch/x86/rte_memcpy_avx512f.c   | 316 +++++++
 .../common/include/arch/x86/rte_memcpy_internal.h  | 909 +++++++++++++++++++++
 .../common/include/arch/x86/rte_memcpy_sse.c       | 585 +++++++++++++
 lib/librte_eal/linuxapp/eal/Makefile               |  17 +
 lib/librte_efd/rte_efd_x86.h                       |  41 +-
 mk/rte.cpuflags.mk                                 |  14 +
 test/test/test_memcpy_perf.c                       |  40 +-
 11 files changed, 2288 insertions(+), 862 deletions(-)
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [dpdk-dev] [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-02 16:13 ` [dpdk-dev] [PATCH v4 0/3] run-time Linking support Xiaoyun Li
@ 2017-10-02 16:13   ` Xiaoyun Li
  2017-10-02 16:39     ` Ananyev, Konstantin
  2017-10-02 16:13   ` [dpdk-dev] [PATCH v4 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 88+ messages in thread
From: Xiaoyun Li @ 2017-10-02 16:13 UTC (permalink / raw)
  To: konstantin.ananyev, bruce.richardson
  Cc: wenzhuo.lu, helin.zhang, dev, Xiaoyun Li

This patch dynamically selects functions of memcpy at run-time based
on CPU flags that current machine supports. This patch uses function
pointers which are bind to the relative functions at constrctor time.
In addition, AVX512 instructions set would be compiled only if users
config it enabled and the compiler supports it.

Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
---
v2
* Use gcc function multi-versioning to avoid compilation issues.
* Add macros for AVX512 and AVX2. Only if users enable AVX512 and the
compiler supports it, the AVX512 codes would be compiled. Only if the
compiler supports AVX2, the AVX2 codes would be compiled.

v3
* Reduce function calls via only keep rte_memcpy_xxx.
* Add conditions that when copy size is small, use inline code path.
Otherwise, use dynamic code path.
* To support attribute target, clang version must be greater than 3.7.
Otherwise, would choose SSE/AVX code path, the same as before.
* Move two mocro functions to the top of the code since they would be
used in inline SSE/AVX and dynamic SSE/AVX codes.

v4
* Modify rte_memcpy.h to several .c files and modify makefiles to compile
AVX2 and AVX512 files.

 lib/librte_eal/bsdapp/eal/Makefile                 |  17 +
 .../common/include/arch/x86/rte_memcpy.c           |  59 ++
 .../common/include/arch/x86/rte_memcpy.h           | 861 +------------------
 .../common/include/arch/x86/rte_memcpy_avx2.c      | 291 +++++++
 .../common/include/arch/x86/rte_memcpy_avx512f.c   | 316 +++++++
 .../common/include/arch/x86/rte_memcpy_internal.h  | 909 +++++++++++++++++++++
 .../common/include/arch/x86/rte_memcpy_sse.c       | 585 +++++++++++++
 lib/librte_eal/linuxapp/eal/Makefile               |  17 +
 mk/rte.cpuflags.mk                                 |  14 +
 9 files changed, 2223 insertions(+), 846 deletions(-)
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c

diff --git a/lib/librte_eal/bsdapp/eal/Makefile b/lib/librte_eal/bsdapp/eal/Makefile
index 005019e..27023c6 100644
--- a/lib/librte_eal/bsdapp/eal/Makefile
+++ b/lib/librte_eal/bsdapp/eal/Makefile
@@ -36,6 +36,7 @@ LIB = librte_eal.a
 ARCH_DIR ?= $(RTE_ARCH)
 VPATH += $(RTE_SDK)/lib/librte_eal/common
 VPATH += $(RTE_SDK)/lib/librte_eal/common/arch/$(ARCH_DIR)
+VPATH += $(RTE_SDK)/lib/librte_eal/common/include/arch/$(ARCH_DIR)
 
 CFLAGS += -I$(SRCDIR)/include
 CFLAGS += -I$(RTE_SDK)/lib/librte_eal/common
@@ -93,6 +94,22 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_ARCH_X86) += rte_spinlock.c
 
+# for run-time dispatch of memcpy
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy.c
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_sse.c
+
+# if the compiler supports AVX512, add avx512 file
+ifneq ($(filter $(MACHINE_CFLAGS),CC_SUPPORT_AVX512F),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx512f.c
+CFLAGS_rte_memcpy_avx512f.o += -mavx512f
+endif
+
+# if the compiler supports AVX2, add avx2 file
+ifneq ($(filter $(MACHINE_CFLAGS),CC_SUPPORT_AVX2),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx2.c
+CFLAGS_rte_memcpy_avx2.o += -mavx2
+endif
+
 CFLAGS_eal_common_cpuflags.o := $(CPUFLAGS_LIST)
 
 CFLAGS_eal.o := -D_GNU_SOURCE
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.c b/lib/librte_eal/common/include/arch/x86/rte_memcpy.c
new file mode 100644
index 0000000..74ae702
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.c
@@ -0,0 +1,59 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+#include <rte_cpuflags.h>
+#include <rte_log.h>
+
+void *(*rte_memcpy_ptr)(void *dst, const void *src, size_t n) = NULL;
+
+static void __attribute__((constructor))
+rte_memcpy_init(void)
+{
+#ifdef CC_SUPPORT_AVX512F
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F)) {
+		rte_memcpy_ptr = rte_memcpy_avx512f;
+		RTE_LOG(DEBUG, EAL, "AVX512 memcpy is using!\n");
+		return;
+	}
+#endif
+#ifdef CC_SUPPORT_AVX2
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) {
+		rte_memcpy_ptr = rte_memcpy_avx2;
+		RTE_LOG(DEBUG, EAL, "AVX2 memcpy is using!\n");
+		return;
+	}
+#endif
+	rte_memcpy_ptr = rte_memcpy_sse;
+	RTE_LOG(DEBUG, EAL, "Default SSE/AVX memcpy is using!\n");
+}
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index 74c280c..460dcdb 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -1,7 +1,7 @@
 /*-
  *   BSD LICENSE
  *
- *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
  *   All rights reserved.
  *
  *   Redistribution and use in source and binary forms, with or without
@@ -34,867 +34,36 @@
 #ifndef _RTE_MEMCPY_X86_64_H_
 #define _RTE_MEMCPY_X86_64_H_
 
-/**
- * @file
- *
- * Functions for SSE/AVX/AVX2/AVX512 implementation of memcpy().
- */
-
-#include <stdio.h>
-#include <stdint.h>
-#include <string.h>
-#include <rte_vect.h>
-#include <rte_common.h>
+#include <rte_memcpy_internal.h>
 
 #ifdef __cplusplus
 extern "C" {
 #endif
 
-/**
- * Copy bytes from one location to another. The locations must not overlap.
- *
- * @note This is implemented as a macro, so it's address should not be taken
- * and care is needed as parameter expressions may be evaluated multiple times.
- *
- * @param dst
- *   Pointer to the destination of the data.
- * @param src
- *   Pointer to the source data.
- * @param n
- *   Number of bytes to copy.
- * @return
- *   Pointer to the destination data.
- */
-static __rte_always_inline void *
-rte_memcpy(void *dst, const void *src, size_t n);
-
-#ifdef RTE_MACHINE_CPUFLAG_AVX512F
+#define RTE_X86_MEMCPY_THRESH 128
 
-#define ALIGNMENT_MASK 0x3F
+extern void *
+(*rte_memcpy_ptr)(void *dst, const void *src, size_t n);
 
 /**
- * AVX512 implementation below
+ * Different implementations of memcpy.
  */
+extern void*
+rte_memcpy_avx512f(void *dst, const void *src, size_t n);
 
-/**
- * Copy 16 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov16(uint8_t *dst, const uint8_t *src)
-{
-	__m128i xmm0;
-
-	xmm0 = _mm_loadu_si128((const __m128i *)src);
-	_mm_storeu_si128((__m128i *)dst, xmm0);
-}
-
-/**
- * Copy 32 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov32(uint8_t *dst, const uint8_t *src)
-{
-	__m256i ymm0;
+extern void *
+rte_memcpy_avx2(void *dst, const void *src, size_t n);
 
-	ymm0 = _mm256_loadu_si256((const __m256i *)src);
-	_mm256_storeu_si256((__m256i *)dst, ymm0);
-}
-
-/**
- * Copy 64 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov64(uint8_t *dst, const uint8_t *src)
-{
-	__m512i zmm0;
-
-	zmm0 = _mm512_loadu_si512((const void *)src);
-	_mm512_storeu_si512((void *)dst, zmm0);
-}
-
-/**
- * Copy 128 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov64(dst + 0 * 64, src + 0 * 64);
-	rte_mov64(dst + 1 * 64, src + 1 * 64);
-}
-
-/**
- * Copy 256 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov256(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov64(dst + 0 * 64, src + 0 * 64);
-	rte_mov64(dst + 1 * 64, src + 1 * 64);
-	rte_mov64(dst + 2 * 64, src + 2 * 64);
-	rte_mov64(dst + 3 * 64, src + 3 * 64);
-}
-
-/**
- * Copy 128-byte blocks from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
-{
-	__m512i zmm0, zmm1;
-
-	while (n >= 128) {
-		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
-		n -= 128;
-		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
-		src = src + 128;
-		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
-		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
-		dst = dst + 128;
-	}
-}
-
-/**
- * Copy 512-byte blocks from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
-{
-	__m512i zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, zmm7;
-
-	while (n >= 512) {
-		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
-		n -= 512;
-		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
-		zmm2 = _mm512_loadu_si512((const void *)(src + 2 * 64));
-		zmm3 = _mm512_loadu_si512((const void *)(src + 3 * 64));
-		zmm4 = _mm512_loadu_si512((const void *)(src + 4 * 64));
-		zmm5 = _mm512_loadu_si512((const void *)(src + 5 * 64));
-		zmm6 = _mm512_loadu_si512((const void *)(src + 6 * 64));
-		zmm7 = _mm512_loadu_si512((const void *)(src + 7 * 64));
-		src = src + 512;
-		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
-		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
-		_mm512_storeu_si512((void *)(dst + 2 * 64), zmm2);
-		_mm512_storeu_si512((void *)(dst + 3 * 64), zmm3);
-		_mm512_storeu_si512((void *)(dst + 4 * 64), zmm4);
-		_mm512_storeu_si512((void *)(dst + 5 * 64), zmm5);
-		_mm512_storeu_si512((void *)(dst + 6 * 64), zmm6);
-		_mm512_storeu_si512((void *)(dst + 7 * 64), zmm7);
-		dst = dst + 512;
-	}
-}
-
-static inline void *
-rte_memcpy_generic(void *dst, const void *src, size_t n)
-{
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
-	void *ret = dst;
-	size_t dstofss;
-	size_t bits;
-
-	/**
-	 * Copy less than 16 bytes
-	 */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08)
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		return ret;
-	}
-
-	/**
-	 * Fast way when copy size doesn't exceed 512 bytes
-	 */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				  (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov32((uint8_t *)dst - 32 + n,
-				  (const uint8_t *)src - 32 + n);
-		return ret;
-	}
-	if (n <= 512) {
-		if (n >= 256) {
-			n -= 256;
-			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 256;
-			dst = (uint8_t *)dst + 256;
-		}
-		if (n >= 128) {
-			n -= 128;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 128;
-			dst = (uint8_t *)dst + 128;
-		}
-COPY_BLOCK_128_BACK63:
-		if (n > 64) {
-			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov64((uint8_t *)dst - 64 + n,
-					  (const uint8_t *)src - 64 + n);
-			return ret;
-		}
-		if (n > 0)
-			rte_mov64((uint8_t *)dst - 64 + n,
-					  (const uint8_t *)src - 64 + n);
-		return ret;
-	}
-
-	/**
-	 * Make store aligned when copy size exceeds 512 bytes
-	 */
-	dstofss = ((uintptr_t)dst & 0x3F);
-	if (dstofss > 0) {
-		dstofss = 64 - dstofss;
-		n -= dstofss;
-		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-		src = (const uint8_t *)src + dstofss;
-		dst = (uint8_t *)dst + dstofss;
-	}
-
-	/**
-	 * Copy 512-byte blocks.
-	 * Use copy block function for better instruction order control,
-	 * which is important when load is unaligned.
-	 */
-	rte_mov512blocks((uint8_t *)dst, (const uint8_t *)src, n);
-	bits = n;
-	n = n & 511;
-	bits -= n;
-	src = (const uint8_t *)src + bits;
-	dst = (uint8_t *)dst + bits;
-
-	/**
-	 * Copy 128-byte blocks.
-	 * Use copy block function for better instruction order control,
-	 * which is important when load is unaligned.
-	 */
-	if (n >= 128) {
-		rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
-		bits = n;
-		n = n & 127;
-		bits -= n;
-		src = (const uint8_t *)src + bits;
-		dst = (uint8_t *)dst + bits;
-	}
-
-	/**
-	 * Copy whatever left
-	 */
-	goto COPY_BLOCK_128_BACK63;
-}
-
-#elif defined RTE_MACHINE_CPUFLAG_AVX2
-
-#define ALIGNMENT_MASK 0x1F
-
-/**
- * AVX2 implementation below
- */
-
-/**
- * Copy 16 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov16(uint8_t *dst, const uint8_t *src)
-{
-	__m128i xmm0;
-
-	xmm0 = _mm_loadu_si128((const __m128i *)src);
-	_mm_storeu_si128((__m128i *)dst, xmm0);
-}
-
-/**
- * Copy 32 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov32(uint8_t *dst, const uint8_t *src)
-{
-	__m256i ymm0;
-
-	ymm0 = _mm256_loadu_si256((const __m256i *)src);
-	_mm256_storeu_si256((__m256i *)dst, ymm0);
-}
-
-/**
- * Copy 64 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov64(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
-	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
-}
-
-/**
- * Copy 128 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
-	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
-	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
-	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
-}
-
-/**
- * Copy 128-byte blocks from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
-{
-	__m256i ymm0, ymm1, ymm2, ymm3;
-
-	while (n >= 128) {
-		ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32));
-		n -= 128;
-		ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32));
-		ymm2 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 2 * 32));
-		ymm3 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 3 * 32));
-		src = (const uint8_t *)src + 128;
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2);
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3);
-		dst = (uint8_t *)dst + 128;
-	}
-}
-
-static inline void *
-rte_memcpy_generic(void *dst, const void *src, size_t n)
-{
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
-	void *ret = dst;
-	size_t dstofss;
-	size_t bits;
-
-	/**
-	 * Copy less than 16 bytes
-	 */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08) {
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		}
-		return ret;
-	}
-
-	/**
-	 * Fast way when copy size doesn't exceed 256 bytes
-	 */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				(const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 48) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst + 16, (const uint8_t *)src + 16);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				(const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov32((uint8_t *)dst - 32 + n,
-				(const uint8_t *)src - 32 + n);
-		return ret;
-	}
-	if (n <= 256) {
-		if (n >= 128) {
-			n -= 128;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 128;
-			dst = (uint8_t *)dst + 128;
-		}
-COPY_BLOCK_128_BACK31:
-		if (n >= 64) {
-			n -= 64;
-			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 64;
-			dst = (uint8_t *)dst + 64;
-		}
-		if (n > 32) {
-			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov32((uint8_t *)dst - 32 + n,
-					(const uint8_t *)src - 32 + n);
-			return ret;
-		}
-		if (n > 0) {
-			rte_mov32((uint8_t *)dst - 32 + n,
-					(const uint8_t *)src - 32 + n);
-		}
-		return ret;
-	}
-
-	/**
-	 * Make store aligned when copy size exceeds 256 bytes
-	 */
-	dstofss = (uintptr_t)dst & 0x1F;
-	if (dstofss > 0) {
-		dstofss = 32 - dstofss;
-		n -= dstofss;
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		src = (const uint8_t *)src + dstofss;
-		dst = (uint8_t *)dst + dstofss;
-	}
-
-	/**
-	 * Copy 128-byte blocks
-	 */
-	rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
-	bits = n;
-	n = n & 127;
-	bits -= n;
-	src = (const uint8_t *)src + bits;
-	dst = (uint8_t *)dst + bits;
-
-	/**
-	 * Copy whatever left
-	 */
-	goto COPY_BLOCK_128_BACK31;
-}
-
-#else /* RTE_MACHINE_CPUFLAG */
-
-#define ALIGNMENT_MASK 0x0F
-
-/**
- * SSE & AVX implementation below
- */
-
-/**
- * Copy 16 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov16(uint8_t *dst, const uint8_t *src)
-{
-	__m128i xmm0;
-
-	xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);
-	_mm_storeu_si128((__m128i *)dst, xmm0);
-}
-
-/**
- * Copy 32 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov32(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-}
-
-/**
- * Copy 64 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov64(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
-	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
-}
-
-/**
- * Copy 128 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
-	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
-	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
-	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
-	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
-	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
-}
-
-/**
- * Copy 256 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov256(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
-	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
-	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
-	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
-	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
-	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
-	rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);
-	rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);
-	rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);
-	rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);
-	rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);
-	rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);
-	rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);
-	rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
-}
-
-/**
- * Macro for copying unaligned block from one location to another with constant load offset,
- * 47 bytes leftover maximum,
- * locations should not overlap.
- * Requirements:
- * - Store is aligned
- * - Load offset is <offset>, which must be immediate value within [1, 15]
- * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
- * - <dst>, <src>, <len> must be variables
- * - __m128i <xmm0> ~ <xmm8> must be pre-defined
- */
-#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)                                                     \
-__extension__ ({                                                                                            \
-    int tmp;                                                                                                \
-    while (len >= 128 + 16 - offset) {                                                                      \
-        xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));                  \
-        len -= 128;                                                                                         \
-        xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));                  \
-        xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));                  \
-        xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16));                  \
-        xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16));                  \
-        xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16));                  \
-        xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16));                  \
-        xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \
-        xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \
-        src = (const uint8_t *)src + 128;                                                                   \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));        \
-        dst = (uint8_t *)dst + 128;                                                                         \
-    }                                                                                                       \
-    tmp = len;                                                                                              \
-    len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
-    tmp -= len;                                                                                             \
-    src = (const uint8_t *)src + tmp;                                                                       \
-    dst = (uint8_t *)dst + tmp;                                                                             \
-    if (len >= 32 + 16 - offset) {                                                                          \
-        while (len >= 32 + 16 - offset) {                                                                   \
-            xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));              \
-            len -= 32;                                                                                      \
-            xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));              \
-            xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));              \
-            src = (const uint8_t *)src + 32;                                                                \
-            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));    \
-            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));    \
-            dst = (uint8_t *)dst + 32;                                                                      \
-        }                                                                                                   \
-        tmp = len;                                                                                          \
-        len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
-        tmp -= len;                                                                                         \
-        src = (const uint8_t *)src + tmp;                                                                   \
-        dst = (uint8_t *)dst + tmp;                                                                         \
-    }                                                                                                       \
-})
-
-/**
- * Macro for copying unaligned block from one location to another,
- * 47 bytes leftover maximum,
- * locations should not overlap.
- * Use switch here because the aligning instruction requires immediate value for shift count.
- * Requirements:
- * - Store is aligned
- * - Load offset is <offset>, which must be within [1, 15]
- * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
- * - <dst>, <src>, <len> must be variables
- * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined
- */
-#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                   \
-__extension__ ({                                                      \
-    switch (offset) {                                                 \
-    case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break;    \
-    case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break;    \
-    case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break;    \
-    case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break;    \
-    case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break;    \
-    case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break;    \
-    case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break;    \
-    case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break;    \
-    case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break;    \
-    case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break;    \
-    case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;    \
-    case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break;    \
-    case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break;    \
-    case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break;    \
-    case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break;    \
-    default:;                                                         \
-    }                                                                 \
-})
-
-static inline void *
-rte_memcpy_generic(void *dst, const void *src, size_t n)
-{
-	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
-	void *ret = dst;
-	size_t dstofss;
-	size_t srcofs;
-
-	/**
-	 * Copy less than 16 bytes
-	 */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08) {
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		}
-		return ret;
-	}
-
-	/**
-	 * Fast way when copy size doesn't exceed 512 bytes
-	 */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 48) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
-		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 128) {
-		goto COPY_BLOCK_128_BACK15;
-	}
-	if (n <= 512) {
-		if (n >= 256) {
-			n -= 256;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov128((uint8_t *)dst + 128, (const uint8_t *)src + 128);
-			src = (const uint8_t *)src + 256;
-			dst = (uint8_t *)dst + 256;
-		}
-COPY_BLOCK_255_BACK15:
-		if (n >= 128) {
-			n -= 128;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 128;
-			dst = (uint8_t *)dst + 128;
-		}
-COPY_BLOCK_128_BACK15:
-		if (n >= 64) {
-			n -= 64;
-			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 64;
-			dst = (uint8_t *)dst + 64;
-		}
-COPY_BLOCK_64_BACK15:
-		if (n >= 32) {
-			n -= 32;
-			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 32;
-			dst = (uint8_t *)dst + 32;
-		}
-		if (n > 16) {
-			rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-			return ret;
-		}
-		if (n > 0) {
-			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		}
-		return ret;
-	}
-
-	/**
-	 * Make store aligned when copy size exceeds 512 bytes,
-	 * and make sure the first 15 bytes are copied, because
-	 * unaligned copy functions require up to 15 bytes
-	 * backwards access.
-	 */
-	dstofss = (uintptr_t)dst & 0x0F;
-	if (dstofss > 0) {
-		dstofss = 16 - dstofss + 16;
-		n -= dstofss;
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		src = (const uint8_t *)src + dstofss;
-		dst = (uint8_t *)dst + dstofss;
-	}
-	srcofs = ((uintptr_t)src & 0x0F);
-
-	/**
-	 * For aligned copy
-	 */
-	if (srcofs == 0) {
-		/**
-		 * Copy 256-byte blocks
-		 */
-		for (; n >= 256; n -= 256) {
-			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
-			dst = (uint8_t *)dst + 256;
-			src = (const uint8_t *)src + 256;
-		}
-
-		/**
-		 * Copy whatever left
-		 */
-		goto COPY_BLOCK_255_BACK15;
-	}
-
-	/**
-	 * For copy with unaligned load
-	 */
-	MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
-
-	/**
-	 * Copy whatever left
-	 */
-	goto COPY_BLOCK_64_BACK15;
-}
-
-#endif /* RTE_MACHINE_CPUFLAG */
-
-static inline void *
-rte_memcpy_aligned(void *dst, const void *src, size_t n)
-{
-	void *ret = dst;
-
-	/* Copy size <= 16 bytes */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dst = *(const uint8_t *)src;
-			src = (const uint8_t *)src + 1;
-			dst = (uint8_t *)dst + 1;
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dst = *(const uint16_t *)src;
-			src = (const uint16_t *)src + 1;
-			dst = (uint16_t *)dst + 1;
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dst = *(const uint32_t *)src;
-			src = (const uint32_t *)src + 1;
-			dst = (uint32_t *)dst + 1;
-		}
-		if (n & 0x08)
-			*(uint64_t *)dst = *(const uint64_t *)src;
-
-		return ret;
-	}
-
-	/* Copy 16 <= size <= 32 bytes */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				(const uint8_t *)src - 16 + n);
-
-		return ret;
-	}
-
-	/* Copy 32 < size <= 64 bytes */
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov32((uint8_t *)dst - 32 + n,
-				(const uint8_t *)src - 32 + n);
-
-		return ret;
-	}
-
-	/* Copy 64 bytes blocks */
-	for (; n >= 64; n -= 64) {
-		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-		dst = (uint8_t *)dst + 64;
-		src = (const uint8_t *)src + 64;
-	}
-
-	/* Copy whatever left */
-	rte_mov64((uint8_t *)dst - 64 + n,
-			(const uint8_t *)src - 64 + n);
-
-	return ret;
-}
+extern void *
+rte_memcpy_sse(void *dst, const void *src, size_t n);
 
 static inline void *
 rte_memcpy(void *dst, const void *src, size_t n)
 {
-	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
-		return rte_memcpy_aligned(dst, src, n);
+	if (n <= RTE_X86_MEMCPY_THRESH)
+		return rte_memcpy_internal(dst, src, n);
 	else
-		return rte_memcpy_generic(dst, src, n);
+		return (*rte_memcpy_ptr)(dst, src, n);
 }
 
 #ifdef __cplusplus
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c b/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
new file mode 100644
index 0000000..c83351a
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
@@ -0,0 +1,291 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+
+#ifndef CC_SUPPORT_AVX2
+#error CC_SUPPORT_AVX2 not defined
+#endif
+
+void *
+rte_memcpy_avx2(void *dst, const void *src, size_t n)
+{
+	if (!(((uintptr_t)dst | (uintptr_t)src) & 0x1F)) {
+		void *ret = dst;
+
+		/* Copy size <= 16 bytes */
+		if (n < 16) {
+			if (n & 0x01) {
+				*(uint8_t *)dst = *(const uint8_t *)src;
+				src = (const uint8_t *)src + 1;
+				dst = (uint8_t *)dst + 1;
+			}
+			if (n & 0x02) {
+				*(uint16_t *)dst = *(const uint16_t *)src;
+				src = (const uint16_t *)src + 1;
+				dst = (uint16_t *)dst + 1;
+			}
+			if (n & 0x04) {
+				*(uint32_t *)dst = *(const uint32_t *)src;
+				src = (const uint32_t *)src + 1;
+				dst = (uint32_t *)dst + 1;
+			}
+			if (n & 0x08)
+				*(uint64_t *)dst = *(const uint64_t *)src;
+
+			return ret;
+		}
+
+		/* Copy 16 <= size <= 32 bytes */
+		if (n <= 32) {
+			__m128i xmm0, xmm1;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src - 16 + n));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst - 16 + n), xmm1);
+
+			return ret;
+		}
+
+		/* Copy 32 < size <= 64 bytes */
+		if (n <= 64) {
+			__m256i ymm0, ymm1;
+			ymm0 = _mm256_loadu_si256((const __m256i *)src);
+			ymm1 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src - 32 + n));
+			_mm256_storeu_si256((__m256i *)dst, ymm0);
+			_mm256_storeu_si256((__m256i *)
+				((uint8_t *)dst - 32 + n), ymm1);
+
+			return ret;
+		}
+
+		/* Copy 64 bytes blocks */
+		for (; n >= 64; n -= 64) {
+			__m256i ymm0, ymm1;
+			ymm0 = _mm256_loadu_si256((const __m256i *)src);
+			ymm1 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 32));
+			_mm256_storeu_si256((__m256i *)dst, ymm0);
+			_mm256_storeu_si256((__m256i *)
+				((uint8_t *)dst + 32), ymm1);
+			dst = (uint8_t *)dst + 64;
+			src = (const uint8_t *)src + 64;
+		}
+
+		/* Copy whatever left */
+		__m256i ymm0, ymm1;
+		ymm0 = _mm256_loadu_si256((const __m256i *)
+			((const uint8_t *)src - 64 + n));
+		ymm1 = _mm256_loadu_si256((const __m256i *)
+			((const uint8_t *)src - 32 + n));
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst - 64 + n), ymm0);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst - 32 + n), ymm1);
+
+		return ret;
+	} else {
+		uintptr_t dstu = (uintptr_t)dst;
+		uintptr_t srcu = (uintptr_t)src;
+		void *ret = dst;
+		size_t dstofss;
+		size_t bits;
+
+		/**
+		 * Copy less than 16 bytes
+		 */
+		if (n < 16) {
+			if (n & 0x01) {
+				*(uint8_t *)dstu = *(const uint8_t *)srcu;
+				srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+				dstu = (uintptr_t)((uint8_t *)dstu + 1);
+			}
+			if (n & 0x02) {
+				*(uint16_t *)dstu = *(const uint16_t *)srcu;
+				srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+				dstu = (uintptr_t)((uint16_t *)dstu + 1);
+			}
+			if (n & 0x04) {
+				*(uint32_t *)dstu = *(const uint32_t *)srcu;
+				srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+				dstu = (uintptr_t)((uint32_t *)dstu + 1);
+			}
+			if (n & 0x08)
+				*(uint64_t *)dstu = *(const uint64_t *)srcu;
+			return ret;
+		}
+
+		/**
+		 * Fast way when copy size doesn't exceed 256 bytes
+		 */
+		if (n <= 32) {
+			__m128i xmm0, xmm1;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src - 16 + n));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst - 16 + n), xmm1);
+			return ret;
+		}
+		if (n <= 48) {
+			__m128i xmm0, xmm1, xmm2;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src + 16));
+			xmm2 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src - 16 + n));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst + 16), xmm1);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst - 16 + n), xmm2);
+			return ret;
+		}
+		if (n <= 64) {
+			__m256i ymm0, ymm1;
+			ymm0 = _mm256_loadu_si256((const __m256i *)src);
+			ymm1 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src - 32 + n));
+			_mm256_storeu_si256((__m256i *)dst, ymm0);
+			_mm256_storeu_si256((__m256i *)
+				((uint8_t *)dst - 32 + n), ymm1);
+			return ret;
+		}
+		if (n <= 256) {
+			if (n >= 128) {
+				n -= 128;
+				__m256i ymm0, ymm1, ymm2, ymm3;
+				ymm0 = _mm256_loadu_si256((const __m256i *)src);
+				ymm1 = _mm256_loadu_si256((const __m256i *)
+					((const uint8_t *)src + 32));
+				ymm2 = _mm256_loadu_si256((const __m256i *)
+					((const uint8_t *)src + 2*32));
+				ymm3 = _mm256_loadu_si256((const __m256i *)
+					((const uint8_t *)src + 3*32));
+				_mm256_storeu_si256((__m256i *)dst, ymm0);
+				_mm256_storeu_si256((__m256i *)
+					((uint8_t *)dst + 32), ymm1);
+				_mm256_storeu_si256((__m256i *)
+					((uint8_t *)dst + 2*32), ymm2);
+				_mm256_storeu_si256((__m256i *)
+					((uint8_t *)dst + 3*32), ymm3);
+				src = (const uint8_t *)src + 128;
+				dst = (uint8_t *)dst + 128;
+			}
+COPY_BLOCK_128_BACK31:
+			if (n >= 64) {
+				n -= 64;
+				__m256i ymm0, ymm1;
+				ymm0 = _mm256_loadu_si256((const __m256i *)src);
+				ymm1 = _mm256_loadu_si256((const __m256i *)
+					((const uint8_t *)src + 32));
+				_mm256_storeu_si256((__m256i *)dst, ymm0);
+				_mm256_storeu_si256((__m256i *)
+					((uint8_t *)dst + 32), ymm1);
+				src = (const uint8_t *)src + 64;
+				dst = (uint8_t *)dst + 64;
+			}
+			if (n > 32) {
+				__m256i ymm0, ymm1;
+				ymm0 = _mm256_loadu_si256((const __m256i *)src);
+				ymm1 = _mm256_loadu_si256((const __m256i *)
+					((const uint8_t *)src - 32 + n));
+				_mm256_storeu_si256((__m256i *)dst, ymm0);
+				_mm256_storeu_si256((__m256i *)
+					((uint8_t *)dst - 32 + n), ymm1);
+				return ret;
+			}
+			if (n > 0) {
+				__m256i ymm0;
+				ymm0 = _mm256_loadu_si256((const __m256i *)
+					((const uint8_t *)src - 32 + n));
+				_mm256_storeu_si256((__m256i *)
+					((uint8_t *)dst - 32 + n), ymm0);
+			}
+			return ret;
+		}
+
+		/**
+		 * Make store aligned when copy size exceeds 256 bytes
+		 */
+		dstofss = (uintptr_t)dst & 0x1F;
+		if (dstofss > 0) {
+			dstofss = 32 - dstofss;
+			n -= dstofss;
+			__m256i ymm0;
+			ymm0 = _mm256_loadu_si256((const __m256i *)src);
+			_mm256_storeu_si256((__m256i *)dst, ymm0);
+			src = (const uint8_t *)src + dstofss;
+			dst = (uint8_t *)dst + dstofss;
+		}
+
+		/**
+		 * Copy 128-byte blocks
+		 */
+		__m256i ymm0, ymm1, ymm2, ymm3;
+
+		while (n >= 128) {
+			ymm0 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 0 * 32));
+			n -= 128;
+			ymm1 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 1 * 32));
+			ymm2 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 2 * 32));
+			ymm3 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 3 * 32));
+			src = (const uint8_t *)src + 128;
+			_mm256_storeu_si256((__m256i *)
+				((uint8_t *)dst + 0 * 32), ymm0);
+			_mm256_storeu_si256((__m256i *)
+				((uint8_t *)dst + 1 * 32), ymm1);
+			_mm256_storeu_si256((__m256i *)
+				((uint8_t *)dst + 2 * 32), ymm2);
+			_mm256_storeu_si256((__m256i *)
+				((uint8_t *)dst + 3 * 32), ymm3);
+			dst = (uint8_t *)dst + 128;
+		}
+		bits = n;
+		n = n & 127;
+		bits -= n;
+		src = (const uint8_t *)src + bits;
+		dst = (uint8_t *)dst + bits;
+
+		/**
+		 * Copy whatever left
+		 */
+		goto COPY_BLOCK_128_BACK31;
+	}
+}
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c b/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
new file mode 100644
index 0000000..c8a9d20
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
@@ -0,0 +1,316 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+
+#ifndef CC_SUPPORT_AVX512F
+#error CC_SUPPORT_AVX512F not defined
+#endif
+
+void *
+rte_memcpy_avx512f(void *dst, const void *src, size_t n)
+{
+	if (!(((uintptr_t)dst | (uintptr_t)src) & 0x3F)) {
+		void *ret = dst;
+
+		/* Copy size <= 16 bytes */
+		if (n < 16) {
+			if (n & 0x01) {
+				*(uint8_t *)dst = *(const uint8_t *)src;
+				src = (const uint8_t *)src + 1;
+				dst = (uint8_t *)dst + 1;
+			}
+			if (n & 0x02) {
+				*(uint16_t *)dst = *(const uint16_t *)src;
+				src = (const uint16_t *)src + 1;
+				dst = (uint16_t *)dst + 1;
+			}
+			if (n & 0x04) {
+				*(uint32_t *)dst = *(const uint32_t *)src;
+				src = (const uint32_t *)src + 1;
+				dst = (uint32_t *)dst + 1;
+			}
+			if (n & 0x08)
+				*(uint64_t *)dst = *(const uint64_t *)src;
+
+			return ret;
+		}
+
+		/* Copy 16 <= size <= 32 bytes */
+		if (n <= 32) {
+			__m128i xmm0, xmm1;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src - 16 + n));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst - 16 + n), xmm1);
+
+			return ret;
+		}
+
+		/* Copy 32 < size <= 64 bytes */
+		if (n <= 64) {
+			__m256i ymm0, ymm1;
+			ymm0 = _mm256_loadu_si256((const __m256i *)src);
+			ymm1 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src - 32 + n));
+			_mm256_storeu_si256((__m256i *)dst, ymm0);
+			_mm256_storeu_si256((__m256i *)
+				((uint8_t *)dst - 32 + n), ymm1);
+
+			return ret;
+		}
+
+		/* Copy 64 bytes blocks */
+		for (; n >= 64; n -= 64) {
+			__m512i zmm0;
+			zmm0 = _mm512_loadu_si512((const void *)src);
+			_mm512_storeu_si512((void *)dst, zmm0);
+			dst = (uint8_t *)dst + 64;
+			src = (const uint8_t *)src + 64;
+		}
+
+		/* Copy whatever left */
+		__m512i zmm0;
+		zmm0 = _mm512_loadu_si512((const void *)
+			((const uint8_t *)src - 64 + n));
+		_mm512_storeu_si512((void *)((uint8_t *)dst - 64 + n), zmm0);
+
+		return ret;
+	} else {
+		uintptr_t dstu = (uintptr_t)dst;
+		uintptr_t srcu = (uintptr_t)src;
+		void *ret = dst;
+		size_t dstofss;
+		size_t bits;
+
+		/**
+		 * Copy less than 16 bytes
+		 */
+		if (n < 16) {
+			if (n & 0x01) {
+				*(uint8_t *)dstu = *(const uint8_t *)srcu;
+				srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+				dstu = (uintptr_t)((uint8_t *)dstu + 1);
+			}
+			if (n & 0x02) {
+				*(uint16_t *)dstu = *(const uint16_t *)srcu;
+				srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+				dstu = (uintptr_t)((uint16_t *)dstu + 1);
+			}
+			if (n & 0x04) {
+				*(uint32_t *)dstu = *(const uint32_t *)srcu;
+				srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+				dstu = (uintptr_t)((uint32_t *)dstu + 1);
+			}
+			if (n & 0x08)
+				*(uint64_t *)dstu = *(const uint64_t *)srcu;
+			return ret;
+		}
+
+		/**
+		 * Fast way when copy size doesn't exceed 512 bytes
+		 */
+		if (n <= 32) {
+			__m128i xmm0, xmm1;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src - 16 + n));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst - 16 + n), xmm1);
+			return ret;
+		}
+		if (n <= 64) {
+			__m256i ymm0, ymm1;
+			ymm0 = _mm256_loadu_si256((const __m256i *)src);
+			ymm1 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src - 32 + n));
+			_mm256_storeu_si256((__m256i *)dst, ymm0);
+			_mm256_storeu_si256((__m256i *)
+				((uint8_t *)dst - 32 + n), ymm1);
+			return ret;
+		}
+		if (n <= 512) {
+			if (n >= 256) {
+				n -= 256;
+				__m512i zmm0, zmm1, zmm2, zmm3;
+				zmm0 = _mm512_loadu_si512((const void *)src);
+				zmm1 = _mm512_loadu_si512((const void *)
+					((const uint8_t *)src + 64));
+				zmm2 = _mm512_loadu_si512((const void *)
+					((const uint8_t *)src + 2*64));
+				zmm3 = _mm512_loadu_si512((const void *)
+					((const uint8_t *)src + 3*64));
+				_mm512_storeu_si512((void *)dst, zmm0);
+				_mm512_storeu_si512((void *)
+					((uint8_t *)dst + 64), zmm1);
+				_mm512_storeu_si512((void *)
+					((uint8_t *)dst + 2*64), zmm2);
+				_mm512_storeu_si512((void *)
+					((uint8_t *)dst + 3*64), zmm3);
+				src = (const uint8_t *)src + 256;
+				dst = (uint8_t *)dst + 256;
+			}
+			if (n >= 128) {
+				n -= 128;
+				__m512i zmm0, zmm1;
+				zmm0 = _mm512_loadu_si512((const void *)src);
+				zmm1 = _mm512_loadu_si512((const void *)
+					((const uint8_t *)src + 64));
+				_mm512_storeu_si512((void *)dst, zmm0);
+				_mm512_storeu_si512((void *)
+					((uint8_t *)dst + 64), zmm1);
+				src = (const uint8_t *)src + 128;
+				dst = (uint8_t *)dst + 128;
+			}
+COPY_BLOCK_128_BACK63:
+			if (n > 64) {
+				__m512i zmm0, zmm1;
+				zmm0 = _mm512_loadu_si512((const void *)src);
+				zmm1 = _mm512_loadu_si512((const void *)
+					((const uint8_t *)src - 64 + n));
+				_mm512_storeu_si512((void *)dst, zmm0);
+				_mm512_storeu_si512((void *)
+					((uint8_t *)dst - 64 + n), zmm1);
+				return ret;
+			}
+			if (n > 0) {
+				__m512i zmm0;
+				zmm0 = _mm512_loadu_si512((const void *)
+					((const uint8_t *)src - 64 + n));
+				_mm512_storeu_si512((void *)
+					((uint8_t *)dst - 64 + n), zmm0);
+			}
+			return ret;
+		}
+
+		/**
+		 * Make store aligned when copy size exceeds 512 bytes
+		 */
+		dstofss = ((uintptr_t)dst & 0x3F);
+		if (dstofss > 0) {
+			dstofss = 64 - dstofss;
+			n -= dstofss;
+			__m512i zmm0;
+			zmm0 = _mm512_loadu_si512((const void *)src);
+			_mm512_storeu_si512((void *)dst, zmm0);
+			src = (const uint8_t *)src + dstofss;
+			dst = (uint8_t *)dst + dstofss;
+		}
+
+		/**
+		 * Copy 512-byte blocks.
+		 * Use copy block function for better instruction order control,
+		 * which is important when load is unaligned.
+		 */
+		__m512i zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, zmm7;
+
+		while (n >= 512) {
+			zmm0 = _mm512_loadu_si512((const void *)
+				((const uint8_t *)src + 0 * 64));
+			n -= 512;
+			zmm1 = _mm512_loadu_si512((const void *)
+				((const uint8_t *)src + 1 * 64));
+			zmm2 = _mm512_loadu_si512((const void *)
+				((const uint8_t *)src + 2 * 64));
+			zmm3 = _mm512_loadu_si512((const void *)
+				((const uint8_t *)src + 3 * 64));
+			zmm4 = _mm512_loadu_si512((const void *)
+				((const uint8_t *)src + 4 * 64));
+			zmm5 = _mm512_loadu_si512((const void *)
+				((const uint8_t *)src + 5 * 64));
+			zmm6 = _mm512_loadu_si512((const void *)
+				((const uint8_t *)src + 6 * 64));
+			zmm7 = _mm512_loadu_si512((const void *)
+				((const uint8_t *)src + 7 * 64));
+			src = (const uint8_t *)src + 512;
+			_mm512_storeu_si512((void *)
+				((uint8_t *)dst + 0 * 64), zmm0);
+			_mm512_storeu_si512((void *)
+				((uint8_t *)dst + 1 * 64), zmm1);
+			_mm512_storeu_si512((void *)
+				((uint8_t *)dst + 2 * 64), zmm2);
+			_mm512_storeu_si512((void *)
+				((uint8_t *)dst + 3 * 64), zmm3);
+			_mm512_storeu_si512((void *)
+				((uint8_t *)dst + 4 * 64), zmm4);
+			_mm512_storeu_si512((void *)
+				((uint8_t *)dst + 5 * 64), zmm5);
+			_mm512_storeu_si512((void *)
+				((uint8_t *)dst + 6 * 64), zmm6);
+			_mm512_storeu_si512((void *)
+				((uint8_t *)dst + 7 * 64), zmm7);
+			dst = (uint8_t *)dst + 512;
+		}
+		bits = n;
+		n = n & 511;
+		bits -= n;
+		src = (const uint8_t *)src + bits;
+		dst = (uint8_t *)dst + bits;
+
+		/**
+		 * Copy 128-byte blocks.
+		 * Use copy block function for better instruction order control,
+		 * which is important when load is unaligned.
+		 */
+		if (n >= 128) {
+			__m512i zmm0, zmm1;
+
+			while (n >= 128) {
+				zmm0 = _mm512_loadu_si512((const void *)
+					((const uint8_t *)src + 0 * 64));
+				n -= 128;
+				zmm1 = _mm512_loadu_si512((const void *)
+					((const uint8_t *)src + 1 * 64));
+				src = (const uint8_t *)src + 128;
+				_mm512_storeu_si512((void *)
+					((uint8_t *)dst + 0 * 64), zmm0);
+				_mm512_storeu_si512((void *)
+					((uint8_t *)dst + 1 * 64), zmm1);
+				dst = (uint8_t *)dst + 128;
+			}
+			bits = n;
+			n = n & 127;
+			bits -= n;
+			src = (const uint8_t *)src + bits;
+			dst = (uint8_t *)dst + bits;
+		}
+
+		/**
+		 * Copy whatever left
+		 */
+		goto COPY_BLOCK_128_BACK63;
+	}
+}
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
new file mode 100644
index 0000000..d17fb5b
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
@@ -0,0 +1,909 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_MEMCPY_INTERNAL_X86_64_H_
+#define _RTE_MEMCPY_INTERNAL_X86_64_H_
+
+/**
+ * @file
+ *
+ * Functions for SSE/AVX/AVX2/AVX512 implementation of memcpy().
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <string.h>
+#include <rte_vect.h>
+#include <rte_common.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Copy bytes from one location to another. The locations must not overlap.
+ *
+ * @note This is implemented as a macro, so it's address should not be taken
+ * and care is needed as parameter expressions may be evaluated multiple times.
+ *
+ * @param dst
+ *   Pointer to the destination of the data.
+ * @param src
+ *   Pointer to the source data.
+ * @param n
+ *   Number of bytes to copy.
+ * @return
+ *   Pointer to the destination data.
+ */
+
+#ifdef RTE_MACHINE_CPUFLAG_AVX512F
+
+#define ALIGNMENT_MASK 0x3F
+
+/**
+ * AVX512 implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_loadu_si256((const __m256i *)src);
+	_mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+	__m512i zmm0;
+
+	zmm0 = _mm512_loadu_si512((const void *)src);
+	_mm512_storeu_si512((void *)dst, zmm0);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov64(dst + 0 * 64, src + 0 * 64);
+	rte_mov64(dst + 1 * 64, src + 1 * 64);
+}
+
+/**
+ * Copy 256 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov256(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov64(dst + 0 * 64, src + 0 * 64);
+	rte_mov64(dst + 1 * 64, src + 1 * 64);
+	rte_mov64(dst + 2 * 64, src + 2 * 64);
+	rte_mov64(dst + 3 * 64, src + 3 * 64);
+}
+
+/**
+ * Copy 128-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m512i zmm0, zmm1;
+
+	while (n >= 128) {
+		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
+		n -= 128;
+		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
+		src = src + 128;
+		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
+		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
+		dst = dst + 128;
+	}
+}
+
+/**
+ * Copy 512-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m512i zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, zmm7;
+
+	while (n >= 512) {
+		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
+		n -= 512;
+		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
+		zmm2 = _mm512_loadu_si512((const void *)(src + 2 * 64));
+		zmm3 = _mm512_loadu_si512((const void *)(src + 3 * 64));
+		zmm4 = _mm512_loadu_si512((const void *)(src + 4 * 64));
+		zmm5 = _mm512_loadu_si512((const void *)(src + 5 * 64));
+		zmm6 = _mm512_loadu_si512((const void *)(src + 6 * 64));
+		zmm7 = _mm512_loadu_si512((const void *)(src + 7 * 64));
+		src = src + 512;
+		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
+		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
+		_mm512_storeu_si512((void *)(dst + 2 * 64), zmm2);
+		_mm512_storeu_si512((void *)(dst + 3 * 64), zmm3);
+		_mm512_storeu_si512((void *)(dst + 4 * 64), zmm4);
+		_mm512_storeu_si512((void *)(dst + 5 * 64), zmm5);
+		_mm512_storeu_si512((void *)(dst + 6 * 64), zmm6);
+		_mm512_storeu_si512((void *)(dst + 7 * 64), zmm7);
+		dst = dst + 512;
+	}
+}
+
+static inline void *
+rte_memcpy_generic(void *dst, const void *src, size_t n)
+{
+	uintptr_t dstu = (uintptr_t)dst;
+	uintptr_t srcu = (uintptr_t)src;
+	void *ret = dst;
+	size_t dstofss;
+	size_t bits;
+
+	/**
+	 * Copy less than 16 bytes
+	 */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dstu = *(const uint8_t *)srcu;
+			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+			dstu = (uintptr_t)((uint8_t *)dstu + 1);
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dstu = *(const uint16_t *)srcu;
+			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+			dstu = (uintptr_t)((uint16_t *)dstu + 1);
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dstu = *(const uint32_t *)srcu;
+			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+			dstu = (uintptr_t)((uint32_t *)dstu + 1);
+		}
+		if (n & 0x08)
+			*(uint64_t *)dstu = *(const uint64_t *)srcu;
+		return ret;
+	}
+
+	/**
+	 * Fast way when copy size doesn't exceed 512 bytes
+	 */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				  (const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				  (const uint8_t *)src - 32 + n);
+		return ret;
+	}
+	if (n <= 512) {
+		if (n >= 256) {
+			n -= 256;
+			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 256;
+			dst = (uint8_t *)dst + 256;
+		}
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+COPY_BLOCK_128_BACK63:
+		if (n > 64) {
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov64((uint8_t *)dst - 64 + n,
+					  (const uint8_t *)src - 64 + n);
+			return ret;
+		}
+		if (n > 0)
+			rte_mov64((uint8_t *)dst - 64 + n,
+					  (const uint8_t *)src - 64 + n);
+		return ret;
+	}
+
+	/**
+	 * Make store aligned when copy size exceeds 512 bytes
+	 */
+	dstofss = ((uintptr_t)dst & 0x3F);
+	if (dstofss > 0) {
+		dstofss = 64 - dstofss;
+		n -= dstofss;
+		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+		src = (const uint8_t *)src + dstofss;
+		dst = (uint8_t *)dst + dstofss;
+	}
+
+	/**
+	 * Copy 512-byte blocks.
+	 * Use copy block function for better instruction order control,
+	 * which is important when load is unaligned.
+	 */
+	rte_mov512blocks((uint8_t *)dst, (const uint8_t *)src, n);
+	bits = n;
+	n = n & 511;
+	bits -= n;
+	src = (const uint8_t *)src + bits;
+	dst = (uint8_t *)dst + bits;
+
+	/**
+	 * Copy 128-byte blocks.
+	 * Use copy block function for better instruction order control,
+	 * which is important when load is unaligned.
+	 */
+	if (n >= 128) {
+		rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
+		bits = n;
+		n = n & 127;
+		bits -= n;
+		src = (const uint8_t *)src + bits;
+		dst = (uint8_t *)dst + bits;
+	}
+
+	/**
+	 * Copy whatever left
+	 */
+	goto COPY_BLOCK_128_BACK63;
+}
+
+#elif defined RTE_MACHINE_CPUFLAG_AVX2
+
+#define ALIGNMENT_MASK 0x1F
+
+/**
+ * AVX2 implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_loadu_si256((const __m256i *)src);
+	_mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
+	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
+	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
+	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
+	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
+}
+
+/**
+ * Copy 128-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m256i ymm0, ymm1, ymm2, ymm3;
+
+	while (n >= 128) {
+		ymm0 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 0 * 32));
+		n -= 128;
+		ymm1 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 1 * 32));
+		ymm2 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 2 * 32));
+		ymm3 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 3 * 32));
+		src = (const uint8_t *)src + 128;
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3);
+		dst = (uint8_t *)dst + 128;
+	}
+}
+
+static inline void *
+rte_memcpy_generic(void *dst, const void *src, size_t n)
+{
+	uintptr_t dstu = (uintptr_t)dst;
+	uintptr_t srcu = (uintptr_t)src;
+	void *ret = dst;
+	size_t dstofss;
+	size_t bits;
+
+	/**
+	 * Copy less than 16 bytes
+	 */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dstu = *(const uint8_t *)srcu;
+			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+			dstu = (uintptr_t)((uint8_t *)dstu + 1);
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dstu = *(const uint16_t *)srcu;
+			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+			dstu = (uintptr_t)((uint16_t *)dstu + 1);
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dstu = *(const uint32_t *)srcu;
+			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+			dstu = (uintptr_t)((uint32_t *)dstu + 1);
+		}
+		if (n & 0x08)
+			*(uint64_t *)dstu = *(const uint64_t *)srcu;
+		return ret;
+	}
+
+	/**
+	 * Fast way when copy size doesn't exceed 256 bytes
+	 */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 48) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst + 16, (const uint8_t *)src + 16);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				(const uint8_t *)src - 32 + n);
+		return ret;
+	}
+	if (n <= 256) {
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+COPY_BLOCK_128_BACK31:
+		if (n >= 64) {
+			n -= 64;
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 64;
+			dst = (uint8_t *)dst + 64;
+		}
+		if (n > 32) {
+			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov32((uint8_t *)dst - 32 + n,
+					(const uint8_t *)src - 32 + n);
+			return ret;
+		}
+		if (n > 0) {
+			rte_mov32((uint8_t *)dst - 32 + n,
+					(const uint8_t *)src - 32 + n);
+		}
+		return ret;
+	}
+
+	/**
+	 * Make store aligned when copy size exceeds 256 bytes
+	 */
+	dstofss = (uintptr_t)dst & 0x1F;
+	if (dstofss > 0) {
+		dstofss = 32 - dstofss;
+		n -= dstofss;
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		src = (const uint8_t *)src + dstofss;
+		dst = (uint8_t *)dst + dstofss;
+	}
+
+	/**
+	 * Copy 128-byte blocks
+	 */
+	rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
+	bits = n;
+	n = n & 127;
+	bits -= n;
+	src = (const uint8_t *)src + bits;
+	dst = (uint8_t *)dst + bits;
+
+	/**
+	 * Copy whatever left
+	 */
+	goto COPY_BLOCK_128_BACK31;
+}
+
+#else /* RTE_MACHINE_CPUFLAG */
+
+#define ALIGNMENT_MASK 0x0F
+
+/**
+ * SSE & AVX implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
+	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
+	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
+	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
+}
+
+/**
+ * Copy 256 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov256(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
+	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
+	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
+	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
+	rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);
+	rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);
+	rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);
+	rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);
+	rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);
+	rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);
+	rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);
+	rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
+}
+
+/**
+ * Macro for copying unaligned block from one location to another with constant load offset,
+ * 47 bytes leftover maximum,
+ * locations should not overlap.
+ * Requirements:
+ * - Store is aligned
+ * - Load offset is <offset>, which must be immediate value within [1, 15]
+ * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
+ * - <dst>, <src>, <len> must be variables
+ * - __m128i <xmm0> ~ <xmm8> must be pre-defined
+ */
+#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)                                                     \
+__extension__ ({                                                                                            \
+    int tmp;                                                                                                \
+    while (len >= 128 + 16 - offset) {                                                                      \
+        xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));                  \
+        len -= 128;                                                                                         \
+        xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));                  \
+        xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));                  \
+        xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16));                  \
+        xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16));                  \
+        xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16));                  \
+        xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16));                  \
+        xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \
+        xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \
+        src = (const uint8_t *)src + 128;                                                                   \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));        \
+        dst = (uint8_t *)dst + 128;                                                                         \
+    }                                                                                                       \
+    tmp = len;                                                                                              \
+    len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
+    tmp -= len;                                                                                             \
+    src = (const uint8_t *)src + tmp;                                                                       \
+    dst = (uint8_t *)dst + tmp;                                                                             \
+    if (len >= 32 + 16 - offset) {                                                                          \
+        while (len >= 32 + 16 - offset) {                                                                   \
+            xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));              \
+            len -= 32;                                                                                      \
+            xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));              \
+            xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));              \
+            src = (const uint8_t *)src + 32;                                                                \
+            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));    \
+            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));    \
+            dst = (uint8_t *)dst + 32;                                                                      \
+        }                                                                                                   \
+        tmp = len;                                                                                          \
+        len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
+        tmp -= len;                                                                                         \
+        src = (const uint8_t *)src + tmp;                                                                   \
+        dst = (uint8_t *)dst + tmp;                                                                         \
+    }                                                                                                       \
+})
+
+/**
+ * Macro for copying unaligned block from one location to another,
+ * 47 bytes leftover maximum,
+ * locations should not overlap.
+ * Use switch here because the aligning instruction requires immediate value for shift count.
+ * Requirements:
+ * - Store is aligned
+ * - Load offset is <offset>, which must be within [1, 15]
+ * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
+ * - <dst>, <src>, <len> must be variables
+ * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined
+ */
+#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                   \
+__extension__ ({                                                      \
+    switch (offset) {                                                 \
+    case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break;    \
+    case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break;    \
+    case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break;    \
+    case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break;    \
+    case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break;    \
+    case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break;    \
+    case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break;    \
+    case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break;    \
+    case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break;    \
+    case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break;    \
+    case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;    \
+    case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break;    \
+    case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break;    \
+    case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break;    \
+    case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break;    \
+    default:;                                                         \
+    }                                                                 \
+})
+
+static inline void *
+rte_memcpy_generic(void *dst, const void *src, size_t n)
+{
+	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
+	uintptr_t dstu = (uintptr_t)dst;
+	uintptr_t srcu = (uintptr_t)src;
+	void *ret = dst;
+	size_t dstofss;
+	size_t srcofs;
+
+	/**
+	 * Copy less than 16 bytes
+	 */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dstu = *(const uint8_t *)srcu;
+			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+			dstu = (uintptr_t)((uint8_t *)dstu + 1);
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dstu = *(const uint16_t *)srcu;
+			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+			dstu = (uintptr_t)((uint16_t *)dstu + 1);
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dstu = *(const uint32_t *)srcu;
+			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+			dstu = (uintptr_t)((uint32_t *)dstu + 1);
+		}
+		if (n & 0x08)
+			*(uint64_t *)dstu = *(const uint64_t *)srcu;
+		return ret;
+	}
+
+	/**
+	 * Fast way when copy size doesn't exceed 512 bytes
+	 */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 48) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 128)
+		goto COPY_BLOCK_128_BACK15;
+	if (n <= 512) {
+		if (n >= 256) {
+			n -= 256;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov128((uint8_t *)dst + 128,
+					(const uint8_t *)src + 128);
+			src = (const uint8_t *)src + 256;
+			dst = (uint8_t *)dst + 256;
+		}
+COPY_BLOCK_255_BACK15:
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+COPY_BLOCK_128_BACK15:
+		if (n >= 64) {
+			n -= 64;
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 64;
+			dst = (uint8_t *)dst + 64;
+		}
+COPY_BLOCK_64_BACK15:
+		if (n >= 32) {
+			n -= 32;
+			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 32;
+			dst = (uint8_t *)dst + 32;
+		}
+		if (n > 16) {
+			rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov16((uint8_t *)dst - 16 + n,
+					(const uint8_t *)src - 16 + n);
+			return ret;
+		}
+		if (n > 0) {
+			rte_mov16((uint8_t *)dst - 16 + n,
+					(const uint8_t *)src - 16 + n);
+		}
+		return ret;
+	}
+
+	/**
+	 * Make store aligned when copy size exceeds 512 bytes,
+	 * and make sure the first 15 bytes are copied, because
+	 * unaligned copy functions require up to 15 bytes
+	 * backwards access.
+	 */
+	dstofss = (uintptr_t)dst & 0x0F;
+	if (dstofss > 0) {
+		dstofss = 16 - dstofss + 16;
+		n -= dstofss;
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		src = (const uint8_t *)src + dstofss;
+		dst = (uint8_t *)dst + dstofss;
+	}
+	srcofs = ((uintptr_t)src & 0x0F);
+
+	/**
+	 * For aligned copy
+	 */
+	if (srcofs == 0) {
+		/**
+		 * Copy 256-byte blocks
+		 */
+		for (; n >= 256; n -= 256) {
+			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
+			dst = (uint8_t *)dst + 256;
+			src = (const uint8_t *)src + 256;
+		}
+
+		/**
+		 * Copy whatever left
+		 */
+		goto COPY_BLOCK_255_BACK15;
+	}
+
+	/**
+	 * For copy with unaligned load
+	 */
+	MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
+
+	/**
+	 * Copy whatever left
+	 */
+	goto COPY_BLOCK_64_BACK15;
+}
+
+#endif /* RTE_MACHINE_CPUFLAG */
+
+static inline void *
+rte_memcpy_aligned(void *dst, const void *src, size_t n)
+{
+	void *ret = dst;
+
+	/* Copy size <= 16 bytes */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dst = *(const uint8_t *)src;
+			src = (const uint8_t *)src + 1;
+			dst = (uint8_t *)dst + 1;
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dst = *(const uint16_t *)src;
+			src = (const uint16_t *)src + 1;
+			dst = (uint16_t *)dst + 1;
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dst = *(const uint32_t *)src;
+			src = (const uint32_t *)src + 1;
+			dst = (uint32_t *)dst + 1;
+		}
+		if (n & 0x08)
+			*(uint64_t *)dst = *(const uint64_t *)src;
+
+		return ret;
+	}
+
+	/* Copy 16 <= size <= 32 bytes */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+
+		return ret;
+	}
+
+	/* Copy 32 < size <= 64 bytes */
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				(const uint8_t *)src - 32 + n);
+
+		return ret;
+	}
+
+	/* Copy 64 bytes blocks */
+	for (; n >= 64; n -= 64) {
+		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+		dst = (uint8_t *)dst + 64;
+		src = (const uint8_t *)src + 64;
+	}
+
+	/* Copy whatever left */
+	rte_mov64((uint8_t *)dst - 64 + n,
+			(const uint8_t *)src - 64 + n);
+
+	return ret;
+}
+
+static inline void *
+rte_memcpy_internal(void *dst, const void *src, size_t n)
+{
+	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
+		return rte_memcpy_aligned(dst, src, n);
+	else
+		return rte_memcpy_generic(dst, src, n);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_MEMCPY_INTERNAL_X86_64_H_ */
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c b/lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c
new file mode 100644
index 0000000..2532696
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c
@@ -0,0 +1,585 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+
+/**
+ * Macro for copying unaligned block from one location to another with constant load offset,
+ * 47 bytes leftover maximum,
+ * locations should not overlap.
+ * Requirements:
+ * - Store is aligned
+ * - Load offset is <offset>, which must be immediate value within [1, 15]
+ * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
+ * - <dst>, <src>, <len> must be variables
+ * - __m128i <xmm0> ~ <xmm8> must be pre-defined
+ */
+#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)                                                     \
+__extension__ ({                                                                                            \
+    int tmp;                                                                                                \
+    while (len >= 128 + 16 - offset) {                                                                      \
+        xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));                  \
+        len -= 128;                                                                                         \
+        xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));                  \
+        xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));                  \
+        xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16));                  \
+        xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16));                  \
+        xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16));                  \
+        xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16));                  \
+        xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \
+        xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \
+        src = (const uint8_t *)src + 128;                                                                   \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));        \
+        dst = (uint8_t *)dst + 128;                                                                         \
+    }                                                                                                       \
+    tmp = len;                                                                                              \
+    len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
+    tmp -= len;                                                                                             \
+    src = (const uint8_t *)src + tmp;                                                                       \
+    dst = (uint8_t *)dst + tmp;                                                                             \
+    if (len >= 32 + 16 - offset) {                                                                          \
+        while (len >= 32 + 16 - offset) {                                                                   \
+            xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));              \
+            len -= 32;                                                                                      \
+            xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));              \
+            xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));              \
+            src = (const uint8_t *)src + 32;                                                                \
+            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));    \
+            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));    \
+            dst = (uint8_t *)dst + 32;                                                                      \
+        }                                                                                                   \
+        tmp = len;                                                                                          \
+        len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
+        tmp -= len;                                                                                         \
+        src = (const uint8_t *)src + tmp;                                                                   \
+        dst = (uint8_t *)dst + tmp;                                                                         \
+    }                                                                                                       \
+})
+
+/**
+ * Macro for copying unaligned block from one location to another,
+ * 47 bytes leftover maximum,
+ * locations should not overlap.
+ * Use switch here because the aligning instruction requires immediate value for shift count.
+ * Requirements:
+ * - Store is aligned
+ * - Load offset is <offset>, which must be within [1, 15]
+ * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
+ * - <dst>, <src>, <len> must be variables
+ * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined
+ */
+#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                   \
+__extension__ ({                                                      \
+    switch (offset) {                                                 \
+    case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break;    \
+    case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break;    \
+    case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break;    \
+    case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break;    \
+    case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break;    \
+    case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break;    \
+    case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break;    \
+    case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break;    \
+    case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break;    \
+    case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break;    \
+    case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;    \
+    case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break;    \
+    case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break;    \
+    case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break;    \
+    case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break;    \
+    default:;                                                         \
+    }                                                                 \
+})
+
+void *
+rte_memcpy_sse(void *dst, const void *src, size_t n)
+{
+	if (!(((uintptr_t)dst | (uintptr_t)src) & 0x0F)) {
+		void *ret = dst;
+
+		/* Copy size <= 16 bytes */
+		if (n < 16) {
+			if (n & 0x01) {
+				*(uint8_t *)dst = *(const uint8_t *)src;
+				src = (const uint8_t *)src + 1;
+				dst = (uint8_t *)dst + 1;
+			}
+			if (n & 0x02) {
+				*(uint16_t *)dst = *(const uint16_t *)src;
+				src = (const uint16_t *)src + 1;
+				dst = (uint16_t *)dst + 1;
+			}
+			if (n & 0x04) {
+				*(uint32_t *)dst = *(const uint32_t *)src;
+				src = (const uint32_t *)src + 1;
+				dst = (uint32_t *)dst + 1;
+			}
+			if (n & 0x08)
+				*(uint64_t *)dst = *(const uint64_t *)src;
+
+			return ret;
+		}
+
+		/* Copy 16 <= size <= 32 bytes */
+		if (n <= 32) {
+			__m128i xmm0, xmm1;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src - 16 + n));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst - 16 + n), xmm1);
+
+			return ret;
+		}
+
+		/* Copy 32 < size <= 64 bytes */
+		if (n <= 64) {
+			__m128i xmm0, xmm1, xmm2, xmm3;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src + 16));
+			xmm2 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src - 32 + n));
+			xmm3 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src - 16 + n));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst + 16), xmm1);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst - 32 + n), xmm2);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst - 16 + n), xmm3);
+
+			return ret;
+		}
+
+		/* Copy 64 bytes blocks */
+		for (; n >= 64; n -= 64) {
+			__m128i xmm0, xmm1, xmm2, xmm3;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src + 16));
+			xmm2 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src + 2*16));
+			xmm3 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src + 3*16));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst + 16), xmm1);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst + 2*16), xmm2);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst + 3*16), xmm3);
+			dst = (uint8_t *)dst + 64;
+			src = (const uint8_t *)src + 64;
+		}
+
+		/* Copy whatever left */
+		__m128i xmm0, xmm1, xmm2, xmm3;
+		xmm0 = _mm_loadu_si128((const __m128i *)
+			((const uint8_t *)src - 64 + n));
+		xmm1 = _mm_loadu_si128((const __m128i *)
+			((const uint8_t *)src - 48 + n));
+		xmm2 = _mm_loadu_si128((const __m128i *)
+			((const uint8_t *)src - 32 + n));
+		xmm3 = _mm_loadu_si128((const __m128i *)
+			((const uint8_t *)src - 16 + n));
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst - 64 + n), xmm0);
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst - 48 + n), xmm1);
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst - 32 + n), xmm2);
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst - 16 + n), xmm3);
+
+		return ret;
+	} else {
+		__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
+		uintptr_t dstu = (uintptr_t)dst;
+		uintptr_t srcu = (uintptr_t)src;
+		void *ret = dst;
+		size_t dstofss;
+		size_t srcofs;
+
+		/**
+		 * Copy less than 16 bytes
+		 */
+		if (n < 16) {
+			if (n & 0x01) {
+				*(uint8_t *)dstu = *(const uint8_t *)srcu;
+				srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+				dstu = (uintptr_t)((uint8_t *)dstu + 1);
+			}
+			if (n & 0x02) {
+				*(uint16_t *)dstu = *(const uint16_t *)srcu;
+				srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+				dstu = (uintptr_t)((uint16_t *)dstu + 1);
+			}
+			if (n & 0x04) {
+				*(uint32_t *)dstu = *(const uint32_t *)srcu;
+				srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+				dstu = (uintptr_t)((uint32_t *)dstu + 1);
+			}
+			if (n & 0x08)
+				*(uint64_t *)dstu = *(const uint64_t *)srcu;
+			return ret;
+		}
+
+		/**
+		 * Fast way when copy size doesn't exceed 512 bytes
+		 */
+		if (n <= 32) {
+			__m128i xmm0, xmm1;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src - 16 + n));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst - 16 + n), xmm1);
+			return ret;
+		}
+		if (n <= 48) {
+			__m128i xmm0, xmm1, xmm2;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src + 16));
+			xmm2 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src - 16 + n));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst + 16), xmm1);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst - 16 + n), xmm2);
+			return ret;
+		}
+		if (n <= 64) {
+			__m128i xmm0, xmm1, xmm2, xmm3;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src + 16));
+			xmm2 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src + 32));
+			xmm3 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src - 16 + n));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst + 16), xmm1);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst + 32), xmm2);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst - 16 + n), xmm3);
+			return ret;
+		}
+		if (n <= 128)
+			goto COPY_BLOCK_128_BACK15;
+		if (n <= 512) {
+			if (n >= 256) {
+				n -= 256;
+				__m128i xmm0, xmm1;
+				xmm0 = _mm_loadu_si128((const __m128i *)src);
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 16));
+				_mm_storeu_si128((__m128i *)dst, xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 2*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 3*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 2*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 3*16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 4*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 5*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 4*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 5*16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 6*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 7*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 6*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 7*16), xmm1);
+
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 128));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 128 + 16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 128), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 128 + 16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 128 + 2*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 128 + 3*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 128 + 2*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 128 + 3*16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 128 + 4*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 128 + 5*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 128 + 4*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 128 + 5*16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 128 + 6*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 128 + 7*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 128 + 6*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 128 + 7*16), xmm1);
+				src = (const uint8_t *)src + 256;
+				dst = (uint8_t *)dst + 256;
+			}
+COPY_BLOCK_255_BACK15:
+			if (n >= 128) {
+				n -= 128;
+				__m128i xmm0, xmm1;
+				xmm0 = _mm_loadu_si128((const __m128i *)src);
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 16));
+				_mm_storeu_si128((__m128i *)dst, xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 2*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 3*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 2*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 3*16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 4*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 5*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 4*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 5*16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 6*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 7*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 6*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 7*16), xmm1);
+				src = (const uint8_t *)src + 128;
+				dst = (uint8_t *)dst + 128;
+			}
+COPY_BLOCK_128_BACK15:
+			if (n >= 64) {
+				n -= 64;
+				__m128i xmm0, xmm1;
+				xmm0 = _mm_loadu_si128((const __m128i *)src);
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 16));
+				_mm_storeu_si128((__m128i *)dst, xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 16), xmm1);
+
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 2*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 3*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 2*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 3*16), xmm1);
+				src = (const uint8_t *)src + 64;
+				dst = (uint8_t *)dst + 64;
+			}
+COPY_BLOCK_64_BACK15:
+			if (n >= 32) {
+				n -= 32;
+				__m128i xmm0, xmm1;
+				xmm0 = _mm_loadu_si128((const __m128i *)src);
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 16));
+				_mm_storeu_si128((__m128i *)dst, xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 16), xmm1);
+				src = (const uint8_t *)src + 32;
+				dst = (uint8_t *)dst + 32;
+			}
+			if (n > 16) {
+				__m128i xmm0, xmm1;
+				xmm0 = _mm_loadu_si128((const __m128i *)src);
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src - 16 + n));
+				_mm_storeu_si128((__m128i *)dst, xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst - 16 + n), xmm1);
+				return ret;
+			}
+			if (n > 0) {
+				__m128i xmm0;
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src - 16 + n));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst - 16 + n), xmm0);
+			}
+			return ret;
+		}
+
+		/**
+		 * Make store aligned when copy size exceeds 512 bytes,
+		 * and make sure the first 15 bytes are copied, because
+		 * unaligned copy functions require up to 15 bytes
+		 * backwards access.
+		 */
+		dstofss = (uintptr_t)dst & 0x0F;
+		if (dstofss > 0) {
+			dstofss = 16 - dstofss + 16;
+			n -= dstofss;
+			__m128i xmm0, xmm1;
+			xmm0 = _mm_loadu_si128((const __m128i *)src);
+			xmm1 = _mm_loadu_si128((const __m128i *)
+				((const uint8_t *)src + 16));
+			_mm_storeu_si128((__m128i *)dst, xmm0);
+			_mm_storeu_si128((__m128i *)
+				((uint8_t *)dst + 16), xmm1);
+			src = (const uint8_t *)src + dstofss;
+			dst = (uint8_t *)dst + dstofss;
+		}
+		srcofs = ((uintptr_t)src & 0x0F);
+
+		/**
+		 * For aligned copy
+		 */
+		if (srcofs == 0) {
+			/**
+			 * Copy 256-byte blocks
+			 */
+			for (; n >= 256; n -= 256) {
+				__m128i xmm0, xmm1;
+				xmm0 = _mm_loadu_si128((const __m128i *)src);
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 16));
+				_mm_storeu_si128((__m128i *)dst, xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 2*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 3*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 2*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 3*16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 4*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 5*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 4*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 5*16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 6*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 7*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 6*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 7*16), xmm1);
+
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 8*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 9*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 8*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 9*16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 10*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 11*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 10*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 11*16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 12*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 13*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 12*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 13*16), xmm1);
+				xmm0 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 14*16));
+				xmm1 = _mm_loadu_si128((const __m128i *)
+					((const uint8_t *)src + 15*16));
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 14*16), xmm0);
+				_mm_storeu_si128((__m128i *)
+					((uint8_t *)dst + 15*16), xmm1);
+				dst = (uint8_t *)dst + 256;
+				src = (const uint8_t *)src + 256;
+			}
+
+			/**
+			 * Copy whatever left
+			 */
+			goto COPY_BLOCK_255_BACK15;
+		}
+
+		/**
+		 * For copy with unaligned load
+		 */
+		MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
+
+		/**
+		 * Copy whatever left
+		 */
+		goto COPY_BLOCK_64_BACK15;
+	}
+}
diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
index 90bca4d..88d3298 100644
--- a/lib/librte_eal/linuxapp/eal/Makefile
+++ b/lib/librte_eal/linuxapp/eal/Makefile
@@ -40,6 +40,7 @@ VPATH += $(RTE_SDK)/lib/librte_eal/common/arch/$(ARCH_DIR)
 LIBABIVER := 5
 
 VPATH += $(RTE_SDK)/lib/librte_eal/common
+VPATH += $(RTE_SDK)/lib/librte_eal/common/include/arch/$(ARCH_DIR)
 
 CFLAGS += -I$(SRCDIR)/include
 CFLAGS += -I$(RTE_SDK)/lib/librte_eal/common
@@ -105,6 +106,22 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_ARCH_X86) += rte_spinlock.c
 
+# for run-time dispatch of memcpy
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy.c
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_sse.c
+
+# if the compiler supports AVX512, add avx512 file
+ifneq ($(filter $(MACHINE_CFLAGS),CC_SUPPORT_AVX512F),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx512f.c
+CFLAGS_rte_memcpy_avx512f.o += -mavx512f
+endif
+
+# if the compiler supports AVX2, add avx2 file
+ifneq ($(filter $(MACHINE_CFLAGS),CC_SUPPORT_AVX2),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx2.c
+CFLAGS_rte_memcpy_avx2.o += -mavx2
+endif
+
 CFLAGS_eal_common_cpuflags.o := $(CPUFLAGS_LIST)
 
 CFLAGS_eal.o := -D_GNU_SOURCE
diff --git a/mk/rte.cpuflags.mk b/mk/rte.cpuflags.mk
index a813c91..8a7a1e7 100644
--- a/mk/rte.cpuflags.mk
+++ b/mk/rte.cpuflags.mk
@@ -134,6 +134,20 @@ endif
 
 MACHINE_CFLAGS += $(addprefix -DRTE_MACHINE_CPUFLAG_,$(CPUFLAGS))
 
+# Check if the compiler suppoerts AVX512
+CC_SUPPORT_AVX512F := $(shell $(CC) -mavx512f -dM -E - < /dev/null 2>&1 | grep -q AVX512 && echo 1)
+ifeq ($(CC_SUPPORT_AVX512F),1)
+ifeq ($(CONFIG_RTE_ENABLE_AVX512),y)
+MACHINE_CFLAGS += -DCC_SUPPORT_AVX512F
+endif
+endif
+
+# Check if the compiler supports AVX2
+CC_SUPPORT_AVX2 := $(shell $(CC) -mavx2 -dM -E - < /dev/null 2>&1 | grep -q AVX2 && echo 1)
+ifeq ($(CC_SUPPORT_AVX2),1)
+MACHINE_CFLAGS += -DCC_SUPPORT_AVX2
+endif
+
 # To strip whitespace
 comma:= ,
 empty:=
-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [dpdk-dev] [PATCH v4 2/3] app/test: run-time dispatch over memcpy perf test
  2017-10-02 16:13 ` [dpdk-dev] [PATCH v4 0/3] run-time Linking support Xiaoyun Li
  2017-10-02 16:13   ` [dpdk-dev] [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
@ 2017-10-02 16:13   ` Xiaoyun Li
  2017-10-02 16:13   ` [dpdk-dev] [PATCH v4 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
  2017-10-03 14:59   ` [dpdk-dev] [PATCH v5 0/3] run-time Linking support Xiaoyun Li
  3 siblings, 0 replies; 88+ messages in thread
From: Xiaoyun Li @ 2017-10-02 16:13 UTC (permalink / raw)
  To: konstantin.ananyev, bruce.richardson
  Cc: wenzhuo.lu, helin.zhang, dev, Xiaoyun Li

This patch modifies assignment of alignment unit from build-time
to run-time based on CPU flags that machine supports.

Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
---
 test/test/test_memcpy_perf.c | 40 +++++++++++++++++++++++++++-------------
 1 file changed, 27 insertions(+), 13 deletions(-)

diff --git a/test/test/test_memcpy_perf.c b/test/test/test_memcpy_perf.c
index ff3aaaa..33def3b 100644
--- a/test/test/test_memcpy_perf.c
+++ b/test/test/test_memcpy_perf.c
@@ -79,13 +79,7 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
 #define TEST_BATCH_SIZE         100
 
 /* Data is aligned on this many bytes (power of 2) */
-#ifdef RTE_MACHINE_CPUFLAG_AVX512F
-#define ALIGNMENT_UNIT          64
-#elif defined RTE_MACHINE_CPUFLAG_AVX2
-#define ALIGNMENT_UNIT          32
-#else /* RTE_MACHINE_CPUFLAG */
-#define ALIGNMENT_UNIT          16
-#endif /* RTE_MACHINE_CPUFLAG */
+static uint8_t alignment_unit = 16;
 
 /*
  * Pointers used in performance tests. The two large buffers are for uncached
@@ -100,20 +94,39 @@ static int
 init_buffers(void)
 {
 	unsigned i;
+#ifdef CC_SUPPORT_AVX512
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F))
+		alignment_unit = 64;
+	else
+#endif
+#ifdef CC_SUPPORT_AVX2
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
+		alignment_unit = 32;
+	else
+#endif
+		alignment_unit = 16;
 
-	large_buf_read = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	large_buf_read = rte_malloc("memcpy",
+				    LARGE_BUFFER_SIZE + alignment_unit,
+				    alignment_unit);
 	if (large_buf_read == NULL)
 		goto error_large_buf_read;
 
-	large_buf_write = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	large_buf_write = rte_malloc("memcpy",
+				     LARGE_BUFFER_SIZE + alignment_unit,
+				     alignment_unit);
 	if (large_buf_write == NULL)
 		goto error_large_buf_write;
 
-	small_buf_read = rte_malloc("memcpy", SMALL_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	small_buf_read = rte_malloc("memcpy",
+				    SMALL_BUFFER_SIZE + alignment_unit,
+				    alignment_unit);
 	if (small_buf_read == NULL)
 		goto error_small_buf_read;
 
-	small_buf_write = rte_malloc("memcpy", SMALL_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	small_buf_write = rte_malloc("memcpy",
+				     SMALL_BUFFER_SIZE + alignment_unit,
+				     alignment_unit);
 	if (small_buf_write == NULL)
 		goto error_small_buf_write;
 
@@ -153,7 +166,7 @@ static inline size_t
 get_rand_offset(size_t uoffset)
 {
 	return ((rte_rand() % (LARGE_BUFFER_SIZE - SMALL_BUFFER_SIZE)) &
-			~(ALIGNMENT_UNIT - 1)) + uoffset;
+			~(alignment_unit - 1)) + uoffset;
 }
 
 /* Fill in source and destination addresses. */
@@ -321,7 +334,8 @@ perf_test(void)
 		   "(bytes)        (ticks)        (ticks)        (ticks)        (ticks)\n"
 		   "------- -------------- -------------- -------------- --------------");
 
-	printf("\n========================== %2dB aligned ============================", ALIGNMENT_UNIT);
+	printf("\n========================= %2dB aligned ============================",
+		alignment_unit);
 	/* Do aligned tests where size is a variable */
 	perf_test_variable_aligned();
 	printf("\n------- -------------- -------------- -------------- --------------");
-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [dpdk-dev] [PATCH v4 3/3] efd: run-time dispatch over x86 EFD functions
  2017-10-02 16:13 ` [dpdk-dev] [PATCH v4 0/3] run-time Linking support Xiaoyun Li
  2017-10-02 16:13   ` [dpdk-dev] [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
  2017-10-02 16:13   ` [dpdk-dev] [PATCH v4 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
@ 2017-10-02 16:13   ` Xiaoyun Li
  2017-10-02 16:52     ` Ananyev, Konstantin
  2017-10-03 14:59   ` [dpdk-dev] [PATCH v5 0/3] run-time Linking support Xiaoyun Li
  3 siblings, 1 reply; 88+ messages in thread
From: Xiaoyun Li @ 2017-10-02 16:13 UTC (permalink / raw)
  To: konstantin.ananyev, bruce.richardson
  Cc: wenzhuo.lu, helin.zhang, dev, Xiaoyun Li

This patch dynamically selects x86 EFD functions at run-time.
This patch uses function pointer and binds it to the relative
function based on CPU flags at constructor time.

Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
---
 lib/librte_efd/rte_efd_x86.h | 41 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 38 insertions(+), 3 deletions(-)

diff --git a/lib/librte_efd/rte_efd_x86.h b/lib/librte_efd/rte_efd_x86.h
index 34f37d7..93b6743 100644
--- a/lib/librte_efd/rte_efd_x86.h
+++ b/lib/librte_efd/rte_efd_x86.h
@@ -43,12 +43,29 @@
 #define EFD_LOAD_SI128(val) _mm_lddqu_si128(val)
 #endif
 
+typedef efd_value_t
+(*efd_lookup_internal_avx2_t)(const efd_hashfunc_t *group_hash_idx,
+		const efd_lookuptbl_t *group_lookup_table,
+		const uint32_t hash_val_a, const uint32_t hash_val_b);
+
+static efd_lookup_internal_avx2_t efd_lookup_internal_avx2_ptr;
+
 static inline efd_value_t
 efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
 		const efd_lookuptbl_t *group_lookup_table,
 		const uint32_t hash_val_a, const uint32_t hash_val_b)
 {
-#ifdef RTE_MACHINE_CPUFLAG_AVX2
+	return (*efd_lookup_internal_avx2_ptr)(group_hash_idx,
+					       group_lookup_table,
+					       hash_val_a, hash_val_b);
+}
+
+#ifdef CC_SUPPORT_AVX2
+static inline efd_value_t
+efd_lookup_internal_avx2_AVX2(const efd_hashfunc_t *group_hash_idx,
+		const efd_lookuptbl_t *group_lookup_table,
+		const uint32_t hash_val_a, const uint32_t hash_val_b)
+{
 	efd_value_t value = 0;
 	uint32_t i = 0;
 	__m256i vhash_val_a = _mm256_set1_epi32(hash_val_a);
@@ -74,13 +91,31 @@ efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
 	}
 
 	return value;
-#else
+}
+#endif
+
+static inline efd_value_t
+efd_lookup_internal_avx2_DEFAULT(const efd_hashfunc_t *group_hash_idx,
+		const efd_lookuptbl_t *group_lookup_table,
+		const uint32_t hash_val_a, const uint32_t hash_val_b)
+{
 	RTE_SET_USED(group_hash_idx);
 	RTE_SET_USED(group_lookup_table);
 	RTE_SET_USED(hash_val_a);
 	RTE_SET_USED(hash_val_b);
 	/* Return dummy value, only to avoid compilation breakage */
 	return 0;
-#endif
+}
 
+static void __attribute__((constructor))
+rte_efd_x86_init(void)
+{
+#ifdef CC_SUPPORT_AVX2
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
+		efd_lookup_internal_avx2_ptr = efd_lookup_internal_avx2_AVX2;
+	else
+		efd_lookup_internal_avx2_ptr = efd_lookup_internal_avx2_DEFAULT;
+#else
+	efd_lookup_internal_avx2_ptr = efd_lookup_internal_avx2_DEFAULT;
+#endif
 }
-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-02 16:13   ` [dpdk-dev] [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
@ 2017-10-02 16:39     ` Ananyev, Konstantin
  2017-10-02 23:10       ` Li, Xiaoyun
  0 siblings, 1 reply; 88+ messages in thread
From: Ananyev, Konstantin @ 2017-10-02 16:39 UTC (permalink / raw)
  To: Li, Xiaoyun, Richardson, Bruce; +Cc: Lu, Wenzhuo, Zhang, Helin, dev



> -----Original Message-----
> From: Li, Xiaoyun
> Sent: Monday, October 2, 2017 5:13 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>; dev@dpdk.org; Li, Xiaoyun <xiaoyun.li@intel.com>
> Subject: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy
> 
> This patch dynamically selects functions of memcpy at run-time based
> on CPU flags that current machine supports. This patch uses function
> pointers which are bind to the relative functions at constrctor time.
> In addition, AVX512 instructions set would be compiled only if users
> config it enabled and the compiler supports it.
> 
> Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
> ---
> v2
> * Use gcc function multi-versioning to avoid compilation issues.
> * Add macros for AVX512 and AVX2. Only if users enable AVX512 and the
> compiler supports it, the AVX512 codes would be compiled. Only if the
> compiler supports AVX2, the AVX2 codes would be compiled.
> 
> v3
> * Reduce function calls via only keep rte_memcpy_xxx.
> * Add conditions that when copy size is small, use inline code path.
> Otherwise, use dynamic code path.
> * To support attribute target, clang version must be greater than 3.7.
> Otherwise, would choose SSE/AVX code path, the same as before.
> * Move two mocro functions to the top of the code since they would be
> used in inline SSE/AVX and dynamic SSE/AVX codes.
> 
> v4
> * Modify rte_memcpy.h to several .c files and modify makefiles to compile
> AVX2 and AVX512 files.

Could you explain to me why instead of reusing existing rte_memcpy() code
to generate _sse/_avx2/ax512f flavors you keep pushing changes with 3 separate implementations?
Obviously that is much more expensive in terms of maintenance and doesn't look like
feasible solution to me.
Is existing rte_memcpy() implementation is not good enough in terms of functionality and/or performance?
If so, can you outline these problems and try to fix them first.
Konstantin

> 
>  lib/librte_eal/bsdapp/eal/Makefile                 |  17 +
>  .../common/include/arch/x86/rte_memcpy.c           |  59 ++
>  .../common/include/arch/x86/rte_memcpy.h           | 861 +------------------
>  .../common/include/arch/x86/rte_memcpy_avx2.c      | 291 +++++++
>  .../common/include/arch/x86/rte_memcpy_avx512f.c   | 316 +++++++
>  .../common/include/arch/x86/rte_memcpy_internal.h  | 909 +++++++++++++++++++++
>  .../common/include/arch/x86/rte_memcpy_sse.c       | 585 +++++++++++++
>  lib/librte_eal/linuxapp/eal/Makefile               |  17 +
>  mk/rte.cpuflags.mk                                 |  14 +
>  9 files changed, 2223 insertions(+), 846 deletions(-)
>  create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy.c
>  create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
>  create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
>  create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
>  create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c
> 
> diff --git a/lib/librte_eal/bsdapp/eal/Makefile b/lib/librte_eal/bsdapp/eal/Makefile
> index 005019e..27023c6 100644
> --- a/lib/librte_eal/bsdapp/eal/Makefile
> +++ b/lib/librte_eal/bsdapp/eal/Makefile
> @@ -36,6 +36,7 @@ LIB = librte_eal.a
>  ARCH_DIR ?= $(RTE_ARCH)
>  VPATH += $(RTE_SDK)/lib/librte_eal/common
>  VPATH += $(RTE_SDK)/lib/librte_eal/common/arch/$(ARCH_DIR)
> +VPATH += $(RTE_SDK)/lib/librte_eal/common/include/arch/$(ARCH_DIR)
> 
>  CFLAGS += -I$(SRCDIR)/include
>  CFLAGS += -I$(RTE_SDK)/lib/librte_eal/common
> @@ -93,6 +94,22 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += rte_service.c
>  SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += rte_cpuflags.c
>  SRCS-$(CONFIG_RTE_ARCH_X86) += rte_spinlock.c
> 
> +# for run-time dispatch of memcpy
> +SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy.c
> +SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_sse.c
> +
> +# if the compiler supports AVX512, add avx512 file
> +ifneq ($(filter $(MACHINE_CFLAGS),CC_SUPPORT_AVX512F),)
> +SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx512f.c
> +CFLAGS_rte_memcpy_avx512f.o += -mavx512f
> +endif
> +
> +# if the compiler supports AVX2, add avx2 file
> +ifneq ($(filter $(MACHINE_CFLAGS),CC_SUPPORT_AVX2),)
> +SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx2.c
> +CFLAGS_rte_memcpy_avx2.o += -mavx2
> +endif
> +
>  CFLAGS_eal_common_cpuflags.o := $(CPUFLAGS_LIST)
> 
>  CFLAGS_eal.o := -D_GNU_SOURCE
> diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.c b/lib/librte_eal/common/include/arch/x86/rte_memcpy.c
> new file mode 100644
> index 0000000..74ae702
> --- /dev/null
> +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.c
> @@ -0,0 +1,59 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#include <rte_memcpy.h>
> +#include <rte_cpuflags.h>
> +#include <rte_log.h>
> +
> +void *(*rte_memcpy_ptr)(void *dst, const void *src, size_t n) = NULL;
> +
> +static void __attribute__((constructor))
> +rte_memcpy_init(void)
> +{
> +#ifdef CC_SUPPORT_AVX512F
> +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F)) {
> +		rte_memcpy_ptr = rte_memcpy_avx512f;
> +		RTE_LOG(DEBUG, EAL, "AVX512 memcpy is using!\n");
> +		return;
> +	}
> +#endif
> +#ifdef CC_SUPPORT_AVX2
> +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) {
> +		rte_memcpy_ptr = rte_memcpy_avx2;
> +		RTE_LOG(DEBUG, EAL, "AVX2 memcpy is using!\n");
> +		return;
> +	}
> +#endif
> +	rte_memcpy_ptr = rte_memcpy_sse;
> +	RTE_LOG(DEBUG, EAL, "Default SSE/AVX memcpy is using!\n");
> +}
> diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> index 74c280c..460dcdb 100644
> --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> @@ -1,7 +1,7 @@
>  /*-
>   *   BSD LICENSE
>   *
> - *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
>   *   All rights reserved.
>   *
>   *   Redistribution and use in source and binary forms, with or without
> @@ -34,867 +34,36 @@
>  #ifndef _RTE_MEMCPY_X86_64_H_
>  #define _RTE_MEMCPY_X86_64_H_
> 
> -/**
> - * @file
> - *
> - * Functions for SSE/AVX/AVX2/AVX512 implementation of memcpy().
> - */
> -
> -#include <stdio.h>
> -#include <stdint.h>
> -#include <string.h>
> -#include <rte_vect.h>
> -#include <rte_common.h>
> +#include <rte_memcpy_internal.h>
> 
>  #ifdef __cplusplus
>  extern "C" {
>  #endif
> 
> -/**
> - * Copy bytes from one location to another. The locations must not overlap.
> - *
> - * @note This is implemented as a macro, so it's address should not be taken
> - * and care is needed as parameter expressions may be evaluated multiple times.
> - *
> - * @param dst
> - *   Pointer to the destination of the data.
> - * @param src
> - *   Pointer to the source data.
> - * @param n
> - *   Number of bytes to copy.
> - * @return
> - *   Pointer to the destination data.
> - */
> -static __rte_always_inline void *
> -rte_memcpy(void *dst, const void *src, size_t n);
> -
> -#ifdef RTE_MACHINE_CPUFLAG_AVX512F
> +#define RTE_X86_MEMCPY_THRESH 128
> 
> -#define ALIGNMENT_MASK 0x3F
> +extern void *
> +(*rte_memcpy_ptr)(void *dst, const void *src, size_t n);
> 
>  /**
> - * AVX512 implementation below
> + * Different implementations of memcpy.
>   */
> +extern void*
> +rte_memcpy_avx512f(void *dst, const void *src, size_t n);
> 
> -/**
> - * Copy 16 bytes from one location to another,
> - * locations should not overlap.
> - */
> -static inline void
> -rte_mov16(uint8_t *dst, const uint8_t *src)
> -{
> -	__m128i xmm0;
> -
> -	xmm0 = _mm_loadu_si128((const __m128i *)src);
> -	_mm_storeu_si128((__m128i *)dst, xmm0);
> -}
> -
> -/**
> - * Copy 32 bytes from one location to another,
> - * locations should not overlap.
> - */
> -static inline void
> -rte_mov32(uint8_t *dst, const uint8_t *src)
> -{
> -	__m256i ymm0;
> +extern void *
> +rte_memcpy_avx2(void *dst, const void *src, size_t n);
> 
> -	ymm0 = _mm256_loadu_si256((const __m256i *)src);
> -	_mm256_storeu_si256((__m256i *)dst, ymm0);
> -}
> -
> -/**
> - * Copy 64 bytes from one location to another,
> - * locations should not overlap.
> - */
> -static inline void
> -rte_mov64(uint8_t *dst, const uint8_t *src)
> -{
> -	__m512i zmm0;
> -
> -	zmm0 = _mm512_loadu_si512((const void *)src);
> -	_mm512_storeu_si512((void *)dst, zmm0);
> -}
> -
> -/**
> - * Copy 128 bytes from one location to another,
> - * locations should not overlap.
> - */
> -static inline void
> -rte_mov128(uint8_t *dst, const uint8_t *src)
> -{
> -	rte_mov64(dst + 0 * 64, src + 0 * 64);
> -	rte_mov64(dst + 1 * 64, src + 1 * 64);
> -}
> -
> -/**
> - * Copy 256 bytes from one location to another,
> - * locations should not overlap.
> - */
> -static inline void
> -rte_mov256(uint8_t *dst, const uint8_t *src)
> -{
> -	rte_mov64(dst + 0 * 64, src + 0 * 64);
> -	rte_mov64(dst + 1 * 64, src + 1 * 64);
> -	rte_mov64(dst + 2 * 64, src + 2 * 64);
> -	rte_mov64(dst + 3 * 64, src + 3 * 64);
> -}
> -
> -/**
> - * Copy 128-byte blocks from one location to another,
> - * locations should not overlap.
> - */
> -static inline void
> -rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
> -{
> -	__m512i zmm0, zmm1;
> -
> -	while (n >= 128) {
> -		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
> -		n -= 128;
> -		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
> -		src = src + 128;
> -		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
> -		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
> -		dst = dst + 128;
> -	}
> -}
> -
> -/**
> - * Copy 512-byte blocks from one location to another,
> - * locations should not overlap.
> - */
> -static inline void
> -rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
> -{
> -	__m512i zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, zmm7;
> -
> -	while (n >= 512) {
> -		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
> -		n -= 512;
> -		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
> -		zmm2 = _mm512_loadu_si512((const void *)(src + 2 * 64));
> -		zmm3 = _mm512_loadu_si512((const void *)(src + 3 * 64));
> -		zmm4 = _mm512_loadu_si512((const void *)(src + 4 * 64));
> -		zmm5 = _mm512_loadu_si512((const void *)(src + 5 * 64));
> -		zmm6 = _mm512_loadu_si512((const void *)(src + 6 * 64));
> -		zmm7 = _mm512_loadu_si512((const void *)(src + 7 * 64));
> -		src = src + 512;
> -		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
> -		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
> -		_mm512_storeu_si512((void *)(dst + 2 * 64), zmm2);
> -		_mm512_storeu_si512((void *)(dst + 3 * 64), zmm3);
> -		_mm512_storeu_si512((void *)(dst + 4 * 64), zmm4);
> -		_mm512_storeu_si512((void *)(dst + 5 * 64), zmm5);
> -		_mm512_storeu_si512((void *)(dst + 6 * 64), zmm6);
> -		_mm512_storeu_si512((void *)(dst + 7 * 64), zmm7);
> -		dst = dst + 512;
> -	}
> -}
> -
> -static inline void *
> -rte_memcpy_generic(void *dst, const void *src, size_t n)
> -{
> -	uintptr_t dstu = (uintptr_t)dst;
> -	uintptr_t srcu = (uintptr_t)src;
> -	void *ret = dst;
> -	size_t dstofss;
> -	size_t bits;
> -
> -	/**
> -	 * Copy less than 16 bytes
> -	 */
> -	if (n < 16) {
> -		if (n & 0x01) {
> -			*(uint8_t *)dstu = *(const uint8_t *)srcu;
> -			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
> -			dstu = (uintptr_t)((uint8_t *)dstu + 1);
> -		}
> -		if (n & 0x02) {
> -			*(uint16_t *)dstu = *(const uint16_t *)srcu;
> -			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
> -			dstu = (uintptr_t)((uint16_t *)dstu + 1);
> -		}
> -		if (n & 0x04) {
> -			*(uint32_t *)dstu = *(const uint32_t *)srcu;
> -			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
> -			dstu = (uintptr_t)((uint32_t *)dstu + 1);
> -		}
> -		if (n & 0x08)
> -			*(uint64_t *)dstu = *(const uint64_t *)srcu;
> -		return ret;
> -	}
> -
> -	/**
> -	 * Fast way when copy size doesn't exceed 512 bytes
> -	 */
> -	if (n <= 32) {
> -		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> -		rte_mov16((uint8_t *)dst - 16 + n,
> -				  (const uint8_t *)src - 16 + n);
> -		return ret;
> -	}
> -	if (n <= 64) {
> -		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> -		rte_mov32((uint8_t *)dst - 32 + n,
> -				  (const uint8_t *)src - 32 + n);
> -		return ret;
> -	}
> -	if (n <= 512) {
> -		if (n >= 256) {
> -			n -= 256;
> -			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
> -			src = (const uint8_t *)src + 256;
> -			dst = (uint8_t *)dst + 256;
> -		}
> -		if (n >= 128) {
> -			n -= 128;
> -			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> -			src = (const uint8_t *)src + 128;
> -			dst = (uint8_t *)dst + 128;
> -		}
> -COPY_BLOCK_128_BACK63:
> -		if (n > 64) {
> -			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> -			rte_mov64((uint8_t *)dst - 64 + n,
> -					  (const uint8_t *)src - 64 + n);
> -			return ret;
> -		}
> -		if (n > 0)
> -			rte_mov64((uint8_t *)dst - 64 + n,
> -					  (const uint8_t *)src - 64 + n);
> -		return ret;
> -	}
> -
> -	/**
> -	 * Make store aligned when copy size exceeds 512 bytes
> -	 */
> -	dstofss = ((uintptr_t)dst & 0x3F);
> -	if (dstofss > 0) {
> -		dstofss = 64 - dstofss;
> -		n -= dstofss;
> -		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> -		src = (const uint8_t *)src + dstofss;
> -		dst = (uint8_t *)dst + dstofss;
> -	}
> -
> -	/**
> -	 * Copy 512-byte blocks.
> -	 * Use copy block function for better instruction order control,
> -	 * which is important when load is unaligned.
> -	 */
> -	rte_mov512blocks((uint8_t *)dst, (const uint8_t *)src, n);
> -	bits = n;
> -	n = n & 511;
> -	bits -= n;
> -	src = (const uint8_t *)src + bits;
> -	dst = (uint8_t *)dst + bits;
> -
> -	/**
> -	 * Copy 128-byte blocks.
> -	 * Use copy block function for better instruction order control,
> -	 * which is important when load is unaligned.
> -	 */
> -	if (n >= 128) {
> -		rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
> -		bits = n;
> -		n = n & 127;
> -		bits -= n;
> -		src = (const uint8_t *)src + bits;
> -		dst = (uint8_t *)dst + bits;
> -	}
> -
> -	/**
> -	 * Copy whatever left
> -	 */
> -	goto COPY_BLOCK_128_BACK63;
> -}
> -
> -#elif defined RTE_MACHINE_CPUFLAG_AVX2
> -
> -#define ALIGNMENT_MASK 0x1F
> -
> -/**
> - * AVX2 implementation below
> - */
> -
> -/**
> - * Copy 16 bytes from one location to another,
> - * locations should not overlap.
> - */
> -static inline void
> -rte_mov16(uint8_t *dst, const uint8_t *src)
> -{
> -	__m128i xmm0;
> -
> -	xmm0 = _mm_loadu_si128((const __m128i *)src);
> -	_mm_storeu_si128((__m128i *)dst, xmm0);
> -}
> -
> -/**
> - * Copy 32 bytes from one location to another,
> - * locations should not overlap.
> - */
> -static inline void
> -rte_mov32(uint8_t *dst, const uint8_t *src)
> -{
> -	__m256i ymm0;
> -
> -	ymm0 = _mm256_loadu_si256((const __m256i *)src);
> -	_mm256_storeu_si256((__m256i *)dst, ymm0);
> -}
> -
> -/**
> - * Copy 64 bytes from one location to another,
> - * locations should not overlap.
> - */
> -static inline void
> -rte_mov64(uint8_t *dst, const uint8_t *src)
> -{
> -	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
> -	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
> -}
> -
> -/**
> - * Copy 128 bytes from one location to another,
> - * locations should not overlap.
> - */
> -static inline void
> -rte_mov128(uint8_t *dst, const uint8_t *src)
> -{
> -	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
> -	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
> -	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
> -	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
> -}
> -
> -/**
> - * Copy 128-byte blocks from one location to another,
> - * locations should not overlap.
> - */
> -static inline void
> -rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
> -{
> -	__m256i ymm0, ymm1, ymm2, ymm3;
> -
> -	while (n >= 128) {
> -		ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32));
> -		n -= 128;
> -		ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32));
> -		ymm2 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 2 * 32));
> -		ymm3 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 3 * 32));
> -		src = (const uint8_t *)src + 128;
> -		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
> -		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
> -		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2);
> -		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3);
> -		dst = (uint8_t *)dst + 128;
> -	}
> -}
> -
> -static inline void *
> -rte_memcpy_generic(void *dst, const void *src, size_t n)
> -{
> -	uintptr_t dstu = (uintptr_t)dst;
> -	uintptr_t srcu = (uintptr_t)src;
> -	void *ret = dst;
> -	size_t dstofss;
> -	size_t bits;
> -
> -	/**
> -	 * Copy less than 16 bytes
> -	 */
> -	if (n < 16) {
> -		if (n & 0x01) {
> -			*(uint8_t *)dstu = *(const uint8_t *)srcu;
> -			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
> -			dstu = (uintptr_t)((uint8_t *)dstu + 1);
> -		}
> -		if (n & 0x02) {
> -			*(uint16_t *)dstu = *(const uint16_t *)srcu;
> -			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
> -			dstu = (uintptr_t)((uint16_t *)dstu + 1);
> -		}
> -		if (n & 0x04) {
> -			*(uint32_t *)dstu = *(const uint32_t *)srcu;
> -			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
> -			dstu = (uintptr_t)((uint32_t *)dstu + 1);
> -		}
> -		if (n & 0x08) {
> -			*(uint64_t *)dstu = *(const uint64_t *)srcu;
> -		}
> -		return ret;
> -	}
> -
> -	/**
> -	 * Fast way when copy size doesn't exceed 256 bytes
> -	 */
> -	if (n <= 32) {
> -		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> -		rte_mov16((uint8_t *)dst - 16 + n,
> -				(const uint8_t *)src - 16 + n);
> -		return ret;
> -	}
> -	if (n <= 48) {
> -		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> -		rte_mov16((uint8_t *)dst + 16, (const uint8_t *)src + 16);
> -		rte_mov16((uint8_t *)dst - 16 + n,
> -				(const uint8_t *)src - 16 + n);
> -		return ret;
> -	}
> -	if (n <= 64) {
> -		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> -		rte_mov32((uint8_t *)dst - 32 + n,
> -				(const uint8_t *)src - 32 + n);
> -		return ret;
> -	}
> -	if (n <= 256) {
> -		if (n >= 128) {
> -			n -= 128;
> -			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> -			src = (const uint8_t *)src + 128;
> -			dst = (uint8_t *)dst + 128;
> -		}
> -COPY_BLOCK_128_BACK31:
> -		if (n >= 64) {
> -			n -= 64;
> -			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> -			src = (const uint8_t *)src + 64;
> -			dst = (uint8_t *)dst + 64;
> -		}
> -		if (n > 32) {
> -			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> -			rte_mov32((uint8_t *)dst - 32 + n,
> -					(const uint8_t *)src - 32 + n);
> -			return ret;
> -		}
> -		if (n > 0) {
> -			rte_mov32((uint8_t *)dst - 32 + n,
> -					(const uint8_t *)src - 32 + n);
> -		}
> -		return ret;
> -	}
> -
> -	/**
> -	 * Make store aligned when copy size exceeds 256 bytes
> -	 */
> -	dstofss = (uintptr_t)dst & 0x1F;
> -	if (dstofss > 0) {
> -		dstofss = 32 - dstofss;
> -		n -= dstofss;
> -		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> -		src = (const uint8_t *)src + dstofss;
> -		dst = (uint8_t *)dst + dstofss;
> -	}
> -
> -	/**
> -	 * Copy 128-byte blocks
> -	 */
> -	rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
> -	bits = n;
> -	n = n & 127;
> -	bits -= n;
> -	src = (const uint8_t *)src + bits;
> -	dst = (uint8_t *)dst + bits;
> -
> -	/**
> -	 * Copy whatever left
> -	 */
> -	goto COPY_BLOCK_128_BACK31;
> -}
> -
> -#else /* RTE_MACHINE_CPUFLAG */
> -
> -#define ALIGNMENT_MASK 0x0F
> -
> -/**
> - * SSE & AVX implementation below
> - */
> -
> -/**
> - * Copy 16 bytes from one location to another,
> - * locations should not overlap.
> - */
> -static inline void
> -rte_mov16(uint8_t *dst, const uint8_t *src)
> -{
> -	__m128i xmm0;
> -
> -	xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);
> -	_mm_storeu_si128((__m128i *)dst, xmm0);
> -}
> -
> -/**
> - * Copy 32 bytes from one location to another,
> - * locations should not overlap.
> - */
> -static inline void
> -rte_mov32(uint8_t *dst, const uint8_t *src)
> -{
> -	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> -	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> -}
> -
> -/**
> - * Copy 64 bytes from one location to another,
> - * locations should not overlap.
> - */
> -static inline void
> -rte_mov64(uint8_t *dst, const uint8_t *src)
> -{
> -	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> -	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> -	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> -	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
> -}
> -
> -/**
> - * Copy 128 bytes from one location to another,
> - * locations should not overlap.
> - */
> -static inline void
> -rte_mov128(uint8_t *dst, const uint8_t *src)
> -{
> -	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> -	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> -	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> -	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
> -	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
> -	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
> -	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
> -	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
> -}
> -
> -/**
> - * Copy 256 bytes from one location to another,
> - * locations should not overlap.
> - */
> -static inline void
> -rte_mov256(uint8_t *dst, const uint8_t *src)
> -{
> -	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> -	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> -	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> -	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
> -	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
> -	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
> -	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
> -	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
> -	rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);
> -	rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);
> -	rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);
> -	rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);
> -	rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);
> -	rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);
> -	rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);
> -	rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
> -}
> -
> -/**
> - * Macro for copying unaligned block from one location to another with constant load offset,
> - * 47 bytes leftover maximum,
> - * locations should not overlap.
> - * Requirements:
> - * - Store is aligned
> - * - Load offset is <offset>, which must be immediate value within [1, 15]
> - * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
> - * - <dst>, <src>, <len> must be variables
> - * - __m128i <xmm0> ~ <xmm8> must be pre-defined
> - */
> -#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)                                                     \
> -__extension__ ({                                                                                            \
> -    int tmp;                                                                                                \
> -    while (len >= 128 + 16 - offset) {                                                                      \
> -        xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));                  \
> -        len -= 128;                                                                                         \
> -        xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));                  \
> -        xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));                  \
> -        xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16));                  \
> -        xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16));                  \
> -        xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16));                  \
> -        xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16));                  \
> -        xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \
> -        xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \
> -        src = (const uint8_t *)src + 128;                                                                   \
> -        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \
> -        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \
> -        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \
> -        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \
> -        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \
> -        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \
> -        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \
> -        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));        \
> -        dst = (uint8_t *)dst + 128;                                                                         \
> -    }                                                                                                       \
> -    tmp = len;                                                                                              \
> -    len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
> -    tmp -= len;                                                                                             \
> -    src = (const uint8_t *)src + tmp;                                                                       \
> -    dst = (uint8_t *)dst + tmp;                                                                             \
> -    if (len >= 32 + 16 - offset) {                                                                          \
> -        while (len >= 32 + 16 - offset) {                                                                   \
> -            xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));              \
> -            len -= 32;                                                                                      \
> -            xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));              \
> -            xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));              \
> -            src = (const uint8_t *)src + 32;                                                                \
> -            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));    \
> -            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));    \
> -            dst = (uint8_t *)dst + 32;                                                                      \
> -        }                                                                                                   \
> -        tmp = len;                                                                                          \
> -        len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
> -        tmp -= len;                                                                                         \
> -        src = (const uint8_t *)src + tmp;                                                                   \
> -        dst = (uint8_t *)dst + tmp;                                                                         \
> -    }                                                                                                       \
> -})
> -
> -/**
> - * Macro for copying unaligned block from one location to another,
> - * 47 bytes leftover maximum,
> - * locations should not overlap.
> - * Use switch here because the aligning instruction requires immediate value for shift count.
> - * Requirements:
> - * - Store is aligned
> - * - Load offset is <offset>, which must be within [1, 15]
> - * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
> - * - <dst>, <src>, <len> must be variables
> - * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined
> - */
> -#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                   \
> -__extension__ ({                                                      \
> -    switch (offset) {                                                 \
> -    case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break;    \
> -    case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break;    \
> -    case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break;    \
> -    case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break;    \
> -    case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break;    \
> -    case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break;    \
> -    case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break;    \
> -    case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break;    \
> -    case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break;    \
> -    case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break;    \
> -    case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;    \
> -    case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break;    \
> -    case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break;    \
> -    case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break;    \
> -    case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break;    \
> -    default:;                                                         \
> -    }                                                                 \
> -})
> -
> -static inline void *
> -rte_memcpy_generic(void *dst, const void *src, size_t n)
> -{
> -	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
> -	uintptr_t dstu = (uintptr_t)dst;
> -	uintptr_t srcu = (uintptr_t)src;
> -	void *ret = dst;
> -	size_t dstofss;
> -	size_t srcofs;
> -
> -	/**
> -	 * Copy less than 16 bytes
> -	 */
> -	if (n < 16) {
> -		if (n & 0x01) {
> -			*(uint8_t *)dstu = *(const uint8_t *)srcu;
> -			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
> -			dstu = (uintptr_t)((uint8_t *)dstu + 1);
> -		}
> -		if (n & 0x02) {
> -			*(uint16_t *)dstu = *(const uint16_t *)srcu;
> -			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
> -			dstu = (uintptr_t)((uint16_t *)dstu + 1);
> -		}
> -		if (n & 0x04) {
> -			*(uint32_t *)dstu = *(const uint32_t *)srcu;
> -			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
> -			dstu = (uintptr_t)((uint32_t *)dstu + 1);
> -		}
> -		if (n & 0x08) {
> -			*(uint64_t *)dstu = *(const uint64_t *)srcu;
> -		}
> -		return ret;
> -	}
> -
> -	/**
> -	 * Fast way when copy size doesn't exceed 512 bytes
> -	 */
> -	if (n <= 32) {
> -		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> -		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
> -		return ret;
> -	}
> -	if (n <= 48) {
> -		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> -		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
> -		return ret;
> -	}
> -	if (n <= 64) {
> -		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> -		rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
> -		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
> -		return ret;
> -	}
> -	if (n <= 128) {
> -		goto COPY_BLOCK_128_BACK15;
> -	}
> -	if (n <= 512) {
> -		if (n >= 256) {
> -			n -= 256;
> -			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> -			rte_mov128((uint8_t *)dst + 128, (const uint8_t *)src + 128);
> -			src = (const uint8_t *)src + 256;
> -			dst = (uint8_t *)dst + 256;
> -		}
> -COPY_BLOCK_255_BACK15:
> -		if (n >= 128) {
> -			n -= 128;
> -			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> -			src = (const uint8_t *)src + 128;
> -			dst = (uint8_t *)dst + 128;
> -		}
> -COPY_BLOCK_128_BACK15:
> -		if (n >= 64) {
> -			n -= 64;
> -			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> -			src = (const uint8_t *)src + 64;
> -			dst = (uint8_t *)dst + 64;
> -		}
> -COPY_BLOCK_64_BACK15:
> -		if (n >= 32) {
> -			n -= 32;
> -			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> -			src = (const uint8_t *)src + 32;
> -			dst = (uint8_t *)dst + 32;
> -		}
> -		if (n > 16) {
> -			rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> -			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
> -			return ret;
> -		}
> -		if (n > 0) {
> -			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
> -		}
> -		return ret;
> -	}
> -
> -	/**
> -	 * Make store aligned when copy size exceeds 512 bytes,
> -	 * and make sure the first 15 bytes are copied, because
> -	 * unaligned copy functions require up to 15 bytes
> -	 * backwards access.
> -	 */
> -	dstofss = (uintptr_t)dst & 0x0F;
> -	if (dstofss > 0) {
> -		dstofss = 16 - dstofss + 16;
> -		n -= dstofss;
> -		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> -		src = (const uint8_t *)src + dstofss;
> -		dst = (uint8_t *)dst + dstofss;
> -	}
> -	srcofs = ((uintptr_t)src & 0x0F);
> -
> -	/**
> -	 * For aligned copy
> -	 */
> -	if (srcofs == 0) {
> -		/**
> -		 * Copy 256-byte blocks
> -		 */
> -		for (; n >= 256; n -= 256) {
> -			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
> -			dst = (uint8_t *)dst + 256;
> -			src = (const uint8_t *)src + 256;
> -		}
> -
> -		/**
> -		 * Copy whatever left
> -		 */
> -		goto COPY_BLOCK_255_BACK15;
> -	}
> -
> -	/**
> -	 * For copy with unaligned load
> -	 */
> -	MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
> -
> -	/**
> -	 * Copy whatever left
> -	 */
> -	goto COPY_BLOCK_64_BACK15;
> -}
> -
> -#endif /* RTE_MACHINE_CPUFLAG */
> -
> -static inline void *
> -rte_memcpy_aligned(void *dst, const void *src, size_t n)
> -{
> -	void *ret = dst;
> -
> -	/* Copy size <= 16 bytes */
> -	if (n < 16) {
> -		if (n & 0x01) {
> -			*(uint8_t *)dst = *(const uint8_t *)src;
> -			src = (const uint8_t *)src + 1;
> -			dst = (uint8_t *)dst + 1;
> -		}
> -		if (n & 0x02) {
> -			*(uint16_t *)dst = *(const uint16_t *)src;
> -			src = (const uint16_t *)src + 1;
> -			dst = (uint16_t *)dst + 1;
> -		}
> -		if (n & 0x04) {
> -			*(uint32_t *)dst = *(const uint32_t *)src;
> -			src = (const uint32_t *)src + 1;
> -			dst = (uint32_t *)dst + 1;
> -		}
> -		if (n & 0x08)
> -			*(uint64_t *)dst = *(const uint64_t *)src;
> -
> -		return ret;
> -	}
> -
> -	/* Copy 16 <= size <= 32 bytes */
> -	if (n <= 32) {
> -		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> -		rte_mov16((uint8_t *)dst - 16 + n,
> -				(const uint8_t *)src - 16 + n);
> -
> -		return ret;
> -	}
> -
> -	/* Copy 32 < size <= 64 bytes */
> -	if (n <= 64) {
> -		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> -		rte_mov32((uint8_t *)dst - 32 + n,
> -				(const uint8_t *)src - 32 + n);
> -
> -		return ret;
> -	}
> -
> -	/* Copy 64 bytes blocks */
> -	for (; n >= 64; n -= 64) {
> -		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> -		dst = (uint8_t *)dst + 64;
> -		src = (const uint8_t *)src + 64;
> -	}
> -
> -	/* Copy whatever left */
> -	rte_mov64((uint8_t *)dst - 64 + n,
> -			(const uint8_t *)src - 64 + n);
> -
> -	return ret;
> -}
> +extern void *
> +rte_memcpy_sse(void *dst, const void *src, size_t n);
> 
>  static inline void *
>  rte_memcpy(void *dst, const void *src, size_t n)
>  {
> -	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
> -		return rte_memcpy_aligned(dst, src, n);
> +	if (n <= RTE_X86_MEMCPY_THRESH)
> +		return rte_memcpy_internal(dst, src, n);
>  	else
> -		return rte_memcpy_generic(dst, src, n);
> +		return (*rte_memcpy_ptr)(dst, src, n);
>  }
> 
>  #ifdef __cplusplus
> diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c b/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
> new file mode 100644
> index 0000000..c83351a
> --- /dev/null
> +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
> @@ -0,0 +1,291 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#include <rte_memcpy.h>
> +
> +#ifndef CC_SUPPORT_AVX2
> +#error CC_SUPPORT_AVX2 not defined
> +#endif
> +
> +void *
> +rte_memcpy_avx2(void *dst, const void *src, size_t n)
> +{
> +	if (!(((uintptr_t)dst | (uintptr_t)src) & 0x1F)) {
> +		void *ret = dst;
> +
> +		/* Copy size <= 16 bytes */
> +		if (n < 16) {
> +			if (n & 0x01) {
> +				*(uint8_t *)dst = *(const uint8_t *)src;
> +				src = (const uint8_t *)src + 1;
> +				dst = (uint8_t *)dst + 1;
> +			}
> +			if (n & 0x02) {
> +				*(uint16_t *)dst = *(const uint16_t *)src;
> +				src = (const uint16_t *)src + 1;
> +				dst = (uint16_t *)dst + 1;
> +			}
> +			if (n & 0x04) {
> +				*(uint32_t *)dst = *(const uint32_t *)src;
> +				src = (const uint32_t *)src + 1;
> +				dst = (uint32_t *)dst + 1;
> +			}
> +			if (n & 0x08)
> +				*(uint64_t *)dst = *(const uint64_t *)src;
> +
> +			return ret;
> +		}
> +
> +		/* Copy 16 <= size <= 32 bytes */
> +		if (n <= 32) {
> +			__m128i xmm0, xmm1;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src - 16 + n));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst - 16 + n), xmm1);
> +
> +			return ret;
> +		}
> +
> +		/* Copy 32 < size <= 64 bytes */
> +		if (n <= 64) {
> +			__m256i ymm0, ymm1;
> +			ymm0 = _mm256_loadu_si256((const __m256i *)src);
> +			ymm1 = _mm256_loadu_si256((const __m256i *)
> +				((const uint8_t *)src - 32 + n));
> +			_mm256_storeu_si256((__m256i *)dst, ymm0);
> +			_mm256_storeu_si256((__m256i *)
> +				((uint8_t *)dst - 32 + n), ymm1);
> +
> +			return ret;
> +		}
> +
> +		/* Copy 64 bytes blocks */
> +		for (; n >= 64; n -= 64) {
> +			__m256i ymm0, ymm1;
> +			ymm0 = _mm256_loadu_si256((const __m256i *)src);
> +			ymm1 = _mm256_loadu_si256((const __m256i *)
> +				((const uint8_t *)src + 32));
> +			_mm256_storeu_si256((__m256i *)dst, ymm0);
> +			_mm256_storeu_si256((__m256i *)
> +				((uint8_t *)dst + 32), ymm1);
> +			dst = (uint8_t *)dst + 64;
> +			src = (const uint8_t *)src + 64;
> +		}
> +
> +		/* Copy whatever left */
> +		__m256i ymm0, ymm1;
> +		ymm0 = _mm256_loadu_si256((const __m256i *)
> +			((const uint8_t *)src - 64 + n));
> +		ymm1 = _mm256_loadu_si256((const __m256i *)
> +			((const uint8_t *)src - 32 + n));
> +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst - 64 + n), ymm0);
> +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst - 32 + n), ymm1);
> +
> +		return ret;
> +	} else {
> +		uintptr_t dstu = (uintptr_t)dst;
> +		uintptr_t srcu = (uintptr_t)src;
> +		void *ret = dst;
> +		size_t dstofss;
> +		size_t bits;
> +
> +		/**
> +		 * Copy less than 16 bytes
> +		 */
> +		if (n < 16) {
> +			if (n & 0x01) {
> +				*(uint8_t *)dstu = *(const uint8_t *)srcu;
> +				srcu = (uintptr_t)((const uint8_t *)srcu + 1);
> +				dstu = (uintptr_t)((uint8_t *)dstu + 1);
> +			}
> +			if (n & 0x02) {
> +				*(uint16_t *)dstu = *(const uint16_t *)srcu;
> +				srcu = (uintptr_t)((const uint16_t *)srcu + 1);
> +				dstu = (uintptr_t)((uint16_t *)dstu + 1);
> +			}
> +			if (n & 0x04) {
> +				*(uint32_t *)dstu = *(const uint32_t *)srcu;
> +				srcu = (uintptr_t)((const uint32_t *)srcu + 1);
> +				dstu = (uintptr_t)((uint32_t *)dstu + 1);
> +			}
> +			if (n & 0x08)
> +				*(uint64_t *)dstu = *(const uint64_t *)srcu;
> +			return ret;
> +		}
> +
> +		/**
> +		 * Fast way when copy size doesn't exceed 256 bytes
> +		 */
> +		if (n <= 32) {
> +			__m128i xmm0, xmm1;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src - 16 + n));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst - 16 + n), xmm1);
> +			return ret;
> +		}
> +		if (n <= 48) {
> +			__m128i xmm0, xmm1, xmm2;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src + 16));
> +			xmm2 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src - 16 + n));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst + 16), xmm1);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst - 16 + n), xmm2);
> +			return ret;
> +		}
> +		if (n <= 64) {
> +			__m256i ymm0, ymm1;
> +			ymm0 = _mm256_loadu_si256((const __m256i *)src);
> +			ymm1 = _mm256_loadu_si256((const __m256i *)
> +				((const uint8_t *)src - 32 + n));
> +			_mm256_storeu_si256((__m256i *)dst, ymm0);
> +			_mm256_storeu_si256((__m256i *)
> +				((uint8_t *)dst - 32 + n), ymm1);
> +			return ret;
> +		}
> +		if (n <= 256) {
> +			if (n >= 128) {
> +				n -= 128;
> +				__m256i ymm0, ymm1, ymm2, ymm3;
> +				ymm0 = _mm256_loadu_si256((const __m256i *)src);
> +				ymm1 = _mm256_loadu_si256((const __m256i *)
> +					((const uint8_t *)src + 32));
> +				ymm2 = _mm256_loadu_si256((const __m256i *)
> +					((const uint8_t *)src + 2*32));
> +				ymm3 = _mm256_loadu_si256((const __m256i *)
> +					((const uint8_t *)src + 3*32));
> +				_mm256_storeu_si256((__m256i *)dst, ymm0);
> +				_mm256_storeu_si256((__m256i *)
> +					((uint8_t *)dst + 32), ymm1);
> +				_mm256_storeu_si256((__m256i *)
> +					((uint8_t *)dst + 2*32), ymm2);
> +				_mm256_storeu_si256((__m256i *)
> +					((uint8_t *)dst + 3*32), ymm3);
> +				src = (const uint8_t *)src + 128;
> +				dst = (uint8_t *)dst + 128;
> +			}
> +COPY_BLOCK_128_BACK31:
> +			if (n >= 64) {
> +				n -= 64;
> +				__m256i ymm0, ymm1;
> +				ymm0 = _mm256_loadu_si256((const __m256i *)src);
> +				ymm1 = _mm256_loadu_si256((const __m256i *)
> +					((const uint8_t *)src + 32));
> +				_mm256_storeu_si256((__m256i *)dst, ymm0);
> +				_mm256_storeu_si256((__m256i *)
> +					((uint8_t *)dst + 32), ymm1);
> +				src = (const uint8_t *)src + 64;
> +				dst = (uint8_t *)dst + 64;
> +			}
> +			if (n > 32) {
> +				__m256i ymm0, ymm1;
> +				ymm0 = _mm256_loadu_si256((const __m256i *)src);
> +				ymm1 = _mm256_loadu_si256((const __m256i *)
> +					((const uint8_t *)src - 32 + n));
> +				_mm256_storeu_si256((__m256i *)dst, ymm0);
> +				_mm256_storeu_si256((__m256i *)
> +					((uint8_t *)dst - 32 + n), ymm1);
> +				return ret;
> +			}
> +			if (n > 0) {
> +				__m256i ymm0;
> +				ymm0 = _mm256_loadu_si256((const __m256i *)
> +					((const uint8_t *)src - 32 + n));
> +				_mm256_storeu_si256((__m256i *)
> +					((uint8_t *)dst - 32 + n), ymm0);
> +			}
> +			return ret;
> +		}
> +
> +		/**
> +		 * Make store aligned when copy size exceeds 256 bytes
> +		 */
> +		dstofss = (uintptr_t)dst & 0x1F;
> +		if (dstofss > 0) {
> +			dstofss = 32 - dstofss;
> +			n -= dstofss;
> +			__m256i ymm0;
> +			ymm0 = _mm256_loadu_si256((const __m256i *)src);
> +			_mm256_storeu_si256((__m256i *)dst, ymm0);
> +			src = (const uint8_t *)src + dstofss;
> +			dst = (uint8_t *)dst + dstofss;
> +		}
> +
> +		/**
> +		 * Copy 128-byte blocks
> +		 */
> +		__m256i ymm0, ymm1, ymm2, ymm3;
> +
> +		while (n >= 128) {
> +			ymm0 = _mm256_loadu_si256((const __m256i *)
> +				((const uint8_t *)src + 0 * 32));
> +			n -= 128;
> +			ymm1 = _mm256_loadu_si256((const __m256i *)
> +				((const uint8_t *)src + 1 * 32));
> +			ymm2 = _mm256_loadu_si256((const __m256i *)
> +				((const uint8_t *)src + 2 * 32));
> +			ymm3 = _mm256_loadu_si256((const __m256i *)
> +				((const uint8_t *)src + 3 * 32));
> +			src = (const uint8_t *)src + 128;
> +			_mm256_storeu_si256((__m256i *)
> +				((uint8_t *)dst + 0 * 32), ymm0);
> +			_mm256_storeu_si256((__m256i *)
> +				((uint8_t *)dst + 1 * 32), ymm1);
> +			_mm256_storeu_si256((__m256i *)
> +				((uint8_t *)dst + 2 * 32), ymm2);
> +			_mm256_storeu_si256((__m256i *)
> +				((uint8_t *)dst + 3 * 32), ymm3);
> +			dst = (uint8_t *)dst + 128;
> +		}
> +		bits = n;
> +		n = n & 127;
> +		bits -= n;
> +		src = (const uint8_t *)src + bits;
> +		dst = (uint8_t *)dst + bits;
> +
> +		/**
> +		 * Copy whatever left
> +		 */
> +		goto COPY_BLOCK_128_BACK31;
> +	}
> +}
> diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
> b/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
> new file mode 100644
> index 0000000..c8a9d20
> --- /dev/null
> +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
> @@ -0,0 +1,316 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#include <rte_memcpy.h>
> +
> +#ifndef CC_SUPPORT_AVX512F
> +#error CC_SUPPORT_AVX512F not defined
> +#endif
> +
> +void *
> +rte_memcpy_avx512f(void *dst, const void *src, size_t n)
> +{
> +	if (!(((uintptr_t)dst | (uintptr_t)src) & 0x3F)) {
> +		void *ret = dst;
> +
> +		/* Copy size <= 16 bytes */
> +		if (n < 16) {
> +			if (n & 0x01) {
> +				*(uint8_t *)dst = *(const uint8_t *)src;
> +				src = (const uint8_t *)src + 1;
> +				dst = (uint8_t *)dst + 1;
> +			}
> +			if (n & 0x02) {
> +				*(uint16_t *)dst = *(const uint16_t *)src;
> +				src = (const uint16_t *)src + 1;
> +				dst = (uint16_t *)dst + 1;
> +			}
> +			if (n & 0x04) {
> +				*(uint32_t *)dst = *(const uint32_t *)src;
> +				src = (const uint32_t *)src + 1;
> +				dst = (uint32_t *)dst + 1;
> +			}
> +			if (n & 0x08)
> +				*(uint64_t *)dst = *(const uint64_t *)src;
> +
> +			return ret;
> +		}
> +
> +		/* Copy 16 <= size <= 32 bytes */
> +		if (n <= 32) {
> +			__m128i xmm0, xmm1;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src - 16 + n));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst - 16 + n), xmm1);
> +
> +			return ret;
> +		}
> +
> +		/* Copy 32 < size <= 64 bytes */
> +		if (n <= 64) {
> +			__m256i ymm0, ymm1;
> +			ymm0 = _mm256_loadu_si256((const __m256i *)src);
> +			ymm1 = _mm256_loadu_si256((const __m256i *)
> +				((const uint8_t *)src - 32 + n));
> +			_mm256_storeu_si256((__m256i *)dst, ymm0);
> +			_mm256_storeu_si256((__m256i *)
> +				((uint8_t *)dst - 32 + n), ymm1);
> +
> +			return ret;
> +		}
> +
> +		/* Copy 64 bytes blocks */
> +		for (; n >= 64; n -= 64) {
> +			__m512i zmm0;
> +			zmm0 = _mm512_loadu_si512((const void *)src);
> +			_mm512_storeu_si512((void *)dst, zmm0);
> +			dst = (uint8_t *)dst + 64;
> +			src = (const uint8_t *)src + 64;
> +		}
> +
> +		/* Copy whatever left */
> +		__m512i zmm0;
> +		zmm0 = _mm512_loadu_si512((const void *)
> +			((const uint8_t *)src - 64 + n));
> +		_mm512_storeu_si512((void *)((uint8_t *)dst - 64 + n), zmm0);
> +
> +		return ret;
> +	} else {
> +		uintptr_t dstu = (uintptr_t)dst;
> +		uintptr_t srcu = (uintptr_t)src;
> +		void *ret = dst;
> +		size_t dstofss;
> +		size_t bits;
> +
> +		/**
> +		 * Copy less than 16 bytes
> +		 */
> +		if (n < 16) {
> +			if (n & 0x01) {
> +				*(uint8_t *)dstu = *(const uint8_t *)srcu;
> +				srcu = (uintptr_t)((const uint8_t *)srcu + 1);
> +				dstu = (uintptr_t)((uint8_t *)dstu + 1);
> +			}
> +			if (n & 0x02) {
> +				*(uint16_t *)dstu = *(const uint16_t *)srcu;
> +				srcu = (uintptr_t)((const uint16_t *)srcu + 1);
> +				dstu = (uintptr_t)((uint16_t *)dstu + 1);
> +			}
> +			if (n & 0x04) {
> +				*(uint32_t *)dstu = *(const uint32_t *)srcu;
> +				srcu = (uintptr_t)((const uint32_t *)srcu + 1);
> +				dstu = (uintptr_t)((uint32_t *)dstu + 1);
> +			}
> +			if (n & 0x08)
> +				*(uint64_t *)dstu = *(const uint64_t *)srcu;
> +			return ret;
> +		}
> +
> +		/**
> +		 * Fast way when copy size doesn't exceed 512 bytes
> +		 */
> +		if (n <= 32) {
> +			__m128i xmm0, xmm1;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src - 16 + n));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst - 16 + n), xmm1);
> +			return ret;
> +		}
> +		if (n <= 64) {
> +			__m256i ymm0, ymm1;
> +			ymm0 = _mm256_loadu_si256((const __m256i *)src);
> +			ymm1 = _mm256_loadu_si256((const __m256i *)
> +				((const uint8_t *)src - 32 + n));
> +			_mm256_storeu_si256((__m256i *)dst, ymm0);
> +			_mm256_storeu_si256((__m256i *)
> +				((uint8_t *)dst - 32 + n), ymm1);
> +			return ret;
> +		}
> +		if (n <= 512) {
> +			if (n >= 256) {
> +				n -= 256;
> +				__m512i zmm0, zmm1, zmm2, zmm3;
> +				zmm0 = _mm512_loadu_si512((const void *)src);
> +				zmm1 = _mm512_loadu_si512((const void *)
> +					((const uint8_t *)src + 64));
> +				zmm2 = _mm512_loadu_si512((const void *)
> +					((const uint8_t *)src + 2*64));
> +				zmm3 = _mm512_loadu_si512((const void *)
> +					((const uint8_t *)src + 3*64));
> +				_mm512_storeu_si512((void *)dst, zmm0);
> +				_mm512_storeu_si512((void *)
> +					((uint8_t *)dst + 64), zmm1);
> +				_mm512_storeu_si512((void *)
> +					((uint8_t *)dst + 2*64), zmm2);
> +				_mm512_storeu_si512((void *)
> +					((uint8_t *)dst + 3*64), zmm3);
> +				src = (const uint8_t *)src + 256;
> +				dst = (uint8_t *)dst + 256;
> +			}
> +			if (n >= 128) {
> +				n -= 128;
> +				__m512i zmm0, zmm1;
> +				zmm0 = _mm512_loadu_si512((const void *)src);
> +				zmm1 = _mm512_loadu_si512((const void *)
> +					((const uint8_t *)src + 64));
> +				_mm512_storeu_si512((void *)dst, zmm0);
> +				_mm512_storeu_si512((void *)
> +					((uint8_t *)dst + 64), zmm1);
> +				src = (const uint8_t *)src + 128;
> +				dst = (uint8_t *)dst + 128;
> +			}
> +COPY_BLOCK_128_BACK63:
> +			if (n > 64) {
> +				__m512i zmm0, zmm1;
> +				zmm0 = _mm512_loadu_si512((const void *)src);
> +				zmm1 = _mm512_loadu_si512((const void *)
> +					((const uint8_t *)src - 64 + n));
> +				_mm512_storeu_si512((void *)dst, zmm0);
> +				_mm512_storeu_si512((void *)
> +					((uint8_t *)dst - 64 + n), zmm1);
> +				return ret;
> +			}
> +			if (n > 0) {
> +				__m512i zmm0;
> +				zmm0 = _mm512_loadu_si512((const void *)
> +					((const uint8_t *)src - 64 + n));
> +				_mm512_storeu_si512((void *)
> +					((uint8_t *)dst - 64 + n), zmm0);
> +			}
> +			return ret;
> +		}
> +
> +		/**
> +		 * Make store aligned when copy size exceeds 512 bytes
> +		 */
> +		dstofss = ((uintptr_t)dst & 0x3F);
> +		if (dstofss > 0) {
> +			dstofss = 64 - dstofss;
> +			n -= dstofss;
> +			__m512i zmm0;
> +			zmm0 = _mm512_loadu_si512((const void *)src);
> +			_mm512_storeu_si512((void *)dst, zmm0);
> +			src = (const uint8_t *)src + dstofss;
> +			dst = (uint8_t *)dst + dstofss;
> +		}
> +
> +		/**
> +		 * Copy 512-byte blocks.
> +		 * Use copy block function for better instruction order control,
> +		 * which is important when load is unaligned.
> +		 */
> +		__m512i zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, zmm7;
> +
> +		while (n >= 512) {
> +			zmm0 = _mm512_loadu_si512((const void *)
> +				((const uint8_t *)src + 0 * 64));
> +			n -= 512;
> +			zmm1 = _mm512_loadu_si512((const void *)
> +				((const uint8_t *)src + 1 * 64));
> +			zmm2 = _mm512_loadu_si512((const void *)
> +				((const uint8_t *)src + 2 * 64));
> +			zmm3 = _mm512_loadu_si512((const void *)
> +				((const uint8_t *)src + 3 * 64));
> +			zmm4 = _mm512_loadu_si512((const void *)
> +				((const uint8_t *)src + 4 * 64));
> +			zmm5 = _mm512_loadu_si512((const void *)
> +				((const uint8_t *)src + 5 * 64));
> +			zmm6 = _mm512_loadu_si512((const void *)
> +				((const uint8_t *)src + 6 * 64));
> +			zmm7 = _mm512_loadu_si512((const void *)
> +				((const uint8_t *)src + 7 * 64));
> +			src = (const uint8_t *)src + 512;
> +			_mm512_storeu_si512((void *)
> +				((uint8_t *)dst + 0 * 64), zmm0);
> +			_mm512_storeu_si512((void *)
> +				((uint8_t *)dst + 1 * 64), zmm1);
> +			_mm512_storeu_si512((void *)
> +				((uint8_t *)dst + 2 * 64), zmm2);
> +			_mm512_storeu_si512((void *)
> +				((uint8_t *)dst + 3 * 64), zmm3);
> +			_mm512_storeu_si512((void *)
> +				((uint8_t *)dst + 4 * 64), zmm4);
> +			_mm512_storeu_si512((void *)
> +				((uint8_t *)dst + 5 * 64), zmm5);
> +			_mm512_storeu_si512((void *)
> +				((uint8_t *)dst + 6 * 64), zmm6);
> +			_mm512_storeu_si512((void *)
> +				((uint8_t *)dst + 7 * 64), zmm7);
> +			dst = (uint8_t *)dst + 512;
> +		}
> +		bits = n;
> +		n = n & 511;
> +		bits -= n;
> +		src = (const uint8_t *)src + bits;
> +		dst = (uint8_t *)dst + bits;
> +
> +		/**
> +		 * Copy 128-byte blocks.
> +		 * Use copy block function for better instruction order control,
> +		 * which is important when load is unaligned.
> +		 */
> +		if (n >= 128) {
> +			__m512i zmm0, zmm1;
> +
> +			while (n >= 128) {
> +				zmm0 = _mm512_loadu_si512((const void *)
> +					((const uint8_t *)src + 0 * 64));
> +				n -= 128;
> +				zmm1 = _mm512_loadu_si512((const void *)
> +					((const uint8_t *)src + 1 * 64));
> +				src = (const uint8_t *)src + 128;
> +				_mm512_storeu_si512((void *)
> +					((uint8_t *)dst + 0 * 64), zmm0);
> +				_mm512_storeu_si512((void *)
> +					((uint8_t *)dst + 1 * 64), zmm1);
> +				dst = (uint8_t *)dst + 128;
> +			}
> +			bits = n;
> +			n = n & 127;
> +			bits -= n;
> +			src = (const uint8_t *)src + bits;
> +			dst = (uint8_t *)dst + bits;
> +		}
> +
> +		/**
> +		 * Copy whatever left
> +		 */
> +		goto COPY_BLOCK_128_BACK63;
> +	}
> +}
> diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
> b/lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
> new file mode 100644
> index 0000000..d17fb5b
> --- /dev/null
> +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
> @@ -0,0 +1,909 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#ifndef _RTE_MEMCPY_INTERNAL_X86_64_H_
> +#define _RTE_MEMCPY_INTERNAL_X86_64_H_
> +
> +/**
> + * @file
> + *
> + * Functions for SSE/AVX/AVX2/AVX512 implementation of memcpy().
> + */
> +
> +#include <stdio.h>
> +#include <stdint.h>
> +#include <string.h>
> +#include <rte_vect.h>
> +#include <rte_common.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/**
> + * Copy bytes from one location to another. The locations must not overlap.
> + *
> + * @note This is implemented as a macro, so it's address should not be taken
> + * and care is needed as parameter expressions may be evaluated multiple times.
> + *
> + * @param dst
> + *   Pointer to the destination of the data.
> + * @param src
> + *   Pointer to the source data.
> + * @param n
> + *   Number of bytes to copy.
> + * @return
> + *   Pointer to the destination data.
> + */
> +
> +#ifdef RTE_MACHINE_CPUFLAG_AVX512F
> +
> +#define ALIGNMENT_MASK 0x3F
> +
> +/**
> + * AVX512 implementation below
> + */
> +
> +/**
> + * Copy 16 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov16(uint8_t *dst, const uint8_t *src)
> +{
> +	__m128i xmm0;
> +
> +	xmm0 = _mm_loadu_si128((const __m128i *)src);
> +	_mm_storeu_si128((__m128i *)dst, xmm0);
> +}
> +
> +/**
> + * Copy 32 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov32(uint8_t *dst, const uint8_t *src)
> +{
> +	__m256i ymm0;
> +
> +	ymm0 = _mm256_loadu_si256((const __m256i *)src);
> +	_mm256_storeu_si256((__m256i *)dst, ymm0);
> +}
> +
> +/**
> + * Copy 64 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov64(uint8_t *dst, const uint8_t *src)
> +{
> +	__m512i zmm0;
> +
> +	zmm0 = _mm512_loadu_si512((const void *)src);
> +	_mm512_storeu_si512((void *)dst, zmm0);
> +}
> +
> +/**
> + * Copy 128 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov128(uint8_t *dst, const uint8_t *src)
> +{
> +	rte_mov64(dst + 0 * 64, src + 0 * 64);
> +	rte_mov64(dst + 1 * 64, src + 1 * 64);
> +}
> +
> +/**
> + * Copy 256 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov256(uint8_t *dst, const uint8_t *src)
> +{
> +	rte_mov64(dst + 0 * 64, src + 0 * 64);
> +	rte_mov64(dst + 1 * 64, src + 1 * 64);
> +	rte_mov64(dst + 2 * 64, src + 2 * 64);
> +	rte_mov64(dst + 3 * 64, src + 3 * 64);
> +}
> +
> +/**
> + * Copy 128-byte blocks from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
> +{
> +	__m512i zmm0, zmm1;
> +
> +	while (n >= 128) {
> +		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
> +		n -= 128;
> +		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
> +		src = src + 128;
> +		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
> +		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
> +		dst = dst + 128;
> +	}
> +}
> +
> +/**
> + * Copy 512-byte blocks from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
> +{
> +	__m512i zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, zmm7;
> +
> +	while (n >= 512) {
> +		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
> +		n -= 512;
> +		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
> +		zmm2 = _mm512_loadu_si512((const void *)(src + 2 * 64));
> +		zmm3 = _mm512_loadu_si512((const void *)(src + 3 * 64));
> +		zmm4 = _mm512_loadu_si512((const void *)(src + 4 * 64));
> +		zmm5 = _mm512_loadu_si512((const void *)(src + 5 * 64));
> +		zmm6 = _mm512_loadu_si512((const void *)(src + 6 * 64));
> +		zmm7 = _mm512_loadu_si512((const void *)(src + 7 * 64));
> +		src = src + 512;
> +		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
> +		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
> +		_mm512_storeu_si512((void *)(dst + 2 * 64), zmm2);
> +		_mm512_storeu_si512((void *)(dst + 3 * 64), zmm3);
> +		_mm512_storeu_si512((void *)(dst + 4 * 64), zmm4);
> +		_mm512_storeu_si512((void *)(dst + 5 * 64), zmm5);
> +		_mm512_storeu_si512((void *)(dst + 6 * 64), zmm6);
> +		_mm512_storeu_si512((void *)(dst + 7 * 64), zmm7);
> +		dst = dst + 512;
> +	}
> +}
> +
> +static inline void *
> +rte_memcpy_generic(void *dst, const void *src, size_t n)
> +{
> +	uintptr_t dstu = (uintptr_t)dst;
> +	uintptr_t srcu = (uintptr_t)src;
> +	void *ret = dst;
> +	size_t dstofss;
> +	size_t bits;
> +
> +	/**
> +	 * Copy less than 16 bytes
> +	 */
> +	if (n < 16) {
> +		if (n & 0x01) {
> +			*(uint8_t *)dstu = *(const uint8_t *)srcu;
> +			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
> +			dstu = (uintptr_t)((uint8_t *)dstu + 1);
> +		}
> +		if (n & 0x02) {
> +			*(uint16_t *)dstu = *(const uint16_t *)srcu;
> +			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
> +			dstu = (uintptr_t)((uint16_t *)dstu + 1);
> +		}
> +		if (n & 0x04) {
> +			*(uint32_t *)dstu = *(const uint32_t *)srcu;
> +			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
> +			dstu = (uintptr_t)((uint32_t *)dstu + 1);
> +		}
> +		if (n & 0x08)
> +			*(uint64_t *)dstu = *(const uint64_t *)srcu;
> +		return ret;
> +	}
> +
> +	/**
> +	 * Fast way when copy size doesn't exceed 512 bytes
> +	 */
> +	if (n <= 32) {
> +		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> +		rte_mov16((uint8_t *)dst - 16 + n,
> +				  (const uint8_t *)src - 16 + n);
> +		return ret;
> +	}
> +	if (n <= 64) {
> +		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> +		rte_mov32((uint8_t *)dst - 32 + n,
> +				  (const uint8_t *)src - 32 + n);
> +		return ret;
> +	}
> +	if (n <= 512) {
> +		if (n >= 256) {
> +			n -= 256;
> +			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
> +			src = (const uint8_t *)src + 256;
> +			dst = (uint8_t *)dst + 256;
> +		}
> +		if (n >= 128) {
> +			n -= 128;
> +			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> +			src = (const uint8_t *)src + 128;
> +			dst = (uint8_t *)dst + 128;
> +		}
> +COPY_BLOCK_128_BACK63:
> +		if (n > 64) {
> +			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> +			rte_mov64((uint8_t *)dst - 64 + n,
> +					  (const uint8_t *)src - 64 + n);
> +			return ret;
> +		}
> +		if (n > 0)
> +			rte_mov64((uint8_t *)dst - 64 + n,
> +					  (const uint8_t *)src - 64 + n);
> +		return ret;
> +	}
> +
> +	/**
> +	 * Make store aligned when copy size exceeds 512 bytes
> +	 */
> +	dstofss = ((uintptr_t)dst & 0x3F);
> +	if (dstofss > 0) {
> +		dstofss = 64 - dstofss;
> +		n -= dstofss;
> +		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> +		src = (const uint8_t *)src + dstofss;
> +		dst = (uint8_t *)dst + dstofss;
> +	}
> +
> +	/**
> +	 * Copy 512-byte blocks.
> +	 * Use copy block function for better instruction order control,
> +	 * which is important when load is unaligned.
> +	 */
> +	rte_mov512blocks((uint8_t *)dst, (const uint8_t *)src, n);
> +	bits = n;
> +	n = n & 511;
> +	bits -= n;
> +	src = (const uint8_t *)src + bits;
> +	dst = (uint8_t *)dst + bits;
> +
> +	/**
> +	 * Copy 128-byte blocks.
> +	 * Use copy block function for better instruction order control,
> +	 * which is important when load is unaligned.
> +	 */
> +	if (n >= 128) {
> +		rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
> +		bits = n;
> +		n = n & 127;
> +		bits -= n;
> +		src = (const uint8_t *)src + bits;
> +		dst = (uint8_t *)dst + bits;
> +	}
> +
> +	/**
> +	 * Copy whatever left
> +	 */
> +	goto COPY_BLOCK_128_BACK63;
> +}
> +
> +#elif defined RTE_MACHINE_CPUFLAG_AVX2
> +
> +#define ALIGNMENT_MASK 0x1F
> +
> +/**
> + * AVX2 implementation below
> + */
> +
> +/**
> + * Copy 16 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov16(uint8_t *dst, const uint8_t *src)
> +{
> +	__m128i xmm0;
> +
> +	xmm0 = _mm_loadu_si128((const __m128i *)src);
> +	_mm_storeu_si128((__m128i *)dst, xmm0);
> +}
> +
> +/**
> + * Copy 32 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov32(uint8_t *dst, const uint8_t *src)
> +{
> +	__m256i ymm0;
> +
> +	ymm0 = _mm256_loadu_si256((const __m256i *)src);
> +	_mm256_storeu_si256((__m256i *)dst, ymm0);
> +}
> +
> +/**
> + * Copy 64 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov64(uint8_t *dst, const uint8_t *src)
> +{
> +	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
> +	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
> +}
> +
> +/**
> + * Copy 128 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov128(uint8_t *dst, const uint8_t *src)
> +{
> +	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
> +	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
> +	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
> +	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
> +}
> +
> +/**
> + * Copy 128-byte blocks from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
> +{
> +	__m256i ymm0, ymm1, ymm2, ymm3;
> +
> +	while (n >= 128) {
> +		ymm0 = _mm256_loadu_si256((const __m256i *)
> +				((const uint8_t *)src + 0 * 32));
> +		n -= 128;
> +		ymm1 = _mm256_loadu_si256((const __m256i *)
> +				((const uint8_t *)src + 1 * 32));
> +		ymm2 = _mm256_loadu_si256((const __m256i *)
> +				((const uint8_t *)src + 2 * 32));
> +		ymm3 = _mm256_loadu_si256((const __m256i *)
> +				((const uint8_t *)src + 3 * 32));
> +		src = (const uint8_t *)src + 128;
> +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
> +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
> +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2);
> +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3);
> +		dst = (uint8_t *)dst + 128;
> +	}
> +}
> +
> +static inline void *
> +rte_memcpy_generic(void *dst, const void *src, size_t n)
> +{
> +	uintptr_t dstu = (uintptr_t)dst;
> +	uintptr_t srcu = (uintptr_t)src;
> +	void *ret = dst;
> +	size_t dstofss;
> +	size_t bits;
> +
> +	/**
> +	 * Copy less than 16 bytes
> +	 */
> +	if (n < 16) {
> +		if (n & 0x01) {
> +			*(uint8_t *)dstu = *(const uint8_t *)srcu;
> +			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
> +			dstu = (uintptr_t)((uint8_t *)dstu + 1);
> +		}
> +		if (n & 0x02) {
> +			*(uint16_t *)dstu = *(const uint16_t *)srcu;
> +			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
> +			dstu = (uintptr_t)((uint16_t *)dstu + 1);
> +		}
> +		if (n & 0x04) {
> +			*(uint32_t *)dstu = *(const uint32_t *)srcu;
> +			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
> +			dstu = (uintptr_t)((uint32_t *)dstu + 1);
> +		}
> +		if (n & 0x08)
> +			*(uint64_t *)dstu = *(const uint64_t *)srcu;
> +		return ret;
> +	}
> +
> +	/**
> +	 * Fast way when copy size doesn't exceed 256 bytes
> +	 */
> +	if (n <= 32) {
> +		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> +		rte_mov16((uint8_t *)dst - 16 + n,
> +				(const uint8_t *)src - 16 + n);
> +		return ret;
> +	}
> +	if (n <= 48) {
> +		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> +		rte_mov16((uint8_t *)dst + 16, (const uint8_t *)src + 16);
> +		rte_mov16((uint8_t *)dst - 16 + n,
> +				(const uint8_t *)src - 16 + n);
> +		return ret;
> +	}
> +	if (n <= 64) {
> +		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> +		rte_mov32((uint8_t *)dst - 32 + n,
> +				(const uint8_t *)src - 32 + n);
> +		return ret;
> +	}
> +	if (n <= 256) {
> +		if (n >= 128) {
> +			n -= 128;
> +			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> +			src = (const uint8_t *)src + 128;
> +			dst = (uint8_t *)dst + 128;
> +		}
> +COPY_BLOCK_128_BACK31:
> +		if (n >= 64) {
> +			n -= 64;
> +			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> +			src = (const uint8_t *)src + 64;
> +			dst = (uint8_t *)dst + 64;
> +		}
> +		if (n > 32) {
> +			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> +			rte_mov32((uint8_t *)dst - 32 + n,
> +					(const uint8_t *)src - 32 + n);
> +			return ret;
> +		}
> +		if (n > 0) {
> +			rte_mov32((uint8_t *)dst - 32 + n,
> +					(const uint8_t *)src - 32 + n);
> +		}
> +		return ret;
> +	}
> +
> +	/**
> +	 * Make store aligned when copy size exceeds 256 bytes
> +	 */
> +	dstofss = (uintptr_t)dst & 0x1F;
> +	if (dstofss > 0) {
> +		dstofss = 32 - dstofss;
> +		n -= dstofss;
> +		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> +		src = (const uint8_t *)src + dstofss;
> +		dst = (uint8_t *)dst + dstofss;
> +	}
> +
> +	/**
> +	 * Copy 128-byte blocks
> +	 */
> +	rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
> +	bits = n;
> +	n = n & 127;
> +	bits -= n;
> +	src = (const uint8_t *)src + bits;
> +	dst = (uint8_t *)dst + bits;
> +
> +	/**
> +	 * Copy whatever left
> +	 */
> +	goto COPY_BLOCK_128_BACK31;
> +}
> +
> +#else /* RTE_MACHINE_CPUFLAG */
> +
> +#define ALIGNMENT_MASK 0x0F
> +
> +/**
> + * SSE & AVX implementation below
> + */
> +
> +/**
> + * Copy 16 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov16(uint8_t *dst, const uint8_t *src)
> +{
> +	__m128i xmm0;
> +
> +	xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);
> +	_mm_storeu_si128((__m128i *)dst, xmm0);
> +}
> +
> +/**
> + * Copy 32 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov32(uint8_t *dst, const uint8_t *src)
> +{
> +	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> +	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> +}
> +
> +/**
> + * Copy 64 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov64(uint8_t *dst, const uint8_t *src)
> +{
> +	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> +	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> +	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> +	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
> +}
> +
> +/**
> + * Copy 128 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov128(uint8_t *dst, const uint8_t *src)
> +{
> +	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> +	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> +	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> +	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
> +	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
> +	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
> +	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
> +	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
> +}
> +
> +/**
> + * Copy 256 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov256(uint8_t *dst, const uint8_t *src)
> +{
> +	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> +	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> +	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> +	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
> +	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
> +	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
> +	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
> +	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
> +	rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);
> +	rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);
> +	rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);
> +	rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);
> +	rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);
> +	rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);
> +	rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);
> +	rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
> +}
> +
> +/**
> + * Macro for copying unaligned block from one location to another with constant load offset,
> + * 47 bytes leftover maximum,
> + * locations should not overlap.
> + * Requirements:
> + * - Store is aligned
> + * - Load offset is <offset>, which must be immediate value within [1, 15]
> + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
> + * - <dst>, <src>, <len> must be variables
> + * - __m128i <xmm0> ~ <xmm8> must be pre-defined
> + */
> +#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)                                                     \
> +__extension__ ({                                                                                            \
> +    int tmp;                                                                                                \
> +    while (len >= 128 + 16 - offset) {                                                                      \
> +        xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));                  \
> +        len -= 128;                                                                                         \
> +        xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));                  \
> +        xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));                  \
> +        xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16));                  \
> +        xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16));                  \
> +        xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16));                  \
> +        xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16));                  \
> +        xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \
> +        xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \
> +        src = (const uint8_t *)src + 128;                                                                   \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));        \
> +        dst = (uint8_t *)dst + 128;                                                                         \
> +    }                                                                                                       \
> +    tmp = len;                                                                                              \
> +    len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
> +    tmp -= len;                                                                                             \
> +    src = (const uint8_t *)src + tmp;                                                                       \
> +    dst = (uint8_t *)dst + tmp;                                                                             \
> +    if (len >= 32 + 16 - offset) {                                                                          \
> +        while (len >= 32 + 16 - offset) {                                                                   \
> +            xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));              \
> +            len -= 32;                                                                                      \
> +            xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));              \
> +            xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));              \
> +            src = (const uint8_t *)src + 32;                                                                \
> +            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));    \
> +            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));    \
> +            dst = (uint8_t *)dst + 32;                                                                      \
> +        }                                                                                                   \
> +        tmp = len;                                                                                          \
> +        len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
> +        tmp -= len;                                                                                         \
> +        src = (const uint8_t *)src + tmp;                                                                   \
> +        dst = (uint8_t *)dst + tmp;                                                                         \
> +    }                                                                                                       \
> +})
> +
> +/**
> + * Macro for copying unaligned block from one location to another,
> + * 47 bytes leftover maximum,
> + * locations should not overlap.
> + * Use switch here because the aligning instruction requires immediate value for shift count.
> + * Requirements:
> + * - Store is aligned
> + * - Load offset is <offset>, which must be within [1, 15]
> + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
> + * - <dst>, <src>, <len> must be variables
> + * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined
> + */
> +#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                   \
> +__extension__ ({                                                      \
> +    switch (offset) {                                                 \
> +    case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break;    \
> +    case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break;    \
> +    case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break;    \
> +    case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break;    \
> +    case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break;    \
> +    case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break;    \
> +    case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break;    \
> +    case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break;    \
> +    case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break;    \
> +    case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break;    \
> +    case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;    \
> +    case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break;    \
> +    case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break;    \
> +    case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break;    \
> +    case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break;    \
> +    default:;                                                         \
> +    }                                                                 \
> +})
> +
> +static inline void *
> +rte_memcpy_generic(void *dst, const void *src, size_t n)
> +{
> +	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
> +	uintptr_t dstu = (uintptr_t)dst;
> +	uintptr_t srcu = (uintptr_t)src;
> +	void *ret = dst;
> +	size_t dstofss;
> +	size_t srcofs;
> +
> +	/**
> +	 * Copy less than 16 bytes
> +	 */
> +	if (n < 16) {
> +		if (n & 0x01) {
> +			*(uint8_t *)dstu = *(const uint8_t *)srcu;
> +			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
> +			dstu = (uintptr_t)((uint8_t *)dstu + 1);
> +		}
> +		if (n & 0x02) {
> +			*(uint16_t *)dstu = *(const uint16_t *)srcu;
> +			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
> +			dstu = (uintptr_t)((uint16_t *)dstu + 1);
> +		}
> +		if (n & 0x04) {
> +			*(uint32_t *)dstu = *(const uint32_t *)srcu;
> +			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
> +			dstu = (uintptr_t)((uint32_t *)dstu + 1);
> +		}
> +		if (n & 0x08)
> +			*(uint64_t *)dstu = *(const uint64_t *)srcu;
> +		return ret;
> +	}
> +
> +	/**
> +	 * Fast way when copy size doesn't exceed 512 bytes
> +	 */
> +	if (n <= 32) {
> +		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> +		rte_mov16((uint8_t *)dst - 16 + n,
> +				(const uint8_t *)src - 16 + n);
> +		return ret;
> +	}
> +	if (n <= 48) {
> +		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> +		rte_mov16((uint8_t *)dst - 16 + n,
> +				(const uint8_t *)src - 16 + n);
> +		return ret;
> +	}
> +	if (n <= 64) {
> +		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> +		rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
> +		rte_mov16((uint8_t *)dst - 16 + n,
> +				(const uint8_t *)src - 16 + n);
> +		return ret;
> +	}
> +	if (n <= 128)
> +		goto COPY_BLOCK_128_BACK15;
> +	if (n <= 512) {
> +		if (n >= 256) {
> +			n -= 256;
> +			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> +			rte_mov128((uint8_t *)dst + 128,
> +					(const uint8_t *)src + 128);
> +			src = (const uint8_t *)src + 256;
> +			dst = (uint8_t *)dst + 256;
> +		}
> +COPY_BLOCK_255_BACK15:
> +		if (n >= 128) {
> +			n -= 128;
> +			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> +			src = (const uint8_t *)src + 128;
> +			dst = (uint8_t *)dst + 128;
> +		}
> +COPY_BLOCK_128_BACK15:
> +		if (n >= 64) {
> +			n -= 64;
> +			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> +			src = (const uint8_t *)src + 64;
> +			dst = (uint8_t *)dst + 64;
> +		}
> +COPY_BLOCK_64_BACK15:
> +		if (n >= 32) {
> +			n -= 32;
> +			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> +			src = (const uint8_t *)src + 32;
> +			dst = (uint8_t *)dst + 32;
> +		}
> +		if (n > 16) {
> +			rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> +			rte_mov16((uint8_t *)dst - 16 + n,
> +					(const uint8_t *)src - 16 + n);
> +			return ret;
> +		}
> +		if (n > 0) {
> +			rte_mov16((uint8_t *)dst - 16 + n,
> +					(const uint8_t *)src - 16 + n);
> +		}
> +		return ret;
> +	}
> +
> +	/**
> +	 * Make store aligned when copy size exceeds 512 bytes,
> +	 * and make sure the first 15 bytes are copied, because
> +	 * unaligned copy functions require up to 15 bytes
> +	 * backwards access.
> +	 */
> +	dstofss = (uintptr_t)dst & 0x0F;
> +	if (dstofss > 0) {
> +		dstofss = 16 - dstofss + 16;
> +		n -= dstofss;
> +		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> +		src = (const uint8_t *)src + dstofss;
> +		dst = (uint8_t *)dst + dstofss;
> +	}
> +	srcofs = ((uintptr_t)src & 0x0F);
> +
> +	/**
> +	 * For aligned copy
> +	 */
> +	if (srcofs == 0) {
> +		/**
> +		 * Copy 256-byte blocks
> +		 */
> +		for (; n >= 256; n -= 256) {
> +			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
> +			dst = (uint8_t *)dst + 256;
> +			src = (const uint8_t *)src + 256;
> +		}
> +
> +		/**
> +		 * Copy whatever left
> +		 */
> +		goto COPY_BLOCK_255_BACK15;
> +	}
> +
> +	/**
> +	 * For copy with unaligned load
> +	 */
> +	MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
> +
> +	/**
> +	 * Copy whatever left
> +	 */
> +	goto COPY_BLOCK_64_BACK15;
> +}
> +
> +#endif /* RTE_MACHINE_CPUFLAG */
> +
> +static inline void *
> +rte_memcpy_aligned(void *dst, const void *src, size_t n)
> +{
> +	void *ret = dst;
> +
> +	/* Copy size <= 16 bytes */
> +	if (n < 16) {
> +		if (n & 0x01) {
> +			*(uint8_t *)dst = *(const uint8_t *)src;
> +			src = (const uint8_t *)src + 1;
> +			dst = (uint8_t *)dst + 1;
> +		}
> +		if (n & 0x02) {
> +			*(uint16_t *)dst = *(const uint16_t *)src;
> +			src = (const uint16_t *)src + 1;
> +			dst = (uint16_t *)dst + 1;
> +		}
> +		if (n & 0x04) {
> +			*(uint32_t *)dst = *(const uint32_t *)src;
> +			src = (const uint32_t *)src + 1;
> +			dst = (uint32_t *)dst + 1;
> +		}
> +		if (n & 0x08)
> +			*(uint64_t *)dst = *(const uint64_t *)src;
> +
> +		return ret;
> +	}
> +
> +	/* Copy 16 <= size <= 32 bytes */
> +	if (n <= 32) {
> +		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> +		rte_mov16((uint8_t *)dst - 16 + n,
> +				(const uint8_t *)src - 16 + n);
> +
> +		return ret;
> +	}
> +
> +	/* Copy 32 < size <= 64 bytes */
> +	if (n <= 64) {
> +		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> +		rte_mov32((uint8_t *)dst - 32 + n,
> +				(const uint8_t *)src - 32 + n);
> +
> +		return ret;
> +	}
> +
> +	/* Copy 64 bytes blocks */
> +	for (; n >= 64; n -= 64) {
> +		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> +		dst = (uint8_t *)dst + 64;
> +		src = (const uint8_t *)src + 64;
> +	}
> +
> +	/* Copy whatever left */
> +	rte_mov64((uint8_t *)dst - 64 + n,
> +			(const uint8_t *)src - 64 + n);
> +
> +	return ret;
> +}
> +
> +static inline void *
> +rte_memcpy_internal(void *dst, const void *src, size_t n)
> +{
> +	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
> +		return rte_memcpy_aligned(dst, src, n);
> +	else
> +		return rte_memcpy_generic(dst, src, n);
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_MEMCPY_INTERNAL_X86_64_H_ */
> diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c b/lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c
> new file mode 100644
> index 0000000..2532696
> --- /dev/null
> +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c
> @@ -0,0 +1,585 @@
> +/*-
> + *   BSD LICENSE
> + *
> + *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
> + *   All rights reserved.
> + *
> + *   Redistribution and use in source and binary forms, with or without
> + *   modification, are permitted provided that the following conditions
> + *   are met:
> + *
> + *     * Redistributions of source code must retain the above copyright
> + *       notice, this list of conditions and the following disclaimer.
> + *     * Redistributions in binary form must reproduce the above copyright
> + *       notice, this list of conditions and the following disclaimer in
> + *       the documentation and/or other materials provided with the
> + *       distribution.
> + *     * Neither the name of Intel Corporation nor the names of its
> + *       contributors may be used to endorse or promote products derived
> + *       from this software without specific prior written permission.
> + *
> + *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
> + *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
> + *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
> + *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
> + *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
> + *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
> + *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
> + *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
> + *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
> + *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
> + *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
> + */
> +
> +#include <rte_memcpy.h>
> +
> +/**
> + * Macro for copying unaligned block from one location to another with constant load offset,
> + * 47 bytes leftover maximum,
> + * locations should not overlap.
> + * Requirements:
> + * - Store is aligned
> + * - Load offset is <offset>, which must be immediate value within [1, 15]
> + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
> + * - <dst>, <src>, <len> must be variables
> + * - __m128i <xmm0> ~ <xmm8> must be pre-defined
> + */
> +#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)                                                     \
> +__extension__ ({                                                                                            \
> +    int tmp;                                                                                                \
> +    while (len >= 128 + 16 - offset) {                                                                      \
> +        xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));                  \
> +        len -= 128;                                                                                         \
> +        xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));                  \
> +        xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));                  \
> +        xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16));                  \
> +        xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16));                  \
> +        xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16));                  \
> +        xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16));                  \
> +        xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \
> +        xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \
> +        src = (const uint8_t *)src + 128;                                                                   \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));        \
> +        dst = (uint8_t *)dst + 128;                                                                         \
> +    }                                                                                                       \
> +    tmp = len;                                                                                              \
> +    len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
> +    tmp -= len;                                                                                             \
> +    src = (const uint8_t *)src + tmp;                                                                       \
> +    dst = (uint8_t *)dst + tmp;                                                                             \
> +    if (len >= 32 + 16 - offset) {                                                                          \
> +        while (len >= 32 + 16 - offset) {                                                                   \
> +            xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));              \
> +            len -= 32;                                                                                      \
> +            xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));              \
> +            xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));              \
> +            src = (const uint8_t *)src + 32;                                                                \
> +            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));    \
> +            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));    \
> +            dst = (uint8_t *)dst + 32;                                                                      \
> +        }                                                                                                   \
> +        tmp = len;                                                                                          \
> +        len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
> +        tmp -= len;                                                                                         \
> +        src = (const uint8_t *)src + tmp;                                                                   \
> +        dst = (uint8_t *)dst + tmp;                                                                         \
> +    }                                                                                                       \
> +})
> +
> +/**
> + * Macro for copying unaligned block from one location to another,
> + * 47 bytes leftover maximum,
> + * locations should not overlap.
> + * Use switch here because the aligning instruction requires immediate value for shift count.
> + * Requirements:
> + * - Store is aligned
> + * - Load offset is <offset>, which must be within [1, 15]
> + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
> + * - <dst>, <src>, <len> must be variables
> + * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined
> + */
> +#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                   \
> +__extension__ ({                                                      \
> +    switch (offset) {                                                 \
> +    case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break;    \
> +    case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break;    \
> +    case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break;    \
> +    case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break;    \
> +    case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break;    \
> +    case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break;    \
> +    case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break;    \
> +    case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break;    \
> +    case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break;    \
> +    case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break;    \
> +    case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;    \
> +    case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break;    \
> +    case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break;    \
> +    case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break;    \
> +    case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break;    \
> +    default:;                                                         \
> +    }                                                                 \
> +})
> +
> +void *
> +rte_memcpy_sse(void *dst, const void *src, size_t n)
> +{
> +	if (!(((uintptr_t)dst | (uintptr_t)src) & 0x0F)) {
> +		void *ret = dst;
> +
> +		/* Copy size <= 16 bytes */
> +		if (n < 16) {
> +			if (n & 0x01) {
> +				*(uint8_t *)dst = *(const uint8_t *)src;
> +				src = (const uint8_t *)src + 1;
> +				dst = (uint8_t *)dst + 1;
> +			}
> +			if (n & 0x02) {
> +				*(uint16_t *)dst = *(const uint16_t *)src;
> +				src = (const uint16_t *)src + 1;
> +				dst = (uint16_t *)dst + 1;
> +			}
> +			if (n & 0x04) {
> +				*(uint32_t *)dst = *(const uint32_t *)src;
> +				src = (const uint32_t *)src + 1;
> +				dst = (uint32_t *)dst + 1;
> +			}
> +			if (n & 0x08)
> +				*(uint64_t *)dst = *(const uint64_t *)src;
> +
> +			return ret;
> +		}
> +
> +		/* Copy 16 <= size <= 32 bytes */
> +		if (n <= 32) {
> +			__m128i xmm0, xmm1;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src - 16 + n));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst - 16 + n), xmm1);
> +
> +			return ret;
> +		}
> +
> +		/* Copy 32 < size <= 64 bytes */
> +		if (n <= 64) {
> +			__m128i xmm0, xmm1, xmm2, xmm3;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src + 16));
> +			xmm2 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src - 32 + n));
> +			xmm3 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src - 16 + n));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst + 16), xmm1);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst - 32 + n), xmm2);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst - 16 + n), xmm3);
> +
> +			return ret;
> +		}
> +
> +		/* Copy 64 bytes blocks */
> +		for (; n >= 64; n -= 64) {
> +			__m128i xmm0, xmm1, xmm2, xmm3;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src + 16));
> +			xmm2 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src + 2*16));
> +			xmm3 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src + 3*16));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst + 16), xmm1);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst + 2*16), xmm2);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst + 3*16), xmm3);
> +			dst = (uint8_t *)dst + 64;
> +			src = (const uint8_t *)src + 64;
> +		}
> +
> +		/* Copy whatever left */
> +		__m128i xmm0, xmm1, xmm2, xmm3;
> +		xmm0 = _mm_loadu_si128((const __m128i *)
> +			((const uint8_t *)src - 64 + n));
> +		xmm1 = _mm_loadu_si128((const __m128i *)
> +			((const uint8_t *)src - 48 + n));
> +		xmm2 = _mm_loadu_si128((const __m128i *)
> +			((const uint8_t *)src - 32 + n));
> +		xmm3 = _mm_loadu_si128((const __m128i *)
> +			((const uint8_t *)src - 16 + n));
> +		_mm_storeu_si128((__m128i *)((uint8_t *)dst - 64 + n), xmm0);
> +		_mm_storeu_si128((__m128i *)((uint8_t *)dst - 48 + n), xmm1);
> +		_mm_storeu_si128((__m128i *)((uint8_t *)dst - 32 + n), xmm2);
> +		_mm_storeu_si128((__m128i *)((uint8_t *)dst - 16 + n), xmm3);
> +
> +		return ret;
> +	} else {
> +		__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
> +		uintptr_t dstu = (uintptr_t)dst;
> +		uintptr_t srcu = (uintptr_t)src;
> +		void *ret = dst;
> +		size_t dstofss;
> +		size_t srcofs;
> +
> +		/**
> +		 * Copy less than 16 bytes
> +		 */
> +		if (n < 16) {
> +			if (n & 0x01) {
> +				*(uint8_t *)dstu = *(const uint8_t *)srcu;
> +				srcu = (uintptr_t)((const uint8_t *)srcu + 1);
> +				dstu = (uintptr_t)((uint8_t *)dstu + 1);
> +			}
> +			if (n & 0x02) {
> +				*(uint16_t *)dstu = *(const uint16_t *)srcu;
> +				srcu = (uintptr_t)((const uint16_t *)srcu + 1);
> +				dstu = (uintptr_t)((uint16_t *)dstu + 1);
> +			}
> +			if (n & 0x04) {
> +				*(uint32_t *)dstu = *(const uint32_t *)srcu;
> +				srcu = (uintptr_t)((const uint32_t *)srcu + 1);
> +				dstu = (uintptr_t)((uint32_t *)dstu + 1);
> +			}
> +			if (n & 0x08)
> +				*(uint64_t *)dstu = *(const uint64_t *)srcu;
> +			return ret;
> +		}
> +
> +		/**
> +		 * Fast way when copy size doesn't exceed 512 bytes
> +		 */
> +		if (n <= 32) {
> +			__m128i xmm0, xmm1;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src - 16 + n));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst - 16 + n), xmm1);
> +			return ret;
> +		}
> +		if (n <= 48) {
> +			__m128i xmm0, xmm1, xmm2;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src + 16));
> +			xmm2 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src - 16 + n));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst + 16), xmm1);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst - 16 + n), xmm2);
> +			return ret;
> +		}
> +		if (n <= 64) {
> +			__m128i xmm0, xmm1, xmm2, xmm3;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src + 16));
> +			xmm2 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src + 32));
> +			xmm3 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src - 16 + n));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst + 16), xmm1);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst + 32), xmm2);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst - 16 + n), xmm3);
> +			return ret;
> +		}
> +		if (n <= 128)
> +			goto COPY_BLOCK_128_BACK15;
> +		if (n <= 512) {
> +			if (n >= 256) {
> +				n -= 256;
> +				__m128i xmm0, xmm1;
> +				xmm0 = _mm_loadu_si128((const __m128i *)src);
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 16));
> +				_mm_storeu_si128((__m128i *)dst, xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 2*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 3*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 2*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 3*16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 4*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 5*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 4*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 5*16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 6*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 7*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 6*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 7*16), xmm1);
> +
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 128));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 128 + 16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 128), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 128 + 16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 128 + 2*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 128 + 3*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 128 + 2*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 128 + 3*16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 128 + 4*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 128 + 5*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 128 + 4*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 128 + 5*16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 128 + 6*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 128 + 7*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 128 + 6*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 128 + 7*16), xmm1);
> +				src = (const uint8_t *)src + 256;
> +				dst = (uint8_t *)dst + 256;
> +			}
> +COPY_BLOCK_255_BACK15:
> +			if (n >= 128) {
> +				n -= 128;
> +				__m128i xmm0, xmm1;
> +				xmm0 = _mm_loadu_si128((const __m128i *)src);
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 16));
> +				_mm_storeu_si128((__m128i *)dst, xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 2*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 3*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 2*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 3*16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 4*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 5*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 4*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 5*16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 6*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 7*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 6*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 7*16), xmm1);
> +				src = (const uint8_t *)src + 128;
> +				dst = (uint8_t *)dst + 128;
> +			}
> +COPY_BLOCK_128_BACK15:
> +			if (n >= 64) {
> +				n -= 64;
> +				__m128i xmm0, xmm1;
> +				xmm0 = _mm_loadu_si128((const __m128i *)src);
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 16));
> +				_mm_storeu_si128((__m128i *)dst, xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 16), xmm1);
> +
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 2*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 3*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 2*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 3*16), xmm1);
> +				src = (const uint8_t *)src + 64;
> +				dst = (uint8_t *)dst + 64;
> +			}
> +COPY_BLOCK_64_BACK15:
> +			if (n >= 32) {
> +				n -= 32;
> +				__m128i xmm0, xmm1;
> +				xmm0 = _mm_loadu_si128((const __m128i *)src);
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 16));
> +				_mm_storeu_si128((__m128i *)dst, xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 16), xmm1);
> +				src = (const uint8_t *)src + 32;
> +				dst = (uint8_t *)dst + 32;
> +			}
> +			if (n > 16) {
> +				__m128i xmm0, xmm1;
> +				xmm0 = _mm_loadu_si128((const __m128i *)src);
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src - 16 + n));
> +				_mm_storeu_si128((__m128i *)dst, xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst - 16 + n), xmm1);
> +				return ret;
> +			}
> +			if (n > 0) {
> +				__m128i xmm0;
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src - 16 + n));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst - 16 + n), xmm0);
> +			}
> +			return ret;
> +		}
> +
> +		/**
> +		 * Make store aligned when copy size exceeds 512 bytes,
> +		 * and make sure the first 15 bytes are copied, because
> +		 * unaligned copy functions require up to 15 bytes
> +		 * backwards access.
> +		 */
> +		dstofss = (uintptr_t)dst & 0x0F;
> +		if (dstofss > 0) {
> +			dstofss = 16 - dstofss + 16;
> +			n -= dstofss;
> +			__m128i xmm0, xmm1;
> +			xmm0 = _mm_loadu_si128((const __m128i *)src);
> +			xmm1 = _mm_loadu_si128((const __m128i *)
> +				((const uint8_t *)src + 16));
> +			_mm_storeu_si128((__m128i *)dst, xmm0);
> +			_mm_storeu_si128((__m128i *)
> +				((uint8_t *)dst + 16), xmm1);
> +			src = (const uint8_t *)src + dstofss;
> +			dst = (uint8_t *)dst + dstofss;
> +		}
> +		srcofs = ((uintptr_t)src & 0x0F);
> +
> +		/**
> +		 * For aligned copy
> +		 */
> +		if (srcofs == 0) {
> +			/**
> +			 * Copy 256-byte blocks
> +			 */
> +			for (; n >= 256; n -= 256) {
> +				__m128i xmm0, xmm1;
> +				xmm0 = _mm_loadu_si128((const __m128i *)src);
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 16));
> +				_mm_storeu_si128((__m128i *)dst, xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 2*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 3*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 2*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 3*16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 4*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 5*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 4*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 5*16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 6*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 7*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 6*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 7*16), xmm1);
> +
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 8*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 9*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 8*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 9*16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 10*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 11*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 10*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 11*16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 12*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 13*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 12*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 13*16), xmm1);
> +				xmm0 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 14*16));
> +				xmm1 = _mm_loadu_si128((const __m128i *)
> +					((const uint8_t *)src + 15*16));
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 14*16), xmm0);
> +				_mm_storeu_si128((__m128i *)
> +					((uint8_t *)dst + 15*16), xmm1);
> +				dst = (uint8_t *)dst + 256;
> +				src = (const uint8_t *)src + 256;
> +			}
> +
> +			/**
> +			 * Copy whatever left
> +			 */
> +			goto COPY_BLOCK_255_BACK15;
> +		}
> +
> +		/**
> +		 * For copy with unaligned load
> +		 */
> +		MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
> +
> +		/**
> +		 * Copy whatever left
> +		 */
> +		goto COPY_BLOCK_64_BACK15;
> +	}
> +}
> diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
> index 90bca4d..88d3298 100644
> --- a/lib/librte_eal/linuxapp/eal/Makefile
> +++ b/lib/librte_eal/linuxapp/eal/Makefile
> @@ -40,6 +40,7 @@ VPATH += $(RTE_SDK)/lib/librte_eal/common/arch/$(ARCH_DIR)
>  LIBABIVER := 5
> 
>  VPATH += $(RTE_SDK)/lib/librte_eal/common
> +VPATH += $(RTE_SDK)/lib/librte_eal/common/include/arch/$(ARCH_DIR)
> 
>  CFLAGS += -I$(SRCDIR)/include
>  CFLAGS += -I$(RTE_SDK)/lib/librte_eal/common
> @@ -105,6 +106,22 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += rte_service.c
>  SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += rte_cpuflags.c
>  SRCS-$(CONFIG_RTE_ARCH_X86) += rte_spinlock.c
> 
> +# for run-time dispatch of memcpy
> +SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy.c
> +SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_sse.c
> +
> +# if the compiler supports AVX512, add avx512 file
> +ifneq ($(filter $(MACHINE_CFLAGS),CC_SUPPORT_AVX512F),)
> +SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx512f.c
> +CFLAGS_rte_memcpy_avx512f.o += -mavx512f
> +endif
> +
> +# if the compiler supports AVX2, add avx2 file
> +ifneq ($(filter $(MACHINE_CFLAGS),CC_SUPPORT_AVX2),)
> +SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx2.c
> +CFLAGS_rte_memcpy_avx2.o += -mavx2
> +endif
> +
>  CFLAGS_eal_common_cpuflags.o := $(CPUFLAGS_LIST)
> 
>  CFLAGS_eal.o := -D_GNU_SOURCE
> diff --git a/mk/rte.cpuflags.mk b/mk/rte.cpuflags.mk
> index a813c91..8a7a1e7 100644
> --- a/mk/rte.cpuflags.mk
> +++ b/mk/rte.cpuflags.mk
> @@ -134,6 +134,20 @@ endif
> 
>  MACHINE_CFLAGS += $(addprefix -DRTE_MACHINE_CPUFLAG_,$(CPUFLAGS))
> 
> +# Check if the compiler suppoerts AVX512
> +CC_SUPPORT_AVX512F := $(shell $(CC) -mavx512f -dM -E - < /dev/null 2>&1 | grep -q AVX512 && echo 1)
> +ifeq ($(CC_SUPPORT_AVX512F),1)
> +ifeq ($(CONFIG_RTE_ENABLE_AVX512),y)
> +MACHINE_CFLAGS += -DCC_SUPPORT_AVX512F
> +endif
> +endif
> +
> +# Check if the compiler supports AVX2
> +CC_SUPPORT_AVX2 := $(shell $(CC) -mavx2 -dM -E - < /dev/null 2>&1 | grep -q AVX2 && echo 1)
> +ifeq ($(CC_SUPPORT_AVX2),1)
> +MACHINE_CFLAGS += -DCC_SUPPORT_AVX2
> +endif
> +
>  # To strip whitespace
>  comma:= ,
>  empty:=
> --
> 2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v4 3/3] efd: run-time dispatch over x86 EFD functions
  2017-10-02 16:13   ` [dpdk-dev] [PATCH v4 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
@ 2017-10-02 16:52     ` Ananyev, Konstantin
  2017-10-03  8:15       ` Li, Xiaoyun
  0 siblings, 1 reply; 88+ messages in thread
From: Ananyev, Konstantin @ 2017-10-02 16:52 UTC (permalink / raw)
  To: Li, Xiaoyun, Richardson, Bruce; +Cc: Lu, Wenzhuo, Zhang, Helin, dev



> -----Original Message-----
> From: Li, Xiaoyun
> Sent: Monday, October 2, 2017 5:13 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>; dev@dpdk.org; Li, Xiaoyun <xiaoyun.li@intel.com>
> Subject: [PATCH v4 3/3] efd: run-time dispatch over x86 EFD functions
> 
> This patch dynamically selects x86 EFD functions at run-time.
> This patch uses function pointer and binds it to the relative
> function based on CPU flags at constructor time.
> 
> Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
> ---
>  lib/librte_efd/rte_efd_x86.h | 41 ++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 38 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/librte_efd/rte_efd_x86.h b/lib/librte_efd/rte_efd_x86.h
> index 34f37d7..93b6743 100644
> --- a/lib/librte_efd/rte_efd_x86.h
> +++ b/lib/librte_efd/rte_efd_x86.h
> @@ -43,12 +43,29 @@
>  #define EFD_LOAD_SI128(val) _mm_lddqu_si128(val)
>  #endif
> 
> +typedef efd_value_t
> +(*efd_lookup_internal_avx2_t)(const efd_hashfunc_t *group_hash_idx,
> +		const efd_lookuptbl_t *group_lookup_table,
> +		const uint32_t hash_val_a, const uint32_t hash_val_b);
> +
> +static efd_lookup_internal_avx2_t efd_lookup_internal_avx2_ptr;
> +
>  static inline efd_value_t
>  efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
>  		const efd_lookuptbl_t *group_lookup_table,
>  		const uint32_t hash_val_a, const uint32_t hash_val_b)
>  {
> -#ifdef RTE_MACHINE_CPUFLAG_AVX2
> +	return (*efd_lookup_internal_avx2_ptr)(group_hash_idx,
> +					       group_lookup_table,
> +					       hash_val_a, hash_val_b);

I don't think you need all that.
All you need - build proper avx2 function even if current HW doesn't support it.
The existing runtime selection here seems ok already.
Konstantin

> +}
> +
> +#ifdef CC_SUPPORT_AVX2
> +static inline efd_value_t
> +efd_lookup_internal_avx2_AVX2(const efd_hashfunc_t *group_hash_idx,
> +		const efd_lookuptbl_t *group_lookup_table,
> +		const uint32_t hash_val_a, const uint32_t hash_val_b)
> +{
>  	efd_value_t value = 0;
>  	uint32_t i = 0;
>  	__m256i vhash_val_a = _mm256_set1_epi32(hash_val_a);
> @@ -74,13 +91,31 @@ efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
>  	}
> 
>  	return value;
> -#else
> +}
> +#endif
> +
> +static inline efd_value_t
> +efd_lookup_internal_avx2_DEFAULT(const efd_hashfunc_t *group_hash_idx,
> +		const efd_lookuptbl_t *group_lookup_table,
> +		const uint32_t hash_val_a, const uint32_t hash_val_b)
> +{
>  	RTE_SET_USED(group_hash_idx);
>  	RTE_SET_USED(group_lookup_table);
>  	RTE_SET_USED(hash_val_a);
>  	RTE_SET_USED(hash_val_b);
>  	/* Return dummy value, only to avoid compilation breakage */
>  	return 0;
> -#endif
> +}
> 
> +static void __attribute__((constructor))
> +rte_efd_x86_init(void)
> +{
> +#ifdef CC_SUPPORT_AVX2
> +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> +		efd_lookup_internal_avx2_ptr = efd_lookup_internal_avx2_AVX2;
> +	else
> +		efd_lookup_internal_avx2_ptr = efd_lookup_internal_avx2_DEFAULT;
> +#else
> +	efd_lookup_internal_avx2_ptr = efd_lookup_internal_avx2_DEFAULT;
> +#endif
>  }
> --
> 2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-02 16:39     ` Ananyev, Konstantin
@ 2017-10-02 23:10       ` Li, Xiaoyun
  2017-10-03 11:15         ` Ananyev, Konstantin
  0 siblings, 1 reply; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-02 23:10 UTC (permalink / raw)
  To: Ananyev, Konstantin, Richardson, Bruce; +Cc: Lu, Wenzhuo, Zhang, Helin, dev

Hi

> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Tuesday, October 3, 2017 00:39
> To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> <helin.zhang@intel.com>; dev@dpdk.org
> Subject: RE: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy
> 
> 
> 
> > -----Original Message-----
> > From: Li, Xiaoyun
> > Sent: Monday, October 2, 2017 5:13 PM
> > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson,
> Bruce <bruce.richardson@intel.com>
> > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> <helin.zhang@intel.com>; dev@dpdk.org; Li, Xiaoyun <xiaoyun.li@intel.com>
> > Subject: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy
> >
> > This patch dynamically selects functions of memcpy at run-time based
> > on CPU flags that current machine supports. This patch uses function
> > pointers which are bind to the relative functions at constrctor time.
> > In addition, AVX512 instructions set would be compiled only if users
> > config it enabled and the compiler supports it.
> >
> > Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
> > ---
> > v2
> > * Use gcc function multi-versioning to avoid compilation issues.
> > * Add macros for AVX512 and AVX2. Only if users enable AVX512 and the
> > compiler supports it, the AVX512 codes would be compiled. Only if the
> > compiler supports AVX2, the AVX2 codes would be compiled.
> >
> > v3
> > * Reduce function calls via only keep rte_memcpy_xxx.
> > * Add conditions that when copy size is small, use inline code path.
> > Otherwise, use dynamic code path.
> > * To support attribute target, clang version must be greater than 3.7.
> > Otherwise, would choose SSE/AVX code path, the same as before.
> > * Move two mocro functions to the top of the code since they would be
> > used in inline SSE/AVX and dynamic SSE/AVX codes.
> >
> > v4
> > * Modify rte_memcpy.h to several .c files and modify makefiles to compile
> > AVX2 and AVX512 files.
> 
> Could you explain to me why instead of reusing existing rte_memcpy() code
> to generate _sse/_avx2/ax512f flavors you keep pushing changes with 3
> separate implementations?
> Obviously that is much more expensive in terms of maintenance and doesn't
> look like
> feasible solution to me.
> Is existing rte_memcpy() implementation is not good enough in terms of
> functionality and/or performance?
> If so, can you outline these problems and try to fix them first.
> Konstantin
> 

I just change many small functions to one function in those 3 separate functions.
Because the existing codes are totally inline, including rte_memcpy() itself. So the compilation will 
change all rte_memcpy() calls into the basic codes like xmm0=xxx.

The existing codes in this way are OK. But when run-time, it will bring lots of function calls
and cause perf drop.


Best Regards,
Xiaoyun Li

 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v4 3/3] efd: run-time dispatch over x86 EFD functions
  2017-10-02 16:52     ` Ananyev, Konstantin
@ 2017-10-03  8:15       ` Li, Xiaoyun
  2017-10-03 11:23         ` Ananyev, Konstantin
  0 siblings, 1 reply; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-03  8:15 UTC (permalink / raw)
  To: Ananyev, Konstantin, Richardson, Bruce; +Cc: Lu, Wenzhuo, Zhang, Helin, dev

Hi

> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Tuesday, October 3, 2017 00:52
> To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> <helin.zhang@intel.com>; dev@dpdk.org
> Subject: RE: [PATCH v4 3/3] efd: run-time dispatch over x86 EFD functions
> 
> 
> 
> > -----Original Message-----
> > From: Li, Xiaoyun
> > Sent: Monday, October 2, 2017 5:13 PM
> > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson,
> > Bruce <bruce.richardson@intel.com>
> > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > <helin.zhang@intel.com>; dev@dpdk.org; Li, Xiaoyun
> > <xiaoyun.li@intel.com>
> > Subject: [PATCH v4 3/3] efd: run-time dispatch over x86 EFD functions
> >
> > This patch dynamically selects x86 EFD functions at run-time.
> > This patch uses function pointer and binds it to the relative function
> > based on CPU flags at constructor time.
> >
> > Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
> > ---
> >  lib/librte_efd/rte_efd_x86.h | 41
> > ++++++++++++++++++++++++++++++++++++++---
> >  1 file changed, 38 insertions(+), 3 deletions(-)
> >
> > diff --git a/lib/librte_efd/rte_efd_x86.h
> > b/lib/librte_efd/rte_efd_x86.h index 34f37d7..93b6743 100644
> > --- a/lib/librte_efd/rte_efd_x86.h
> > +++ b/lib/librte_efd/rte_efd_x86.h
> > @@ -43,12 +43,29 @@
> >  #define EFD_LOAD_SI128(val) _mm_lddqu_si128(val)  #endif
> >
> > +typedef efd_value_t
> > +(*efd_lookup_internal_avx2_t)(const efd_hashfunc_t *group_hash_idx,
> > +		const efd_lookuptbl_t *group_lookup_table,
> > +		const uint32_t hash_val_a, const uint32_t hash_val_b);
> > +
> > +static efd_lookup_internal_avx2_t efd_lookup_internal_avx2_ptr;
> > +
> >  static inline efd_value_t
> >  efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
> >  		const efd_lookuptbl_t *group_lookup_table,
> >  		const uint32_t hash_val_a, const uint32_t hash_val_b)  { -
> #ifdef
> > RTE_MACHINE_CPUFLAG_AVX2
> > +	return (*efd_lookup_internal_avx2_ptr)(group_hash_idx,
> > +					       group_lookup_table,
> > +					       hash_val_a, hash_val_b);
> 
> I don't think you need all that.
> All you need - build proper avx2 function even if current HW doesn't support
> it.
> The existing runtime selection here seems ok already.
> Konstantin
> 

Sorry, not quite understand here. So don't need to change codes of efd here?
I didn't care about the HW. CC_SUPPORT_AVX2 only means the compiler supports AVX2 since would runtime selection.
The existing codes RTE_MACHINE_CPUFLAG_AVX2 means both the compiler and HW supports AVX2.


Best Regards,
Xiaoyun Li


> > +}
> > +
> > +#ifdef CC_SUPPORT_AVX2
> > +static inline efd_value_t
> > +efd_lookup_internal_avx2_AVX2(const efd_hashfunc_t *group_hash_idx,
> > +		const efd_lookuptbl_t *group_lookup_table,
> > +		const uint32_t hash_val_a, const uint32_t hash_val_b) {
> >  	efd_value_t value = 0;
> >  	uint32_t i = 0;
> >  	__m256i vhash_val_a = _mm256_set1_epi32(hash_val_a); @@ -
> 74,13
> > +91,31 @@ efd_lookup_internal_avx2(const efd_hashfunc_t
> *group_hash_idx,
> >  	}
> >
> >  	return value;
> > -#else
> > +}
> > +#endif
> > +
> > +static inline efd_value_t
> > +efd_lookup_internal_avx2_DEFAULT(const efd_hashfunc_t
> *group_hash_idx,
> > +		const efd_lookuptbl_t *group_lookup_table,
> > +		const uint32_t hash_val_a, const uint32_t hash_val_b) {
> >  	RTE_SET_USED(group_hash_idx);
> >  	RTE_SET_USED(group_lookup_table);
> >  	RTE_SET_USED(hash_val_a);
> >  	RTE_SET_USED(hash_val_b);
> >  	/* Return dummy value, only to avoid compilation breakage */
> >  	return 0;
> > -#endif
> > +}
> >
> > +static void __attribute__((constructor))
> > +rte_efd_x86_init(void)
> > +{
> > +#ifdef CC_SUPPORT_AVX2
> > +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> > +		efd_lookup_internal_avx2_ptr =
> efd_lookup_internal_avx2_AVX2;
> > +	else
> > +		efd_lookup_internal_avx2_ptr =
> efd_lookup_internal_avx2_DEFAULT;
> > +#else
> > +	efd_lookup_internal_avx2_ptr = efd_lookup_internal_avx2_DEFAULT;
> > +#endif
> >  }
> > --
> > 2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-02 23:10       ` Li, Xiaoyun
@ 2017-10-03 11:15         ` Ananyev, Konstantin
  2017-10-03 11:39           ` Li, Xiaoyun
  0 siblings, 1 reply; 88+ messages in thread
From: Ananyev, Konstantin @ 2017-10-03 11:15 UTC (permalink / raw)
  To: Li, Xiaoyun, Richardson, Bruce; +Cc: Lu, Wenzhuo, Zhang, Helin, dev

Hi,

> 
> Hi
> 
> > -----Original Message-----
> > From: Ananyev, Konstantin
> > Sent: Tuesday, October 3, 2017 00:39
> > To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> > <bruce.richardson@intel.com>
> > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > <helin.zhang@intel.com>; dev@dpdk.org
> > Subject: RE: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy
> >
> >
> >
> > > -----Original Message-----
> > > From: Li, Xiaoyun
> > > Sent: Monday, October 2, 2017 5:13 PM
> > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson,
> > Bruce <bruce.richardson@intel.com>
> > > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > <helin.zhang@intel.com>; dev@dpdk.org; Li, Xiaoyun <xiaoyun.li@intel.com>
> > > Subject: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy
> > >
> > > This patch dynamically selects functions of memcpy at run-time based
> > > on CPU flags that current machine supports. This patch uses function
> > > pointers which are bind to the relative functions at constrctor time.
> > > In addition, AVX512 instructions set would be compiled only if users
> > > config it enabled and the compiler supports it.
> > >
> > > Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
> > > ---
> > > v2
> > > * Use gcc function multi-versioning to avoid compilation issues.
> > > * Add macros for AVX512 and AVX2. Only if users enable AVX512 and the
> > > compiler supports it, the AVX512 codes would be compiled. Only if the
> > > compiler supports AVX2, the AVX2 codes would be compiled.
> > >
> > > v3
> > > * Reduce function calls via only keep rte_memcpy_xxx.
> > > * Add conditions that when copy size is small, use inline code path.
> > > Otherwise, use dynamic code path.
> > > * To support attribute target, clang version must be greater than 3.7.
> > > Otherwise, would choose SSE/AVX code path, the same as before.
> > > * Move two mocro functions to the top of the code since they would be
> > > used in inline SSE/AVX and dynamic SSE/AVX codes.
> > >
> > > v4
> > > * Modify rte_memcpy.h to several .c files and modify makefiles to compile
> > > AVX2 and AVX512 files.
> >
> > Could you explain to me why instead of reusing existing rte_memcpy() code
> > to generate _sse/_avx2/ax512f flavors you keep pushing changes with 3
> > separate implementations?
> > Obviously that is much more expensive in terms of maintenance and doesn't
> > look like
> > feasible solution to me.
> > Is existing rte_memcpy() implementation is not good enough in terms of
> > functionality and/or performance?
> > If so, can you outline these problems and try to fix them first.
> > Konstantin
> >
> 
> I just change many small functions to one function in those 3 separate functions.

Yes, so with what you suggest  we'll have 4 implementations  for rte_memcpy to support.
That's very expensive terms of maintenance and I believe totally unnecessary.

> Because the existing codes are totally inline, including rte_memcpy() itself. So the compilation will
> change all rte_memcpy() calls into the basic codes like xmm0=xxx.
> 
> The existing codes in this way are OK. 

Good.

>But when run-time, it will bring lots of function calls
> and cause perf drop.

I believe it wouldn't if we do it properly.
All internal functions (mov16, mov32, etc.) will still be unlined by the compiler for each flavor (sse/avx2/etc.) -
have a look at the patch I sent.

Konstantin

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v4 3/3] efd: run-time dispatch over x86 EFD functions
  2017-10-03  8:15       ` Li, Xiaoyun
@ 2017-10-03 11:23         ` Ananyev, Konstantin
  2017-10-03 11:27           ` Li, Xiaoyun
  0 siblings, 1 reply; 88+ messages in thread
From: Ananyev, Konstantin @ 2017-10-03 11:23 UTC (permalink / raw)
  To: Li, Xiaoyun, Richardson, Bruce; +Cc: Lu, Wenzhuo, Zhang, Helin, dev



> -----Original Message-----
> From: Li, Xiaoyun
> Sent: Tuesday, October 3, 2017 9:15 AM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>; dev@dpdk.org
> Subject: RE: [PATCH v4 3/3] efd: run-time dispatch over x86 EFD functions
> 
> Hi
> 
> > -----Original Message-----
> > From: Ananyev, Konstantin
> > Sent: Tuesday, October 3, 2017 00:52
> > To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> > <bruce.richardson@intel.com>
> > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > <helin.zhang@intel.com>; dev@dpdk.org
> > Subject: RE: [PATCH v4 3/3] efd: run-time dispatch over x86 EFD functions
> >
> >
> >
> > > -----Original Message-----
> > > From: Li, Xiaoyun
> > > Sent: Monday, October 2, 2017 5:13 PM
> > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson,
> > > Bruce <bruce.richardson@intel.com>
> > > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > > <helin.zhang@intel.com>; dev@dpdk.org; Li, Xiaoyun
> > > <xiaoyun.li@intel.com>
> > > Subject: [PATCH v4 3/3] efd: run-time dispatch over x86 EFD functions
> > >
> > > This patch dynamically selects x86 EFD functions at run-time.
> > > This patch uses function pointer and binds it to the relative function
> > > based on CPU flags at constructor time.
> > >
> > > Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
> > > ---
> > >  lib/librte_efd/rte_efd_x86.h | 41
> > > ++++++++++++++++++++++++++++++++++++++---
> > >  1 file changed, 38 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/lib/librte_efd/rte_efd_x86.h
> > > b/lib/librte_efd/rte_efd_x86.h index 34f37d7..93b6743 100644
> > > --- a/lib/librte_efd/rte_efd_x86.h
> > > +++ b/lib/librte_efd/rte_efd_x86.h
> > > @@ -43,12 +43,29 @@
> > >  #define EFD_LOAD_SI128(val) _mm_lddqu_si128(val)  #endif
> > >
> > > +typedef efd_value_t
> > > +(*efd_lookup_internal_avx2_t)(const efd_hashfunc_t *group_hash_idx,
> > > +		const efd_lookuptbl_t *group_lookup_table,
> > > +		const uint32_t hash_val_a, const uint32_t hash_val_b);
> > > +
> > > +static efd_lookup_internal_avx2_t efd_lookup_internal_avx2_ptr;
> > > +
> > >  static inline efd_value_t
> > >  efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
> > >  		const efd_lookuptbl_t *group_lookup_table,
> > >  		const uint32_t hash_val_a, const uint32_t hash_val_b)  { -
> > #ifdef
> > > RTE_MACHINE_CPUFLAG_AVX2
> > > +	return (*efd_lookup_internal_avx2_ptr)(group_hash_idx,
> > > +					       group_lookup_table,
> > > +					       hash_val_a, hash_val_b);
> >
> > I don't think you need all that.
> > All you need - build proper avx2 function even if current HW doesn't support
> > it.
> > The existing runtime selection here seems ok already.
> > Konstantin
> >
> 
> Sorry, not quite understand here. So don't need to change codes of efd here?
> I didn't care about the HW. CC_SUPPORT_AVX2 only means the compiler supports AVX2 since would runtime selection.
> The existing codes RTE_MACHINE_CPUFLAG_AVX2 means both the compiler and HW supports AVX2.

What I am saying - you don't need all these dances with extra function pointer.
All you need - move efd_lookup_internal_avx2() into a .c file and make sure it get compiled with -mavx2 flag.
Then at rte_efd_create() select AVX2 only when both HW and compiler supports AVX2:

...
#ifdef CC_SUPPORT_AVX2
  if (RTE_EFD_VALUE_NUM_BITS > 3 && rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
                table->lookup_fn = EFD_LOOKUP_AVX2;
        else
#endif
...

Konstantin

> 
> 
> Best Regards,
> Xiaoyun Li
> 
> 
> > > +}
> > > +
> > > +#ifdef CC_SUPPORT_AVX2
> > > +static inline efd_value_t
> > > +efd_lookup_internal_avx2_AVX2(const efd_hashfunc_t *group_hash_idx,
> > > +		const efd_lookuptbl_t *group_lookup_table,
> > > +		const uint32_t hash_val_a, const uint32_t hash_val_b) {
> > >  	efd_value_t value = 0;
> > >  	uint32_t i = 0;
> > >  	__m256i vhash_val_a = _mm256_set1_epi32(hash_val_a); @@ -
> > 74,13
> > > +91,31 @@ efd_lookup_internal_avx2(const efd_hashfunc_t
> > *group_hash_idx,
> > >  	}
> > >
> > >  	return value;
> > > -#else
> > > +}
> > > +#endif
> > > +
> > > +static inline efd_value_t
> > > +efd_lookup_internal_avx2_DEFAULT(const efd_hashfunc_t
> > *group_hash_idx,
> > > +		const efd_lookuptbl_t *group_lookup_table,
> > > +		const uint32_t hash_val_a, const uint32_t hash_val_b) {
> > >  	RTE_SET_USED(group_hash_idx);
> > >  	RTE_SET_USED(group_lookup_table);
> > >  	RTE_SET_USED(hash_val_a);
> > >  	RTE_SET_USED(hash_val_b);
> > >  	/* Return dummy value, only to avoid compilation breakage */
> > >  	return 0;
> > > -#endif
> > > +}
> > >
> > > +static void __attribute__((constructor))
> > > +rte_efd_x86_init(void)
> > > +{
> > > +#ifdef CC_SUPPORT_AVX2
> > > +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> > > +		efd_lookup_internal_avx2_ptr =
> > efd_lookup_internal_avx2_AVX2;
> > > +	else
> > > +		efd_lookup_internal_avx2_ptr =
> > efd_lookup_internal_avx2_DEFAULT;
> > > +#else
> > > +	efd_lookup_internal_avx2_ptr = efd_lookup_internal_avx2_DEFAULT;
> > > +#endif
> > >  }
> > > --
> > > 2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v4 3/3] efd: run-time dispatch over x86 EFD functions
  2017-10-03 11:23         ` Ananyev, Konstantin
@ 2017-10-03 11:27           ` Li, Xiaoyun
  0 siblings, 0 replies; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-03 11:27 UTC (permalink / raw)
  To: Ananyev, Konstantin, Richardson, Bruce; +Cc: Lu, Wenzhuo, Zhang, Helin, dev

OK.

> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Tuesday, October 3, 2017 19:23
> To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> <helin.zhang@intel.com>; dev@dpdk.org
> Subject: RE: [PATCH v4 3/3] efd: run-time dispatch over x86 EFD functions
> 
> 
> 
> > -----Original Message-----
> > From: Li, Xiaoyun
> > Sent: Tuesday, October 3, 2017 9:15 AM
> > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson,
> > Bruce <bruce.richardson@intel.com>
> > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > <helin.zhang@intel.com>; dev@dpdk.org
> > Subject: RE: [PATCH v4 3/3] efd: run-time dispatch over x86 EFD
> > functions
> >
> > Hi
> >
> > > -----Original Message-----
> > > From: Ananyev, Konstantin
> > > Sent: Tuesday, October 3, 2017 00:52
> > > To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> > > <bruce.richardson@intel.com>
> > > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > > <helin.zhang@intel.com>; dev@dpdk.org
> > > Subject: RE: [PATCH v4 3/3] efd: run-time dispatch over x86 EFD
> > > functions
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Li, Xiaoyun
> > > > Sent: Monday, October 2, 2017 5:13 PM
> > > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> > > > Richardson, Bruce <bruce.richardson@intel.com>
> > > > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > > > <helin.zhang@intel.com>; dev@dpdk.org; Li, Xiaoyun
> > > > <xiaoyun.li@intel.com>
> > > > Subject: [PATCH v4 3/3] efd: run-time dispatch over x86 EFD
> > > > functions
> > > >
> > > > This patch dynamically selects x86 EFD functions at run-time.
> > > > This patch uses function pointer and binds it to the relative
> > > > function based on CPU flags at constructor time.
> > > >
> > > > Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
> > > > ---
> > > >  lib/librte_efd/rte_efd_x86.h | 41
> > > > ++++++++++++++++++++++++++++++++++++++---
> > > >  1 file changed, 38 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/lib/librte_efd/rte_efd_x86.h
> > > > b/lib/librte_efd/rte_efd_x86.h index 34f37d7..93b6743 100644
> > > > --- a/lib/librte_efd/rte_efd_x86.h
> > > > +++ b/lib/librte_efd/rte_efd_x86.h
> > > > @@ -43,12 +43,29 @@
> > > >  #define EFD_LOAD_SI128(val) _mm_lddqu_si128(val)  #endif
> > > >
> > > > +typedef efd_value_t
> > > > +(*efd_lookup_internal_avx2_t)(const efd_hashfunc_t
> *group_hash_idx,
> > > > +		const efd_lookuptbl_t *group_lookup_table,
> > > > +		const uint32_t hash_val_a, const uint32_t hash_val_b);
> > > > +
> > > > +static efd_lookup_internal_avx2_t efd_lookup_internal_avx2_ptr;
> > > > +
> > > >  static inline efd_value_t
> > > >  efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
> > > >  		const efd_lookuptbl_t *group_lookup_table,
> > > >  		const uint32_t hash_val_a, const uint32_t hash_val_b)  { -
> > > #ifdef
> > > > RTE_MACHINE_CPUFLAG_AVX2
> > > > +	return (*efd_lookup_internal_avx2_ptr)(group_hash_idx,
> > > > +					       group_lookup_table,
> > > > +					       hash_val_a, hash_val_b);
> > >
> > > I don't think you need all that.
> > > All you need - build proper avx2 function even if current HW doesn't
> > > support it.
> > > The existing runtime selection here seems ok already.
> > > Konstantin
> > >
> >
> > Sorry, not quite understand here. So don't need to change codes of efd
> here?
> > I didn't care about the HW. CC_SUPPORT_AVX2 only means the compiler
> supports AVX2 since would runtime selection.
> > The existing codes RTE_MACHINE_CPUFLAG_AVX2 means both the
> compiler and HW supports AVX2.
> 
> What I am saying - you don't need all these dances with extra function
> pointer.
> All you need - move efd_lookup_internal_avx2() into a .c file and make sure
> it get compiled with -mavx2 flag.
> Then at rte_efd_create() select AVX2 only when both HW and compiler
> supports AVX2:
> 
> ...
> #ifdef CC_SUPPORT_AVX2
>   if (RTE_EFD_VALUE_NUM_BITS > 3 &&
> rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
>                 table->lookup_fn = EFD_LOOKUP_AVX2;
>         else
> #endif
> ...
> 
> Konstantin
> 
> >
> >
> > Best Regards,
> > Xiaoyun Li
> >
> >
> > > > +}
> > > > +
> > > > +#ifdef CC_SUPPORT_AVX2
> > > > +static inline efd_value_t
> > > > +efd_lookup_internal_avx2_AVX2(const efd_hashfunc_t
> *group_hash_idx,
> > > > +		const efd_lookuptbl_t *group_lookup_table,
> > > > +		const uint32_t hash_val_a, const uint32_t hash_val_b) {
> > > >  	efd_value_t value = 0;
> > > >  	uint32_t i = 0;
> > > >  	__m256i vhash_val_a = _mm256_set1_epi32(hash_val_a); @@ -
> > > 74,13
> > > > +91,31 @@ efd_lookup_internal_avx2(const efd_hashfunc_t
> > > *group_hash_idx,
> > > >  	}
> > > >
> > > >  	return value;
> > > > -#else
> > > > +}
> > > > +#endif
> > > > +
> > > > +static inline efd_value_t
> > > > +efd_lookup_internal_avx2_DEFAULT(const efd_hashfunc_t
> > > *group_hash_idx,
> > > > +		const efd_lookuptbl_t *group_lookup_table,
> > > > +		const uint32_t hash_val_a, const uint32_t hash_val_b) {
> > > >  	RTE_SET_USED(group_hash_idx);
> > > >  	RTE_SET_USED(group_lookup_table);
> > > >  	RTE_SET_USED(hash_val_a);
> > > >  	RTE_SET_USED(hash_val_b);
> > > >  	/* Return dummy value, only to avoid compilation breakage */
> > > >  	return 0;
> > > > -#endif
> > > > +}
> > > >
> > > > +static void __attribute__((constructor))
> > > > +rte_efd_x86_init(void)
> > > > +{
> > > > +#ifdef CC_SUPPORT_AVX2
> > > > +	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
> > > > +		efd_lookup_internal_avx2_ptr =
> > > efd_lookup_internal_avx2_AVX2;
> > > > +	else
> > > > +		efd_lookup_internal_avx2_ptr =
> > > efd_lookup_internal_avx2_DEFAULT;
> > > > +#else
> > > > +	efd_lookup_internal_avx2_ptr = efd_lookup_internal_avx2_DEFAULT;
> > > > +#endif
> > > >  }
> > > > --
> > > > 2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-03 11:15         ` Ananyev, Konstantin
@ 2017-10-03 11:39           ` Li, Xiaoyun
  2017-10-03 12:12             ` Ananyev, Konstantin
  0 siblings, 1 reply; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-03 11:39 UTC (permalink / raw)
  To: Ananyev, Konstantin, Richardson, Bruce; +Cc: Lu, Wenzhuo, Zhang, Helin, dev

Hi
You mean just use rte_memcpy_internal in rte_memcpy_avx2, rte_memcpy_avx512?
But if RTE_MACHINE_CPUFLAGS_AVX2 means only whether the compiler supports avx2, then internal would only compiled
With avx2 codes, then cannot choose other code path. What if the HW cannot support avx2?
If RTE_MACHINE_CPUFLAGS_AVX2 means as before, suggests whether both compiler and HW supports avx2. Then the function
has no difference right now.
The mocro is determined at compilation time. But selection is hoped to be at runtime.
Did I consider something wrong?

Best Regards,
Xiaoyun Li




> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Tuesday, October 3, 2017 19:16
> To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> <helin.zhang@intel.com>; dev@dpdk.org
> Subject: RE: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy
> 
> Hi,
> 
> >
> > Hi
> >
> > > -----Original Message-----
> > > From: Ananyev, Konstantin
> > > Sent: Tuesday, October 3, 2017 00:39
> > > To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> > > <bruce.richardson@intel.com>
> > > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > > <helin.zhang@intel.com>; dev@dpdk.org
> > > Subject: RE: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Li, Xiaoyun
> > > > Sent: Monday, October 2, 2017 5:13 PM
> > > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> > > > Richardson,
> > > Bruce <bruce.richardson@intel.com>
> > > > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > > <helin.zhang@intel.com>; dev@dpdk.org; Li, Xiaoyun
> > > <xiaoyun.li@intel.com>
> > > > Subject: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy
> > > >
> > > > This patch dynamically selects functions of memcpy at run-time
> > > > based on CPU flags that current machine supports. This patch uses
> > > > function pointers which are bind to the relative functions at constrctor
> time.
> > > > In addition, AVX512 instructions set would be compiled only if
> > > > users config it enabled and the compiler supports it.
> > > >
> > > > Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
> > > > ---
> > > > v2
> > > > * Use gcc function multi-versioning to avoid compilation issues.
> > > > * Add macros for AVX512 and AVX2. Only if users enable AVX512 and
> > > > the compiler supports it, the AVX512 codes would be compiled. Only
> > > > if the compiler supports AVX2, the AVX2 codes would be compiled.
> > > >
> > > > v3
> > > > * Reduce function calls via only keep rte_memcpy_xxx.
> > > > * Add conditions that when copy size is small, use inline code path.
> > > > Otherwise, use dynamic code path.
> > > > * To support attribute target, clang version must be greater than 3.7.
> > > > Otherwise, would choose SSE/AVX code path, the same as before.
> > > > * Move two mocro functions to the top of the code since they would
> > > > be used in inline SSE/AVX and dynamic SSE/AVX codes.
> > > >
> > > > v4
> > > > * Modify rte_memcpy.h to several .c files and modify makefiles to
> > > > compile
> > > > AVX2 and AVX512 files.
> > >
> > > Could you explain to me why instead of reusing existing rte_memcpy()
> > > code to generate _sse/_avx2/ax512f flavors you keep pushing changes
> > > with 3 separate implementations?
> > > Obviously that is much more expensive in terms of maintenance and
> > > doesn't look like feasible solution to me.
> > > Is existing rte_memcpy() implementation is not good enough in terms
> > > of functionality and/or performance?
> > > If so, can you outline these problems and try to fix them first.
> > > Konstantin
> > >
> >
> > I just change many small functions to one function in those 3 separate
> functions.
> 
> Yes, so with what you suggest  we'll have 4 implementations  for rte_memcpy
> to support.
> That's very expensive terms of maintenance and I believe totally unnecessary.
> 
> > Because the existing codes are totally inline, including rte_memcpy()
> > itself. So the compilation will change all rte_memcpy() calls into the basic
> codes like xmm0=xxx.
> >
> > The existing codes in this way are OK.
> 
> Good.
> 
> >But when run-time, it will bring lots of function calls  and cause perf
> >drop.
> 
> I believe it wouldn't if we do it properly.
> All internal functions (mov16, mov32, etc.) will still be unlined by the
> compiler for each flavor (sse/avx2/etc.) - have a look at the patch I sent.
> 
> Konstantin

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-03 11:39           ` Li, Xiaoyun
@ 2017-10-03 12:12             ` Ananyev, Konstantin
  2017-10-03 12:23               ` Li, Xiaoyun
  0 siblings, 1 reply; 88+ messages in thread
From: Ananyev, Konstantin @ 2017-10-03 12:12 UTC (permalink / raw)
  To: Li, Xiaoyun, Richardson, Bruce; +Cc: Lu, Wenzhuo, Zhang, Helin, dev



> 
> Hi
> You mean just use rte_memcpy_internal in rte_memcpy_avx2, rte_memcpy_avx512?

Yes, exactly and for rte_memcpy_sse() too.
Basically we for rte_memcpy_avx512() we force compiler to use AVX512F path inside rte_memcpy_iternal(),
for rte_memcpy_avx2() we use AVX2 path inside rte_memcpy_internal(), etc.
To do that we setup:
CFLAGS_rte_memcpy_avx512f.o += -mavx512f
CFLAGS_rte_memcpy_avx512f.o += -DRTE_MACHINE_CPUFLAG_AVX512F
inside the Makefile.

For rte_memcpy_avx2() we force compiler to use AVX2 path inside rte_memcpy_internal(), etc.

> But if RTE_MACHINE_CPUFLAGS_AVX2 means only whether the compiler supports avx2, then internal would only compiled
> With avx2 codes, then cannot choose other code path. What if the HW cannot support avx2?

If the HW can't support AVX2 then rte_memcpy_init() just wouldn't select rte_memcpy_avx2(),
it would select rte_memcpy_sse() instead:

if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) {...} -
that is a runtime check that underlying HW does support AVX2.

Konstantin

> If RTE_MACHINE_CPUFLAGS_AVX2 means as before, suggests whether both compiler and HW supports avx2. Then the function
> has no difference right now.
> The mocro is determined at compilation time. But selection is hoped to be at runtime.
> Did I consider something wrong?
> 
> Best Regards,
> Xiaoyun Li
> 
> 
> 
> 
> > -----Original Message-----
> > From: Ananyev, Konstantin
> > Sent: Tuesday, October 3, 2017 19:16
> > To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> > <bruce.richardson@intel.com>
> > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > <helin.zhang@intel.com>; dev@dpdk.org
> > Subject: RE: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy
> >
> > Hi,
> >
> > >
> > > Hi
> > >
> > > > -----Original Message-----
> > > > From: Ananyev, Konstantin
> > > > Sent: Tuesday, October 3, 2017 00:39
> > > > To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> > > > <bruce.richardson@intel.com>
> > > > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > > > <helin.zhang@intel.com>; dev@dpdk.org
> > > > Subject: RE: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Li, Xiaoyun
> > > > > Sent: Monday, October 2, 2017 5:13 PM
> > > > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> > > > > Richardson,
> > > > Bruce <bruce.richardson@intel.com>
> > > > > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > > > <helin.zhang@intel.com>; dev@dpdk.org; Li, Xiaoyun
> > > > <xiaoyun.li@intel.com>
> > > > > Subject: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy
> > > > >
> > > > > This patch dynamically selects functions of memcpy at run-time
> > > > > based on CPU flags that current machine supports. This patch uses
> > > > > function pointers which are bind to the relative functions at constrctor
> > time.
> > > > > In addition, AVX512 instructions set would be compiled only if
> > > > > users config it enabled and the compiler supports it.
> > > > >
> > > > > Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
> > > > > ---
> > > > > v2
> > > > > * Use gcc function multi-versioning to avoid compilation issues.
> > > > > * Add macros for AVX512 and AVX2. Only if users enable AVX512 and
> > > > > the compiler supports it, the AVX512 codes would be compiled. Only
> > > > > if the compiler supports AVX2, the AVX2 codes would be compiled.
> > > > >
> > > > > v3
> > > > > * Reduce function calls via only keep rte_memcpy_xxx.
> > > > > * Add conditions that when copy size is small, use inline code path.
> > > > > Otherwise, use dynamic code path.
> > > > > * To support attribute target, clang version must be greater than 3.7.
> > > > > Otherwise, would choose SSE/AVX code path, the same as before.
> > > > > * Move two mocro functions to the top of the code since they would
> > > > > be used in inline SSE/AVX and dynamic SSE/AVX codes.
> > > > >
> > > > > v4
> > > > > * Modify rte_memcpy.h to several .c files and modify makefiles to
> > > > > compile
> > > > > AVX2 and AVX512 files.
> > > >
> > > > Could you explain to me why instead of reusing existing rte_memcpy()
> > > > code to generate _sse/_avx2/ax512f flavors you keep pushing changes
> > > > with 3 separate implementations?
> > > > Obviously that is much more expensive in terms of maintenance and
> > > > doesn't look like feasible solution to me.
> > > > Is existing rte_memcpy() implementation is not good enough in terms
> > > > of functionality and/or performance?
> > > > If so, can you outline these problems and try to fix them first.
> > > > Konstantin
> > > >
> > >
> > > I just change many small functions to one function in those 3 separate
> > functions.
> >
> > Yes, so with what you suggest  we'll have 4 implementations  for rte_memcpy
> > to support.
> > That's very expensive terms of maintenance and I believe totally unnecessary.
> >
> > > Because the existing codes are totally inline, including rte_memcpy()
> > > itself. So the compilation will change all rte_memcpy() calls into the basic
> > codes like xmm0=xxx.
> > >
> > > The existing codes in this way are OK.
> >
> > Good.
> >
> > >But when run-time, it will bring lots of function calls  and cause perf
> > >drop.
> >
> > I believe it wouldn't if we do it properly.
> > All internal functions (mov16, mov32, etc.) will still be unlined by the
> > compiler for each flavor (sse/avx2/etc.) - have a look at the patch I sent.
> >
> > Konstantin

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-03 12:12             ` Ananyev, Konstantin
@ 2017-10-03 12:23               ` Li, Xiaoyun
  0 siblings, 0 replies; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-03 12:23 UTC (permalink / raw)
  To: Ananyev, Konstantin, Richardson, Bruce; +Cc: Lu, Wenzhuo, Zhang, Helin, dev

OK. Got it. Thanks!

> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Tuesday, October 3, 2017 20:12
> To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> <helin.zhang@intel.com>; dev@dpdk.org
> Subject: RE: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy
> 
> 
> 
> >
> > Hi
> > You mean just use rte_memcpy_internal in rte_memcpy_avx2,
> rte_memcpy_avx512?
> 
> Yes, exactly and for rte_memcpy_sse() too.
> Basically we for rte_memcpy_avx512() we force compiler to use AVX512F
> path inside rte_memcpy_iternal(), for rte_memcpy_avx2() we use AVX2 path
> inside rte_memcpy_internal(), etc.
> To do that we setup:
> CFLAGS_rte_memcpy_avx512f.o += -mavx512f
> CFLAGS_rte_memcpy_avx512f.o += -DRTE_MACHINE_CPUFLAG_AVX512F
> inside the Makefile.
> 
> For rte_memcpy_avx2() we force compiler to use AVX2 path inside
> rte_memcpy_internal(), etc.
> 
> > But if RTE_MACHINE_CPUFLAGS_AVX2 means only whether the compiler
> > supports avx2, then internal would only compiled With avx2 codes, then
> cannot choose other code path. What if the HW cannot support avx2?
> 
> If the HW can't support AVX2 then rte_memcpy_init() just wouldn't select
> rte_memcpy_avx2(), it would select rte_memcpy_sse() instead:
> 
> if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) {...} - that is a runtime
> check that underlying HW does support AVX2.
> 
> Konstantin
> 
> > If RTE_MACHINE_CPUFLAGS_AVX2 means as before, suggests whether
> both
> > compiler and HW supports avx2. Then the function has no difference right
> now.
> > The mocro is determined at compilation time. But selection is hoped to be
> at runtime.
> > Did I consider something wrong?
> >
> > Best Regards,
> > Xiaoyun Li
> >
> >
> >
> >
> > > -----Original Message-----
> > > From: Ananyev, Konstantin
> > > Sent: Tuesday, October 3, 2017 19:16
> > > To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> > > <bruce.richardson@intel.com>
> > > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > > <helin.zhang@intel.com>; dev@dpdk.org
> > > Subject: RE: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy
> > >
> > > Hi,
> > >
> > > >
> > > > Hi
> > > >
> > > > > -----Original Message-----
> > > > > From: Ananyev, Konstantin
> > > > > Sent: Tuesday, October 3, 2017 00:39
> > > > > To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> > > > > <bruce.richardson@intel.com>
> > > > > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > > > > <helin.zhang@intel.com>; dev@dpdk.org
> > > > > Subject: RE: [PATCH v4 1/3] eal/x86: run-time dispatch over
> > > > > memcpy
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Li, Xiaoyun
> > > > > > Sent: Monday, October 2, 2017 5:13 PM
> > > > > > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> > > > > > Richardson,
> > > > > Bruce <bruce.richardson@intel.com>
> > > > > > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > > > > <helin.zhang@intel.com>; dev@dpdk.org; Li, Xiaoyun
> > > > > <xiaoyun.li@intel.com>
> > > > > > Subject: [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy
> > > > > >
> > > > > > This patch dynamically selects functions of memcpy at run-time
> > > > > > based on CPU flags that current machine supports. This patch
> > > > > > uses function pointers which are bind to the relative
> > > > > > functions at constrctor
> > > time.
> > > > > > In addition, AVX512 instructions set would be compiled only if
> > > > > > users config it enabled and the compiler supports it.
> > > > > >
> > > > > > Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
> > > > > > ---
> > > > > > v2
> > > > > > * Use gcc function multi-versioning to avoid compilation issues.
> > > > > > * Add macros for AVX512 and AVX2. Only if users enable AVX512
> > > > > > and the compiler supports it, the AVX512 codes would be
> > > > > > compiled. Only if the compiler supports AVX2, the AVX2 codes
> would be compiled.
> > > > > >
> > > > > > v3
> > > > > > * Reduce function calls via only keep rte_memcpy_xxx.
> > > > > > * Add conditions that when copy size is small, use inline code path.
> > > > > > Otherwise, use dynamic code path.
> > > > > > * To support attribute target, clang version must be greater than 3.7.
> > > > > > Otherwise, would choose SSE/AVX code path, the same as before.
> > > > > > * Move two mocro functions to the top of the code since they
> > > > > > would be used in inline SSE/AVX and dynamic SSE/AVX codes.
> > > > > >
> > > > > > v4
> > > > > > * Modify rte_memcpy.h to several .c files and modify makefiles
> > > > > > to compile
> > > > > > AVX2 and AVX512 files.
> > > > >
> > > > > Could you explain to me why instead of reusing existing
> > > > > rte_memcpy() code to generate _sse/_avx2/ax512f flavors you keep
> > > > > pushing changes with 3 separate implementations?
> > > > > Obviously that is much more expensive in terms of maintenance
> > > > > and doesn't look like feasible solution to me.
> > > > > Is existing rte_memcpy() implementation is not good enough in
> > > > > terms of functionality and/or performance?
> > > > > If so, can you outline these problems and try to fix them first.
> > > > > Konstantin
> > > > >
> > > >
> > > > I just change many small functions to one function in those 3
> > > > separate
> > > functions.
> > >
> > > Yes, so with what you suggest  we'll have 4 implementations  for
> > > rte_memcpy to support.
> > > That's very expensive terms of maintenance and I believe totally
> unnecessary.
> > >
> > > > Because the existing codes are totally inline, including
> > > > rte_memcpy() itself. So the compilation will change all
> > > > rte_memcpy() calls into the basic
> > > codes like xmm0=xxx.
> > > >
> > > > The existing codes in this way are OK.
> > >
> > > Good.
> > >
> > > >But when run-time, it will bring lots of function calls  and cause
> > > >perf drop.
> > >
> > > I believe it wouldn't if we do it properly.
> > > All internal functions (mov16, mov32, etc.) will still be unlined by
> > > the compiler for each flavor (sse/avx2/etc.) - have a look at the patch I
> sent.
> > >
> > > Konstantin

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [dpdk-dev] [PATCH v5 0/3] run-time Linking support
  2017-10-02 16:13 ` [dpdk-dev] [PATCH v4 0/3] run-time Linking support Xiaoyun Li
                     ` (2 preceding siblings ...)
  2017-10-02 16:13   ` [dpdk-dev] [PATCH v4 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
@ 2017-10-03 14:59   ` Xiaoyun Li
  2017-10-03 14:59     ` [dpdk-dev] [PATCH v5 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
                       ` (4 more replies)
  3 siblings, 5 replies; 88+ messages in thread
From: Xiaoyun Li @ 2017-10-03 14:59 UTC (permalink / raw)
  To: konstantin.ananyev, bruce.richardson
  Cc: wenzhuo.lu, helin.zhang, dev, Xiaoyun Li

This patchset dynamically selects functions at run-time based on CPU flags
that current machine supports.This patchset modifies mempcy, memcpy perf
test and x86 EFD, using function pointers and bind them at constructor time.
Then in the cloud environment, users can compiler once for the minimum target
such as 'haswell'(not 'native') and run on different platforms (equal or above
haswell) and can get ISA optimization based on running CPU.

Xiaoyun Li (3):
  eal/x86: run-time dispatch over memcpy
  app/test: run-time dispatch over memcpy perf test
  efd: run-time dispatch over x86 EFD functions

---
v2
* Use gcc function multi-versioning to avoid compilation issues.
* Add macros for AVX512 and AVX2. Only if users enable AVX512 and the compiler
supports it, the AVX512 codes would be compiled. Only if the compiler supports
AVX2, the AVX2 codes would be compiled.

v3
* Reduce function calls via only keep rte_memcpy_xxx.
* Add conditions that when copy size is small, use inline code path.
Otherwise, use dynamic code path.
* To support attribute target, clang version must be greater than 3.7.
Otherwise, would choose SSE/AVX code path, the same as before.
* Move two mocro functions to the top of the code since they would be used in
inline SSE/AVX and dynamic SSE/AVX codes.

v4
* Modify rte_memcpy.h to several .c files and modify makefiles to compile
AVX2 and AVX512 files.

v5
* Delete redundant repeated codes of rte_memcpy_xxx.
* Modify makefiles to enable reuse of existing rte_memcpy.
* Delete redundant codes of rte_efd_x86.h in v4. Move it into .c file and enable
compilation -mavx2 for it in makefile since it is already chosen at run-time.

 lib/librte_eal/bsdapp/eal/Makefile                 |  19 +
 .../common/include/arch/x86/rte_memcpy.c           |  59 ++
 .../common/include/arch/x86/rte_memcpy.h           | 861 +------------------
 .../common/include/arch/x86/rte_memcpy_avx2.c      |  44 +
 .../common/include/arch/x86/rte_memcpy_avx512f.c   |  44 +
 .../common/include/arch/x86/rte_memcpy_internal.h  | 909 +++++++++++++++++++++
 .../common/include/arch/x86/rte_memcpy_sse.c       |  40 +
 lib/librte_eal/linuxapp/eal/Makefile               |  19 +
 lib/librte_efd/Makefile                            |   6 +
 lib/librte_efd/rte_efd_x86.c                       |  87 ++
 lib/librte_efd/rte_efd_x86.h                       |  48 +-
 mk/rte.cpuflags.mk                                 |  14 +
 test/test/test_memcpy_perf.c                       |  40 +-
 13 files changed, 1285 insertions(+), 905 deletions(-)
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c
 create mode 100644 lib/librte_efd/rte_efd_x86.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [dpdk-dev] [PATCH v5 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-03 14:59   ` [dpdk-dev] [PATCH v5 0/3] run-time Linking support Xiaoyun Li
@ 2017-10-03 14:59     ` Xiaoyun Li
  2017-10-03 14:59     ` [dpdk-dev] [PATCH v5 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 88+ messages in thread
From: Xiaoyun Li @ 2017-10-03 14:59 UTC (permalink / raw)
  To: konstantin.ananyev, bruce.richardson
  Cc: wenzhuo.lu, helin.zhang, dev, Xiaoyun Li

This patch dynamically selects functions of memcpy at run-time based
on CPU flags that current machine supports. This patch uses function
pointers which are bind to the relative functions at constrctor time.
In addition, AVX512 instructions set would be compiled only if users
config it enabled and the compiler supports it.

Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
---
 lib/librte_eal/bsdapp/eal/Makefile                 |  19 +
 .../common/include/arch/x86/rte_memcpy.c           |  59 ++
 .../common/include/arch/x86/rte_memcpy.h           | 861 +------------------
 .../common/include/arch/x86/rte_memcpy_avx2.c      |  44 +
 .../common/include/arch/x86/rte_memcpy_avx512f.c   |  44 +
 .../common/include/arch/x86/rte_memcpy_internal.h  | 909 +++++++++++++++++++++
 .../common/include/arch/x86/rte_memcpy_sse.c       |  40 +
 lib/librte_eal/linuxapp/eal/Makefile               |  19 +
 mk/rte.cpuflags.mk                                 |  14 +
 9 files changed, 1163 insertions(+), 846 deletions(-)
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c

diff --git a/lib/librte_eal/bsdapp/eal/Makefile b/lib/librte_eal/bsdapp/eal/Makefile
index 005019e..1dcd2e3 100644
--- a/lib/librte_eal/bsdapp/eal/Makefile
+++ b/lib/librte_eal/bsdapp/eal/Makefile
@@ -36,6 +36,7 @@ LIB = librte_eal.a
 ARCH_DIR ?= $(RTE_ARCH)
 VPATH += $(RTE_SDK)/lib/librte_eal/common
 VPATH += $(RTE_SDK)/lib/librte_eal/common/arch/$(ARCH_DIR)
+VPATH += $(RTE_SDK)/lib/librte_eal/common/include/arch/$(ARCH_DIR)
 
 CFLAGS += -I$(SRCDIR)/include
 CFLAGS += -I$(RTE_SDK)/lib/librte_eal/common
@@ -93,6 +94,24 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_ARCH_X86) += rte_spinlock.c
 
+# for run-time dispatch of memcpy
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy.c
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_sse.c
+
+# if the compiler supports AVX512, add avx512 file
+ifneq ($(findstring CC_SUPPORT_AVX512F,$(MACHINE_CFLAGS)),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx512f.c
+CFLAGS_rte_memcpy_avx512f.o += -mavx512f
+CFLAGS_rte_memcpy_avx512f.o += -DRTE_MACHINE_CPUFLAG_AVX512F
+endif
+
+# if the compiler supports AVX2, add avx2 file
+ifneq ($(findstring CC_SUPPORT_AVX2,$(MACHINE_CFLAGS)),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx2.c
+CFLAGS_rte_memcpy_avx2.o += -mavx2
+CFLAGS_rte_memcpy_avx2.o += -DRTE_MACHINE_CPUFLAG_AVX2
+endif
+
 CFLAGS_eal_common_cpuflags.o := $(CPUFLAGS_LIST)
 
 CFLAGS_eal.o := -D_GNU_SOURCE
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.c b/lib/librte_eal/common/include/arch/x86/rte_memcpy.c
new file mode 100644
index 0000000..74ae702
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.c
@@ -0,0 +1,59 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+#include <rte_cpuflags.h>
+#include <rte_log.h>
+
+void *(*rte_memcpy_ptr)(void *dst, const void *src, size_t n) = NULL;
+
+static void __attribute__((constructor))
+rte_memcpy_init(void)
+{
+#ifdef CC_SUPPORT_AVX512F
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F)) {
+		rte_memcpy_ptr = rte_memcpy_avx512f;
+		RTE_LOG(DEBUG, EAL, "AVX512 memcpy is using!\n");
+		return;
+	}
+#endif
+#ifdef CC_SUPPORT_AVX2
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) {
+		rte_memcpy_ptr = rte_memcpy_avx2;
+		RTE_LOG(DEBUG, EAL, "AVX2 memcpy is using!\n");
+		return;
+	}
+#endif
+	rte_memcpy_ptr = rte_memcpy_sse;
+	RTE_LOG(DEBUG, EAL, "Default SSE/AVX memcpy is using!\n");
+}
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index 74c280c..460dcdb 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -1,7 +1,7 @@
 /*-
  *   BSD LICENSE
  *
- *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
  *   All rights reserved.
  *
  *   Redistribution and use in source and binary forms, with or without
@@ -34,867 +34,36 @@
 #ifndef _RTE_MEMCPY_X86_64_H_
 #define _RTE_MEMCPY_X86_64_H_
 
-/**
- * @file
- *
- * Functions for SSE/AVX/AVX2/AVX512 implementation of memcpy().
- */
-
-#include <stdio.h>
-#include <stdint.h>
-#include <string.h>
-#include <rte_vect.h>
-#include <rte_common.h>
+#include <rte_memcpy_internal.h>
 
 #ifdef __cplusplus
 extern "C" {
 #endif
 
-/**
- * Copy bytes from one location to another. The locations must not overlap.
- *
- * @note This is implemented as a macro, so it's address should not be taken
- * and care is needed as parameter expressions may be evaluated multiple times.
- *
- * @param dst
- *   Pointer to the destination of the data.
- * @param src
- *   Pointer to the source data.
- * @param n
- *   Number of bytes to copy.
- * @return
- *   Pointer to the destination data.
- */
-static __rte_always_inline void *
-rte_memcpy(void *dst, const void *src, size_t n);
-
-#ifdef RTE_MACHINE_CPUFLAG_AVX512F
+#define RTE_X86_MEMCPY_THRESH 128
 
-#define ALIGNMENT_MASK 0x3F
+extern void *
+(*rte_memcpy_ptr)(void *dst, const void *src, size_t n);
 
 /**
- * AVX512 implementation below
+ * Different implementations of memcpy.
  */
+extern void*
+rte_memcpy_avx512f(void *dst, const void *src, size_t n);
 
-/**
- * Copy 16 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov16(uint8_t *dst, const uint8_t *src)
-{
-	__m128i xmm0;
-
-	xmm0 = _mm_loadu_si128((const __m128i *)src);
-	_mm_storeu_si128((__m128i *)dst, xmm0);
-}
-
-/**
- * Copy 32 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov32(uint8_t *dst, const uint8_t *src)
-{
-	__m256i ymm0;
+extern void *
+rte_memcpy_avx2(void *dst, const void *src, size_t n);
 
-	ymm0 = _mm256_loadu_si256((const __m256i *)src);
-	_mm256_storeu_si256((__m256i *)dst, ymm0);
-}
-
-/**
- * Copy 64 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov64(uint8_t *dst, const uint8_t *src)
-{
-	__m512i zmm0;
-
-	zmm0 = _mm512_loadu_si512((const void *)src);
-	_mm512_storeu_si512((void *)dst, zmm0);
-}
-
-/**
- * Copy 128 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov64(dst + 0 * 64, src + 0 * 64);
-	rte_mov64(dst + 1 * 64, src + 1 * 64);
-}
-
-/**
- * Copy 256 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov256(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov64(dst + 0 * 64, src + 0 * 64);
-	rte_mov64(dst + 1 * 64, src + 1 * 64);
-	rte_mov64(dst + 2 * 64, src + 2 * 64);
-	rte_mov64(dst + 3 * 64, src + 3 * 64);
-}
-
-/**
- * Copy 128-byte blocks from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
-{
-	__m512i zmm0, zmm1;
-
-	while (n >= 128) {
-		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
-		n -= 128;
-		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
-		src = src + 128;
-		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
-		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
-		dst = dst + 128;
-	}
-}
-
-/**
- * Copy 512-byte blocks from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
-{
-	__m512i zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, zmm7;
-
-	while (n >= 512) {
-		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
-		n -= 512;
-		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
-		zmm2 = _mm512_loadu_si512((const void *)(src + 2 * 64));
-		zmm3 = _mm512_loadu_si512((const void *)(src + 3 * 64));
-		zmm4 = _mm512_loadu_si512((const void *)(src + 4 * 64));
-		zmm5 = _mm512_loadu_si512((const void *)(src + 5 * 64));
-		zmm6 = _mm512_loadu_si512((const void *)(src + 6 * 64));
-		zmm7 = _mm512_loadu_si512((const void *)(src + 7 * 64));
-		src = src + 512;
-		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
-		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
-		_mm512_storeu_si512((void *)(dst + 2 * 64), zmm2);
-		_mm512_storeu_si512((void *)(dst + 3 * 64), zmm3);
-		_mm512_storeu_si512((void *)(dst + 4 * 64), zmm4);
-		_mm512_storeu_si512((void *)(dst + 5 * 64), zmm5);
-		_mm512_storeu_si512((void *)(dst + 6 * 64), zmm6);
-		_mm512_storeu_si512((void *)(dst + 7 * 64), zmm7);
-		dst = dst + 512;
-	}
-}
-
-static inline void *
-rte_memcpy_generic(void *dst, const void *src, size_t n)
-{
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
-	void *ret = dst;
-	size_t dstofss;
-	size_t bits;
-
-	/**
-	 * Copy less than 16 bytes
-	 */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08)
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		return ret;
-	}
-
-	/**
-	 * Fast way when copy size doesn't exceed 512 bytes
-	 */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				  (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov32((uint8_t *)dst - 32 + n,
-				  (const uint8_t *)src - 32 + n);
-		return ret;
-	}
-	if (n <= 512) {
-		if (n >= 256) {
-			n -= 256;
-			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 256;
-			dst = (uint8_t *)dst + 256;
-		}
-		if (n >= 128) {
-			n -= 128;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 128;
-			dst = (uint8_t *)dst + 128;
-		}
-COPY_BLOCK_128_BACK63:
-		if (n > 64) {
-			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov64((uint8_t *)dst - 64 + n,
-					  (const uint8_t *)src - 64 + n);
-			return ret;
-		}
-		if (n > 0)
-			rte_mov64((uint8_t *)dst - 64 + n,
-					  (const uint8_t *)src - 64 + n);
-		return ret;
-	}
-
-	/**
-	 * Make store aligned when copy size exceeds 512 bytes
-	 */
-	dstofss = ((uintptr_t)dst & 0x3F);
-	if (dstofss > 0) {
-		dstofss = 64 - dstofss;
-		n -= dstofss;
-		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-		src = (const uint8_t *)src + dstofss;
-		dst = (uint8_t *)dst + dstofss;
-	}
-
-	/**
-	 * Copy 512-byte blocks.
-	 * Use copy block function for better instruction order control,
-	 * which is important when load is unaligned.
-	 */
-	rte_mov512blocks((uint8_t *)dst, (const uint8_t *)src, n);
-	bits = n;
-	n = n & 511;
-	bits -= n;
-	src = (const uint8_t *)src + bits;
-	dst = (uint8_t *)dst + bits;
-
-	/**
-	 * Copy 128-byte blocks.
-	 * Use copy block function for better instruction order control,
-	 * which is important when load is unaligned.
-	 */
-	if (n >= 128) {
-		rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
-		bits = n;
-		n = n & 127;
-		bits -= n;
-		src = (const uint8_t *)src + bits;
-		dst = (uint8_t *)dst + bits;
-	}
-
-	/**
-	 * Copy whatever left
-	 */
-	goto COPY_BLOCK_128_BACK63;
-}
-
-#elif defined RTE_MACHINE_CPUFLAG_AVX2
-
-#define ALIGNMENT_MASK 0x1F
-
-/**
- * AVX2 implementation below
- */
-
-/**
- * Copy 16 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov16(uint8_t *dst, const uint8_t *src)
-{
-	__m128i xmm0;
-
-	xmm0 = _mm_loadu_si128((const __m128i *)src);
-	_mm_storeu_si128((__m128i *)dst, xmm0);
-}
-
-/**
- * Copy 32 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov32(uint8_t *dst, const uint8_t *src)
-{
-	__m256i ymm0;
-
-	ymm0 = _mm256_loadu_si256((const __m256i *)src);
-	_mm256_storeu_si256((__m256i *)dst, ymm0);
-}
-
-/**
- * Copy 64 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov64(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
-	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
-}
-
-/**
- * Copy 128 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
-	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
-	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
-	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
-}
-
-/**
- * Copy 128-byte blocks from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
-{
-	__m256i ymm0, ymm1, ymm2, ymm3;
-
-	while (n >= 128) {
-		ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32));
-		n -= 128;
-		ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32));
-		ymm2 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 2 * 32));
-		ymm3 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 3 * 32));
-		src = (const uint8_t *)src + 128;
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2);
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3);
-		dst = (uint8_t *)dst + 128;
-	}
-}
-
-static inline void *
-rte_memcpy_generic(void *dst, const void *src, size_t n)
-{
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
-	void *ret = dst;
-	size_t dstofss;
-	size_t bits;
-
-	/**
-	 * Copy less than 16 bytes
-	 */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08) {
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		}
-		return ret;
-	}
-
-	/**
-	 * Fast way when copy size doesn't exceed 256 bytes
-	 */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				(const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 48) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst + 16, (const uint8_t *)src + 16);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				(const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov32((uint8_t *)dst - 32 + n,
-				(const uint8_t *)src - 32 + n);
-		return ret;
-	}
-	if (n <= 256) {
-		if (n >= 128) {
-			n -= 128;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 128;
-			dst = (uint8_t *)dst + 128;
-		}
-COPY_BLOCK_128_BACK31:
-		if (n >= 64) {
-			n -= 64;
-			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 64;
-			dst = (uint8_t *)dst + 64;
-		}
-		if (n > 32) {
-			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov32((uint8_t *)dst - 32 + n,
-					(const uint8_t *)src - 32 + n);
-			return ret;
-		}
-		if (n > 0) {
-			rte_mov32((uint8_t *)dst - 32 + n,
-					(const uint8_t *)src - 32 + n);
-		}
-		return ret;
-	}
-
-	/**
-	 * Make store aligned when copy size exceeds 256 bytes
-	 */
-	dstofss = (uintptr_t)dst & 0x1F;
-	if (dstofss > 0) {
-		dstofss = 32 - dstofss;
-		n -= dstofss;
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		src = (const uint8_t *)src + dstofss;
-		dst = (uint8_t *)dst + dstofss;
-	}
-
-	/**
-	 * Copy 128-byte blocks
-	 */
-	rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
-	bits = n;
-	n = n & 127;
-	bits -= n;
-	src = (const uint8_t *)src + bits;
-	dst = (uint8_t *)dst + bits;
-
-	/**
-	 * Copy whatever left
-	 */
-	goto COPY_BLOCK_128_BACK31;
-}
-
-#else /* RTE_MACHINE_CPUFLAG */
-
-#define ALIGNMENT_MASK 0x0F
-
-/**
- * SSE & AVX implementation below
- */
-
-/**
- * Copy 16 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov16(uint8_t *dst, const uint8_t *src)
-{
-	__m128i xmm0;
-
-	xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);
-	_mm_storeu_si128((__m128i *)dst, xmm0);
-}
-
-/**
- * Copy 32 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov32(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-}
-
-/**
- * Copy 64 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov64(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
-	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
-}
-
-/**
- * Copy 128 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
-	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
-	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
-	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
-	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
-	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
-}
-
-/**
- * Copy 256 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov256(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
-	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
-	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
-	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
-	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
-	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
-	rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);
-	rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);
-	rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);
-	rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);
-	rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);
-	rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);
-	rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);
-	rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
-}
-
-/**
- * Macro for copying unaligned block from one location to another with constant load offset,
- * 47 bytes leftover maximum,
- * locations should not overlap.
- * Requirements:
- * - Store is aligned
- * - Load offset is <offset>, which must be immediate value within [1, 15]
- * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
- * - <dst>, <src>, <len> must be variables
- * - __m128i <xmm0> ~ <xmm8> must be pre-defined
- */
-#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)                                                     \
-__extension__ ({                                                                                            \
-    int tmp;                                                                                                \
-    while (len >= 128 + 16 - offset) {                                                                      \
-        xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));                  \
-        len -= 128;                                                                                         \
-        xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));                  \
-        xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));                  \
-        xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16));                  \
-        xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16));                  \
-        xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16));                  \
-        xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16));                  \
-        xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \
-        xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \
-        src = (const uint8_t *)src + 128;                                                                   \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));        \
-        dst = (uint8_t *)dst + 128;                                                                         \
-    }                                                                                                       \
-    tmp = len;                                                                                              \
-    len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
-    tmp -= len;                                                                                             \
-    src = (const uint8_t *)src + tmp;                                                                       \
-    dst = (uint8_t *)dst + tmp;                                                                             \
-    if (len >= 32 + 16 - offset) {                                                                          \
-        while (len >= 32 + 16 - offset) {                                                                   \
-            xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));              \
-            len -= 32;                                                                                      \
-            xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));              \
-            xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));              \
-            src = (const uint8_t *)src + 32;                                                                \
-            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));    \
-            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));    \
-            dst = (uint8_t *)dst + 32;                                                                      \
-        }                                                                                                   \
-        tmp = len;                                                                                          \
-        len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
-        tmp -= len;                                                                                         \
-        src = (const uint8_t *)src + tmp;                                                                   \
-        dst = (uint8_t *)dst + tmp;                                                                         \
-    }                                                                                                       \
-})
-
-/**
- * Macro for copying unaligned block from one location to another,
- * 47 bytes leftover maximum,
- * locations should not overlap.
- * Use switch here because the aligning instruction requires immediate value for shift count.
- * Requirements:
- * - Store is aligned
- * - Load offset is <offset>, which must be within [1, 15]
- * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
- * - <dst>, <src>, <len> must be variables
- * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined
- */
-#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                   \
-__extension__ ({                                                      \
-    switch (offset) {                                                 \
-    case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break;    \
-    case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break;    \
-    case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break;    \
-    case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break;    \
-    case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break;    \
-    case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break;    \
-    case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break;    \
-    case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break;    \
-    case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break;    \
-    case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break;    \
-    case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;    \
-    case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break;    \
-    case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break;    \
-    case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break;    \
-    case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break;    \
-    default:;                                                         \
-    }                                                                 \
-})
-
-static inline void *
-rte_memcpy_generic(void *dst, const void *src, size_t n)
-{
-	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
-	void *ret = dst;
-	size_t dstofss;
-	size_t srcofs;
-
-	/**
-	 * Copy less than 16 bytes
-	 */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08) {
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		}
-		return ret;
-	}
-
-	/**
-	 * Fast way when copy size doesn't exceed 512 bytes
-	 */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 48) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
-		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 128) {
-		goto COPY_BLOCK_128_BACK15;
-	}
-	if (n <= 512) {
-		if (n >= 256) {
-			n -= 256;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov128((uint8_t *)dst + 128, (const uint8_t *)src + 128);
-			src = (const uint8_t *)src + 256;
-			dst = (uint8_t *)dst + 256;
-		}
-COPY_BLOCK_255_BACK15:
-		if (n >= 128) {
-			n -= 128;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 128;
-			dst = (uint8_t *)dst + 128;
-		}
-COPY_BLOCK_128_BACK15:
-		if (n >= 64) {
-			n -= 64;
-			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 64;
-			dst = (uint8_t *)dst + 64;
-		}
-COPY_BLOCK_64_BACK15:
-		if (n >= 32) {
-			n -= 32;
-			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 32;
-			dst = (uint8_t *)dst + 32;
-		}
-		if (n > 16) {
-			rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-			return ret;
-		}
-		if (n > 0) {
-			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		}
-		return ret;
-	}
-
-	/**
-	 * Make store aligned when copy size exceeds 512 bytes,
-	 * and make sure the first 15 bytes are copied, because
-	 * unaligned copy functions require up to 15 bytes
-	 * backwards access.
-	 */
-	dstofss = (uintptr_t)dst & 0x0F;
-	if (dstofss > 0) {
-		dstofss = 16 - dstofss + 16;
-		n -= dstofss;
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		src = (const uint8_t *)src + dstofss;
-		dst = (uint8_t *)dst + dstofss;
-	}
-	srcofs = ((uintptr_t)src & 0x0F);
-
-	/**
-	 * For aligned copy
-	 */
-	if (srcofs == 0) {
-		/**
-		 * Copy 256-byte blocks
-		 */
-		for (; n >= 256; n -= 256) {
-			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
-			dst = (uint8_t *)dst + 256;
-			src = (const uint8_t *)src + 256;
-		}
-
-		/**
-		 * Copy whatever left
-		 */
-		goto COPY_BLOCK_255_BACK15;
-	}
-
-	/**
-	 * For copy with unaligned load
-	 */
-	MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
-
-	/**
-	 * Copy whatever left
-	 */
-	goto COPY_BLOCK_64_BACK15;
-}
-
-#endif /* RTE_MACHINE_CPUFLAG */
-
-static inline void *
-rte_memcpy_aligned(void *dst, const void *src, size_t n)
-{
-	void *ret = dst;
-
-	/* Copy size <= 16 bytes */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dst = *(const uint8_t *)src;
-			src = (const uint8_t *)src + 1;
-			dst = (uint8_t *)dst + 1;
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dst = *(const uint16_t *)src;
-			src = (const uint16_t *)src + 1;
-			dst = (uint16_t *)dst + 1;
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dst = *(const uint32_t *)src;
-			src = (const uint32_t *)src + 1;
-			dst = (uint32_t *)dst + 1;
-		}
-		if (n & 0x08)
-			*(uint64_t *)dst = *(const uint64_t *)src;
-
-		return ret;
-	}
-
-	/* Copy 16 <= size <= 32 bytes */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				(const uint8_t *)src - 16 + n);
-
-		return ret;
-	}
-
-	/* Copy 32 < size <= 64 bytes */
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov32((uint8_t *)dst - 32 + n,
-				(const uint8_t *)src - 32 + n);
-
-		return ret;
-	}
-
-	/* Copy 64 bytes blocks */
-	for (; n >= 64; n -= 64) {
-		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-		dst = (uint8_t *)dst + 64;
-		src = (const uint8_t *)src + 64;
-	}
-
-	/* Copy whatever left */
-	rte_mov64((uint8_t *)dst - 64 + n,
-			(const uint8_t *)src - 64 + n);
-
-	return ret;
-}
+extern void *
+rte_memcpy_sse(void *dst, const void *src, size_t n);
 
 static inline void *
 rte_memcpy(void *dst, const void *src, size_t n)
 {
-	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
-		return rte_memcpy_aligned(dst, src, n);
+	if (n <= RTE_X86_MEMCPY_THRESH)
+		return rte_memcpy_internal(dst, src, n);
 	else
-		return rte_memcpy_generic(dst, src, n);
+		return (*rte_memcpy_ptr)(dst, src, n);
 }
 
 #ifdef __cplusplus
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c b/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
new file mode 100644
index 0000000..3ad229c
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
@@ -0,0 +1,44 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+
+#ifndef RTE_MACHINE_CPUFLAG_AVX2
+#error RTE_MACHINE_CPUFLAG_AVX2 not defined
+#endif
+
+void *
+rte_memcpy_avx2(void *dst, const void *src, size_t n)
+{
+	return rte_memcpy_internal(dst, src, n);
+}
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c b/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
new file mode 100644
index 0000000..be8d964
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
@@ -0,0 +1,44 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+
+#ifndef RTE_MACHINE_CPUFLAG_AVX512F
+#error RTE_MACHINE_CPUFLAG_AVX512F not defined
+#endif
+
+void *
+rte_memcpy_avx512f(void *dst, const void *src, size_t n)
+{
+	return rte_memcpy_internal(dst, src, n);
+}
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
new file mode 100644
index 0000000..d17fb5b
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
@@ -0,0 +1,909 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_MEMCPY_INTERNAL_X86_64_H_
+#define _RTE_MEMCPY_INTERNAL_X86_64_H_
+
+/**
+ * @file
+ *
+ * Functions for SSE/AVX/AVX2/AVX512 implementation of memcpy().
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <string.h>
+#include <rte_vect.h>
+#include <rte_common.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Copy bytes from one location to another. The locations must not overlap.
+ *
+ * @note This is implemented as a macro, so it's address should not be taken
+ * and care is needed as parameter expressions may be evaluated multiple times.
+ *
+ * @param dst
+ *   Pointer to the destination of the data.
+ * @param src
+ *   Pointer to the source data.
+ * @param n
+ *   Number of bytes to copy.
+ * @return
+ *   Pointer to the destination data.
+ */
+
+#ifdef RTE_MACHINE_CPUFLAG_AVX512F
+
+#define ALIGNMENT_MASK 0x3F
+
+/**
+ * AVX512 implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_loadu_si256((const __m256i *)src);
+	_mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+	__m512i zmm0;
+
+	zmm0 = _mm512_loadu_si512((const void *)src);
+	_mm512_storeu_si512((void *)dst, zmm0);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov64(dst + 0 * 64, src + 0 * 64);
+	rte_mov64(dst + 1 * 64, src + 1 * 64);
+}
+
+/**
+ * Copy 256 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov256(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov64(dst + 0 * 64, src + 0 * 64);
+	rte_mov64(dst + 1 * 64, src + 1 * 64);
+	rte_mov64(dst + 2 * 64, src + 2 * 64);
+	rte_mov64(dst + 3 * 64, src + 3 * 64);
+}
+
+/**
+ * Copy 128-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m512i zmm0, zmm1;
+
+	while (n >= 128) {
+		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
+		n -= 128;
+		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
+		src = src + 128;
+		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
+		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
+		dst = dst + 128;
+	}
+}
+
+/**
+ * Copy 512-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m512i zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, zmm7;
+
+	while (n >= 512) {
+		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
+		n -= 512;
+		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
+		zmm2 = _mm512_loadu_si512((const void *)(src + 2 * 64));
+		zmm3 = _mm512_loadu_si512((const void *)(src + 3 * 64));
+		zmm4 = _mm512_loadu_si512((const void *)(src + 4 * 64));
+		zmm5 = _mm512_loadu_si512((const void *)(src + 5 * 64));
+		zmm6 = _mm512_loadu_si512((const void *)(src + 6 * 64));
+		zmm7 = _mm512_loadu_si512((const void *)(src + 7 * 64));
+		src = src + 512;
+		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
+		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
+		_mm512_storeu_si512((void *)(dst + 2 * 64), zmm2);
+		_mm512_storeu_si512((void *)(dst + 3 * 64), zmm3);
+		_mm512_storeu_si512((void *)(dst + 4 * 64), zmm4);
+		_mm512_storeu_si512((void *)(dst + 5 * 64), zmm5);
+		_mm512_storeu_si512((void *)(dst + 6 * 64), zmm6);
+		_mm512_storeu_si512((void *)(dst + 7 * 64), zmm7);
+		dst = dst + 512;
+	}
+}
+
+static inline void *
+rte_memcpy_generic(void *dst, const void *src, size_t n)
+{
+	uintptr_t dstu = (uintptr_t)dst;
+	uintptr_t srcu = (uintptr_t)src;
+	void *ret = dst;
+	size_t dstofss;
+	size_t bits;
+
+	/**
+	 * Copy less than 16 bytes
+	 */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dstu = *(const uint8_t *)srcu;
+			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+			dstu = (uintptr_t)((uint8_t *)dstu + 1);
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dstu = *(const uint16_t *)srcu;
+			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+			dstu = (uintptr_t)((uint16_t *)dstu + 1);
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dstu = *(const uint32_t *)srcu;
+			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+			dstu = (uintptr_t)((uint32_t *)dstu + 1);
+		}
+		if (n & 0x08)
+			*(uint64_t *)dstu = *(const uint64_t *)srcu;
+		return ret;
+	}
+
+	/**
+	 * Fast way when copy size doesn't exceed 512 bytes
+	 */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				  (const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				  (const uint8_t *)src - 32 + n);
+		return ret;
+	}
+	if (n <= 512) {
+		if (n >= 256) {
+			n -= 256;
+			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 256;
+			dst = (uint8_t *)dst + 256;
+		}
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+COPY_BLOCK_128_BACK63:
+		if (n > 64) {
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov64((uint8_t *)dst - 64 + n,
+					  (const uint8_t *)src - 64 + n);
+			return ret;
+		}
+		if (n > 0)
+			rte_mov64((uint8_t *)dst - 64 + n,
+					  (const uint8_t *)src - 64 + n);
+		return ret;
+	}
+
+	/**
+	 * Make store aligned when copy size exceeds 512 bytes
+	 */
+	dstofss = ((uintptr_t)dst & 0x3F);
+	if (dstofss > 0) {
+		dstofss = 64 - dstofss;
+		n -= dstofss;
+		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+		src = (const uint8_t *)src + dstofss;
+		dst = (uint8_t *)dst + dstofss;
+	}
+
+	/**
+	 * Copy 512-byte blocks.
+	 * Use copy block function for better instruction order control,
+	 * which is important when load is unaligned.
+	 */
+	rte_mov512blocks((uint8_t *)dst, (const uint8_t *)src, n);
+	bits = n;
+	n = n & 511;
+	bits -= n;
+	src = (const uint8_t *)src + bits;
+	dst = (uint8_t *)dst + bits;
+
+	/**
+	 * Copy 128-byte blocks.
+	 * Use copy block function for better instruction order control,
+	 * which is important when load is unaligned.
+	 */
+	if (n >= 128) {
+		rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
+		bits = n;
+		n = n & 127;
+		bits -= n;
+		src = (const uint8_t *)src + bits;
+		dst = (uint8_t *)dst + bits;
+	}
+
+	/**
+	 * Copy whatever left
+	 */
+	goto COPY_BLOCK_128_BACK63;
+}
+
+#elif defined RTE_MACHINE_CPUFLAG_AVX2
+
+#define ALIGNMENT_MASK 0x1F
+
+/**
+ * AVX2 implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_loadu_si256((const __m256i *)src);
+	_mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
+	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
+	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
+	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
+	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
+}
+
+/**
+ * Copy 128-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m256i ymm0, ymm1, ymm2, ymm3;
+
+	while (n >= 128) {
+		ymm0 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 0 * 32));
+		n -= 128;
+		ymm1 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 1 * 32));
+		ymm2 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 2 * 32));
+		ymm3 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 3 * 32));
+		src = (const uint8_t *)src + 128;
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3);
+		dst = (uint8_t *)dst + 128;
+	}
+}
+
+static inline void *
+rte_memcpy_generic(void *dst, const void *src, size_t n)
+{
+	uintptr_t dstu = (uintptr_t)dst;
+	uintptr_t srcu = (uintptr_t)src;
+	void *ret = dst;
+	size_t dstofss;
+	size_t bits;
+
+	/**
+	 * Copy less than 16 bytes
+	 */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dstu = *(const uint8_t *)srcu;
+			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+			dstu = (uintptr_t)((uint8_t *)dstu + 1);
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dstu = *(const uint16_t *)srcu;
+			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+			dstu = (uintptr_t)((uint16_t *)dstu + 1);
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dstu = *(const uint32_t *)srcu;
+			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+			dstu = (uintptr_t)((uint32_t *)dstu + 1);
+		}
+		if (n & 0x08)
+			*(uint64_t *)dstu = *(const uint64_t *)srcu;
+		return ret;
+	}
+
+	/**
+	 * Fast way when copy size doesn't exceed 256 bytes
+	 */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 48) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst + 16, (const uint8_t *)src + 16);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				(const uint8_t *)src - 32 + n);
+		return ret;
+	}
+	if (n <= 256) {
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+COPY_BLOCK_128_BACK31:
+		if (n >= 64) {
+			n -= 64;
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 64;
+			dst = (uint8_t *)dst + 64;
+		}
+		if (n > 32) {
+			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov32((uint8_t *)dst - 32 + n,
+					(const uint8_t *)src - 32 + n);
+			return ret;
+		}
+		if (n > 0) {
+			rte_mov32((uint8_t *)dst - 32 + n,
+					(const uint8_t *)src - 32 + n);
+		}
+		return ret;
+	}
+
+	/**
+	 * Make store aligned when copy size exceeds 256 bytes
+	 */
+	dstofss = (uintptr_t)dst & 0x1F;
+	if (dstofss > 0) {
+		dstofss = 32 - dstofss;
+		n -= dstofss;
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		src = (const uint8_t *)src + dstofss;
+		dst = (uint8_t *)dst + dstofss;
+	}
+
+	/**
+	 * Copy 128-byte blocks
+	 */
+	rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
+	bits = n;
+	n = n & 127;
+	bits -= n;
+	src = (const uint8_t *)src + bits;
+	dst = (uint8_t *)dst + bits;
+
+	/**
+	 * Copy whatever left
+	 */
+	goto COPY_BLOCK_128_BACK31;
+}
+
+#else /* RTE_MACHINE_CPUFLAG */
+
+#define ALIGNMENT_MASK 0x0F
+
+/**
+ * SSE & AVX implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
+	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
+	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
+	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
+}
+
+/**
+ * Copy 256 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov256(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
+	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
+	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
+	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
+	rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);
+	rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);
+	rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);
+	rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);
+	rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);
+	rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);
+	rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);
+	rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
+}
+
+/**
+ * Macro for copying unaligned block from one location to another with constant load offset,
+ * 47 bytes leftover maximum,
+ * locations should not overlap.
+ * Requirements:
+ * - Store is aligned
+ * - Load offset is <offset>, which must be immediate value within [1, 15]
+ * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
+ * - <dst>, <src>, <len> must be variables
+ * - __m128i <xmm0> ~ <xmm8> must be pre-defined
+ */
+#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)                                                     \
+__extension__ ({                                                                                            \
+    int tmp;                                                                                                \
+    while (len >= 128 + 16 - offset) {                                                                      \
+        xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));                  \
+        len -= 128;                                                                                         \
+        xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));                  \
+        xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));                  \
+        xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16));                  \
+        xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16));                  \
+        xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16));                  \
+        xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16));                  \
+        xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \
+        xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \
+        src = (const uint8_t *)src + 128;                                                                   \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));        \
+        dst = (uint8_t *)dst + 128;                                                                         \
+    }                                                                                                       \
+    tmp = len;                                                                                              \
+    len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
+    tmp -= len;                                                                                             \
+    src = (const uint8_t *)src + tmp;                                                                       \
+    dst = (uint8_t *)dst + tmp;                                                                             \
+    if (len >= 32 + 16 - offset) {                                                                          \
+        while (len >= 32 + 16 - offset) {                                                                   \
+            xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));              \
+            len -= 32;                                                                                      \
+            xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));              \
+            xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));              \
+            src = (const uint8_t *)src + 32;                                                                \
+            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));    \
+            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));    \
+            dst = (uint8_t *)dst + 32;                                                                      \
+        }                                                                                                   \
+        tmp = len;                                                                                          \
+        len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
+        tmp -= len;                                                                                         \
+        src = (const uint8_t *)src + tmp;                                                                   \
+        dst = (uint8_t *)dst + tmp;                                                                         \
+    }                                                                                                       \
+})
+
+/**
+ * Macro for copying unaligned block from one location to another,
+ * 47 bytes leftover maximum,
+ * locations should not overlap.
+ * Use switch here because the aligning instruction requires immediate value for shift count.
+ * Requirements:
+ * - Store is aligned
+ * - Load offset is <offset>, which must be within [1, 15]
+ * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
+ * - <dst>, <src>, <len> must be variables
+ * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined
+ */
+#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                   \
+__extension__ ({                                                      \
+    switch (offset) {                                                 \
+    case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break;    \
+    case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break;    \
+    case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break;    \
+    case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break;    \
+    case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break;    \
+    case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break;    \
+    case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break;    \
+    case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break;    \
+    case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break;    \
+    case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break;    \
+    case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;    \
+    case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break;    \
+    case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break;    \
+    case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break;    \
+    case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break;    \
+    default:;                                                         \
+    }                                                                 \
+})
+
+static inline void *
+rte_memcpy_generic(void *dst, const void *src, size_t n)
+{
+	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
+	uintptr_t dstu = (uintptr_t)dst;
+	uintptr_t srcu = (uintptr_t)src;
+	void *ret = dst;
+	size_t dstofss;
+	size_t srcofs;
+
+	/**
+	 * Copy less than 16 bytes
+	 */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dstu = *(const uint8_t *)srcu;
+			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+			dstu = (uintptr_t)((uint8_t *)dstu + 1);
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dstu = *(const uint16_t *)srcu;
+			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+			dstu = (uintptr_t)((uint16_t *)dstu + 1);
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dstu = *(const uint32_t *)srcu;
+			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+			dstu = (uintptr_t)((uint32_t *)dstu + 1);
+		}
+		if (n & 0x08)
+			*(uint64_t *)dstu = *(const uint64_t *)srcu;
+		return ret;
+	}
+
+	/**
+	 * Fast way when copy size doesn't exceed 512 bytes
+	 */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 48) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 128)
+		goto COPY_BLOCK_128_BACK15;
+	if (n <= 512) {
+		if (n >= 256) {
+			n -= 256;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov128((uint8_t *)dst + 128,
+					(const uint8_t *)src + 128);
+			src = (const uint8_t *)src + 256;
+			dst = (uint8_t *)dst + 256;
+		}
+COPY_BLOCK_255_BACK15:
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+COPY_BLOCK_128_BACK15:
+		if (n >= 64) {
+			n -= 64;
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 64;
+			dst = (uint8_t *)dst + 64;
+		}
+COPY_BLOCK_64_BACK15:
+		if (n >= 32) {
+			n -= 32;
+			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 32;
+			dst = (uint8_t *)dst + 32;
+		}
+		if (n > 16) {
+			rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov16((uint8_t *)dst - 16 + n,
+					(const uint8_t *)src - 16 + n);
+			return ret;
+		}
+		if (n > 0) {
+			rte_mov16((uint8_t *)dst - 16 + n,
+					(const uint8_t *)src - 16 + n);
+		}
+		return ret;
+	}
+
+	/**
+	 * Make store aligned when copy size exceeds 512 bytes,
+	 * and make sure the first 15 bytes are copied, because
+	 * unaligned copy functions require up to 15 bytes
+	 * backwards access.
+	 */
+	dstofss = (uintptr_t)dst & 0x0F;
+	if (dstofss > 0) {
+		dstofss = 16 - dstofss + 16;
+		n -= dstofss;
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		src = (const uint8_t *)src + dstofss;
+		dst = (uint8_t *)dst + dstofss;
+	}
+	srcofs = ((uintptr_t)src & 0x0F);
+
+	/**
+	 * For aligned copy
+	 */
+	if (srcofs == 0) {
+		/**
+		 * Copy 256-byte blocks
+		 */
+		for (; n >= 256; n -= 256) {
+			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
+			dst = (uint8_t *)dst + 256;
+			src = (const uint8_t *)src + 256;
+		}
+
+		/**
+		 * Copy whatever left
+		 */
+		goto COPY_BLOCK_255_BACK15;
+	}
+
+	/**
+	 * For copy with unaligned load
+	 */
+	MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
+
+	/**
+	 * Copy whatever left
+	 */
+	goto COPY_BLOCK_64_BACK15;
+}
+
+#endif /* RTE_MACHINE_CPUFLAG */
+
+static inline void *
+rte_memcpy_aligned(void *dst, const void *src, size_t n)
+{
+	void *ret = dst;
+
+	/* Copy size <= 16 bytes */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dst = *(const uint8_t *)src;
+			src = (const uint8_t *)src + 1;
+			dst = (uint8_t *)dst + 1;
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dst = *(const uint16_t *)src;
+			src = (const uint16_t *)src + 1;
+			dst = (uint16_t *)dst + 1;
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dst = *(const uint32_t *)src;
+			src = (const uint32_t *)src + 1;
+			dst = (uint32_t *)dst + 1;
+		}
+		if (n & 0x08)
+			*(uint64_t *)dst = *(const uint64_t *)src;
+
+		return ret;
+	}
+
+	/* Copy 16 <= size <= 32 bytes */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+
+		return ret;
+	}
+
+	/* Copy 32 < size <= 64 bytes */
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				(const uint8_t *)src - 32 + n);
+
+		return ret;
+	}
+
+	/* Copy 64 bytes blocks */
+	for (; n >= 64; n -= 64) {
+		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+		dst = (uint8_t *)dst + 64;
+		src = (const uint8_t *)src + 64;
+	}
+
+	/* Copy whatever left */
+	rte_mov64((uint8_t *)dst - 64 + n,
+			(const uint8_t *)src - 64 + n);
+
+	return ret;
+}
+
+static inline void *
+rte_memcpy_internal(void *dst, const void *src, size_t n)
+{
+	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
+		return rte_memcpy_aligned(dst, src, n);
+	else
+		return rte_memcpy_generic(dst, src, n);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_MEMCPY_INTERNAL_X86_64_H_ */
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c b/lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c
new file mode 100644
index 0000000..55d6b41
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c
@@ -0,0 +1,40 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+
+void *
+rte_memcpy_sse(void *dst, const void *src, size_t n)
+{
+	return rte_memcpy_internal(dst, src, n);
+}
diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
index 90bca4d..c8bdac0 100644
--- a/lib/librte_eal/linuxapp/eal/Makefile
+++ b/lib/librte_eal/linuxapp/eal/Makefile
@@ -40,6 +40,7 @@ VPATH += $(RTE_SDK)/lib/librte_eal/common/arch/$(ARCH_DIR)
 LIBABIVER := 5
 
 VPATH += $(RTE_SDK)/lib/librte_eal/common
+VPATH += $(RTE_SDK)/lib/librte_eal/common/include/arch/$(ARCH_DIR)
 
 CFLAGS += -I$(SRCDIR)/include
 CFLAGS += -I$(RTE_SDK)/lib/librte_eal/common
@@ -105,6 +106,24 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_ARCH_X86) += rte_spinlock.c
 
+# for run-time dispatch of memcpy
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy.c
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_sse.c
+
+# if the compiler supports AVX512, add avx512 file
+ifneq ($(findstring CC_SUPPORT_AVX512F,$(MACHINE_CFLAGS)),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx512f.c
+CFLAGS_rte_memcpy_avx512f.o += -mavx512f
+CFLAGS_rte_memcpy_avx512f.o += -DRTE_MACHINE_CPUFLAG_AVX512F
+endif
+
+# if the compiler supports AVX2, add avx2 file
+ifneq ($(findstring CC_SUPPORT_AVX2,$(MACHINE_CFLAGS)),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx2.c
+CFLAGS_rte_memcpy_avx2.o += -mavx2
+CFLAGS_rte_memcpy_avx2.o += -DRTE_MACHINE_CPUFLAG_AVX2
+endif
+
 CFLAGS_eal_common_cpuflags.o := $(CPUFLAGS_LIST)
 
 CFLAGS_eal.o := -D_GNU_SOURCE
diff --git a/mk/rte.cpuflags.mk b/mk/rte.cpuflags.mk
index a813c91..8a7a1e7 100644
--- a/mk/rte.cpuflags.mk
+++ b/mk/rte.cpuflags.mk
@@ -134,6 +134,20 @@ endif
 
 MACHINE_CFLAGS += $(addprefix -DRTE_MACHINE_CPUFLAG_,$(CPUFLAGS))
 
+# Check if the compiler suppoerts AVX512
+CC_SUPPORT_AVX512F := $(shell $(CC) -mavx512f -dM -E - < /dev/null 2>&1 | grep -q AVX512 && echo 1)
+ifeq ($(CC_SUPPORT_AVX512F),1)
+ifeq ($(CONFIG_RTE_ENABLE_AVX512),y)
+MACHINE_CFLAGS += -DCC_SUPPORT_AVX512F
+endif
+endif
+
+# Check if the compiler supports AVX2
+CC_SUPPORT_AVX2 := $(shell $(CC) -mavx2 -dM -E - < /dev/null 2>&1 | grep -q AVX2 && echo 1)
+ifeq ($(CC_SUPPORT_AVX2),1)
+MACHINE_CFLAGS += -DCC_SUPPORT_AVX2
+endif
+
 # To strip whitespace
 comma:= ,
 empty:=
-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [dpdk-dev] [PATCH v5 2/3] app/test: run-time dispatch over memcpy perf test
  2017-10-03 14:59   ` [dpdk-dev] [PATCH v5 0/3] run-time Linking support Xiaoyun Li
  2017-10-03 14:59     ` [dpdk-dev] [PATCH v5 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
@ 2017-10-03 14:59     ` Xiaoyun Li
  2017-10-03 14:59     ` [dpdk-dev] [PATCH v5 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 88+ messages in thread
From: Xiaoyun Li @ 2017-10-03 14:59 UTC (permalink / raw)
  To: konstantin.ananyev, bruce.richardson
  Cc: wenzhuo.lu, helin.zhang, dev, Xiaoyun Li

This patch modifies assignment of alignment unit from build-time
to run-time based on CPU flags that machine supports.

Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
---
 test/test/test_memcpy_perf.c | 40 +++++++++++++++++++++++++++-------------
 1 file changed, 27 insertions(+), 13 deletions(-)

diff --git a/test/test/test_memcpy_perf.c b/test/test/test_memcpy_perf.c
index ff3aaaa..33def3b 100644
--- a/test/test/test_memcpy_perf.c
+++ b/test/test/test_memcpy_perf.c
@@ -79,13 +79,7 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
 #define TEST_BATCH_SIZE         100
 
 /* Data is aligned on this many bytes (power of 2) */
-#ifdef RTE_MACHINE_CPUFLAG_AVX512F
-#define ALIGNMENT_UNIT          64
-#elif defined RTE_MACHINE_CPUFLAG_AVX2
-#define ALIGNMENT_UNIT          32
-#else /* RTE_MACHINE_CPUFLAG */
-#define ALIGNMENT_UNIT          16
-#endif /* RTE_MACHINE_CPUFLAG */
+static uint8_t alignment_unit = 16;
 
 /*
  * Pointers used in performance tests. The two large buffers are for uncached
@@ -100,20 +94,39 @@ static int
 init_buffers(void)
 {
 	unsigned i;
+#ifdef CC_SUPPORT_AVX512
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F))
+		alignment_unit = 64;
+	else
+#endif
+#ifdef CC_SUPPORT_AVX2
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
+		alignment_unit = 32;
+	else
+#endif
+		alignment_unit = 16;
 
-	large_buf_read = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	large_buf_read = rte_malloc("memcpy",
+				    LARGE_BUFFER_SIZE + alignment_unit,
+				    alignment_unit);
 	if (large_buf_read == NULL)
 		goto error_large_buf_read;
 
-	large_buf_write = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	large_buf_write = rte_malloc("memcpy",
+				     LARGE_BUFFER_SIZE + alignment_unit,
+				     alignment_unit);
 	if (large_buf_write == NULL)
 		goto error_large_buf_write;
 
-	small_buf_read = rte_malloc("memcpy", SMALL_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	small_buf_read = rte_malloc("memcpy",
+				    SMALL_BUFFER_SIZE + alignment_unit,
+				    alignment_unit);
 	if (small_buf_read == NULL)
 		goto error_small_buf_read;
 
-	small_buf_write = rte_malloc("memcpy", SMALL_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	small_buf_write = rte_malloc("memcpy",
+				     SMALL_BUFFER_SIZE + alignment_unit,
+				     alignment_unit);
 	if (small_buf_write == NULL)
 		goto error_small_buf_write;
 
@@ -153,7 +166,7 @@ static inline size_t
 get_rand_offset(size_t uoffset)
 {
 	return ((rte_rand() % (LARGE_BUFFER_SIZE - SMALL_BUFFER_SIZE)) &
-			~(ALIGNMENT_UNIT - 1)) + uoffset;
+			~(alignment_unit - 1)) + uoffset;
 }
 
 /* Fill in source and destination addresses. */
@@ -321,7 +334,8 @@ perf_test(void)
 		   "(bytes)        (ticks)        (ticks)        (ticks)        (ticks)\n"
 		   "------- -------------- -------------- -------------- --------------");
 
-	printf("\n========================== %2dB aligned ============================", ALIGNMENT_UNIT);
+	printf("\n========================= %2dB aligned ============================",
+		alignment_unit);
 	/* Do aligned tests where size is a variable */
 	perf_test_variable_aligned();
 	printf("\n------- -------------- -------------- -------------- --------------");
-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [dpdk-dev] [PATCH v5 3/3] efd: run-time dispatch over x86 EFD functions
  2017-10-03 14:59   ` [dpdk-dev] [PATCH v5 0/3] run-time Linking support Xiaoyun Li
  2017-10-03 14:59     ` [dpdk-dev] [PATCH v5 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
  2017-10-03 14:59     ` [dpdk-dev] [PATCH v5 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
@ 2017-10-03 14:59     ` Xiaoyun Li
  2017-10-04 17:56     ` [dpdk-dev] [PATCH v5 0/3] run-time Linking support Ananyev, Konstantin
  2017-10-04 22:58     ` [dpdk-dev] [PATCH v6 " Xiaoyun Li
  4 siblings, 0 replies; 88+ messages in thread
From: Xiaoyun Li @ 2017-10-03 14:59 UTC (permalink / raw)
  To: konstantin.ananyev, bruce.richardson
  Cc: wenzhuo.lu, helin.zhang, dev, Xiaoyun Li

This patch enables x86 EFD file be compiled only if the compiler
supports AVX2 since it is already chosen at run-time.

Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
---
 lib/librte_efd/Makefile      |  6 +++
 lib/librte_efd/rte_efd_x86.c | 87 ++++++++++++++++++++++++++++++++++++++++++++
 lib/librte_efd/rte_efd_x86.h | 48 +-----------------------
 3 files changed, 95 insertions(+), 46 deletions(-)
 create mode 100644 lib/librte_efd/rte_efd_x86.c

diff --git a/lib/librte_efd/Makefile b/lib/librte_efd/Makefile
index b9277bc..35bb2bd 100644
--- a/lib/librte_efd/Makefile
+++ b/lib/librte_efd/Makefile
@@ -44,6 +44,12 @@ LIBABIVER := 1
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_EFD) := rte_efd.c
 
+# if the compiler supports AVX2, add efd x86 file
+ifneq ($(findstring CC_SUPPORT_AVX2,$(MACHINE_CFLAGS)),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_efd_x86.c
+CFLAGS_rte_efd_x86.o += -mavx2
+endif
+
 # install this header file
 SYMLINK-$(CONFIG_RTE_LIBRTE_EFD)-include := rte_efd.h
 
diff --git a/lib/librte_efd/rte_efd_x86.c b/lib/librte_efd/rte_efd_x86.c
new file mode 100644
index 0000000..d2d1ac5
--- /dev/null
+++ b/lib/librte_efd/rte_efd_x86.c
@@ -0,0 +1,87 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2016-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/* rte_efd_x86.c
+ * This file holds all x86 specific EFD functions
+ */
+#include <rte_efd.h>
+#include <rte_efd_x86.h>
+
+#if (RTE_EFD_VALUE_NUM_BITS == 8 || RTE_EFD_VALUE_NUM_BITS == 16 || \
+	RTE_EFD_VALUE_NUM_BITS == 24 || RTE_EFD_VALUE_NUM_BITS == 32)
+#define EFD_LOAD_SI128(val) _mm_load_si128(val)
+#else
+#define EFD_LOAD_SI128(val) _mm_lddqu_si128(val)
+#endif
+
+efd_value_t
+efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
+		const efd_lookuptbl_t *group_lookup_table,
+		const uint32_t hash_val_a, const uint32_t hash_val_b)
+{
+#ifdef CC_SUPPORT_AVX2
+	efd_value_t value = 0;
+	uint32_t i = 0;
+	__m256i vhash_val_a = _mm256_set1_epi32(hash_val_a);
+	__m256i vhash_val_b = _mm256_set1_epi32(hash_val_b);
+
+	for (; i < RTE_EFD_VALUE_NUM_BITS; i += 8) {
+		__m256i vhash_idx =
+				_mm256_cvtepu16_epi32(EFD_LOAD_SI128(
+				(__m128i const *) &group_hash_idx[i]));
+		__m256i vlookup_table = _mm256_cvtepu16_epi32(
+				EFD_LOAD_SI128((__m128i const *)
+				&group_lookup_table[i]));
+		__m256i vhash = _mm256_add_epi32(vhash_val_a,
+				_mm256_mullo_epi32(vhash_idx, vhash_val_b));
+		__m256i vbucket_idx = _mm256_srli_epi32(vhash,
+				EFD_LOOKUPTBL_SHIFT);
+		__m256i vresult = _mm256_srlv_epi32(vlookup_table,
+				vbucket_idx);
+
+		value |= (_mm256_movemask_ps(
+			(__m256) _mm256_slli_epi32(vresult, 31))
+			& ((1 << (RTE_EFD_VALUE_NUM_BITS - i)) - 1)) << i;
+	}
+
+	return value;
+#else
+	RTE_SET_USED(group_hash_idx);
+	RTE_SET_USED(group_lookup_table);
+	RTE_SET_USED(hash_val_a);
+	RTE_SET_USED(hash_val_b);
+	/* Return dummy value, only to avoid compilation breakage */
+	return 0;
+#endif
+
+}
diff --git a/lib/librte_efd/rte_efd_x86.h b/lib/librte_efd/rte_efd_x86.h
index 34f37d7..7a082aa 100644
--- a/lib/librte_efd/rte_efd_x86.h
+++ b/lib/librte_efd/rte_efd_x86.h
@@ -36,51 +36,7 @@
  */
 #include <immintrin.h>
 
-#if (RTE_EFD_VALUE_NUM_BITS == 8 || RTE_EFD_VALUE_NUM_BITS == 16 || \
-	RTE_EFD_VALUE_NUM_BITS == 24 || RTE_EFD_VALUE_NUM_BITS == 32)
-#define EFD_LOAD_SI128(val) _mm_load_si128(val)
-#else
-#define EFD_LOAD_SI128(val) _mm_lddqu_si128(val)
-#endif
-
-static inline efd_value_t
+extern efd_value_t
 efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
 		const efd_lookuptbl_t *group_lookup_table,
-		const uint32_t hash_val_a, const uint32_t hash_val_b)
-{
-#ifdef RTE_MACHINE_CPUFLAG_AVX2
-	efd_value_t value = 0;
-	uint32_t i = 0;
-	__m256i vhash_val_a = _mm256_set1_epi32(hash_val_a);
-	__m256i vhash_val_b = _mm256_set1_epi32(hash_val_b);
-
-	for (; i < RTE_EFD_VALUE_NUM_BITS; i += 8) {
-		__m256i vhash_idx =
-				_mm256_cvtepu16_epi32(EFD_LOAD_SI128(
-				(__m128i const *) &group_hash_idx[i]));
-		__m256i vlookup_table = _mm256_cvtepu16_epi32(
-				EFD_LOAD_SI128((__m128i const *)
-				&group_lookup_table[i]));
-		__m256i vhash = _mm256_add_epi32(vhash_val_a,
-				_mm256_mullo_epi32(vhash_idx, vhash_val_b));
-		__m256i vbucket_idx = _mm256_srli_epi32(vhash,
-				EFD_LOOKUPTBL_SHIFT);
-		__m256i vresult = _mm256_srlv_epi32(vlookup_table,
-				vbucket_idx);
-
-		value |= (_mm256_movemask_ps(
-			(__m256) _mm256_slli_epi32(vresult, 31))
-			& ((1 << (RTE_EFD_VALUE_NUM_BITS - i)) - 1)) << i;
-	}
-
-	return value;
-#else
-	RTE_SET_USED(group_hash_idx);
-	RTE_SET_USED(group_lookup_table);
-	RTE_SET_USED(hash_val_a);
-	RTE_SET_USED(hash_val_b);
-	/* Return dummy value, only to avoid compilation breakage */
-	return 0;
-#endif
-
-}
+		const uint32_t hash_val_a, const uint32_t hash_val_b);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v5 0/3] run-time Linking support
  2017-10-03 14:59   ` [dpdk-dev] [PATCH v5 0/3] run-time Linking support Xiaoyun Li
                       ` (2 preceding siblings ...)
  2017-10-03 14:59     ` [dpdk-dev] [PATCH v5 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
@ 2017-10-04 17:56     ` Ananyev, Konstantin
  2017-10-04 22:33       ` Li, Xiaoyun
  2017-10-04 22:58     ` [dpdk-dev] [PATCH v6 " Xiaoyun Li
  4 siblings, 1 reply; 88+ messages in thread
From: Ananyev, Konstantin @ 2017-10-04 17:56 UTC (permalink / raw)
  To: Li, Xiaoyun, Richardson, Bruce; +Cc: Lu, Wenzhuo, Zhang, Helin, dev

Hi Xiaouyn,

> -----Original Message-----
> From: Li, Xiaoyun
> Sent: Tuesday, October 3, 2017 4:00 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>; dev@dpdk.org; Li, Xiaoyun <xiaoyun.li@intel.com>
> Subject: [PATCH v5 0/3] run-time Linking support
> 
> This patchset dynamically selects functions at run-time based on CPU flags
> that current machine supports.This patchset modifies mempcy, memcpy perf
> test and x86 EFD, using function pointers and bind them at constructor time.
> Then in the cloud environment, users can compiler once for the minimum target
> such as 'haswell'(not 'native') and run on different platforms (equal or above
> haswell) and can get ISA optimization based on running CPU.
> 
> Xiaoyun Li (3):
>   eal/x86: run-time dispatch over memcpy
>   app/test: run-time dispatch over memcpy perf test
>   efd: run-time dispatch over x86 EFD functions
> 
> ---
> v2
> * Use gcc function multi-versioning to avoid compilation issues.
> * Add macros for AVX512 and AVX2. Only if users enable AVX512 and the compiler
> supports it, the AVX512 codes would be compiled. Only if the compiler supports
> AVX2, the AVX2 codes would be compiled.
> 
> v3
> * Reduce function calls via only keep rte_memcpy_xxx.
> * Add conditions that when copy size is small, use inline code path.
> Otherwise, use dynamic code path.
> * To support attribute target, clang version must be greater than 3.7.
> Otherwise, would choose SSE/AVX code path, the same as before.
> * Move two mocro functions to the top of the code since they would be used in
> inline SSE/AVX and dynamic SSE/AVX codes.
> 
> v4
> * Modify rte_memcpy.h to several .c files and modify makefiles to compile
> AVX2 and AVX512 files.
> 
> v5
> * Delete redundant repeated codes of rte_memcpy_xxx.
> * Modify makefiles to enable reuse of existing rte_memcpy.
> * Delete redundant codes of rte_efd_x86.h in v4. Move it into .c file and enable
> compilation -mavx2 for it in makefile since it is already chosen at run-time.
> 

Generally looks good, just two things to fix below.
Konstantin

1. [dpdk-dev,v5,1/3] eal/x86: run-time dispatch over memcpy

Shared target build fails:
http://dpdk.org/ml/archives/test-report/2017-October/031032.html

I think you need to include rte_memcpy_ptr into the:
lib/librte_eal/linuxapp/eal/rte_eal_version.map
lib/librte_eal/bsdapp/eal/rte_eal_version.map
to fix it.

2. [dpdk-dev,v5,3/3] efd: run-time dispatch over x86 EFD functions

/lib/librte_efd/rte_efd_x86.c
....
+efd_value_t
+efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
+		const efd_lookuptbl_t *group_lookup_table,
+		const uint32_t hash_val_a, const uint32_t hash_val_b)
+{
+#ifdef CC_SUPPORT_AVX2
+	efd_value_t value = 0;
+	uint32_t i = 0;
+	__m256i vhash_val_a = _mm256_set1_epi32(hash_val_a);
+	__m256i vhash_val_b = _mm256_set1_epi32(hash_val_b);
+
+	for (; i < RTE_EFD_VALUE_NUM_BITS; i += 8) {
+		__m256i vhash_idx =
+				_mm256_cvtepu16_epi32(EFD_LOAD_SI128(
+				(__m128i const *) &group_hash_idx[i]));
+		__m256i vlookup_table = _mm256_cvtepu16_epi32(
+				EFD_LOAD_SI128((__m128i const *)
+				&group_lookup_table[i]));
+		__m256i vhash = _mm256_add_epi32(vhash_val_a,
+				_mm256_mullo_epi32(vhash_idx, vhash_val_b));
+		__m256i vbucket_idx = _mm256_srli_epi32(vhash,
+				EFD_LOOKUPTBL_SHIFT);
+		__m256i vresult = _mm256_srlv_epi32(vlookup_table,
+				vbucket_idx);
+
+		value |= (_mm256_movemask_ps(
+			(__m256) _mm256_slli_epi32(vresult, 31))
+			& ((1 << (RTE_EFD_VALUE_NUM_BITS - i)) - 1)) << i;
+	}
+
+	return value;
+#else

We always build that file with AVX2 option, so I think we can safely  remove
The #ifdef CC_SUPPORT_AVX2 and the code below.

+	RTE_SET_USED(group_hash_idx);
+	RTE_SET_USED(group_lookup_table);
+	RTE_SET_USED(hash_val_a);
+	RTE_SET_USED(hash_val_b);
+	/* Return dummy value, only to avoid compilation breakage */
+	return 0;
+#endif
+
+}


>  lib/librte_eal/bsdapp/eal/Makefile                 |  19 +
>  .../common/include/arch/x86/rte_memcpy.c           |  59 ++
>  .../common/include/arch/x86/rte_memcpy.h           | 861 +------------------
>  .../common/include/arch/x86/rte_memcpy_avx2.c      |  44 +
>  .../common/include/arch/x86/rte_memcpy_avx512f.c   |  44 +
>  .../common/include/arch/x86/rte_memcpy_internal.h  | 909 +++++++++++++++++++++
>  .../common/include/arch/x86/rte_memcpy_sse.c       |  40 +
>  lib/librte_eal/linuxapp/eal/Makefile               |  19 +
>  lib/librte_efd/Makefile                            |   6 +
>  lib/librte_efd/rte_efd_x86.c                       |  87 ++
>  lib/librte_efd/rte_efd_x86.h                       |  48 +-
>  mk/rte.cpuflags.mk                                 |  14 +
>  test/test/test_memcpy_perf.c                       |  40 +-
>  13 files changed, 1285 insertions(+), 905 deletions(-)
>  create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy.c
>  create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
>  create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
>  create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
>  create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c
>  create mode 100644 lib/librte_efd/rte_efd_x86.c
> 
> --
> 2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v5 0/3] run-time Linking support
  2017-10-04 17:56     ` [dpdk-dev] [PATCH v5 0/3] run-time Linking support Ananyev, Konstantin
@ 2017-10-04 22:33       ` Li, Xiaoyun
  0 siblings, 0 replies; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-04 22:33 UTC (permalink / raw)
  To: Ananyev, Konstantin, Richardson, Bruce; +Cc: Lu, Wenzhuo, Zhang, Helin, dev

OK. Will send it later. Many thanks!

> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Thursday, October 5, 2017 01:56
> To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> <helin.zhang@intel.com>; dev@dpdk.org
> Subject: RE: [PATCH v5 0/3] run-time Linking support
> 
> Hi Xiaouyn,
> 
> > -----Original Message-----
> > From: Li, Xiaoyun
> > Sent: Tuesday, October 3, 2017 4:00 PM
> > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson,
> > Bruce <bruce.richardson@intel.com>
> > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > <helin.zhang@intel.com>; dev@dpdk.org; Li, Xiaoyun
> > <xiaoyun.li@intel.com>
> > Subject: [PATCH v5 0/3] run-time Linking support
> >
> > This patchset dynamically selects functions at run-time based on CPU
> > flags that current machine supports.This patchset modifies mempcy,
> > memcpy perf test and x86 EFD, using function pointers and bind them at
> constructor time.
> > Then in the cloud environment, users can compiler once for the minimum
> > target such as 'haswell'(not 'native') and run on different platforms
> > (equal or above
> > haswell) and can get ISA optimization based on running CPU.
> >
> > Xiaoyun Li (3):
> >   eal/x86: run-time dispatch over memcpy
> >   app/test: run-time dispatch over memcpy perf test
> >   efd: run-time dispatch over x86 EFD functions
> >
> > ---
> > v2
> > * Use gcc function multi-versioning to avoid compilation issues.
> > * Add macros for AVX512 and AVX2. Only if users enable AVX512 and the
> > compiler supports it, the AVX512 codes would be compiled. Only if the
> > compiler supports AVX2, the AVX2 codes would be compiled.
> >
> > v3
> > * Reduce function calls via only keep rte_memcpy_xxx.
> > * Add conditions that when copy size is small, use inline code path.
> > Otherwise, use dynamic code path.
> > * To support attribute target, clang version must be greater than 3.7.
> > Otherwise, would choose SSE/AVX code path, the same as before.
> > * Move two mocro functions to the top of the code since they would be
> > used in inline SSE/AVX and dynamic SSE/AVX codes.
> >
> > v4
> > * Modify rte_memcpy.h to several .c files and modify makefiles to
> > compile
> > AVX2 and AVX512 files.
> >
> > v5
> > * Delete redundant repeated codes of rte_memcpy_xxx.
> > * Modify makefiles to enable reuse of existing rte_memcpy.
> > * Delete redundant codes of rte_efd_x86.h in v4. Move it into .c file
> > and enable compilation -mavx2 for it in makefile since it is already chosen
> at run-time.
> >
> 
> Generally looks good, just two things to fix below.
> Konstantin
> 
> 1. [dpdk-dev,v5,1/3] eal/x86: run-time dispatch over memcpy
> 
> Shared target build fails:
> http://dpdk.org/ml/archives/test-report/2017-October/031032.html
> 
> I think you need to include rte_memcpy_ptr into the:
> lib/librte_eal/linuxapp/eal/rte_eal_version.map
> lib/librte_eal/bsdapp/eal/rte_eal_version.map
> to fix it.
> 
> 2. [dpdk-dev,v5,3/3] efd: run-time dispatch over x86 EFD functions
> 
> /lib/librte_efd/rte_efd_x86.c
> ....
> +efd_value_t
> +efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
> +		const efd_lookuptbl_t *group_lookup_table,
> +		const uint32_t hash_val_a, const uint32_t hash_val_b)
> { #ifdef
> +CC_SUPPORT_AVX2
> +	efd_value_t value = 0;
> +	uint32_t i = 0;
> +	__m256i vhash_val_a = _mm256_set1_epi32(hash_val_a);
> +	__m256i vhash_val_b = _mm256_set1_epi32(hash_val_b);
> +
> +	for (; i < RTE_EFD_VALUE_NUM_BITS; i += 8) {
> +		__m256i vhash_idx =
> +				_mm256_cvtepu16_epi32(EFD_LOAD_SI128(
> +				(__m128i const *) &group_hash_idx[i]));
> +		__m256i vlookup_table = _mm256_cvtepu16_epi32(
> +				EFD_LOAD_SI128((__m128i const *)
> +				&group_lookup_table[i]));
> +		__m256i vhash = _mm256_add_epi32(vhash_val_a,
> +				_mm256_mullo_epi32(vhash_idx,
> vhash_val_b));
> +		__m256i vbucket_idx = _mm256_srli_epi32(vhash,
> +				EFD_LOOKUPTBL_SHIFT);
> +		__m256i vresult = _mm256_srlv_epi32(vlookup_table,
> +				vbucket_idx);
> +
> +		value |= (_mm256_movemask_ps(
> +			(__m256) _mm256_slli_epi32(vresult, 31))
> +			& ((1 << (RTE_EFD_VALUE_NUM_BITS - i)) - 1)) << i;
> +	}
> +
> +	return value;
> +#else
> 
> We always build that file with AVX2 option, so I think we can safely  remove
> The #ifdef CC_SUPPORT_AVX2 and the code below.
> 
> +	RTE_SET_USED(group_hash_idx);
> +	RTE_SET_USED(group_lookup_table);
> +	RTE_SET_USED(hash_val_a);
> +	RTE_SET_USED(hash_val_b);
> +	/* Return dummy value, only to avoid compilation breakage */
> +	return 0;
> +#endif
> +
> +}
> 
> 
> >  lib/librte_eal/bsdapp/eal/Makefile                 |  19 +
> >  .../common/include/arch/x86/rte_memcpy.c           |  59 ++
> >  .../common/include/arch/x86/rte_memcpy.h           | 861 +------------------
> >  .../common/include/arch/x86/rte_memcpy_avx2.c      |  44 +
> >  .../common/include/arch/x86/rte_memcpy_avx512f.c   |  44 +
> >  .../common/include/arch/x86/rte_memcpy_internal.h  | 909
> +++++++++++++++++++++
> >  .../common/include/arch/x86/rte_memcpy_sse.c       |  40 +
> >  lib/librte_eal/linuxapp/eal/Makefile               |  19 +
> >  lib/librte_efd/Makefile                            |   6 +
> >  lib/librte_efd/rte_efd_x86.c                       |  87 ++
> >  lib/librte_efd/rte_efd_x86.h                       |  48 +-
> >  mk/rte.cpuflags.mk                                 |  14 +
> >  test/test/test_memcpy_perf.c                       |  40 +-
> >  13 files changed, 1285 insertions(+), 905 deletions(-)  create mode
> > 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy.c
> >  create mode 100644
> > lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
> >  create mode 100644
> > lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
> >  create mode 100644
> > lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
> >  create mode 100644
> > lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c
> >  create mode 100644 lib/librte_efd/rte_efd_x86.c
> >
> > --
> > 2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [dpdk-dev] [PATCH v6 0/3] run-time Linking support
  2017-10-03 14:59   ` [dpdk-dev] [PATCH v5 0/3] run-time Linking support Xiaoyun Li
                       ` (3 preceding siblings ...)
  2017-10-04 17:56     ` [dpdk-dev] [PATCH v5 0/3] run-time Linking support Ananyev, Konstantin
@ 2017-10-04 22:58     ` Xiaoyun Li
  2017-10-04 22:58       ` [dpdk-dev] [PATCH v6 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
                         ` (3 more replies)
  4 siblings, 4 replies; 88+ messages in thread
From: Xiaoyun Li @ 2017-10-04 22:58 UTC (permalink / raw)
  To: konstantin.ananyev, bruce.richardson
  Cc: wenzhuo.lu, helin.zhang, dev, Xiaoyun Li

This patchset dynamically selects functions at run-time based on CPU flags
that current machine supports.This patchset modifies mempcy, memcpy perf
test and x86 EFD, using function pointers and bind them at constructor time.
Then in the cloud environment, users can compiler once for the minimum target
such as 'haswell'(not 'native') and run on different platforms (equal or above
haswell) and can get ISA optimization based on running CPU.

Xiaoyun Li (3):
  eal/x86: run-time dispatch over memcpy
  app/test: run-time dispatch over memcpy perf test
  efd: run-time dispatch over x86 EFD functions

---
v2
* Use gcc function multi-versioning to avoid compilation issues.
* Add macros for AVX512 and AVX2. Only if users enable AVX512 and the compiler
supports it, the AVX512 codes would be compiled. Only if the compiler supports
AVX2, the AVX2 codes would be compiled.

v3
* Reduce function calls via only keep rte_memcpy_xxx.
* Add conditions that when copy size is small, use inline code path.
Otherwise, use dynamic code path.
* To support attribute target, clang version must be greater than 3.7.
Otherwise, would choose SSE/AVX code path, the same as before.
* Move two mocro functions to the top of the code since they would be used in
inline SSE/AVX and dynamic SSE/AVX codes.

v4
* Modify rte_memcpy.h to several .c files and modify makefiles to compile
AVX2 and AVX512 files.

v5
* Delete redundant repeated codes of rte_memcpy_xxx.
* Modify makefiles to enable reuse of existing rte_memcpy.
* Delete redundant codes of rte_efd_x86.h in v4. Move it into .c file and enable
compilation -mavx2 for it in makefile since it is already chosen at run-time.

v6
* Fix shared target build failure.
* Safely remove redundant efd x86 avx2 codes since the file is compiled
with -mavx2.

 lib/librte_eal/bsdapp/eal/Makefile                 |  19 +
 lib/librte_eal/bsdapp/eal/rte_eal_version.map      |   1 +
 .../common/include/arch/x86/rte_memcpy.c           |  59 ++
 .../common/include/arch/x86/rte_memcpy.h           | 861 +------------------
 .../common/include/arch/x86/rte_memcpy_avx2.c      |  44 +
 .../common/include/arch/x86/rte_memcpy_avx512f.c   |  44 +
 .../common/include/arch/x86/rte_memcpy_internal.h  | 909 +++++++++++++++++++++
 .../common/include/arch/x86/rte_memcpy_sse.c       |  40 +
 lib/librte_eal/linuxapp/eal/Makefile               |  19 +
 lib/librte_eal/linuxapp/eal/rte_eal_version.map    |   1 +
 lib/librte_efd/Makefile                            |   6 +
 lib/librte_efd/rte_efd_x86.c                       |  87 ++
 lib/librte_efd/rte_efd_x86.h                       |  48 +-
 mk/rte.cpuflags.mk                                 |  14 +
 test/test/test_memcpy_perf.c                       |  40 +-
 15 files changed, 1287 insertions(+), 905 deletions(-)
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c
 create mode 100644 lib/librte_efd/rte_efd_x86.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [dpdk-dev] [PATCH v6 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-04 22:58     ` [dpdk-dev] [PATCH v6 " Xiaoyun Li
@ 2017-10-04 22:58       ` Xiaoyun Li
  2017-10-05  9:37         ` Ananyev, Konstantin
  2017-10-04 22:58       ` [dpdk-dev] [PATCH v6 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
                         ` (2 subsequent siblings)
  3 siblings, 1 reply; 88+ messages in thread
From: Xiaoyun Li @ 2017-10-04 22:58 UTC (permalink / raw)
  To: konstantin.ananyev, bruce.richardson
  Cc: wenzhuo.lu, helin.zhang, dev, Xiaoyun Li

This patch dynamically selects functions of memcpy at run-time based
on CPU flags that current machine supports. This patch uses function
pointers which are bind to the relative functions at constrctor time.
In addition, AVX512 instructions set would be compiled only if users
config it enabled and the compiler supports it.

Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
---
 lib/librte_eal/bsdapp/eal/Makefile                 |  19 +
 lib/librte_eal/bsdapp/eal/rte_eal_version.map      |   1 +
 .../common/include/arch/x86/rte_memcpy.c           |  59 ++
 .../common/include/arch/x86/rte_memcpy.h           | 861 +------------------
 .../common/include/arch/x86/rte_memcpy_avx2.c      |  44 +
 .../common/include/arch/x86/rte_memcpy_avx512f.c   |  44 +
 .../common/include/arch/x86/rte_memcpy_internal.h  | 909 +++++++++++++++++++++
 .../common/include/arch/x86/rte_memcpy_sse.c       |  40 +
 lib/librte_eal/linuxapp/eal/Makefile               |  19 +
 lib/librte_eal/linuxapp/eal/rte_eal_version.map    |   1 +
 mk/rte.cpuflags.mk                                 |  14 +
 11 files changed, 1165 insertions(+), 846 deletions(-)
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c

diff --git a/lib/librte_eal/bsdapp/eal/Makefile b/lib/librte_eal/bsdapp/eal/Makefile
index 005019e..1dcd2e3 100644
--- a/lib/librte_eal/bsdapp/eal/Makefile
+++ b/lib/librte_eal/bsdapp/eal/Makefile
@@ -36,6 +36,7 @@ LIB = librte_eal.a
 ARCH_DIR ?= $(RTE_ARCH)
 VPATH += $(RTE_SDK)/lib/librte_eal/common
 VPATH += $(RTE_SDK)/lib/librte_eal/common/arch/$(ARCH_DIR)
+VPATH += $(RTE_SDK)/lib/librte_eal/common/include/arch/$(ARCH_DIR)
 
 CFLAGS += -I$(SRCDIR)/include
 CFLAGS += -I$(RTE_SDK)/lib/librte_eal/common
@@ -93,6 +94,24 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_ARCH_X86) += rte_spinlock.c
 
+# for run-time dispatch of memcpy
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy.c
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_sse.c
+
+# if the compiler supports AVX512, add avx512 file
+ifneq ($(findstring CC_SUPPORT_AVX512F,$(MACHINE_CFLAGS)),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx512f.c
+CFLAGS_rte_memcpy_avx512f.o += -mavx512f
+CFLAGS_rte_memcpy_avx512f.o += -DRTE_MACHINE_CPUFLAG_AVX512F
+endif
+
+# if the compiler supports AVX2, add avx2 file
+ifneq ($(findstring CC_SUPPORT_AVX2,$(MACHINE_CFLAGS)),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx2.c
+CFLAGS_rte_memcpy_avx2.o += -mavx2
+CFLAGS_rte_memcpy_avx2.o += -DRTE_MACHINE_CPUFLAG_AVX2
+endif
+
 CFLAGS_eal_common_cpuflags.o := $(CPUFLAGS_LIST)
 
 CFLAGS_eal.o := -D_GNU_SOURCE
diff --git a/lib/librte_eal/bsdapp/eal/rte_eal_version.map b/lib/librte_eal/bsdapp/eal/rte_eal_version.map
index 47a09ea..e46c3e5 100644
--- a/lib/librte_eal/bsdapp/eal/rte_eal_version.map
+++ b/lib/librte_eal/bsdapp/eal/rte_eal_version.map
@@ -236,5 +236,6 @@ EXPERIMENTAL {
 	rte_service_runstate_set;
 	rte_service_set_stats_enable;
 	rte_service_start_with_defaults;
+	rte_memcpy_ptr;
 
 } DPDK_17.08;
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.c b/lib/librte_eal/common/include/arch/x86/rte_memcpy.c
new file mode 100644
index 0000000..74ae702
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.c
@@ -0,0 +1,59 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+#include <rte_cpuflags.h>
+#include <rte_log.h>
+
+void *(*rte_memcpy_ptr)(void *dst, const void *src, size_t n) = NULL;
+
+static void __attribute__((constructor))
+rte_memcpy_init(void)
+{
+#ifdef CC_SUPPORT_AVX512F
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F)) {
+		rte_memcpy_ptr = rte_memcpy_avx512f;
+		RTE_LOG(DEBUG, EAL, "AVX512 memcpy is using!\n");
+		return;
+	}
+#endif
+#ifdef CC_SUPPORT_AVX2
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) {
+		rte_memcpy_ptr = rte_memcpy_avx2;
+		RTE_LOG(DEBUG, EAL, "AVX2 memcpy is using!\n");
+		return;
+	}
+#endif
+	rte_memcpy_ptr = rte_memcpy_sse;
+	RTE_LOG(DEBUG, EAL, "Default SSE/AVX memcpy is using!\n");
+}
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index 74c280c..460dcdb 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -1,7 +1,7 @@
 /*-
  *   BSD LICENSE
  *
- *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
  *   All rights reserved.
  *
  *   Redistribution and use in source and binary forms, with or without
@@ -34,867 +34,36 @@
 #ifndef _RTE_MEMCPY_X86_64_H_
 #define _RTE_MEMCPY_X86_64_H_
 
-/**
- * @file
- *
- * Functions for SSE/AVX/AVX2/AVX512 implementation of memcpy().
- */
-
-#include <stdio.h>
-#include <stdint.h>
-#include <string.h>
-#include <rte_vect.h>
-#include <rte_common.h>
+#include <rte_memcpy_internal.h>
 
 #ifdef __cplusplus
 extern "C" {
 #endif
 
-/**
- * Copy bytes from one location to another. The locations must not overlap.
- *
- * @note This is implemented as a macro, so it's address should not be taken
- * and care is needed as parameter expressions may be evaluated multiple times.
- *
- * @param dst
- *   Pointer to the destination of the data.
- * @param src
- *   Pointer to the source data.
- * @param n
- *   Number of bytes to copy.
- * @return
- *   Pointer to the destination data.
- */
-static __rte_always_inline void *
-rte_memcpy(void *dst, const void *src, size_t n);
-
-#ifdef RTE_MACHINE_CPUFLAG_AVX512F
+#define RTE_X86_MEMCPY_THRESH 128
 
-#define ALIGNMENT_MASK 0x3F
+extern void *
+(*rte_memcpy_ptr)(void *dst, const void *src, size_t n);
 
 /**
- * AVX512 implementation below
+ * Different implementations of memcpy.
  */
+extern void*
+rte_memcpy_avx512f(void *dst, const void *src, size_t n);
 
-/**
- * Copy 16 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov16(uint8_t *dst, const uint8_t *src)
-{
-	__m128i xmm0;
-
-	xmm0 = _mm_loadu_si128((const __m128i *)src);
-	_mm_storeu_si128((__m128i *)dst, xmm0);
-}
-
-/**
- * Copy 32 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov32(uint8_t *dst, const uint8_t *src)
-{
-	__m256i ymm0;
+extern void *
+rte_memcpy_avx2(void *dst, const void *src, size_t n);
 
-	ymm0 = _mm256_loadu_si256((const __m256i *)src);
-	_mm256_storeu_si256((__m256i *)dst, ymm0);
-}
-
-/**
- * Copy 64 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov64(uint8_t *dst, const uint8_t *src)
-{
-	__m512i zmm0;
-
-	zmm0 = _mm512_loadu_si512((const void *)src);
-	_mm512_storeu_si512((void *)dst, zmm0);
-}
-
-/**
- * Copy 128 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov64(dst + 0 * 64, src + 0 * 64);
-	rte_mov64(dst + 1 * 64, src + 1 * 64);
-}
-
-/**
- * Copy 256 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov256(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov64(dst + 0 * 64, src + 0 * 64);
-	rte_mov64(dst + 1 * 64, src + 1 * 64);
-	rte_mov64(dst + 2 * 64, src + 2 * 64);
-	rte_mov64(dst + 3 * 64, src + 3 * 64);
-}
-
-/**
- * Copy 128-byte blocks from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
-{
-	__m512i zmm0, zmm1;
-
-	while (n >= 128) {
-		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
-		n -= 128;
-		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
-		src = src + 128;
-		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
-		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
-		dst = dst + 128;
-	}
-}
-
-/**
- * Copy 512-byte blocks from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
-{
-	__m512i zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, zmm7;
-
-	while (n >= 512) {
-		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
-		n -= 512;
-		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
-		zmm2 = _mm512_loadu_si512((const void *)(src + 2 * 64));
-		zmm3 = _mm512_loadu_si512((const void *)(src + 3 * 64));
-		zmm4 = _mm512_loadu_si512((const void *)(src + 4 * 64));
-		zmm5 = _mm512_loadu_si512((const void *)(src + 5 * 64));
-		zmm6 = _mm512_loadu_si512((const void *)(src + 6 * 64));
-		zmm7 = _mm512_loadu_si512((const void *)(src + 7 * 64));
-		src = src + 512;
-		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
-		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
-		_mm512_storeu_si512((void *)(dst + 2 * 64), zmm2);
-		_mm512_storeu_si512((void *)(dst + 3 * 64), zmm3);
-		_mm512_storeu_si512((void *)(dst + 4 * 64), zmm4);
-		_mm512_storeu_si512((void *)(dst + 5 * 64), zmm5);
-		_mm512_storeu_si512((void *)(dst + 6 * 64), zmm6);
-		_mm512_storeu_si512((void *)(dst + 7 * 64), zmm7);
-		dst = dst + 512;
-	}
-}
-
-static inline void *
-rte_memcpy_generic(void *dst, const void *src, size_t n)
-{
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
-	void *ret = dst;
-	size_t dstofss;
-	size_t bits;
-
-	/**
-	 * Copy less than 16 bytes
-	 */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08)
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		return ret;
-	}
-
-	/**
-	 * Fast way when copy size doesn't exceed 512 bytes
-	 */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				  (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov32((uint8_t *)dst - 32 + n,
-				  (const uint8_t *)src - 32 + n);
-		return ret;
-	}
-	if (n <= 512) {
-		if (n >= 256) {
-			n -= 256;
-			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 256;
-			dst = (uint8_t *)dst + 256;
-		}
-		if (n >= 128) {
-			n -= 128;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 128;
-			dst = (uint8_t *)dst + 128;
-		}
-COPY_BLOCK_128_BACK63:
-		if (n > 64) {
-			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov64((uint8_t *)dst - 64 + n,
-					  (const uint8_t *)src - 64 + n);
-			return ret;
-		}
-		if (n > 0)
-			rte_mov64((uint8_t *)dst - 64 + n,
-					  (const uint8_t *)src - 64 + n);
-		return ret;
-	}
-
-	/**
-	 * Make store aligned when copy size exceeds 512 bytes
-	 */
-	dstofss = ((uintptr_t)dst & 0x3F);
-	if (dstofss > 0) {
-		dstofss = 64 - dstofss;
-		n -= dstofss;
-		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-		src = (const uint8_t *)src + dstofss;
-		dst = (uint8_t *)dst + dstofss;
-	}
-
-	/**
-	 * Copy 512-byte blocks.
-	 * Use copy block function for better instruction order control,
-	 * which is important when load is unaligned.
-	 */
-	rte_mov512blocks((uint8_t *)dst, (const uint8_t *)src, n);
-	bits = n;
-	n = n & 511;
-	bits -= n;
-	src = (const uint8_t *)src + bits;
-	dst = (uint8_t *)dst + bits;
-
-	/**
-	 * Copy 128-byte blocks.
-	 * Use copy block function for better instruction order control,
-	 * which is important when load is unaligned.
-	 */
-	if (n >= 128) {
-		rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
-		bits = n;
-		n = n & 127;
-		bits -= n;
-		src = (const uint8_t *)src + bits;
-		dst = (uint8_t *)dst + bits;
-	}
-
-	/**
-	 * Copy whatever left
-	 */
-	goto COPY_BLOCK_128_BACK63;
-}
-
-#elif defined RTE_MACHINE_CPUFLAG_AVX2
-
-#define ALIGNMENT_MASK 0x1F
-
-/**
- * AVX2 implementation below
- */
-
-/**
- * Copy 16 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov16(uint8_t *dst, const uint8_t *src)
-{
-	__m128i xmm0;
-
-	xmm0 = _mm_loadu_si128((const __m128i *)src);
-	_mm_storeu_si128((__m128i *)dst, xmm0);
-}
-
-/**
- * Copy 32 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov32(uint8_t *dst, const uint8_t *src)
-{
-	__m256i ymm0;
-
-	ymm0 = _mm256_loadu_si256((const __m256i *)src);
-	_mm256_storeu_si256((__m256i *)dst, ymm0);
-}
-
-/**
- * Copy 64 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov64(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
-	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
-}
-
-/**
- * Copy 128 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
-	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
-	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
-	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
-}
-
-/**
- * Copy 128-byte blocks from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
-{
-	__m256i ymm0, ymm1, ymm2, ymm3;
-
-	while (n >= 128) {
-		ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32));
-		n -= 128;
-		ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32));
-		ymm2 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 2 * 32));
-		ymm3 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 3 * 32));
-		src = (const uint8_t *)src + 128;
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2);
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3);
-		dst = (uint8_t *)dst + 128;
-	}
-}
-
-static inline void *
-rte_memcpy_generic(void *dst, const void *src, size_t n)
-{
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
-	void *ret = dst;
-	size_t dstofss;
-	size_t bits;
-
-	/**
-	 * Copy less than 16 bytes
-	 */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08) {
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		}
-		return ret;
-	}
-
-	/**
-	 * Fast way when copy size doesn't exceed 256 bytes
-	 */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				(const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 48) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst + 16, (const uint8_t *)src + 16);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				(const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov32((uint8_t *)dst - 32 + n,
-				(const uint8_t *)src - 32 + n);
-		return ret;
-	}
-	if (n <= 256) {
-		if (n >= 128) {
-			n -= 128;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 128;
-			dst = (uint8_t *)dst + 128;
-		}
-COPY_BLOCK_128_BACK31:
-		if (n >= 64) {
-			n -= 64;
-			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 64;
-			dst = (uint8_t *)dst + 64;
-		}
-		if (n > 32) {
-			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov32((uint8_t *)dst - 32 + n,
-					(const uint8_t *)src - 32 + n);
-			return ret;
-		}
-		if (n > 0) {
-			rte_mov32((uint8_t *)dst - 32 + n,
-					(const uint8_t *)src - 32 + n);
-		}
-		return ret;
-	}
-
-	/**
-	 * Make store aligned when copy size exceeds 256 bytes
-	 */
-	dstofss = (uintptr_t)dst & 0x1F;
-	if (dstofss > 0) {
-		dstofss = 32 - dstofss;
-		n -= dstofss;
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		src = (const uint8_t *)src + dstofss;
-		dst = (uint8_t *)dst + dstofss;
-	}
-
-	/**
-	 * Copy 128-byte blocks
-	 */
-	rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
-	bits = n;
-	n = n & 127;
-	bits -= n;
-	src = (const uint8_t *)src + bits;
-	dst = (uint8_t *)dst + bits;
-
-	/**
-	 * Copy whatever left
-	 */
-	goto COPY_BLOCK_128_BACK31;
-}
-
-#else /* RTE_MACHINE_CPUFLAG */
-
-#define ALIGNMENT_MASK 0x0F
-
-/**
- * SSE & AVX implementation below
- */
-
-/**
- * Copy 16 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov16(uint8_t *dst, const uint8_t *src)
-{
-	__m128i xmm0;
-
-	xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);
-	_mm_storeu_si128((__m128i *)dst, xmm0);
-}
-
-/**
- * Copy 32 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov32(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-}
-
-/**
- * Copy 64 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov64(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
-	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
-}
-
-/**
- * Copy 128 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
-	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
-	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
-	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
-	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
-	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
-}
-
-/**
- * Copy 256 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov256(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
-	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
-	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
-	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
-	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
-	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
-	rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);
-	rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);
-	rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);
-	rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);
-	rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);
-	rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);
-	rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);
-	rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
-}
-
-/**
- * Macro for copying unaligned block from one location to another with constant load offset,
- * 47 bytes leftover maximum,
- * locations should not overlap.
- * Requirements:
- * - Store is aligned
- * - Load offset is <offset>, which must be immediate value within [1, 15]
- * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
- * - <dst>, <src>, <len> must be variables
- * - __m128i <xmm0> ~ <xmm8> must be pre-defined
- */
-#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)                                                     \
-__extension__ ({                                                                                            \
-    int tmp;                                                                                                \
-    while (len >= 128 + 16 - offset) {                                                                      \
-        xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));                  \
-        len -= 128;                                                                                         \
-        xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));                  \
-        xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));                  \
-        xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16));                  \
-        xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16));                  \
-        xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16));                  \
-        xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16));                  \
-        xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \
-        xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \
-        src = (const uint8_t *)src + 128;                                                                   \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));        \
-        dst = (uint8_t *)dst + 128;                                                                         \
-    }                                                                                                       \
-    tmp = len;                                                                                              \
-    len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
-    tmp -= len;                                                                                             \
-    src = (const uint8_t *)src + tmp;                                                                       \
-    dst = (uint8_t *)dst + tmp;                                                                             \
-    if (len >= 32 + 16 - offset) {                                                                          \
-        while (len >= 32 + 16 - offset) {                                                                   \
-            xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));              \
-            len -= 32;                                                                                      \
-            xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));              \
-            xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));              \
-            src = (const uint8_t *)src + 32;                                                                \
-            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));    \
-            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));    \
-            dst = (uint8_t *)dst + 32;                                                                      \
-        }                                                                                                   \
-        tmp = len;                                                                                          \
-        len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
-        tmp -= len;                                                                                         \
-        src = (const uint8_t *)src + tmp;                                                                   \
-        dst = (uint8_t *)dst + tmp;                                                                         \
-    }                                                                                                       \
-})
-
-/**
- * Macro for copying unaligned block from one location to another,
- * 47 bytes leftover maximum,
- * locations should not overlap.
- * Use switch here because the aligning instruction requires immediate value for shift count.
- * Requirements:
- * - Store is aligned
- * - Load offset is <offset>, which must be within [1, 15]
- * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
- * - <dst>, <src>, <len> must be variables
- * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined
- */
-#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                   \
-__extension__ ({                                                      \
-    switch (offset) {                                                 \
-    case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break;    \
-    case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break;    \
-    case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break;    \
-    case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break;    \
-    case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break;    \
-    case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break;    \
-    case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break;    \
-    case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break;    \
-    case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break;    \
-    case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break;    \
-    case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;    \
-    case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break;    \
-    case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break;    \
-    case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break;    \
-    case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break;    \
-    default:;                                                         \
-    }                                                                 \
-})
-
-static inline void *
-rte_memcpy_generic(void *dst, const void *src, size_t n)
-{
-	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
-	void *ret = dst;
-	size_t dstofss;
-	size_t srcofs;
-
-	/**
-	 * Copy less than 16 bytes
-	 */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08) {
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		}
-		return ret;
-	}
-
-	/**
-	 * Fast way when copy size doesn't exceed 512 bytes
-	 */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 48) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
-		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 128) {
-		goto COPY_BLOCK_128_BACK15;
-	}
-	if (n <= 512) {
-		if (n >= 256) {
-			n -= 256;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov128((uint8_t *)dst + 128, (const uint8_t *)src + 128);
-			src = (const uint8_t *)src + 256;
-			dst = (uint8_t *)dst + 256;
-		}
-COPY_BLOCK_255_BACK15:
-		if (n >= 128) {
-			n -= 128;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 128;
-			dst = (uint8_t *)dst + 128;
-		}
-COPY_BLOCK_128_BACK15:
-		if (n >= 64) {
-			n -= 64;
-			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 64;
-			dst = (uint8_t *)dst + 64;
-		}
-COPY_BLOCK_64_BACK15:
-		if (n >= 32) {
-			n -= 32;
-			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 32;
-			dst = (uint8_t *)dst + 32;
-		}
-		if (n > 16) {
-			rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-			return ret;
-		}
-		if (n > 0) {
-			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		}
-		return ret;
-	}
-
-	/**
-	 * Make store aligned when copy size exceeds 512 bytes,
-	 * and make sure the first 15 bytes are copied, because
-	 * unaligned copy functions require up to 15 bytes
-	 * backwards access.
-	 */
-	dstofss = (uintptr_t)dst & 0x0F;
-	if (dstofss > 0) {
-		dstofss = 16 - dstofss + 16;
-		n -= dstofss;
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		src = (const uint8_t *)src + dstofss;
-		dst = (uint8_t *)dst + dstofss;
-	}
-	srcofs = ((uintptr_t)src & 0x0F);
-
-	/**
-	 * For aligned copy
-	 */
-	if (srcofs == 0) {
-		/**
-		 * Copy 256-byte blocks
-		 */
-		for (; n >= 256; n -= 256) {
-			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
-			dst = (uint8_t *)dst + 256;
-			src = (const uint8_t *)src + 256;
-		}
-
-		/**
-		 * Copy whatever left
-		 */
-		goto COPY_BLOCK_255_BACK15;
-	}
-
-	/**
-	 * For copy with unaligned load
-	 */
-	MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
-
-	/**
-	 * Copy whatever left
-	 */
-	goto COPY_BLOCK_64_BACK15;
-}
-
-#endif /* RTE_MACHINE_CPUFLAG */
-
-static inline void *
-rte_memcpy_aligned(void *dst, const void *src, size_t n)
-{
-	void *ret = dst;
-
-	/* Copy size <= 16 bytes */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dst = *(const uint8_t *)src;
-			src = (const uint8_t *)src + 1;
-			dst = (uint8_t *)dst + 1;
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dst = *(const uint16_t *)src;
-			src = (const uint16_t *)src + 1;
-			dst = (uint16_t *)dst + 1;
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dst = *(const uint32_t *)src;
-			src = (const uint32_t *)src + 1;
-			dst = (uint32_t *)dst + 1;
-		}
-		if (n & 0x08)
-			*(uint64_t *)dst = *(const uint64_t *)src;
-
-		return ret;
-	}
-
-	/* Copy 16 <= size <= 32 bytes */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				(const uint8_t *)src - 16 + n);
-
-		return ret;
-	}
-
-	/* Copy 32 < size <= 64 bytes */
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov32((uint8_t *)dst - 32 + n,
-				(const uint8_t *)src - 32 + n);
-
-		return ret;
-	}
-
-	/* Copy 64 bytes blocks */
-	for (; n >= 64; n -= 64) {
-		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-		dst = (uint8_t *)dst + 64;
-		src = (const uint8_t *)src + 64;
-	}
-
-	/* Copy whatever left */
-	rte_mov64((uint8_t *)dst - 64 + n,
-			(const uint8_t *)src - 64 + n);
-
-	return ret;
-}
+extern void *
+rte_memcpy_sse(void *dst, const void *src, size_t n);
 
 static inline void *
 rte_memcpy(void *dst, const void *src, size_t n)
 {
-	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
-		return rte_memcpy_aligned(dst, src, n);
+	if (n <= RTE_X86_MEMCPY_THRESH)
+		return rte_memcpy_internal(dst, src, n);
 	else
-		return rte_memcpy_generic(dst, src, n);
+		return (*rte_memcpy_ptr)(dst, src, n);
 }
 
 #ifdef __cplusplus
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c b/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
new file mode 100644
index 0000000..3ad229c
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
@@ -0,0 +1,44 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+
+#ifndef RTE_MACHINE_CPUFLAG_AVX2
+#error RTE_MACHINE_CPUFLAG_AVX2 not defined
+#endif
+
+void *
+rte_memcpy_avx2(void *dst, const void *src, size_t n)
+{
+	return rte_memcpy_internal(dst, src, n);
+}
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c b/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
new file mode 100644
index 0000000..be8d964
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
@@ -0,0 +1,44 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+
+#ifndef RTE_MACHINE_CPUFLAG_AVX512F
+#error RTE_MACHINE_CPUFLAG_AVX512F not defined
+#endif
+
+void *
+rte_memcpy_avx512f(void *dst, const void *src, size_t n)
+{
+	return rte_memcpy_internal(dst, src, n);
+}
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
new file mode 100644
index 0000000..d17fb5b
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
@@ -0,0 +1,909 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_MEMCPY_INTERNAL_X86_64_H_
+#define _RTE_MEMCPY_INTERNAL_X86_64_H_
+
+/**
+ * @file
+ *
+ * Functions for SSE/AVX/AVX2/AVX512 implementation of memcpy().
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <string.h>
+#include <rte_vect.h>
+#include <rte_common.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Copy bytes from one location to another. The locations must not overlap.
+ *
+ * @note This is implemented as a macro, so it's address should not be taken
+ * and care is needed as parameter expressions may be evaluated multiple times.
+ *
+ * @param dst
+ *   Pointer to the destination of the data.
+ * @param src
+ *   Pointer to the source data.
+ * @param n
+ *   Number of bytes to copy.
+ * @return
+ *   Pointer to the destination data.
+ */
+
+#ifdef RTE_MACHINE_CPUFLAG_AVX512F
+
+#define ALIGNMENT_MASK 0x3F
+
+/**
+ * AVX512 implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_loadu_si256((const __m256i *)src);
+	_mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+	__m512i zmm0;
+
+	zmm0 = _mm512_loadu_si512((const void *)src);
+	_mm512_storeu_si512((void *)dst, zmm0);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov64(dst + 0 * 64, src + 0 * 64);
+	rte_mov64(dst + 1 * 64, src + 1 * 64);
+}
+
+/**
+ * Copy 256 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov256(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov64(dst + 0 * 64, src + 0 * 64);
+	rte_mov64(dst + 1 * 64, src + 1 * 64);
+	rte_mov64(dst + 2 * 64, src + 2 * 64);
+	rte_mov64(dst + 3 * 64, src + 3 * 64);
+}
+
+/**
+ * Copy 128-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m512i zmm0, zmm1;
+
+	while (n >= 128) {
+		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
+		n -= 128;
+		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
+		src = src + 128;
+		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
+		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
+		dst = dst + 128;
+	}
+}
+
+/**
+ * Copy 512-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m512i zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, zmm7;
+
+	while (n >= 512) {
+		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
+		n -= 512;
+		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
+		zmm2 = _mm512_loadu_si512((const void *)(src + 2 * 64));
+		zmm3 = _mm512_loadu_si512((const void *)(src + 3 * 64));
+		zmm4 = _mm512_loadu_si512((const void *)(src + 4 * 64));
+		zmm5 = _mm512_loadu_si512((const void *)(src + 5 * 64));
+		zmm6 = _mm512_loadu_si512((const void *)(src + 6 * 64));
+		zmm7 = _mm512_loadu_si512((const void *)(src + 7 * 64));
+		src = src + 512;
+		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
+		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
+		_mm512_storeu_si512((void *)(dst + 2 * 64), zmm2);
+		_mm512_storeu_si512((void *)(dst + 3 * 64), zmm3);
+		_mm512_storeu_si512((void *)(dst + 4 * 64), zmm4);
+		_mm512_storeu_si512((void *)(dst + 5 * 64), zmm5);
+		_mm512_storeu_si512((void *)(dst + 6 * 64), zmm6);
+		_mm512_storeu_si512((void *)(dst + 7 * 64), zmm7);
+		dst = dst + 512;
+	}
+}
+
+static inline void *
+rte_memcpy_generic(void *dst, const void *src, size_t n)
+{
+	uintptr_t dstu = (uintptr_t)dst;
+	uintptr_t srcu = (uintptr_t)src;
+	void *ret = dst;
+	size_t dstofss;
+	size_t bits;
+
+	/**
+	 * Copy less than 16 bytes
+	 */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dstu = *(const uint8_t *)srcu;
+			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+			dstu = (uintptr_t)((uint8_t *)dstu + 1);
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dstu = *(const uint16_t *)srcu;
+			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+			dstu = (uintptr_t)((uint16_t *)dstu + 1);
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dstu = *(const uint32_t *)srcu;
+			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+			dstu = (uintptr_t)((uint32_t *)dstu + 1);
+		}
+		if (n & 0x08)
+			*(uint64_t *)dstu = *(const uint64_t *)srcu;
+		return ret;
+	}
+
+	/**
+	 * Fast way when copy size doesn't exceed 512 bytes
+	 */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				  (const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				  (const uint8_t *)src - 32 + n);
+		return ret;
+	}
+	if (n <= 512) {
+		if (n >= 256) {
+			n -= 256;
+			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 256;
+			dst = (uint8_t *)dst + 256;
+		}
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+COPY_BLOCK_128_BACK63:
+		if (n > 64) {
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov64((uint8_t *)dst - 64 + n,
+					  (const uint8_t *)src - 64 + n);
+			return ret;
+		}
+		if (n > 0)
+			rte_mov64((uint8_t *)dst - 64 + n,
+					  (const uint8_t *)src - 64 + n);
+		return ret;
+	}
+
+	/**
+	 * Make store aligned when copy size exceeds 512 bytes
+	 */
+	dstofss = ((uintptr_t)dst & 0x3F);
+	if (dstofss > 0) {
+		dstofss = 64 - dstofss;
+		n -= dstofss;
+		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+		src = (const uint8_t *)src + dstofss;
+		dst = (uint8_t *)dst + dstofss;
+	}
+
+	/**
+	 * Copy 512-byte blocks.
+	 * Use copy block function for better instruction order control,
+	 * which is important when load is unaligned.
+	 */
+	rte_mov512blocks((uint8_t *)dst, (const uint8_t *)src, n);
+	bits = n;
+	n = n & 511;
+	bits -= n;
+	src = (const uint8_t *)src + bits;
+	dst = (uint8_t *)dst + bits;
+
+	/**
+	 * Copy 128-byte blocks.
+	 * Use copy block function for better instruction order control,
+	 * which is important when load is unaligned.
+	 */
+	if (n >= 128) {
+		rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
+		bits = n;
+		n = n & 127;
+		bits -= n;
+		src = (const uint8_t *)src + bits;
+		dst = (uint8_t *)dst + bits;
+	}
+
+	/**
+	 * Copy whatever left
+	 */
+	goto COPY_BLOCK_128_BACK63;
+}
+
+#elif defined RTE_MACHINE_CPUFLAG_AVX2
+
+#define ALIGNMENT_MASK 0x1F
+
+/**
+ * AVX2 implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_loadu_si256((const __m256i *)src);
+	_mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
+	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
+	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
+	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
+	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
+}
+
+/**
+ * Copy 128-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m256i ymm0, ymm1, ymm2, ymm3;
+
+	while (n >= 128) {
+		ymm0 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 0 * 32));
+		n -= 128;
+		ymm1 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 1 * 32));
+		ymm2 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 2 * 32));
+		ymm3 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 3 * 32));
+		src = (const uint8_t *)src + 128;
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3);
+		dst = (uint8_t *)dst + 128;
+	}
+}
+
+static inline void *
+rte_memcpy_generic(void *dst, const void *src, size_t n)
+{
+	uintptr_t dstu = (uintptr_t)dst;
+	uintptr_t srcu = (uintptr_t)src;
+	void *ret = dst;
+	size_t dstofss;
+	size_t bits;
+
+	/**
+	 * Copy less than 16 bytes
+	 */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dstu = *(const uint8_t *)srcu;
+			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+			dstu = (uintptr_t)((uint8_t *)dstu + 1);
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dstu = *(const uint16_t *)srcu;
+			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+			dstu = (uintptr_t)((uint16_t *)dstu + 1);
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dstu = *(const uint32_t *)srcu;
+			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+			dstu = (uintptr_t)((uint32_t *)dstu + 1);
+		}
+		if (n & 0x08)
+			*(uint64_t *)dstu = *(const uint64_t *)srcu;
+		return ret;
+	}
+
+	/**
+	 * Fast way when copy size doesn't exceed 256 bytes
+	 */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 48) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst + 16, (const uint8_t *)src + 16);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				(const uint8_t *)src - 32 + n);
+		return ret;
+	}
+	if (n <= 256) {
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+COPY_BLOCK_128_BACK31:
+		if (n >= 64) {
+			n -= 64;
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 64;
+			dst = (uint8_t *)dst + 64;
+		}
+		if (n > 32) {
+			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov32((uint8_t *)dst - 32 + n,
+					(const uint8_t *)src - 32 + n);
+			return ret;
+		}
+		if (n > 0) {
+			rte_mov32((uint8_t *)dst - 32 + n,
+					(const uint8_t *)src - 32 + n);
+		}
+		return ret;
+	}
+
+	/**
+	 * Make store aligned when copy size exceeds 256 bytes
+	 */
+	dstofss = (uintptr_t)dst & 0x1F;
+	if (dstofss > 0) {
+		dstofss = 32 - dstofss;
+		n -= dstofss;
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		src = (const uint8_t *)src + dstofss;
+		dst = (uint8_t *)dst + dstofss;
+	}
+
+	/**
+	 * Copy 128-byte blocks
+	 */
+	rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
+	bits = n;
+	n = n & 127;
+	bits -= n;
+	src = (const uint8_t *)src + bits;
+	dst = (uint8_t *)dst + bits;
+
+	/**
+	 * Copy whatever left
+	 */
+	goto COPY_BLOCK_128_BACK31;
+}
+
+#else /* RTE_MACHINE_CPUFLAG */
+
+#define ALIGNMENT_MASK 0x0F
+
+/**
+ * SSE & AVX implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
+	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
+	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
+	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
+}
+
+/**
+ * Copy 256 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov256(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
+	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
+	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
+	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
+	rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);
+	rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);
+	rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);
+	rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);
+	rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);
+	rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);
+	rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);
+	rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
+}
+
+/**
+ * Macro for copying unaligned block from one location to another with constant load offset,
+ * 47 bytes leftover maximum,
+ * locations should not overlap.
+ * Requirements:
+ * - Store is aligned
+ * - Load offset is <offset>, which must be immediate value within [1, 15]
+ * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
+ * - <dst>, <src>, <len> must be variables
+ * - __m128i <xmm0> ~ <xmm8> must be pre-defined
+ */
+#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)                                                     \
+__extension__ ({                                                                                            \
+    int tmp;                                                                                                \
+    while (len >= 128 + 16 - offset) {                                                                      \
+        xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));                  \
+        len -= 128;                                                                                         \
+        xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));                  \
+        xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));                  \
+        xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16));                  \
+        xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16));                  \
+        xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16));                  \
+        xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16));                  \
+        xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \
+        xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \
+        src = (const uint8_t *)src + 128;                                                                   \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));        \
+        dst = (uint8_t *)dst + 128;                                                                         \
+    }                                                                                                       \
+    tmp = len;                                                                                              \
+    len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
+    tmp -= len;                                                                                             \
+    src = (const uint8_t *)src + tmp;                                                                       \
+    dst = (uint8_t *)dst + tmp;                                                                             \
+    if (len >= 32 + 16 - offset) {                                                                          \
+        while (len >= 32 + 16 - offset) {                                                                   \
+            xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));              \
+            len -= 32;                                                                                      \
+            xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));              \
+            xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));              \
+            src = (const uint8_t *)src + 32;                                                                \
+            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));    \
+            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));    \
+            dst = (uint8_t *)dst + 32;                                                                      \
+        }                                                                                                   \
+        tmp = len;                                                                                          \
+        len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
+        tmp -= len;                                                                                         \
+        src = (const uint8_t *)src + tmp;                                                                   \
+        dst = (uint8_t *)dst + tmp;                                                                         \
+    }                                                                                                       \
+})
+
+/**
+ * Macro for copying unaligned block from one location to another,
+ * 47 bytes leftover maximum,
+ * locations should not overlap.
+ * Use switch here because the aligning instruction requires immediate value for shift count.
+ * Requirements:
+ * - Store is aligned
+ * - Load offset is <offset>, which must be within [1, 15]
+ * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
+ * - <dst>, <src>, <len> must be variables
+ * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined
+ */
+#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                   \
+__extension__ ({                                                      \
+    switch (offset) {                                                 \
+    case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break;    \
+    case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break;    \
+    case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break;    \
+    case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break;    \
+    case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break;    \
+    case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break;    \
+    case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break;    \
+    case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break;    \
+    case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break;    \
+    case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break;    \
+    case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;    \
+    case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break;    \
+    case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break;    \
+    case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break;    \
+    case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break;    \
+    default:;                                                         \
+    }                                                                 \
+})
+
+static inline void *
+rte_memcpy_generic(void *dst, const void *src, size_t n)
+{
+	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
+	uintptr_t dstu = (uintptr_t)dst;
+	uintptr_t srcu = (uintptr_t)src;
+	void *ret = dst;
+	size_t dstofss;
+	size_t srcofs;
+
+	/**
+	 * Copy less than 16 bytes
+	 */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dstu = *(const uint8_t *)srcu;
+			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+			dstu = (uintptr_t)((uint8_t *)dstu + 1);
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dstu = *(const uint16_t *)srcu;
+			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+			dstu = (uintptr_t)((uint16_t *)dstu + 1);
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dstu = *(const uint32_t *)srcu;
+			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+			dstu = (uintptr_t)((uint32_t *)dstu + 1);
+		}
+		if (n & 0x08)
+			*(uint64_t *)dstu = *(const uint64_t *)srcu;
+		return ret;
+	}
+
+	/**
+	 * Fast way when copy size doesn't exceed 512 bytes
+	 */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 48) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 128)
+		goto COPY_BLOCK_128_BACK15;
+	if (n <= 512) {
+		if (n >= 256) {
+			n -= 256;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov128((uint8_t *)dst + 128,
+					(const uint8_t *)src + 128);
+			src = (const uint8_t *)src + 256;
+			dst = (uint8_t *)dst + 256;
+		}
+COPY_BLOCK_255_BACK15:
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+COPY_BLOCK_128_BACK15:
+		if (n >= 64) {
+			n -= 64;
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 64;
+			dst = (uint8_t *)dst + 64;
+		}
+COPY_BLOCK_64_BACK15:
+		if (n >= 32) {
+			n -= 32;
+			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 32;
+			dst = (uint8_t *)dst + 32;
+		}
+		if (n > 16) {
+			rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov16((uint8_t *)dst - 16 + n,
+					(const uint8_t *)src - 16 + n);
+			return ret;
+		}
+		if (n > 0) {
+			rte_mov16((uint8_t *)dst - 16 + n,
+					(const uint8_t *)src - 16 + n);
+		}
+		return ret;
+	}
+
+	/**
+	 * Make store aligned when copy size exceeds 512 bytes,
+	 * and make sure the first 15 bytes are copied, because
+	 * unaligned copy functions require up to 15 bytes
+	 * backwards access.
+	 */
+	dstofss = (uintptr_t)dst & 0x0F;
+	if (dstofss > 0) {
+		dstofss = 16 - dstofss + 16;
+		n -= dstofss;
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		src = (const uint8_t *)src + dstofss;
+		dst = (uint8_t *)dst + dstofss;
+	}
+	srcofs = ((uintptr_t)src & 0x0F);
+
+	/**
+	 * For aligned copy
+	 */
+	if (srcofs == 0) {
+		/**
+		 * Copy 256-byte blocks
+		 */
+		for (; n >= 256; n -= 256) {
+			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
+			dst = (uint8_t *)dst + 256;
+			src = (const uint8_t *)src + 256;
+		}
+
+		/**
+		 * Copy whatever left
+		 */
+		goto COPY_BLOCK_255_BACK15;
+	}
+
+	/**
+	 * For copy with unaligned load
+	 */
+	MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
+
+	/**
+	 * Copy whatever left
+	 */
+	goto COPY_BLOCK_64_BACK15;
+}
+
+#endif /* RTE_MACHINE_CPUFLAG */
+
+static inline void *
+rte_memcpy_aligned(void *dst, const void *src, size_t n)
+{
+	void *ret = dst;
+
+	/* Copy size <= 16 bytes */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dst = *(const uint8_t *)src;
+			src = (const uint8_t *)src + 1;
+			dst = (uint8_t *)dst + 1;
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dst = *(const uint16_t *)src;
+			src = (const uint16_t *)src + 1;
+			dst = (uint16_t *)dst + 1;
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dst = *(const uint32_t *)src;
+			src = (const uint32_t *)src + 1;
+			dst = (uint32_t *)dst + 1;
+		}
+		if (n & 0x08)
+			*(uint64_t *)dst = *(const uint64_t *)src;
+
+		return ret;
+	}
+
+	/* Copy 16 <= size <= 32 bytes */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+
+		return ret;
+	}
+
+	/* Copy 32 < size <= 64 bytes */
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				(const uint8_t *)src - 32 + n);
+
+		return ret;
+	}
+
+	/* Copy 64 bytes blocks */
+	for (; n >= 64; n -= 64) {
+		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+		dst = (uint8_t *)dst + 64;
+		src = (const uint8_t *)src + 64;
+	}
+
+	/* Copy whatever left */
+	rte_mov64((uint8_t *)dst - 64 + n,
+			(const uint8_t *)src - 64 + n);
+
+	return ret;
+}
+
+static inline void *
+rte_memcpy_internal(void *dst, const void *src, size_t n)
+{
+	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
+		return rte_memcpy_aligned(dst, src, n);
+	else
+		return rte_memcpy_generic(dst, src, n);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_MEMCPY_INTERNAL_X86_64_H_ */
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c b/lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c
new file mode 100644
index 0000000..55d6b41
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c
@@ -0,0 +1,40 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+
+void *
+rte_memcpy_sse(void *dst, const void *src, size_t n)
+{
+	return rte_memcpy_internal(dst, src, n);
+}
diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
index 90bca4d..c8bdac0 100644
--- a/lib/librte_eal/linuxapp/eal/Makefile
+++ b/lib/librte_eal/linuxapp/eal/Makefile
@@ -40,6 +40,7 @@ VPATH += $(RTE_SDK)/lib/librte_eal/common/arch/$(ARCH_DIR)
 LIBABIVER := 5
 
 VPATH += $(RTE_SDK)/lib/librte_eal/common
+VPATH += $(RTE_SDK)/lib/librte_eal/common/include/arch/$(ARCH_DIR)
 
 CFLAGS += -I$(SRCDIR)/include
 CFLAGS += -I$(RTE_SDK)/lib/librte_eal/common
@@ -105,6 +106,24 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_ARCH_X86) += rte_spinlock.c
 
+# for run-time dispatch of memcpy
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy.c
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_sse.c
+
+# if the compiler supports AVX512, add avx512 file
+ifneq ($(findstring CC_SUPPORT_AVX512F,$(MACHINE_CFLAGS)),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx512f.c
+CFLAGS_rte_memcpy_avx512f.o += -mavx512f
+CFLAGS_rte_memcpy_avx512f.o += -DRTE_MACHINE_CPUFLAG_AVX512F
+endif
+
+# if the compiler supports AVX2, add avx2 file
+ifneq ($(findstring CC_SUPPORT_AVX2,$(MACHINE_CFLAGS)),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx2.c
+CFLAGS_rte_memcpy_avx2.o += -mavx2
+CFLAGS_rte_memcpy_avx2.o += -DRTE_MACHINE_CPUFLAG_AVX2
+endif
+
 CFLAGS_eal_common_cpuflags.o := $(CPUFLAGS_LIST)
 
 CFLAGS_eal.o := -D_GNU_SOURCE
diff --git a/lib/librte_eal/linuxapp/eal/rte_eal_version.map b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
index 8c08b8d..15a2fe9 100644
--- a/lib/librte_eal/linuxapp/eal/rte_eal_version.map
+++ b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
@@ -241,5 +241,6 @@ EXPERIMENTAL {
 	rte_service_runstate_set;
 	rte_service_set_stats_enable;
 	rte_service_start_with_defaults;
+	rte_memcpy_ptr;
 
 } DPDK_17.08;
diff --git a/mk/rte.cpuflags.mk b/mk/rte.cpuflags.mk
index a813c91..8a7a1e7 100644
--- a/mk/rte.cpuflags.mk
+++ b/mk/rte.cpuflags.mk
@@ -134,6 +134,20 @@ endif
 
 MACHINE_CFLAGS += $(addprefix -DRTE_MACHINE_CPUFLAG_,$(CPUFLAGS))
 
+# Check if the compiler suppoerts AVX512
+CC_SUPPORT_AVX512F := $(shell $(CC) -mavx512f -dM -E - < /dev/null 2>&1 | grep -q AVX512 && echo 1)
+ifeq ($(CC_SUPPORT_AVX512F),1)
+ifeq ($(CONFIG_RTE_ENABLE_AVX512),y)
+MACHINE_CFLAGS += -DCC_SUPPORT_AVX512F
+endif
+endif
+
+# Check if the compiler supports AVX2
+CC_SUPPORT_AVX2 := $(shell $(CC) -mavx2 -dM -E - < /dev/null 2>&1 | grep -q AVX2 && echo 1)
+ifeq ($(CC_SUPPORT_AVX2),1)
+MACHINE_CFLAGS += -DCC_SUPPORT_AVX2
+endif
+
 # To strip whitespace
 comma:= ,
 empty:=
-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [dpdk-dev] [PATCH v6 2/3] app/test: run-time dispatch over memcpy perf test
  2017-10-04 22:58     ` [dpdk-dev] [PATCH v6 " Xiaoyun Li
  2017-10-04 22:58       ` [dpdk-dev] [PATCH v6 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
@ 2017-10-04 22:58       ` Xiaoyun Li
  2017-10-04 22:58       ` [dpdk-dev] [PATCH v6 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
  2017-10-05 12:33       ` [dpdk-dev] [PATCH v7 0/3] run-time Linking support Xiaoyun Li
  3 siblings, 0 replies; 88+ messages in thread
From: Xiaoyun Li @ 2017-10-04 22:58 UTC (permalink / raw)
  To: konstantin.ananyev, bruce.richardson
  Cc: wenzhuo.lu, helin.zhang, dev, Xiaoyun Li

This patch modifies assignment of alignment unit from build-time
to run-time based on CPU flags that machine supports.

Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
---
 test/test/test_memcpy_perf.c | 40 +++++++++++++++++++++++++++-------------
 1 file changed, 27 insertions(+), 13 deletions(-)

diff --git a/test/test/test_memcpy_perf.c b/test/test/test_memcpy_perf.c
index ff3aaaa..33def3b 100644
--- a/test/test/test_memcpy_perf.c
+++ b/test/test/test_memcpy_perf.c
@@ -79,13 +79,7 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
 #define TEST_BATCH_SIZE         100
 
 /* Data is aligned on this many bytes (power of 2) */
-#ifdef RTE_MACHINE_CPUFLAG_AVX512F
-#define ALIGNMENT_UNIT          64
-#elif defined RTE_MACHINE_CPUFLAG_AVX2
-#define ALIGNMENT_UNIT          32
-#else /* RTE_MACHINE_CPUFLAG */
-#define ALIGNMENT_UNIT          16
-#endif /* RTE_MACHINE_CPUFLAG */
+static uint8_t alignment_unit = 16;
 
 /*
  * Pointers used in performance tests. The two large buffers are for uncached
@@ -100,20 +94,39 @@ static int
 init_buffers(void)
 {
 	unsigned i;
+#ifdef CC_SUPPORT_AVX512
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F))
+		alignment_unit = 64;
+	else
+#endif
+#ifdef CC_SUPPORT_AVX2
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
+		alignment_unit = 32;
+	else
+#endif
+		alignment_unit = 16;
 
-	large_buf_read = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	large_buf_read = rte_malloc("memcpy",
+				    LARGE_BUFFER_SIZE + alignment_unit,
+				    alignment_unit);
 	if (large_buf_read == NULL)
 		goto error_large_buf_read;
 
-	large_buf_write = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	large_buf_write = rte_malloc("memcpy",
+				     LARGE_BUFFER_SIZE + alignment_unit,
+				     alignment_unit);
 	if (large_buf_write == NULL)
 		goto error_large_buf_write;
 
-	small_buf_read = rte_malloc("memcpy", SMALL_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	small_buf_read = rte_malloc("memcpy",
+				    SMALL_BUFFER_SIZE + alignment_unit,
+				    alignment_unit);
 	if (small_buf_read == NULL)
 		goto error_small_buf_read;
 
-	small_buf_write = rte_malloc("memcpy", SMALL_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	small_buf_write = rte_malloc("memcpy",
+				     SMALL_BUFFER_SIZE + alignment_unit,
+				     alignment_unit);
 	if (small_buf_write == NULL)
 		goto error_small_buf_write;
 
@@ -153,7 +166,7 @@ static inline size_t
 get_rand_offset(size_t uoffset)
 {
 	return ((rte_rand() % (LARGE_BUFFER_SIZE - SMALL_BUFFER_SIZE)) &
-			~(ALIGNMENT_UNIT - 1)) + uoffset;
+			~(alignment_unit - 1)) + uoffset;
 }
 
 /* Fill in source and destination addresses. */
@@ -321,7 +334,8 @@ perf_test(void)
 		   "(bytes)        (ticks)        (ticks)        (ticks)        (ticks)\n"
 		   "------- -------------- -------------- -------------- --------------");
 
-	printf("\n========================== %2dB aligned ============================", ALIGNMENT_UNIT);
+	printf("\n========================= %2dB aligned ============================",
+		alignment_unit);
 	/* Do aligned tests where size is a variable */
 	perf_test_variable_aligned();
 	printf("\n------- -------------- -------------- -------------- --------------");
-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [dpdk-dev] [PATCH v6 3/3] efd: run-time dispatch over x86 EFD functions
  2017-10-04 22:58     ` [dpdk-dev] [PATCH v6 " Xiaoyun Li
  2017-10-04 22:58       ` [dpdk-dev] [PATCH v6 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
  2017-10-04 22:58       ` [dpdk-dev] [PATCH v6 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
@ 2017-10-04 22:58       ` Xiaoyun Li
  2017-10-05  9:40         ` Ananyev, Konstantin
  2017-10-05 12:33       ` [dpdk-dev] [PATCH v7 0/3] run-time Linking support Xiaoyun Li
  3 siblings, 1 reply; 88+ messages in thread
From: Xiaoyun Li @ 2017-10-04 22:58 UTC (permalink / raw)
  To: konstantin.ananyev, bruce.richardson
  Cc: wenzhuo.lu, helin.zhang, dev, Xiaoyun Li

This patch enables x86 EFD file be compiled only if the compiler
supports AVX2 since it is already chosen at run-time.

Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
---
 lib/librte_efd/Makefile      |  6 +++
 lib/librte_efd/rte_efd_x86.c | 87 ++++++++++++++++++++++++++++++++++++++++++++
 lib/librte_efd/rte_efd_x86.h | 48 +-----------------------
 3 files changed, 95 insertions(+), 46 deletions(-)
 create mode 100644 lib/librte_efd/rte_efd_x86.c

diff --git a/lib/librte_efd/Makefile b/lib/librte_efd/Makefile
index b9277bc..35bb2bd 100644
--- a/lib/librte_efd/Makefile
+++ b/lib/librte_efd/Makefile
@@ -44,6 +44,12 @@ LIBABIVER := 1
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_EFD) := rte_efd.c
 
+# if the compiler supports AVX2, add efd x86 file
+ifneq ($(findstring CC_SUPPORT_AVX2,$(MACHINE_CFLAGS)),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_efd_x86.c
+CFLAGS_rte_efd_x86.o += -mavx2
+endif
+
 # install this header file
 SYMLINK-$(CONFIG_RTE_LIBRTE_EFD)-include := rte_efd.h
 
diff --git a/lib/librte_efd/rte_efd_x86.c b/lib/librte_efd/rte_efd_x86.c
new file mode 100644
index 0000000..d2d1ac5
--- /dev/null
+++ b/lib/librte_efd/rte_efd_x86.c
@@ -0,0 +1,87 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2016-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/* rte_efd_x86.c
+ * This file holds all x86 specific EFD functions
+ */
+#include <rte_efd.h>
+#include <rte_efd_x86.h>
+
+#if (RTE_EFD_VALUE_NUM_BITS == 8 || RTE_EFD_VALUE_NUM_BITS == 16 || \
+	RTE_EFD_VALUE_NUM_BITS == 24 || RTE_EFD_VALUE_NUM_BITS == 32)
+#define EFD_LOAD_SI128(val) _mm_load_si128(val)
+#else
+#define EFD_LOAD_SI128(val) _mm_lddqu_si128(val)
+#endif
+
+efd_value_t
+efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
+		const efd_lookuptbl_t *group_lookup_table,
+		const uint32_t hash_val_a, const uint32_t hash_val_b)
+{
+#ifdef CC_SUPPORT_AVX2
+	efd_value_t value = 0;
+	uint32_t i = 0;
+	__m256i vhash_val_a = _mm256_set1_epi32(hash_val_a);
+	__m256i vhash_val_b = _mm256_set1_epi32(hash_val_b);
+
+	for (; i < RTE_EFD_VALUE_NUM_BITS; i += 8) {
+		__m256i vhash_idx =
+				_mm256_cvtepu16_epi32(EFD_LOAD_SI128(
+				(__m128i const *) &group_hash_idx[i]));
+		__m256i vlookup_table = _mm256_cvtepu16_epi32(
+				EFD_LOAD_SI128((__m128i const *)
+				&group_lookup_table[i]));
+		__m256i vhash = _mm256_add_epi32(vhash_val_a,
+				_mm256_mullo_epi32(vhash_idx, vhash_val_b));
+		__m256i vbucket_idx = _mm256_srli_epi32(vhash,
+				EFD_LOOKUPTBL_SHIFT);
+		__m256i vresult = _mm256_srlv_epi32(vlookup_table,
+				vbucket_idx);
+
+		value |= (_mm256_movemask_ps(
+			(__m256) _mm256_slli_epi32(vresult, 31))
+			& ((1 << (RTE_EFD_VALUE_NUM_BITS - i)) - 1)) << i;
+	}
+
+	return value;
+#else
+	RTE_SET_USED(group_hash_idx);
+	RTE_SET_USED(group_lookup_table);
+	RTE_SET_USED(hash_val_a);
+	RTE_SET_USED(hash_val_b);
+	/* Return dummy value, only to avoid compilation breakage */
+	return 0;
+#endif
+
+}
diff --git a/lib/librte_efd/rte_efd_x86.h b/lib/librte_efd/rte_efd_x86.h
index 34f37d7..7a082aa 100644
--- a/lib/librte_efd/rte_efd_x86.h
+++ b/lib/librte_efd/rte_efd_x86.h
@@ -36,51 +36,7 @@
  */
 #include <immintrin.h>
 
-#if (RTE_EFD_VALUE_NUM_BITS == 8 || RTE_EFD_VALUE_NUM_BITS == 16 || \
-	RTE_EFD_VALUE_NUM_BITS == 24 || RTE_EFD_VALUE_NUM_BITS == 32)
-#define EFD_LOAD_SI128(val) _mm_load_si128(val)
-#else
-#define EFD_LOAD_SI128(val) _mm_lddqu_si128(val)
-#endif
-
-static inline efd_value_t
+extern efd_value_t
 efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
 		const efd_lookuptbl_t *group_lookup_table,
-		const uint32_t hash_val_a, const uint32_t hash_val_b)
-{
-#ifdef RTE_MACHINE_CPUFLAG_AVX2
-	efd_value_t value = 0;
-	uint32_t i = 0;
-	__m256i vhash_val_a = _mm256_set1_epi32(hash_val_a);
-	__m256i vhash_val_b = _mm256_set1_epi32(hash_val_b);
-
-	for (; i < RTE_EFD_VALUE_NUM_BITS; i += 8) {
-		__m256i vhash_idx =
-				_mm256_cvtepu16_epi32(EFD_LOAD_SI128(
-				(__m128i const *) &group_hash_idx[i]));
-		__m256i vlookup_table = _mm256_cvtepu16_epi32(
-				EFD_LOAD_SI128((__m128i const *)
-				&group_lookup_table[i]));
-		__m256i vhash = _mm256_add_epi32(vhash_val_a,
-				_mm256_mullo_epi32(vhash_idx, vhash_val_b));
-		__m256i vbucket_idx = _mm256_srli_epi32(vhash,
-				EFD_LOOKUPTBL_SHIFT);
-		__m256i vresult = _mm256_srlv_epi32(vlookup_table,
-				vbucket_idx);
-
-		value |= (_mm256_movemask_ps(
-			(__m256) _mm256_slli_epi32(vresult, 31))
-			& ((1 << (RTE_EFD_VALUE_NUM_BITS - i)) - 1)) << i;
-	}
-
-	return value;
-#else
-	RTE_SET_USED(group_hash_idx);
-	RTE_SET_USED(group_lookup_table);
-	RTE_SET_USED(hash_val_a);
-	RTE_SET_USED(hash_val_b);
-	/* Return dummy value, only to avoid compilation breakage */
-	return 0;
-#endif
-
-}
+		const uint32_t hash_val_a, const uint32_t hash_val_b);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v6 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-04 22:58       ` [dpdk-dev] [PATCH v6 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
@ 2017-10-05  9:37         ` Ananyev, Konstantin
  2017-10-05  9:38           ` Ananyev, Konstantin
  2017-10-05 11:19           ` Li, Xiaoyun
  0 siblings, 2 replies; 88+ messages in thread
From: Ananyev, Konstantin @ 2017-10-05  9:37 UTC (permalink / raw)
  To: Li, Xiaoyun, Richardson, Bruce
  Cc: Lu, Wenzhuo, Zhang, Helin, dev,
	Thomas Monjalon (thomas.monjalon@6wind.com)

> diff --git a/lib/librte_eal/linuxapp/eal/rte_eal_version.map b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
> index 8c08b8d..15a2fe9 100644
> --- a/lib/librte_eal/linuxapp/eal/rte_eal_version.map
> +++ b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
> @@ -241,5 +241,6 @@ EXPERIMENTAL {
>  	rte_service_runstate_set;
>  	rte_service_set_stats_enable;
>  	rte_service_start_with_defaults;
> +	rte_memcpy_ptr;
> 
>  } DPDK_17.08;

I am not an expert in DPDK versioning system,
But shouldn't we create a 17.11 section here?
Also I think an alphabetical order should be preserved here.
Konstantin

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v6 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-05  9:37         ` Ananyev, Konstantin
@ 2017-10-05  9:38           ` Ananyev, Konstantin
  2017-10-05 11:19           ` Li, Xiaoyun
  1 sibling, 0 replies; 88+ messages in thread
From: Ananyev, Konstantin @ 2017-10-05  9:38 UTC (permalink / raw)
  To: Ananyev, Konstantin, Li, Xiaoyun, Richardson, Bruce
  Cc: Lu, Wenzhuo, Zhang, Helin, dev, thomas



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Ananyev, Konstantin
> Sent: Thursday, October 5, 2017 10:37 AM
> To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>; dev@dpdk.org; Thomas Monjalon
> (thomas.monjalon@6wind.com) <thomas.monjalon@6wind.com>
> Subject: Re: [dpdk-dev] [PATCH v6 1/3] eal/x86: run-time dispatch over memcpy
> 
> > diff --git a/lib/librte_eal/linuxapp/eal/rte_eal_version.map b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
> > index 8c08b8d..15a2fe9 100644
> > --- a/lib/librte_eal/linuxapp/eal/rte_eal_version.map
> > +++ b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
> > @@ -241,5 +241,6 @@ EXPERIMENTAL {
> >  	rte_service_runstate_set;
> >  	rte_service_set_stats_enable;
> >  	rte_service_start_with_defaults;
> > +	rte_memcpy_ptr;
> >
> >  } DPDK_17.08;
> 
> I am not an expert in DPDK versioning system,
> But shouldn't we create a 17.11 section here?
> Also I think an alphabetical order should be preserved here.
> Konstantin

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v6 3/3] efd: run-time dispatch over x86 EFD functions
  2017-10-04 22:58       ` [dpdk-dev] [PATCH v6 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
@ 2017-10-05  9:40         ` Ananyev, Konstantin
  2017-10-05 10:23           ` Li, Xiaoyun
  0 siblings, 1 reply; 88+ messages in thread
From: Ananyev, Konstantin @ 2017-10-05  9:40 UTC (permalink / raw)
  To: Li, Xiaoyun, Richardson, Bruce; +Cc: Lu, Wenzhuo, Zhang, Helin, dev



> +efd_value_t
> +efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
> +		const efd_lookuptbl_t *group_lookup_table,
> +		const uint32_t hash_val_a, const uint32_t hash_val_b)
> +{
> +#ifdef CC_SUPPORT_AVX2
> +	efd_value_t value = 0;
> +	uint32_t i = 0;
> +	__m256i vhash_val_a = _mm256_set1_epi32(hash_val_a);
> +	__m256i vhash_val_b = _mm256_set1_epi32(hash_val_b);
> +
> +	for (; i < RTE_EFD_VALUE_NUM_BITS; i += 8) {
> +		__m256i vhash_idx =
> +				_mm256_cvtepu16_epi32(EFD_LOAD_SI128(
> +				(__m128i const *) &group_hash_idx[i]));
> +		__m256i vlookup_table = _mm256_cvtepu16_epi32(
> +				EFD_LOAD_SI128((__m128i const *)
> +				&group_lookup_table[i]));
> +		__m256i vhash = _mm256_add_epi32(vhash_val_a,
> +				_mm256_mullo_epi32(vhash_idx, vhash_val_b));
> +		__m256i vbucket_idx = _mm256_srli_epi32(vhash,
> +				EFD_LOOKUPTBL_SHIFT);
> +		__m256i vresult = _mm256_srlv_epi32(vlookup_table,
> +				vbucket_idx);
> +
> +		value |= (_mm256_movemask_ps(
> +			(__m256) _mm256_slli_epi32(vresult, 31))
> +			& ((1 << (RTE_EFD_VALUE_NUM_BITS - i)) - 1)) << i;
> +	}
> +
> +	return value;
> +#else
> +	RTE_SET_USED(group_hash_idx);
> +	RTE_SET_USED(group_lookup_table);
> +	RTE_SET_USED(hash_val_a);
> +	RTE_SET_USED(hash_val_b);
> +	/* Return dummy value, only to avoid compilation breakage */
> +	return 0;
> +#endif
> +
> +}

#ifdef CC_SUPPORT_AVX2 is still there.
Will wait for v7 I guess.
Konstantin

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v6 3/3] efd: run-time dispatch over x86 EFD functions
  2017-10-05  9:40         ` Ananyev, Konstantin
@ 2017-10-05 10:23           ` Li, Xiaoyun
  0 siblings, 0 replies; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-05 10:23 UTC (permalink / raw)
  To: Ananyev, Konstantin, Richardson, Bruce; +Cc: Lu, Wenzhuo, Zhang, Helin, dev

Yes. Sorry about that.

> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Thursday, October 5, 2017 17:41
> To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> <helin.zhang@intel.com>; dev@dpdk.org
> Subject: RE: [PATCH v6 3/3] efd: run-time dispatch over x86 EFD functions
> 
> 
> 
> > +efd_value_t
> > +efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
> > +		const efd_lookuptbl_t *group_lookup_table,
> > +		const uint32_t hash_val_a, const uint32_t hash_val_b)
> { #ifdef
> > +CC_SUPPORT_AVX2
> > +	efd_value_t value = 0;
> > +	uint32_t i = 0;
> > +	__m256i vhash_val_a = _mm256_set1_epi32(hash_val_a);
> > +	__m256i vhash_val_b = _mm256_set1_epi32(hash_val_b);
> > +
> > +	for (; i < RTE_EFD_VALUE_NUM_BITS; i += 8) {
> > +		__m256i vhash_idx =
> > +				_mm256_cvtepu16_epi32(EFD_LOAD_SI128(
> > +				(__m128i const *) &group_hash_idx[i]));
> > +		__m256i vlookup_table = _mm256_cvtepu16_epi32(
> > +				EFD_LOAD_SI128((__m128i const *)
> > +				&group_lookup_table[i]));
> > +		__m256i vhash = _mm256_add_epi32(vhash_val_a,
> > +				_mm256_mullo_epi32(vhash_idx,
> vhash_val_b));
> > +		__m256i vbucket_idx = _mm256_srli_epi32(vhash,
> > +				EFD_LOOKUPTBL_SHIFT);
> > +		__m256i vresult = _mm256_srlv_epi32(vlookup_table,
> > +				vbucket_idx);
> > +
> > +		value |= (_mm256_movemask_ps(
> > +			(__m256) _mm256_slli_epi32(vresult, 31))
> > +			& ((1 << (RTE_EFD_VALUE_NUM_BITS - i)) - 1)) << i;
> > +	}
> > +
> > +	return value;
> > +#else
> > +	RTE_SET_USED(group_hash_idx);
> > +	RTE_SET_USED(group_lookup_table);
> > +	RTE_SET_USED(hash_val_a);
> > +	RTE_SET_USED(hash_val_b);
> > +	/* Return dummy value, only to avoid compilation breakage */
> > +	return 0;
> > +#endif
> > +
> > +}
> 
> #ifdef CC_SUPPORT_AVX2 is still there.
> Will wait for v7 I guess.
> Konstantin

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v6 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-05  9:37         ` Ananyev, Konstantin
  2017-10-05  9:38           ` Ananyev, Konstantin
@ 2017-10-05 11:19           ` Li, Xiaoyun
  2017-10-05 11:26             ` Richardson, Bruce
  2017-10-05 11:26             ` Li, Xiaoyun
  1 sibling, 2 replies; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-05 11:19 UTC (permalink / raw)
  To: Ananyev, Konstantin, Richardson, Bruce
  Cc: Lu, Wenzhuo, Zhang, Helin, dev,
	Thomas Monjalon (thomas.monjalon@6wind.com)



> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Thursday, October 5, 2017 17:37
> To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> <helin.zhang@intel.com>; dev@dpdk.org; Thomas Monjalon
> (thomas.monjalon@6wind.com) <thomas.monjalon@6wind.com>
> Subject: RE: [PATCH v6 1/3] eal/x86: run-time dispatch over memcpy
> 
> > diff --git a/lib/librte_eal/linuxapp/eal/rte_eal_version.map
> > b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
> > index 8c08b8d..15a2fe9 100644
> > --- a/lib/librte_eal/linuxapp/eal/rte_eal_version.map
> > +++ b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
> > @@ -241,5 +241,6 @@ EXPERIMENTAL {
> >  	rte_service_runstate_set;
> >  	rte_service_set_stats_enable;
> >  	rte_service_start_with_defaults;
> > +	rte_memcpy_ptr;
> >
> >  } DPDK_17.08;
> 
> I am not an expert in DPDK versioning system, But shouldn't we create a
> 17.11 section here?
Should we create a 17.11 section? I am not sure who to ask for.

> Also I think an alphabetical order should be preserved here.
OK.

> Konstantin

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v6 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-05 11:19           ` Li, Xiaoyun
@ 2017-10-05 11:26             ` Richardson, Bruce
  2017-10-05 11:26             ` Li, Xiaoyun
  1 sibling, 0 replies; 88+ messages in thread
From: Richardson, Bruce @ 2017-10-05 11:26 UTC (permalink / raw)
  To: Li, Xiaoyun, Ananyev, Konstantin
  Cc: Lu, Wenzhuo, Zhang, Helin, dev,
	Thomas Monjalon (thomas.monjalon@6wind.com)



> -----Original Message-----
> From: Li, Xiaoyun
> Sent: Thursday, October 5, 2017 12:19 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> <helin.zhang@intel.com>; dev@dpdk.org; Thomas Monjalon
> (thomas.monjalon@6wind.com) <thomas.monjalon@6wind.com>
> Subject: RE: [PATCH v6 1/3] eal/x86: run-time dispatch over memcpy
> 
> 
> 
> > -----Original Message-----
> > From: Ananyev, Konstantin
> > Sent: Thursday, October 5, 2017 17:37
> > To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> > <bruce.richardson@intel.com>
> > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > <helin.zhang@intel.com>; dev@dpdk.org; Thomas Monjalon
> > (thomas.monjalon@6wind.com) <thomas.monjalon@6wind.com>
> > Subject: RE: [PATCH v6 1/3] eal/x86: run-time dispatch over memcpy
> >
> > > diff --git a/lib/librte_eal/linuxapp/eal/rte_eal_version.map
> > > b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
> > > index 8c08b8d..15a2fe9 100644
> > > --- a/lib/librte_eal/linuxapp/eal/rte_eal_version.map
> > > +++ b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
> > > @@ -241,5 +241,6 @@ EXPERIMENTAL {
> > >  	rte_service_runstate_set;
> > >  	rte_service_set_stats_enable;
> > >  	rte_service_start_with_defaults;
> > > +	rte_memcpy_ptr;
> > >
> > >  } DPDK_17.08;
> >
> > I am not an expert in DPDK versioning system, But shouldn't we create
> > a
> > 17.11 section here?
> Should we create a 17.11 section? I am not sure who to ask for.
> 
Any new functions that are public and are added in the 17.11 release need to
be added to the map file in a new 17.11 section. They are not part of the
ABI for the 17.08 release as they were not present there.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v6 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-05 11:19           ` Li, Xiaoyun
  2017-10-05 11:26             ` Richardson, Bruce
@ 2017-10-05 11:26             ` Li, Xiaoyun
  2017-10-05 12:12               ` Ananyev, Konstantin
  1 sibling, 1 reply; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-05 11:26 UTC (permalink / raw)
  To: Li, Xiaoyun, Ananyev, Konstantin, Richardson, Bruce
  Cc: Lu, Wenzhuo, Zhang, Helin, dev,
	Thomas Monjalon (thomas.monjalon@6wind.com)

Another thing, if add 17.11, the end is 17.08 or EXPERIMENTAL?

Best Regards,
Xiaoyun Li



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Li, Xiaoyun
> Sent: Thursday, October 5, 2017 19:19
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson,
> Bruce <bruce.richardson@intel.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> <helin.zhang@intel.com>; dev@dpdk.org; Thomas Monjalon
> (thomas.monjalon@6wind.com) <thomas.monjalon@6wind.com>
> Subject: Re: [dpdk-dev] [PATCH v6 1/3] eal/x86: run-time dispatch over
> memcpy
> 
> 
> 
> > -----Original Message-----
> > From: Ananyev, Konstantin
> > Sent: Thursday, October 5, 2017 17:37
> > To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> > <bruce.richardson@intel.com>
> > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > <helin.zhang@intel.com>; dev@dpdk.org; Thomas Monjalon
> > (thomas.monjalon@6wind.com) <thomas.monjalon@6wind.com>
> > Subject: RE: [PATCH v6 1/3] eal/x86: run-time dispatch over memcpy
> >
> > > diff --git a/lib/librte_eal/linuxapp/eal/rte_eal_version.map
> > > b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
> > > index 8c08b8d..15a2fe9 100644
> > > --- a/lib/librte_eal/linuxapp/eal/rte_eal_version.map
> > > +++ b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
> > > @@ -241,5 +241,6 @@ EXPERIMENTAL {
> > >  	rte_service_runstate_set;
> > >  	rte_service_set_stats_enable;
> > >  	rte_service_start_with_defaults;
> > > +	rte_memcpy_ptr;
> > >
> > >  } DPDK_17.08;
> >
> > I am not an expert in DPDK versioning system, But shouldn't we create
> > a
> > 17.11 section here?
> Should we create a 17.11 section? I am not sure who to ask for.
> 
> > Also I think an alphabetical order should be preserved here.
> OK.
> 
> > Konstantin

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v6 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-05 11:26             ` Li, Xiaoyun
@ 2017-10-05 12:12               ` Ananyev, Konstantin
  0 siblings, 0 replies; 88+ messages in thread
From: Ananyev, Konstantin @ 2017-10-05 12:12 UTC (permalink / raw)
  To: Li, Xiaoyun, Richardson, Bruce
  Cc: Lu, Wenzhuo, Zhang, Helin, dev,
	Thomas Monjalon (thomas.monjalon@6wind.com)



> -----Original Message-----
> From: Li, Xiaoyun
> Sent: Thursday, October 5, 2017 12:27 PM
> To: Li, Xiaoyun <xiaoyun.li@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>; dev@dpdk.org; Thomas Monjalon
> (thomas.monjalon@6wind.com) <thomas.monjalon@6wind.com>
> Subject: RE: [PATCH v6 1/3] eal/x86: run-time dispatch over memcpy
> 
> Another thing, if add 17.11, the end is 17.08 or EXPERIMENTAL?

I don't see why to go experimental here - you are not adding new API,
Just a rework of existing one (rte_memcpy()) and rte_memcpy_ptr is exposed
Just for performance reasons.
So my vote would be for 17.11.
Konstantin 

> 
> Best Regards,
> Xiaoyun Li
> 
> 
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Li, Xiaoyun
> > Sent: Thursday, October 5, 2017 19:19
> > To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson,
> > Bruce <bruce.richardson@intel.com>
> > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > <helin.zhang@intel.com>; dev@dpdk.org; Thomas Monjalon
> > (thomas.monjalon@6wind.com) <thomas.monjalon@6wind.com>
> > Subject: Re: [dpdk-dev] [PATCH v6 1/3] eal/x86: run-time dispatch over
> > memcpy
> >
> >
> >
> > > -----Original Message-----
> > > From: Ananyev, Konstantin
> > > Sent: Thursday, October 5, 2017 17:37
> > > To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> > > <bruce.richardson@intel.com>
> > > Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > > <helin.zhang@intel.com>; dev@dpdk.org; Thomas Monjalon
> > > (thomas.monjalon@6wind.com) <thomas.monjalon@6wind.com>
> > > Subject: RE: [PATCH v6 1/3] eal/x86: run-time dispatch over memcpy
> > >
> > > > diff --git a/lib/librte_eal/linuxapp/eal/rte_eal_version.map
> > > > b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
> > > > index 8c08b8d..15a2fe9 100644
> > > > --- a/lib/librte_eal/linuxapp/eal/rte_eal_version.map
> > > > +++ b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
> > > > @@ -241,5 +241,6 @@ EXPERIMENTAL {
> > > >  	rte_service_runstate_set;
> > > >  	rte_service_set_stats_enable;
> > > >  	rte_service_start_with_defaults;
> > > > +	rte_memcpy_ptr;
> > > >
> > > >  } DPDK_17.08;
> > >
> > > I am not an expert in DPDK versioning system, But shouldn't we create
> > > a
> > > 17.11 section here?
> > Should we create a 17.11 section? I am not sure who to ask for.
> >
> > > Also I think an alphabetical order should be preserved here.
> > OK.
> >
> > > Konstantin

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [dpdk-dev] [PATCH v7 0/3] run-time Linking support
  2017-10-04 22:58     ` [dpdk-dev] [PATCH v6 " Xiaoyun Li
                         ` (2 preceding siblings ...)
  2017-10-04 22:58       ` [dpdk-dev] [PATCH v6 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
@ 2017-10-05 12:33       ` Xiaoyun Li
  2017-10-05 12:33         ` [dpdk-dev] [PATCH v7 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
                           ` (5 more replies)
  3 siblings, 6 replies; 88+ messages in thread
From: Xiaoyun Li @ 2017-10-05 12:33 UTC (permalink / raw)
  To: konstantin.ananyev, bruce.richardson
  Cc: wenzhuo.lu, helin.zhang, dev, Xiaoyun Li

This patchset dynamically selects functions at run-time based on CPU flags
that current machine supports.This patchset modifies mempcy, memcpy perf
test and x86 EFD, using function pointers and bind them at constructor time.
Then in the cloud environment, users can compiler once for the minimum target
such as 'haswell'(not 'native') and run on different platforms (equal or above
haswell) and can get ISA optimization based on running CPU.

Xiaoyun Li (3):
  eal/x86: run-time dispatch over memcpy
  app/test: run-time dispatch over memcpy perf test
  efd: run-time dispatch over x86 EFD functions

---
v2
* Use gcc function multi-versioning to avoid compilation issues.
* Add macros for AVX512 and AVX2. Only if users enable AVX512 and the compiler
supports it, the AVX512 codes would be compiled. Only if the compiler supports
AVX2, the AVX2 codes would be compiled.

v3
* Reduce function calls via only keep rte_memcpy_xxx.
* Add conditions that when copy size is small, use inline code path.
Otherwise, use dynamic code path.
* To support attribute target, clang version must be greater than 3.7.
Otherwise, would choose SSE/AVX code path, the same as before.
* Move two mocro functions to the top of the code since they would be used in
inline SSE/AVX and dynamic SSE/AVX codes.

v4
* Modify rte_memcpy.h to several .c files and modify makefiles to compile
AVX2 and AVX512 files.

v5
* Delete redundant repeated codes of rte_memcpy_xxx.
* Modify makefiles to enable reuse of existing rte_memcpy.
* Delete redundant codes of rte_efd_x86.h in v4. Move it into .c file and enable
compilation -mavx2 for it in makefile since it is already chosen at run-time.

v6
* Fix shared target build failure.
* Safely remove redundant efd x86 avx2 codes since the file is compiled
with -mavx2.

v7
* Modify the added version map code in v6 to be more reasonable.
* Safely remove redundant efd x86 avx2 codes since the file is compiled
with -mavx2.

 lib/librte_eal/bsdapp/eal/Makefile                 |  19 +
 lib/librte_eal/bsdapp/eal/rte_eal_version.map      |   7 +
 .../common/include/arch/x86/rte_memcpy.c           |  59 ++
 .../common/include/arch/x86/rte_memcpy.h           | 861 +------------------
 .../common/include/arch/x86/rte_memcpy_avx2.c      |  44 +
 .../common/include/arch/x86/rte_memcpy_avx512f.c   |  44 +
 .../common/include/arch/x86/rte_memcpy_internal.h  | 909 +++++++++++++++++++++
 .../common/include/arch/x86/rte_memcpy_sse.c       |  40 +
 lib/librte_eal/linuxapp/eal/Makefile               |  19 +
 lib/librte_eal/linuxapp/eal/rte_eal_version.map    |   7 +
 lib/librte_efd/Makefile                            |   6 +
 lib/librte_efd/rte_efd_x86.c                       |  77 ++
 lib/librte_efd/rte_efd_x86.h                       |  48 +-
 mk/rte.cpuflags.mk                                 |  14 +
 test/test/test_memcpy_perf.c                       |  40 +-
 15 files changed, 1289 insertions(+), 905 deletions(-)
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c
 create mode 100644 lib/librte_efd/rte_efd_x86.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [dpdk-dev] [PATCH v7 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-05 12:33       ` [dpdk-dev] [PATCH v7 0/3] run-time Linking support Xiaoyun Li
@ 2017-10-05 12:33         ` Xiaoyun Li
  2017-10-09 17:47           ` Thomas Monjalon
  2017-10-05 12:33         ` [dpdk-dev] [PATCH v7 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
                           ` (4 subsequent siblings)
  5 siblings, 1 reply; 88+ messages in thread
From: Xiaoyun Li @ 2017-10-05 12:33 UTC (permalink / raw)
  To: konstantin.ananyev, bruce.richardson
  Cc: wenzhuo.lu, helin.zhang, dev, Xiaoyun Li

This patch dynamically selects functions of memcpy at run-time based
on CPU flags that current machine supports. This patch uses function
pointers which are bind to the relative functions at constrctor time.
In addition, AVX512 instructions set would be compiled only if users
config it enabled and the compiler supports it.

Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
---
 lib/librte_eal/bsdapp/eal/Makefile                 |  19 +
 lib/librte_eal/bsdapp/eal/rte_eal_version.map      |   7 +
 .../common/include/arch/x86/rte_memcpy.c           |  59 ++
 .../common/include/arch/x86/rte_memcpy.h           | 861 +------------------
 .../common/include/arch/x86/rte_memcpy_avx2.c      |  44 +
 .../common/include/arch/x86/rte_memcpy_avx512f.c   |  44 +
 .../common/include/arch/x86/rte_memcpy_internal.h  | 909 +++++++++++++++++++++
 .../common/include/arch/x86/rte_memcpy_sse.c       |  40 +
 lib/librte_eal/linuxapp/eal/Makefile               |  19 +
 lib/librte_eal/linuxapp/eal/rte_eal_version.map    |   7 +
 mk/rte.cpuflags.mk                                 |  14 +
 11 files changed, 1177 insertions(+), 846 deletions(-)
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c

diff --git a/lib/librte_eal/bsdapp/eal/Makefile b/lib/librte_eal/bsdapp/eal/Makefile
index 005019e..1dcd2e3 100644
--- a/lib/librte_eal/bsdapp/eal/Makefile
+++ b/lib/librte_eal/bsdapp/eal/Makefile
@@ -36,6 +36,7 @@ LIB = librte_eal.a
 ARCH_DIR ?= $(RTE_ARCH)
 VPATH += $(RTE_SDK)/lib/librte_eal/common
 VPATH += $(RTE_SDK)/lib/librte_eal/common/arch/$(ARCH_DIR)
+VPATH += $(RTE_SDK)/lib/librte_eal/common/include/arch/$(ARCH_DIR)
 
 CFLAGS += -I$(SRCDIR)/include
 CFLAGS += -I$(RTE_SDK)/lib/librte_eal/common
@@ -93,6 +94,24 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_ARCH_X86) += rte_spinlock.c
 
+# for run-time dispatch of memcpy
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy.c
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_sse.c
+
+# if the compiler supports AVX512, add avx512 file
+ifneq ($(findstring CC_SUPPORT_AVX512F,$(MACHINE_CFLAGS)),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx512f.c
+CFLAGS_rte_memcpy_avx512f.o += -mavx512f
+CFLAGS_rte_memcpy_avx512f.o += -DRTE_MACHINE_CPUFLAG_AVX512F
+endif
+
+# if the compiler supports AVX2, add avx2 file
+ifneq ($(findstring CC_SUPPORT_AVX2,$(MACHINE_CFLAGS)),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx2.c
+CFLAGS_rte_memcpy_avx2.o += -mavx2
+CFLAGS_rte_memcpy_avx2.o += -DRTE_MACHINE_CPUFLAG_AVX2
+endif
+
 CFLAGS_eal_common_cpuflags.o := $(CPUFLAGS_LIST)
 
 CFLAGS_eal.o := -D_GNU_SOURCE
diff --git a/lib/librte_eal/bsdapp/eal/rte_eal_version.map b/lib/librte_eal/bsdapp/eal/rte_eal_version.map
index 47a09ea..764a39b 100644
--- a/lib/librte_eal/bsdapp/eal/rte_eal_version.map
+++ b/lib/librte_eal/bsdapp/eal/rte_eal_version.map
@@ -238,3 +238,10 @@ EXPERIMENTAL {
 	rte_service_start_with_defaults;
 
 } DPDK_17.08;
+
+DPDK_17.11 {
+	global:
+
+	rte_memcpy_ptr;
+
+} DPDK_17.08;
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.c b/lib/librte_eal/common/include/arch/x86/rte_memcpy.c
new file mode 100644
index 0000000..74ae702
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.c
@@ -0,0 +1,59 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+#include <rte_cpuflags.h>
+#include <rte_log.h>
+
+void *(*rte_memcpy_ptr)(void *dst, const void *src, size_t n) = NULL;
+
+static void __attribute__((constructor))
+rte_memcpy_init(void)
+{
+#ifdef CC_SUPPORT_AVX512F
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F)) {
+		rte_memcpy_ptr = rte_memcpy_avx512f;
+		RTE_LOG(DEBUG, EAL, "AVX512 memcpy is using!\n");
+		return;
+	}
+#endif
+#ifdef CC_SUPPORT_AVX2
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) {
+		rte_memcpy_ptr = rte_memcpy_avx2;
+		RTE_LOG(DEBUG, EAL, "AVX2 memcpy is using!\n");
+		return;
+	}
+#endif
+	rte_memcpy_ptr = rte_memcpy_sse;
+	RTE_LOG(DEBUG, EAL, "Default SSE/AVX memcpy is using!\n");
+}
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index 74c280c..460dcdb 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -1,7 +1,7 @@
 /*-
  *   BSD LICENSE
  *
- *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
  *   All rights reserved.
  *
  *   Redistribution and use in source and binary forms, with or without
@@ -34,867 +34,36 @@
 #ifndef _RTE_MEMCPY_X86_64_H_
 #define _RTE_MEMCPY_X86_64_H_
 
-/**
- * @file
- *
- * Functions for SSE/AVX/AVX2/AVX512 implementation of memcpy().
- */
-
-#include <stdio.h>
-#include <stdint.h>
-#include <string.h>
-#include <rte_vect.h>
-#include <rte_common.h>
+#include <rte_memcpy_internal.h>
 
 #ifdef __cplusplus
 extern "C" {
 #endif
 
-/**
- * Copy bytes from one location to another. The locations must not overlap.
- *
- * @note This is implemented as a macro, so it's address should not be taken
- * and care is needed as parameter expressions may be evaluated multiple times.
- *
- * @param dst
- *   Pointer to the destination of the data.
- * @param src
- *   Pointer to the source data.
- * @param n
- *   Number of bytes to copy.
- * @return
- *   Pointer to the destination data.
- */
-static __rte_always_inline void *
-rte_memcpy(void *dst, const void *src, size_t n);
-
-#ifdef RTE_MACHINE_CPUFLAG_AVX512F
+#define RTE_X86_MEMCPY_THRESH 128
 
-#define ALIGNMENT_MASK 0x3F
+extern void *
+(*rte_memcpy_ptr)(void *dst, const void *src, size_t n);
 
 /**
- * AVX512 implementation below
+ * Different implementations of memcpy.
  */
+extern void*
+rte_memcpy_avx512f(void *dst, const void *src, size_t n);
 
-/**
- * Copy 16 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov16(uint8_t *dst, const uint8_t *src)
-{
-	__m128i xmm0;
-
-	xmm0 = _mm_loadu_si128((const __m128i *)src);
-	_mm_storeu_si128((__m128i *)dst, xmm0);
-}
-
-/**
- * Copy 32 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov32(uint8_t *dst, const uint8_t *src)
-{
-	__m256i ymm0;
+extern void *
+rte_memcpy_avx2(void *dst, const void *src, size_t n);
 
-	ymm0 = _mm256_loadu_si256((const __m256i *)src);
-	_mm256_storeu_si256((__m256i *)dst, ymm0);
-}
-
-/**
- * Copy 64 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov64(uint8_t *dst, const uint8_t *src)
-{
-	__m512i zmm0;
-
-	zmm0 = _mm512_loadu_si512((const void *)src);
-	_mm512_storeu_si512((void *)dst, zmm0);
-}
-
-/**
- * Copy 128 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov64(dst + 0 * 64, src + 0 * 64);
-	rte_mov64(dst + 1 * 64, src + 1 * 64);
-}
-
-/**
- * Copy 256 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov256(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov64(dst + 0 * 64, src + 0 * 64);
-	rte_mov64(dst + 1 * 64, src + 1 * 64);
-	rte_mov64(dst + 2 * 64, src + 2 * 64);
-	rte_mov64(dst + 3 * 64, src + 3 * 64);
-}
-
-/**
- * Copy 128-byte blocks from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
-{
-	__m512i zmm0, zmm1;
-
-	while (n >= 128) {
-		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
-		n -= 128;
-		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
-		src = src + 128;
-		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
-		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
-		dst = dst + 128;
-	}
-}
-
-/**
- * Copy 512-byte blocks from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
-{
-	__m512i zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, zmm7;
-
-	while (n >= 512) {
-		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
-		n -= 512;
-		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
-		zmm2 = _mm512_loadu_si512((const void *)(src + 2 * 64));
-		zmm3 = _mm512_loadu_si512((const void *)(src + 3 * 64));
-		zmm4 = _mm512_loadu_si512((const void *)(src + 4 * 64));
-		zmm5 = _mm512_loadu_si512((const void *)(src + 5 * 64));
-		zmm6 = _mm512_loadu_si512((const void *)(src + 6 * 64));
-		zmm7 = _mm512_loadu_si512((const void *)(src + 7 * 64));
-		src = src + 512;
-		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
-		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
-		_mm512_storeu_si512((void *)(dst + 2 * 64), zmm2);
-		_mm512_storeu_si512((void *)(dst + 3 * 64), zmm3);
-		_mm512_storeu_si512((void *)(dst + 4 * 64), zmm4);
-		_mm512_storeu_si512((void *)(dst + 5 * 64), zmm5);
-		_mm512_storeu_si512((void *)(dst + 6 * 64), zmm6);
-		_mm512_storeu_si512((void *)(dst + 7 * 64), zmm7);
-		dst = dst + 512;
-	}
-}
-
-static inline void *
-rte_memcpy_generic(void *dst, const void *src, size_t n)
-{
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
-	void *ret = dst;
-	size_t dstofss;
-	size_t bits;
-
-	/**
-	 * Copy less than 16 bytes
-	 */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08)
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		return ret;
-	}
-
-	/**
-	 * Fast way when copy size doesn't exceed 512 bytes
-	 */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				  (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov32((uint8_t *)dst - 32 + n,
-				  (const uint8_t *)src - 32 + n);
-		return ret;
-	}
-	if (n <= 512) {
-		if (n >= 256) {
-			n -= 256;
-			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 256;
-			dst = (uint8_t *)dst + 256;
-		}
-		if (n >= 128) {
-			n -= 128;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 128;
-			dst = (uint8_t *)dst + 128;
-		}
-COPY_BLOCK_128_BACK63:
-		if (n > 64) {
-			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov64((uint8_t *)dst - 64 + n,
-					  (const uint8_t *)src - 64 + n);
-			return ret;
-		}
-		if (n > 0)
-			rte_mov64((uint8_t *)dst - 64 + n,
-					  (const uint8_t *)src - 64 + n);
-		return ret;
-	}
-
-	/**
-	 * Make store aligned when copy size exceeds 512 bytes
-	 */
-	dstofss = ((uintptr_t)dst & 0x3F);
-	if (dstofss > 0) {
-		dstofss = 64 - dstofss;
-		n -= dstofss;
-		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-		src = (const uint8_t *)src + dstofss;
-		dst = (uint8_t *)dst + dstofss;
-	}
-
-	/**
-	 * Copy 512-byte blocks.
-	 * Use copy block function for better instruction order control,
-	 * which is important when load is unaligned.
-	 */
-	rte_mov512blocks((uint8_t *)dst, (const uint8_t *)src, n);
-	bits = n;
-	n = n & 511;
-	bits -= n;
-	src = (const uint8_t *)src + bits;
-	dst = (uint8_t *)dst + bits;
-
-	/**
-	 * Copy 128-byte blocks.
-	 * Use copy block function for better instruction order control,
-	 * which is important when load is unaligned.
-	 */
-	if (n >= 128) {
-		rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
-		bits = n;
-		n = n & 127;
-		bits -= n;
-		src = (const uint8_t *)src + bits;
-		dst = (uint8_t *)dst + bits;
-	}
-
-	/**
-	 * Copy whatever left
-	 */
-	goto COPY_BLOCK_128_BACK63;
-}
-
-#elif defined RTE_MACHINE_CPUFLAG_AVX2
-
-#define ALIGNMENT_MASK 0x1F
-
-/**
- * AVX2 implementation below
- */
-
-/**
- * Copy 16 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov16(uint8_t *dst, const uint8_t *src)
-{
-	__m128i xmm0;
-
-	xmm0 = _mm_loadu_si128((const __m128i *)src);
-	_mm_storeu_si128((__m128i *)dst, xmm0);
-}
-
-/**
- * Copy 32 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov32(uint8_t *dst, const uint8_t *src)
-{
-	__m256i ymm0;
-
-	ymm0 = _mm256_loadu_si256((const __m256i *)src);
-	_mm256_storeu_si256((__m256i *)dst, ymm0);
-}
-
-/**
- * Copy 64 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov64(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
-	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
-}
-
-/**
- * Copy 128 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
-	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
-	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
-	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
-}
-
-/**
- * Copy 128-byte blocks from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
-{
-	__m256i ymm0, ymm1, ymm2, ymm3;
-
-	while (n >= 128) {
-		ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32));
-		n -= 128;
-		ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32));
-		ymm2 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 2 * 32));
-		ymm3 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 3 * 32));
-		src = (const uint8_t *)src + 128;
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2);
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3);
-		dst = (uint8_t *)dst + 128;
-	}
-}
-
-static inline void *
-rte_memcpy_generic(void *dst, const void *src, size_t n)
-{
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
-	void *ret = dst;
-	size_t dstofss;
-	size_t bits;
-
-	/**
-	 * Copy less than 16 bytes
-	 */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08) {
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		}
-		return ret;
-	}
-
-	/**
-	 * Fast way when copy size doesn't exceed 256 bytes
-	 */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				(const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 48) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst + 16, (const uint8_t *)src + 16);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				(const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov32((uint8_t *)dst - 32 + n,
-				(const uint8_t *)src - 32 + n);
-		return ret;
-	}
-	if (n <= 256) {
-		if (n >= 128) {
-			n -= 128;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 128;
-			dst = (uint8_t *)dst + 128;
-		}
-COPY_BLOCK_128_BACK31:
-		if (n >= 64) {
-			n -= 64;
-			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 64;
-			dst = (uint8_t *)dst + 64;
-		}
-		if (n > 32) {
-			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov32((uint8_t *)dst - 32 + n,
-					(const uint8_t *)src - 32 + n);
-			return ret;
-		}
-		if (n > 0) {
-			rte_mov32((uint8_t *)dst - 32 + n,
-					(const uint8_t *)src - 32 + n);
-		}
-		return ret;
-	}
-
-	/**
-	 * Make store aligned when copy size exceeds 256 bytes
-	 */
-	dstofss = (uintptr_t)dst & 0x1F;
-	if (dstofss > 0) {
-		dstofss = 32 - dstofss;
-		n -= dstofss;
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		src = (const uint8_t *)src + dstofss;
-		dst = (uint8_t *)dst + dstofss;
-	}
-
-	/**
-	 * Copy 128-byte blocks
-	 */
-	rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
-	bits = n;
-	n = n & 127;
-	bits -= n;
-	src = (const uint8_t *)src + bits;
-	dst = (uint8_t *)dst + bits;
-
-	/**
-	 * Copy whatever left
-	 */
-	goto COPY_BLOCK_128_BACK31;
-}
-
-#else /* RTE_MACHINE_CPUFLAG */
-
-#define ALIGNMENT_MASK 0x0F
-
-/**
- * SSE & AVX implementation below
- */
-
-/**
- * Copy 16 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov16(uint8_t *dst, const uint8_t *src)
-{
-	__m128i xmm0;
-
-	xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);
-	_mm_storeu_si128((__m128i *)dst, xmm0);
-}
-
-/**
- * Copy 32 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov32(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-}
-
-/**
- * Copy 64 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov64(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
-	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
-}
-
-/**
- * Copy 128 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
-	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
-	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
-	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
-	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
-	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
-}
-
-/**
- * Copy 256 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov256(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
-	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
-	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
-	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
-	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
-	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
-	rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);
-	rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);
-	rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);
-	rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);
-	rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);
-	rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);
-	rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);
-	rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
-}
-
-/**
- * Macro for copying unaligned block from one location to another with constant load offset,
- * 47 bytes leftover maximum,
- * locations should not overlap.
- * Requirements:
- * - Store is aligned
- * - Load offset is <offset>, which must be immediate value within [1, 15]
- * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
- * - <dst>, <src>, <len> must be variables
- * - __m128i <xmm0> ~ <xmm8> must be pre-defined
- */
-#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)                                                     \
-__extension__ ({                                                                                            \
-    int tmp;                                                                                                \
-    while (len >= 128 + 16 - offset) {                                                                      \
-        xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));                  \
-        len -= 128;                                                                                         \
-        xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));                  \
-        xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));                  \
-        xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16));                  \
-        xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16));                  \
-        xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16));                  \
-        xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16));                  \
-        xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \
-        xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \
-        src = (const uint8_t *)src + 128;                                                                   \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));        \
-        dst = (uint8_t *)dst + 128;                                                                         \
-    }                                                                                                       \
-    tmp = len;                                                                                              \
-    len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
-    tmp -= len;                                                                                             \
-    src = (const uint8_t *)src + tmp;                                                                       \
-    dst = (uint8_t *)dst + tmp;                                                                             \
-    if (len >= 32 + 16 - offset) {                                                                          \
-        while (len >= 32 + 16 - offset) {                                                                   \
-            xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));              \
-            len -= 32;                                                                                      \
-            xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));              \
-            xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));              \
-            src = (const uint8_t *)src + 32;                                                                \
-            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));    \
-            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));    \
-            dst = (uint8_t *)dst + 32;                                                                      \
-        }                                                                                                   \
-        tmp = len;                                                                                          \
-        len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
-        tmp -= len;                                                                                         \
-        src = (const uint8_t *)src + tmp;                                                                   \
-        dst = (uint8_t *)dst + tmp;                                                                         \
-    }                                                                                                       \
-})
-
-/**
- * Macro for copying unaligned block from one location to another,
- * 47 bytes leftover maximum,
- * locations should not overlap.
- * Use switch here because the aligning instruction requires immediate value for shift count.
- * Requirements:
- * - Store is aligned
- * - Load offset is <offset>, which must be within [1, 15]
- * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
- * - <dst>, <src>, <len> must be variables
- * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined
- */
-#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                   \
-__extension__ ({                                                      \
-    switch (offset) {                                                 \
-    case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break;    \
-    case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break;    \
-    case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break;    \
-    case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break;    \
-    case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break;    \
-    case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break;    \
-    case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break;    \
-    case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break;    \
-    case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break;    \
-    case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break;    \
-    case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;    \
-    case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break;    \
-    case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break;    \
-    case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break;    \
-    case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break;    \
-    default:;                                                         \
-    }                                                                 \
-})
-
-static inline void *
-rte_memcpy_generic(void *dst, const void *src, size_t n)
-{
-	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
-	void *ret = dst;
-	size_t dstofss;
-	size_t srcofs;
-
-	/**
-	 * Copy less than 16 bytes
-	 */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08) {
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		}
-		return ret;
-	}
-
-	/**
-	 * Fast way when copy size doesn't exceed 512 bytes
-	 */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 48) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
-		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 128) {
-		goto COPY_BLOCK_128_BACK15;
-	}
-	if (n <= 512) {
-		if (n >= 256) {
-			n -= 256;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov128((uint8_t *)dst + 128, (const uint8_t *)src + 128);
-			src = (const uint8_t *)src + 256;
-			dst = (uint8_t *)dst + 256;
-		}
-COPY_BLOCK_255_BACK15:
-		if (n >= 128) {
-			n -= 128;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 128;
-			dst = (uint8_t *)dst + 128;
-		}
-COPY_BLOCK_128_BACK15:
-		if (n >= 64) {
-			n -= 64;
-			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 64;
-			dst = (uint8_t *)dst + 64;
-		}
-COPY_BLOCK_64_BACK15:
-		if (n >= 32) {
-			n -= 32;
-			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 32;
-			dst = (uint8_t *)dst + 32;
-		}
-		if (n > 16) {
-			rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-			return ret;
-		}
-		if (n > 0) {
-			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		}
-		return ret;
-	}
-
-	/**
-	 * Make store aligned when copy size exceeds 512 bytes,
-	 * and make sure the first 15 bytes are copied, because
-	 * unaligned copy functions require up to 15 bytes
-	 * backwards access.
-	 */
-	dstofss = (uintptr_t)dst & 0x0F;
-	if (dstofss > 0) {
-		dstofss = 16 - dstofss + 16;
-		n -= dstofss;
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		src = (const uint8_t *)src + dstofss;
-		dst = (uint8_t *)dst + dstofss;
-	}
-	srcofs = ((uintptr_t)src & 0x0F);
-
-	/**
-	 * For aligned copy
-	 */
-	if (srcofs == 0) {
-		/**
-		 * Copy 256-byte blocks
-		 */
-		for (; n >= 256; n -= 256) {
-			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
-			dst = (uint8_t *)dst + 256;
-			src = (const uint8_t *)src + 256;
-		}
-
-		/**
-		 * Copy whatever left
-		 */
-		goto COPY_BLOCK_255_BACK15;
-	}
-
-	/**
-	 * For copy with unaligned load
-	 */
-	MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
-
-	/**
-	 * Copy whatever left
-	 */
-	goto COPY_BLOCK_64_BACK15;
-}
-
-#endif /* RTE_MACHINE_CPUFLAG */
-
-static inline void *
-rte_memcpy_aligned(void *dst, const void *src, size_t n)
-{
-	void *ret = dst;
-
-	/* Copy size <= 16 bytes */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dst = *(const uint8_t *)src;
-			src = (const uint8_t *)src + 1;
-			dst = (uint8_t *)dst + 1;
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dst = *(const uint16_t *)src;
-			src = (const uint16_t *)src + 1;
-			dst = (uint16_t *)dst + 1;
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dst = *(const uint32_t *)src;
-			src = (const uint32_t *)src + 1;
-			dst = (uint32_t *)dst + 1;
-		}
-		if (n & 0x08)
-			*(uint64_t *)dst = *(const uint64_t *)src;
-
-		return ret;
-	}
-
-	/* Copy 16 <= size <= 32 bytes */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				(const uint8_t *)src - 16 + n);
-
-		return ret;
-	}
-
-	/* Copy 32 < size <= 64 bytes */
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov32((uint8_t *)dst - 32 + n,
-				(const uint8_t *)src - 32 + n);
-
-		return ret;
-	}
-
-	/* Copy 64 bytes blocks */
-	for (; n >= 64; n -= 64) {
-		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-		dst = (uint8_t *)dst + 64;
-		src = (const uint8_t *)src + 64;
-	}
-
-	/* Copy whatever left */
-	rte_mov64((uint8_t *)dst - 64 + n,
-			(const uint8_t *)src - 64 + n);
-
-	return ret;
-}
+extern void *
+rte_memcpy_sse(void *dst, const void *src, size_t n);
 
 static inline void *
 rte_memcpy(void *dst, const void *src, size_t n)
 {
-	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
-		return rte_memcpy_aligned(dst, src, n);
+	if (n <= RTE_X86_MEMCPY_THRESH)
+		return rte_memcpy_internal(dst, src, n);
 	else
-		return rte_memcpy_generic(dst, src, n);
+		return (*rte_memcpy_ptr)(dst, src, n);
 }
 
 #ifdef __cplusplus
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c b/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
new file mode 100644
index 0000000..3ad229c
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
@@ -0,0 +1,44 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+
+#ifndef RTE_MACHINE_CPUFLAG_AVX2
+#error RTE_MACHINE_CPUFLAG_AVX2 not defined
+#endif
+
+void *
+rte_memcpy_avx2(void *dst, const void *src, size_t n)
+{
+	return rte_memcpy_internal(dst, src, n);
+}
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c b/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
new file mode 100644
index 0000000..be8d964
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
@@ -0,0 +1,44 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+
+#ifndef RTE_MACHINE_CPUFLAG_AVX512F
+#error RTE_MACHINE_CPUFLAG_AVX512F not defined
+#endif
+
+void *
+rte_memcpy_avx512f(void *dst, const void *src, size_t n)
+{
+	return rte_memcpy_internal(dst, src, n);
+}
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
new file mode 100644
index 0000000..d17fb5b
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
@@ -0,0 +1,909 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_MEMCPY_INTERNAL_X86_64_H_
+#define _RTE_MEMCPY_INTERNAL_X86_64_H_
+
+/**
+ * @file
+ *
+ * Functions for SSE/AVX/AVX2/AVX512 implementation of memcpy().
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <string.h>
+#include <rte_vect.h>
+#include <rte_common.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Copy bytes from one location to another. The locations must not overlap.
+ *
+ * @note This is implemented as a macro, so it's address should not be taken
+ * and care is needed as parameter expressions may be evaluated multiple times.
+ *
+ * @param dst
+ *   Pointer to the destination of the data.
+ * @param src
+ *   Pointer to the source data.
+ * @param n
+ *   Number of bytes to copy.
+ * @return
+ *   Pointer to the destination data.
+ */
+
+#ifdef RTE_MACHINE_CPUFLAG_AVX512F
+
+#define ALIGNMENT_MASK 0x3F
+
+/**
+ * AVX512 implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_loadu_si256((const __m256i *)src);
+	_mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+	__m512i zmm0;
+
+	zmm0 = _mm512_loadu_si512((const void *)src);
+	_mm512_storeu_si512((void *)dst, zmm0);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov64(dst + 0 * 64, src + 0 * 64);
+	rte_mov64(dst + 1 * 64, src + 1 * 64);
+}
+
+/**
+ * Copy 256 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov256(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov64(dst + 0 * 64, src + 0 * 64);
+	rte_mov64(dst + 1 * 64, src + 1 * 64);
+	rte_mov64(dst + 2 * 64, src + 2 * 64);
+	rte_mov64(dst + 3 * 64, src + 3 * 64);
+}
+
+/**
+ * Copy 128-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m512i zmm0, zmm1;
+
+	while (n >= 128) {
+		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
+		n -= 128;
+		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
+		src = src + 128;
+		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
+		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
+		dst = dst + 128;
+	}
+}
+
+/**
+ * Copy 512-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m512i zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, zmm7;
+
+	while (n >= 512) {
+		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
+		n -= 512;
+		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
+		zmm2 = _mm512_loadu_si512((const void *)(src + 2 * 64));
+		zmm3 = _mm512_loadu_si512((const void *)(src + 3 * 64));
+		zmm4 = _mm512_loadu_si512((const void *)(src + 4 * 64));
+		zmm5 = _mm512_loadu_si512((const void *)(src + 5 * 64));
+		zmm6 = _mm512_loadu_si512((const void *)(src + 6 * 64));
+		zmm7 = _mm512_loadu_si512((const void *)(src + 7 * 64));
+		src = src + 512;
+		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
+		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
+		_mm512_storeu_si512((void *)(dst + 2 * 64), zmm2);
+		_mm512_storeu_si512((void *)(dst + 3 * 64), zmm3);
+		_mm512_storeu_si512((void *)(dst + 4 * 64), zmm4);
+		_mm512_storeu_si512((void *)(dst + 5 * 64), zmm5);
+		_mm512_storeu_si512((void *)(dst + 6 * 64), zmm6);
+		_mm512_storeu_si512((void *)(dst + 7 * 64), zmm7);
+		dst = dst + 512;
+	}
+}
+
+static inline void *
+rte_memcpy_generic(void *dst, const void *src, size_t n)
+{
+	uintptr_t dstu = (uintptr_t)dst;
+	uintptr_t srcu = (uintptr_t)src;
+	void *ret = dst;
+	size_t dstofss;
+	size_t bits;
+
+	/**
+	 * Copy less than 16 bytes
+	 */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dstu = *(const uint8_t *)srcu;
+			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+			dstu = (uintptr_t)((uint8_t *)dstu + 1);
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dstu = *(const uint16_t *)srcu;
+			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+			dstu = (uintptr_t)((uint16_t *)dstu + 1);
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dstu = *(const uint32_t *)srcu;
+			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+			dstu = (uintptr_t)((uint32_t *)dstu + 1);
+		}
+		if (n & 0x08)
+			*(uint64_t *)dstu = *(const uint64_t *)srcu;
+		return ret;
+	}
+
+	/**
+	 * Fast way when copy size doesn't exceed 512 bytes
+	 */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				  (const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				  (const uint8_t *)src - 32 + n);
+		return ret;
+	}
+	if (n <= 512) {
+		if (n >= 256) {
+			n -= 256;
+			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 256;
+			dst = (uint8_t *)dst + 256;
+		}
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+COPY_BLOCK_128_BACK63:
+		if (n > 64) {
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov64((uint8_t *)dst - 64 + n,
+					  (const uint8_t *)src - 64 + n);
+			return ret;
+		}
+		if (n > 0)
+			rte_mov64((uint8_t *)dst - 64 + n,
+					  (const uint8_t *)src - 64 + n);
+		return ret;
+	}
+
+	/**
+	 * Make store aligned when copy size exceeds 512 bytes
+	 */
+	dstofss = ((uintptr_t)dst & 0x3F);
+	if (dstofss > 0) {
+		dstofss = 64 - dstofss;
+		n -= dstofss;
+		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+		src = (const uint8_t *)src + dstofss;
+		dst = (uint8_t *)dst + dstofss;
+	}
+
+	/**
+	 * Copy 512-byte blocks.
+	 * Use copy block function for better instruction order control,
+	 * which is important when load is unaligned.
+	 */
+	rte_mov512blocks((uint8_t *)dst, (const uint8_t *)src, n);
+	bits = n;
+	n = n & 511;
+	bits -= n;
+	src = (const uint8_t *)src + bits;
+	dst = (uint8_t *)dst + bits;
+
+	/**
+	 * Copy 128-byte blocks.
+	 * Use copy block function for better instruction order control,
+	 * which is important when load is unaligned.
+	 */
+	if (n >= 128) {
+		rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
+		bits = n;
+		n = n & 127;
+		bits -= n;
+		src = (const uint8_t *)src + bits;
+		dst = (uint8_t *)dst + bits;
+	}
+
+	/**
+	 * Copy whatever left
+	 */
+	goto COPY_BLOCK_128_BACK63;
+}
+
+#elif defined RTE_MACHINE_CPUFLAG_AVX2
+
+#define ALIGNMENT_MASK 0x1F
+
+/**
+ * AVX2 implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_loadu_si256((const __m256i *)src);
+	_mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
+	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
+	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
+	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
+	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
+}
+
+/**
+ * Copy 128-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m256i ymm0, ymm1, ymm2, ymm3;
+
+	while (n >= 128) {
+		ymm0 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 0 * 32));
+		n -= 128;
+		ymm1 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 1 * 32));
+		ymm2 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 2 * 32));
+		ymm3 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 3 * 32));
+		src = (const uint8_t *)src + 128;
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3);
+		dst = (uint8_t *)dst + 128;
+	}
+}
+
+static inline void *
+rte_memcpy_generic(void *dst, const void *src, size_t n)
+{
+	uintptr_t dstu = (uintptr_t)dst;
+	uintptr_t srcu = (uintptr_t)src;
+	void *ret = dst;
+	size_t dstofss;
+	size_t bits;
+
+	/**
+	 * Copy less than 16 bytes
+	 */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dstu = *(const uint8_t *)srcu;
+			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+			dstu = (uintptr_t)((uint8_t *)dstu + 1);
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dstu = *(const uint16_t *)srcu;
+			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+			dstu = (uintptr_t)((uint16_t *)dstu + 1);
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dstu = *(const uint32_t *)srcu;
+			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+			dstu = (uintptr_t)((uint32_t *)dstu + 1);
+		}
+		if (n & 0x08)
+			*(uint64_t *)dstu = *(const uint64_t *)srcu;
+		return ret;
+	}
+
+	/**
+	 * Fast way when copy size doesn't exceed 256 bytes
+	 */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 48) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst + 16, (const uint8_t *)src + 16);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				(const uint8_t *)src - 32 + n);
+		return ret;
+	}
+	if (n <= 256) {
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+COPY_BLOCK_128_BACK31:
+		if (n >= 64) {
+			n -= 64;
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 64;
+			dst = (uint8_t *)dst + 64;
+		}
+		if (n > 32) {
+			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov32((uint8_t *)dst - 32 + n,
+					(const uint8_t *)src - 32 + n);
+			return ret;
+		}
+		if (n > 0) {
+			rte_mov32((uint8_t *)dst - 32 + n,
+					(const uint8_t *)src - 32 + n);
+		}
+		return ret;
+	}
+
+	/**
+	 * Make store aligned when copy size exceeds 256 bytes
+	 */
+	dstofss = (uintptr_t)dst & 0x1F;
+	if (dstofss > 0) {
+		dstofss = 32 - dstofss;
+		n -= dstofss;
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		src = (const uint8_t *)src + dstofss;
+		dst = (uint8_t *)dst + dstofss;
+	}
+
+	/**
+	 * Copy 128-byte blocks
+	 */
+	rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
+	bits = n;
+	n = n & 127;
+	bits -= n;
+	src = (const uint8_t *)src + bits;
+	dst = (uint8_t *)dst + bits;
+
+	/**
+	 * Copy whatever left
+	 */
+	goto COPY_BLOCK_128_BACK31;
+}
+
+#else /* RTE_MACHINE_CPUFLAG */
+
+#define ALIGNMENT_MASK 0x0F
+
+/**
+ * SSE & AVX implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
+	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
+	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
+	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
+}
+
+/**
+ * Copy 256 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov256(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
+	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
+	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
+	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
+	rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);
+	rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);
+	rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);
+	rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);
+	rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);
+	rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);
+	rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);
+	rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
+}
+
+/**
+ * Macro for copying unaligned block from one location to another with constant load offset,
+ * 47 bytes leftover maximum,
+ * locations should not overlap.
+ * Requirements:
+ * - Store is aligned
+ * - Load offset is <offset>, which must be immediate value within [1, 15]
+ * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
+ * - <dst>, <src>, <len> must be variables
+ * - __m128i <xmm0> ~ <xmm8> must be pre-defined
+ */
+#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)                                                     \
+__extension__ ({                                                                                            \
+    int tmp;                                                                                                \
+    while (len >= 128 + 16 - offset) {                                                                      \
+        xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));                  \
+        len -= 128;                                                                                         \
+        xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));                  \
+        xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));                  \
+        xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16));                  \
+        xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16));                  \
+        xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16));                  \
+        xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16));                  \
+        xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \
+        xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \
+        src = (const uint8_t *)src + 128;                                                                   \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));        \
+        dst = (uint8_t *)dst + 128;                                                                         \
+    }                                                                                                       \
+    tmp = len;                                                                                              \
+    len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
+    tmp -= len;                                                                                             \
+    src = (const uint8_t *)src + tmp;                                                                       \
+    dst = (uint8_t *)dst + tmp;                                                                             \
+    if (len >= 32 + 16 - offset) {                                                                          \
+        while (len >= 32 + 16 - offset) {                                                                   \
+            xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));              \
+            len -= 32;                                                                                      \
+            xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));              \
+            xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));              \
+            src = (const uint8_t *)src + 32;                                                                \
+            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));    \
+            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));    \
+            dst = (uint8_t *)dst + 32;                                                                      \
+        }                                                                                                   \
+        tmp = len;                                                                                          \
+        len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
+        tmp -= len;                                                                                         \
+        src = (const uint8_t *)src + tmp;                                                                   \
+        dst = (uint8_t *)dst + tmp;                                                                         \
+    }                                                                                                       \
+})
+
+/**
+ * Macro for copying unaligned block from one location to another,
+ * 47 bytes leftover maximum,
+ * locations should not overlap.
+ * Use switch here because the aligning instruction requires immediate value for shift count.
+ * Requirements:
+ * - Store is aligned
+ * - Load offset is <offset>, which must be within [1, 15]
+ * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
+ * - <dst>, <src>, <len> must be variables
+ * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined
+ */
+#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                   \
+__extension__ ({                                                      \
+    switch (offset) {                                                 \
+    case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break;    \
+    case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break;    \
+    case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break;    \
+    case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break;    \
+    case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break;    \
+    case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break;    \
+    case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break;    \
+    case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break;    \
+    case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break;    \
+    case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break;    \
+    case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;    \
+    case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break;    \
+    case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break;    \
+    case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break;    \
+    case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break;    \
+    default:;                                                         \
+    }                                                                 \
+})
+
+static inline void *
+rte_memcpy_generic(void *dst, const void *src, size_t n)
+{
+	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
+	uintptr_t dstu = (uintptr_t)dst;
+	uintptr_t srcu = (uintptr_t)src;
+	void *ret = dst;
+	size_t dstofss;
+	size_t srcofs;
+
+	/**
+	 * Copy less than 16 bytes
+	 */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dstu = *(const uint8_t *)srcu;
+			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+			dstu = (uintptr_t)((uint8_t *)dstu + 1);
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dstu = *(const uint16_t *)srcu;
+			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+			dstu = (uintptr_t)((uint16_t *)dstu + 1);
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dstu = *(const uint32_t *)srcu;
+			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+			dstu = (uintptr_t)((uint32_t *)dstu + 1);
+		}
+		if (n & 0x08)
+			*(uint64_t *)dstu = *(const uint64_t *)srcu;
+		return ret;
+	}
+
+	/**
+	 * Fast way when copy size doesn't exceed 512 bytes
+	 */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 48) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 128)
+		goto COPY_BLOCK_128_BACK15;
+	if (n <= 512) {
+		if (n >= 256) {
+			n -= 256;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov128((uint8_t *)dst + 128,
+					(const uint8_t *)src + 128);
+			src = (const uint8_t *)src + 256;
+			dst = (uint8_t *)dst + 256;
+		}
+COPY_BLOCK_255_BACK15:
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+COPY_BLOCK_128_BACK15:
+		if (n >= 64) {
+			n -= 64;
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 64;
+			dst = (uint8_t *)dst + 64;
+		}
+COPY_BLOCK_64_BACK15:
+		if (n >= 32) {
+			n -= 32;
+			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 32;
+			dst = (uint8_t *)dst + 32;
+		}
+		if (n > 16) {
+			rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov16((uint8_t *)dst - 16 + n,
+					(const uint8_t *)src - 16 + n);
+			return ret;
+		}
+		if (n > 0) {
+			rte_mov16((uint8_t *)dst - 16 + n,
+					(const uint8_t *)src - 16 + n);
+		}
+		return ret;
+	}
+
+	/**
+	 * Make store aligned when copy size exceeds 512 bytes,
+	 * and make sure the first 15 bytes are copied, because
+	 * unaligned copy functions require up to 15 bytes
+	 * backwards access.
+	 */
+	dstofss = (uintptr_t)dst & 0x0F;
+	if (dstofss > 0) {
+		dstofss = 16 - dstofss + 16;
+		n -= dstofss;
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		src = (const uint8_t *)src + dstofss;
+		dst = (uint8_t *)dst + dstofss;
+	}
+	srcofs = ((uintptr_t)src & 0x0F);
+
+	/**
+	 * For aligned copy
+	 */
+	if (srcofs == 0) {
+		/**
+		 * Copy 256-byte blocks
+		 */
+		for (; n >= 256; n -= 256) {
+			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
+			dst = (uint8_t *)dst + 256;
+			src = (const uint8_t *)src + 256;
+		}
+
+		/**
+		 * Copy whatever left
+		 */
+		goto COPY_BLOCK_255_BACK15;
+	}
+
+	/**
+	 * For copy with unaligned load
+	 */
+	MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
+
+	/**
+	 * Copy whatever left
+	 */
+	goto COPY_BLOCK_64_BACK15;
+}
+
+#endif /* RTE_MACHINE_CPUFLAG */
+
+static inline void *
+rte_memcpy_aligned(void *dst, const void *src, size_t n)
+{
+	void *ret = dst;
+
+	/* Copy size <= 16 bytes */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dst = *(const uint8_t *)src;
+			src = (const uint8_t *)src + 1;
+			dst = (uint8_t *)dst + 1;
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dst = *(const uint16_t *)src;
+			src = (const uint16_t *)src + 1;
+			dst = (uint16_t *)dst + 1;
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dst = *(const uint32_t *)src;
+			src = (const uint32_t *)src + 1;
+			dst = (uint32_t *)dst + 1;
+		}
+		if (n & 0x08)
+			*(uint64_t *)dst = *(const uint64_t *)src;
+
+		return ret;
+	}
+
+	/* Copy 16 <= size <= 32 bytes */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+
+		return ret;
+	}
+
+	/* Copy 32 < size <= 64 bytes */
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				(const uint8_t *)src - 32 + n);
+
+		return ret;
+	}
+
+	/* Copy 64 bytes blocks */
+	for (; n >= 64; n -= 64) {
+		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+		dst = (uint8_t *)dst + 64;
+		src = (const uint8_t *)src + 64;
+	}
+
+	/* Copy whatever left */
+	rte_mov64((uint8_t *)dst - 64 + n,
+			(const uint8_t *)src - 64 + n);
+
+	return ret;
+}
+
+static inline void *
+rte_memcpy_internal(void *dst, const void *src, size_t n)
+{
+	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
+		return rte_memcpy_aligned(dst, src, n);
+	else
+		return rte_memcpy_generic(dst, src, n);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_MEMCPY_INTERNAL_X86_64_H_ */
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c b/lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c
new file mode 100644
index 0000000..55d6b41
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c
@@ -0,0 +1,40 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+
+void *
+rte_memcpy_sse(void *dst, const void *src, size_t n)
+{
+	return rte_memcpy_internal(dst, src, n);
+}
diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
index 90bca4d..c8bdac0 100644
--- a/lib/librte_eal/linuxapp/eal/Makefile
+++ b/lib/librte_eal/linuxapp/eal/Makefile
@@ -40,6 +40,7 @@ VPATH += $(RTE_SDK)/lib/librte_eal/common/arch/$(ARCH_DIR)
 LIBABIVER := 5
 
 VPATH += $(RTE_SDK)/lib/librte_eal/common
+VPATH += $(RTE_SDK)/lib/librte_eal/common/include/arch/$(ARCH_DIR)
 
 CFLAGS += -I$(SRCDIR)/include
 CFLAGS += -I$(RTE_SDK)/lib/librte_eal/common
@@ -105,6 +106,24 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_ARCH_X86) += rte_spinlock.c
 
+# for run-time dispatch of memcpy
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy.c
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_sse.c
+
+# if the compiler supports AVX512, add avx512 file
+ifneq ($(findstring CC_SUPPORT_AVX512F,$(MACHINE_CFLAGS)),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx512f.c
+CFLAGS_rte_memcpy_avx512f.o += -mavx512f
+CFLAGS_rte_memcpy_avx512f.o += -DRTE_MACHINE_CPUFLAG_AVX512F
+endif
+
+# if the compiler supports AVX2, add avx2 file
+ifneq ($(findstring CC_SUPPORT_AVX2,$(MACHINE_CFLAGS)),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx2.c
+CFLAGS_rte_memcpy_avx2.o += -mavx2
+CFLAGS_rte_memcpy_avx2.o += -DRTE_MACHINE_CPUFLAG_AVX2
+endif
+
 CFLAGS_eal_common_cpuflags.o := $(CPUFLAGS_LIST)
 
 CFLAGS_eal.o := -D_GNU_SOURCE
diff --git a/lib/librte_eal/linuxapp/eal/rte_eal_version.map b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
index 8c08b8d..66bbdbb 100644
--- a/lib/librte_eal/linuxapp/eal/rte_eal_version.map
+++ b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
@@ -243,3 +243,10 @@ EXPERIMENTAL {
 	rte_service_start_with_defaults;
 
 } DPDK_17.08;
+
+DPDK_17.11 {
+	global:
+
+	rte_memcpy_ptr;
+
+} DPDK_17.08;
diff --git a/mk/rte.cpuflags.mk b/mk/rte.cpuflags.mk
index a813c91..8a7a1e7 100644
--- a/mk/rte.cpuflags.mk
+++ b/mk/rte.cpuflags.mk
@@ -134,6 +134,20 @@ endif
 
 MACHINE_CFLAGS += $(addprefix -DRTE_MACHINE_CPUFLAG_,$(CPUFLAGS))
 
+# Check if the compiler suppoerts AVX512
+CC_SUPPORT_AVX512F := $(shell $(CC) -mavx512f -dM -E - < /dev/null 2>&1 | grep -q AVX512 && echo 1)
+ifeq ($(CC_SUPPORT_AVX512F),1)
+ifeq ($(CONFIG_RTE_ENABLE_AVX512),y)
+MACHINE_CFLAGS += -DCC_SUPPORT_AVX512F
+endif
+endif
+
+# Check if the compiler supports AVX2
+CC_SUPPORT_AVX2 := $(shell $(CC) -mavx2 -dM -E - < /dev/null 2>&1 | grep -q AVX2 && echo 1)
+ifeq ($(CC_SUPPORT_AVX2),1)
+MACHINE_CFLAGS += -DCC_SUPPORT_AVX2
+endif
+
 # To strip whitespace
 comma:= ,
 empty:=
-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [dpdk-dev] [PATCH v7 2/3] app/test: run-time dispatch over memcpy perf test
  2017-10-05 12:33       ` [dpdk-dev] [PATCH v7 0/3] run-time Linking support Xiaoyun Li
  2017-10-05 12:33         ` [dpdk-dev] [PATCH v7 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
@ 2017-10-05 12:33         ` Xiaoyun Li
  2017-10-05 12:33         ` [dpdk-dev] [PATCH v7 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
                           ` (3 subsequent siblings)
  5 siblings, 0 replies; 88+ messages in thread
From: Xiaoyun Li @ 2017-10-05 12:33 UTC (permalink / raw)
  To: konstantin.ananyev, bruce.richardson
  Cc: wenzhuo.lu, helin.zhang, dev, Xiaoyun Li

This patch modifies assignment of alignment unit from build-time
to run-time based on CPU flags that machine supports.

Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
---
 test/test/test_memcpy_perf.c | 40 +++++++++++++++++++++++++++-------------
 1 file changed, 27 insertions(+), 13 deletions(-)

diff --git a/test/test/test_memcpy_perf.c b/test/test/test_memcpy_perf.c
index ff3aaaa..33def3b 100644
--- a/test/test/test_memcpy_perf.c
+++ b/test/test/test_memcpy_perf.c
@@ -79,13 +79,7 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
 #define TEST_BATCH_SIZE         100
 
 /* Data is aligned on this many bytes (power of 2) */
-#ifdef RTE_MACHINE_CPUFLAG_AVX512F
-#define ALIGNMENT_UNIT          64
-#elif defined RTE_MACHINE_CPUFLAG_AVX2
-#define ALIGNMENT_UNIT          32
-#else /* RTE_MACHINE_CPUFLAG */
-#define ALIGNMENT_UNIT          16
-#endif /* RTE_MACHINE_CPUFLAG */
+static uint8_t alignment_unit = 16;
 
 /*
  * Pointers used in performance tests. The two large buffers are for uncached
@@ -100,20 +94,39 @@ static int
 init_buffers(void)
 {
 	unsigned i;
+#ifdef CC_SUPPORT_AVX512
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F))
+		alignment_unit = 64;
+	else
+#endif
+#ifdef CC_SUPPORT_AVX2
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2))
+		alignment_unit = 32;
+	else
+#endif
+		alignment_unit = 16;
 
-	large_buf_read = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	large_buf_read = rte_malloc("memcpy",
+				    LARGE_BUFFER_SIZE + alignment_unit,
+				    alignment_unit);
 	if (large_buf_read == NULL)
 		goto error_large_buf_read;
 
-	large_buf_write = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	large_buf_write = rte_malloc("memcpy",
+				     LARGE_BUFFER_SIZE + alignment_unit,
+				     alignment_unit);
 	if (large_buf_write == NULL)
 		goto error_large_buf_write;
 
-	small_buf_read = rte_malloc("memcpy", SMALL_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	small_buf_read = rte_malloc("memcpy",
+				    SMALL_BUFFER_SIZE + alignment_unit,
+				    alignment_unit);
 	if (small_buf_read == NULL)
 		goto error_small_buf_read;
 
-	small_buf_write = rte_malloc("memcpy", SMALL_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	small_buf_write = rte_malloc("memcpy",
+				     SMALL_BUFFER_SIZE + alignment_unit,
+				     alignment_unit);
 	if (small_buf_write == NULL)
 		goto error_small_buf_write;
 
@@ -153,7 +166,7 @@ static inline size_t
 get_rand_offset(size_t uoffset)
 {
 	return ((rte_rand() % (LARGE_BUFFER_SIZE - SMALL_BUFFER_SIZE)) &
-			~(ALIGNMENT_UNIT - 1)) + uoffset;
+			~(alignment_unit - 1)) + uoffset;
 }
 
 /* Fill in source and destination addresses. */
@@ -321,7 +334,8 @@ perf_test(void)
 		   "(bytes)        (ticks)        (ticks)        (ticks)        (ticks)\n"
 		   "------- -------------- -------------- -------------- --------------");
 
-	printf("\n========================== %2dB aligned ============================", ALIGNMENT_UNIT);
+	printf("\n========================= %2dB aligned ============================",
+		alignment_unit);
 	/* Do aligned tests where size is a variable */
 	perf_test_variable_aligned();
 	printf("\n------- -------------- -------------- -------------- --------------");
-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [dpdk-dev] [PATCH v7 3/3] efd: run-time dispatch over x86 EFD functions
  2017-10-05 12:33       ` [dpdk-dev] [PATCH v7 0/3] run-time Linking support Xiaoyun Li
  2017-10-05 12:33         ` [dpdk-dev] [PATCH v7 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
  2017-10-05 12:33         ` [dpdk-dev] [PATCH v7 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
@ 2017-10-05 12:33         ` Xiaoyun Li
  2017-10-05 13:24         ` [dpdk-dev] [PATCH v7 0/3] run-time Linking support Ananyev, Konstantin
                           ` (2 subsequent siblings)
  5 siblings, 0 replies; 88+ messages in thread
From: Xiaoyun Li @ 2017-10-05 12:33 UTC (permalink / raw)
  To: konstantin.ananyev, bruce.richardson
  Cc: wenzhuo.lu, helin.zhang, dev, Xiaoyun Li

This patch enables x86 EFD file be compiled only if the compiler
supports AVX2 since it is already chosen at run-time.

Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
---
 lib/librte_efd/Makefile      |  6 ++++
 lib/librte_efd/rte_efd_x86.c | 77 ++++++++++++++++++++++++++++++++++++++++++++
 lib/librte_efd/rte_efd_x86.h | 48 ++-------------------------
 3 files changed, 85 insertions(+), 46 deletions(-)
 create mode 100644 lib/librte_efd/rte_efd_x86.c

diff --git a/lib/librte_efd/Makefile b/lib/librte_efd/Makefile
index b9277bc..35bb2bd 100644
--- a/lib/librte_efd/Makefile
+++ b/lib/librte_efd/Makefile
@@ -44,6 +44,12 @@ LIBABIVER := 1
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_EFD) := rte_efd.c
 
+# if the compiler supports AVX2, add efd x86 file
+ifneq ($(findstring CC_SUPPORT_AVX2,$(MACHINE_CFLAGS)),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_efd_x86.c
+CFLAGS_rte_efd_x86.o += -mavx2
+endif
+
 # install this header file
 SYMLINK-$(CONFIG_RTE_LIBRTE_EFD)-include := rte_efd.h
 
diff --git a/lib/librte_efd/rte_efd_x86.c b/lib/librte_efd/rte_efd_x86.c
new file mode 100644
index 0000000..49677db
--- /dev/null
+++ b/lib/librte_efd/rte_efd_x86.c
@@ -0,0 +1,77 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2016-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/* rte_efd_x86.c
+ * This file holds all x86 specific EFD functions
+ */
+#include <rte_efd.h>
+#include <rte_efd_x86.h>
+
+#if (RTE_EFD_VALUE_NUM_BITS == 8 || RTE_EFD_VALUE_NUM_BITS == 16 || \
+	RTE_EFD_VALUE_NUM_BITS == 24 || RTE_EFD_VALUE_NUM_BITS == 32)
+#define EFD_LOAD_SI128(val) _mm_load_si128(val)
+#else
+#define EFD_LOAD_SI128(val) _mm_lddqu_si128(val)
+#endif
+
+efd_value_t
+efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
+		const efd_lookuptbl_t *group_lookup_table,
+		const uint32_t hash_val_a, const uint32_t hash_val_b)
+{
+	efd_value_t value = 0;
+	uint32_t i = 0;
+	__m256i vhash_val_a = _mm256_set1_epi32(hash_val_a);
+	__m256i vhash_val_b = _mm256_set1_epi32(hash_val_b);
+
+	for (; i < RTE_EFD_VALUE_NUM_BITS; i += 8) {
+		__m256i vhash_idx =
+				_mm256_cvtepu16_epi32(EFD_LOAD_SI128(
+				(__m128i const *) &group_hash_idx[i]));
+		__m256i vlookup_table = _mm256_cvtepu16_epi32(
+				EFD_LOAD_SI128((__m128i const *)
+				&group_lookup_table[i]));
+		__m256i vhash = _mm256_add_epi32(vhash_val_a,
+				_mm256_mullo_epi32(vhash_idx, vhash_val_b));
+		__m256i vbucket_idx = _mm256_srli_epi32(vhash,
+				EFD_LOOKUPTBL_SHIFT);
+		__m256i vresult = _mm256_srlv_epi32(vlookup_table,
+				vbucket_idx);
+
+		value |= (_mm256_movemask_ps(
+			(__m256) _mm256_slli_epi32(vresult, 31))
+			& ((1 << (RTE_EFD_VALUE_NUM_BITS - i)) - 1)) << i;
+	}
+
+	return value;
+}
diff --git a/lib/librte_efd/rte_efd_x86.h b/lib/librte_efd/rte_efd_x86.h
index 34f37d7..7a082aa 100644
--- a/lib/librte_efd/rte_efd_x86.h
+++ b/lib/librte_efd/rte_efd_x86.h
@@ -36,51 +36,7 @@
  */
 #include <immintrin.h>
 
-#if (RTE_EFD_VALUE_NUM_BITS == 8 || RTE_EFD_VALUE_NUM_BITS == 16 || \
-	RTE_EFD_VALUE_NUM_BITS == 24 || RTE_EFD_VALUE_NUM_BITS == 32)
-#define EFD_LOAD_SI128(val) _mm_load_si128(val)
-#else
-#define EFD_LOAD_SI128(val) _mm_lddqu_si128(val)
-#endif
-
-static inline efd_value_t
+extern efd_value_t
 efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
 		const efd_lookuptbl_t *group_lookup_table,
-		const uint32_t hash_val_a, const uint32_t hash_val_b)
-{
-#ifdef RTE_MACHINE_CPUFLAG_AVX2
-	efd_value_t value = 0;
-	uint32_t i = 0;
-	__m256i vhash_val_a = _mm256_set1_epi32(hash_val_a);
-	__m256i vhash_val_b = _mm256_set1_epi32(hash_val_b);
-
-	for (; i < RTE_EFD_VALUE_NUM_BITS; i += 8) {
-		__m256i vhash_idx =
-				_mm256_cvtepu16_epi32(EFD_LOAD_SI128(
-				(__m128i const *) &group_hash_idx[i]));
-		__m256i vlookup_table = _mm256_cvtepu16_epi32(
-				EFD_LOAD_SI128((__m128i const *)
-				&group_lookup_table[i]));
-		__m256i vhash = _mm256_add_epi32(vhash_val_a,
-				_mm256_mullo_epi32(vhash_idx, vhash_val_b));
-		__m256i vbucket_idx = _mm256_srli_epi32(vhash,
-				EFD_LOOKUPTBL_SHIFT);
-		__m256i vresult = _mm256_srlv_epi32(vlookup_table,
-				vbucket_idx);
-
-		value |= (_mm256_movemask_ps(
-			(__m256) _mm256_slli_epi32(vresult, 31))
-			& ((1 << (RTE_EFD_VALUE_NUM_BITS - i)) - 1)) << i;
-	}
-
-	return value;
-#else
-	RTE_SET_USED(group_hash_idx);
-	RTE_SET_USED(group_lookup_table);
-	RTE_SET_USED(hash_val_a);
-	RTE_SET_USED(hash_val_b);
-	/* Return dummy value, only to avoid compilation breakage */
-	return 0;
-#endif
-
-}
+		const uint32_t hash_val_a, const uint32_t hash_val_b);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v7 0/3] run-time Linking support
  2017-10-05 12:33       ` [dpdk-dev] [PATCH v7 0/3] run-time Linking support Xiaoyun Li
                           ` (2 preceding siblings ...)
  2017-10-05 12:33         ` [dpdk-dev] [PATCH v7 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
@ 2017-10-05 13:24         ` Ananyev, Konstantin
  2017-10-09 17:40         ` Thomas Monjalon
  2017-10-13  9:01         ` [dpdk-dev] [PATCH v8 " Xiaoyun Li
  5 siblings, 0 replies; 88+ messages in thread
From: Ananyev, Konstantin @ 2017-10-05 13:24 UTC (permalink / raw)
  To: Li, Xiaoyun, Richardson, Bruce; +Cc: Lu, Wenzhuo, Zhang, Helin, dev



> -----Original Message-----
> From: Li, Xiaoyun
> Sent: Thursday, October 5, 2017 1:33 PM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>
> Cc: Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>; dev@dpdk.org; Li, Xiaoyun <xiaoyun.li@intel.com>
> Subject: [PATCH v7 0/3] run-time Linking support
> 
> This patchset dynamically selects functions at run-time based on CPU flags
> that current machine supports.This patchset modifies mempcy, memcpy perf
> test and x86 EFD, using function pointers and bind them at constructor time.
> Then in the cloud environment, users can compiler once for the minimum target
> such as 'haswell'(not 'native') and run on different platforms (equal or above
> haswell) and can get ISA optimization based on running CPU.
> 
> Xiaoyun Li (3):
>   eal/x86: run-time dispatch over memcpy
>   app/test: run-time dispatch over memcpy perf test
>   efd: run-time dispatch over x86 EFD functions
> 
> ---
> v2
> * Use gcc function multi-versioning to avoid compilation issues.
> * Add macros for AVX512 and AVX2. Only if users enable AVX512 and the compiler
> supports it, the AVX512 codes would be compiled. Only if the compiler supports
> AVX2, the AVX2 codes would be compiled.
> 
> v3
> * Reduce function calls via only keep rte_memcpy_xxx.
> * Add conditions that when copy size is small, use inline code path.
> Otherwise, use dynamic code path.
> * To support attribute target, clang version must be greater than 3.7.
> Otherwise, would choose SSE/AVX code path, the same as before.
> * Move two mocro functions to the top of the code since they would be used in
> inline SSE/AVX and dynamic SSE/AVX codes.
> 
> v4
> * Modify rte_memcpy.h to several .c files and modify makefiles to compile
> AVX2 and AVX512 files.
> 
> v5
> * Delete redundant repeated codes of rte_memcpy_xxx.
> * Modify makefiles to enable reuse of existing rte_memcpy.
> * Delete redundant codes of rte_efd_x86.h in v4. Move it into .c file and enable
> compilation -mavx2 for it in makefile since it is already chosen at run-time.
> 
> v6
> * Fix shared target build failure.
> * Safely remove redundant efd x86 avx2 codes since the file is compiled
> with -mavx2.
> 
> v7
> * Modify the added version map code in v6 to be more reasonable.
> * Safely remove redundant efd x86 avx2 codes since the file is compiled
> with -mavx2.
> 
>  lib/librte_eal/bsdapp/eal/Makefile                 |  19 +
>  lib/librte_eal/bsdapp/eal/rte_eal_version.map      |   7 +
>  .../common/include/arch/x86/rte_memcpy.c           |  59 ++
>  .../common/include/arch/x86/rte_memcpy.h           | 861 +------------------
>  .../common/include/arch/x86/rte_memcpy_avx2.c      |  44 +
>  .../common/include/arch/x86/rte_memcpy_avx512f.c   |  44 +
>  .../common/include/arch/x86/rte_memcpy_internal.h  | 909 +++++++++++++++++++++
>  .../common/include/arch/x86/rte_memcpy_sse.c       |  40 +
>  lib/librte_eal/linuxapp/eal/Makefile               |  19 +
>  lib/librte_eal/linuxapp/eal/rte_eal_version.map    |   7 +
>  lib/librte_efd/Makefile                            |   6 +
>  lib/librte_efd/rte_efd_x86.c                       |  77 ++
>  lib/librte_efd/rte_efd_x86.h                       |  48 +-
>  mk/rte.cpuflags.mk                                 |  14 +
>  test/test/test_memcpy_perf.c                       |  40 +-
>  15 files changed, 1289 insertions(+), 905 deletions(-)
>  create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy.c
>  create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
>  create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
>  create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
>  create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c
>  create mode 100644 lib/librte_efd/rte_efd_x86.c
> 
> --

Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>

> 2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v7 0/3] run-time Linking support
  2017-10-05 12:33       ` [dpdk-dev] [PATCH v7 0/3] run-time Linking support Xiaoyun Li
                           ` (3 preceding siblings ...)
  2017-10-05 13:24         ` [dpdk-dev] [PATCH v7 0/3] run-time Linking support Ananyev, Konstantin
@ 2017-10-09 17:40         ` Thomas Monjalon
  2017-10-13  0:58           ` Li, Xiaoyun
  2017-10-13  9:01         ` [dpdk-dev] [PATCH v8 " Xiaoyun Li
  5 siblings, 1 reply; 88+ messages in thread
From: Thomas Monjalon @ 2017-10-09 17:40 UTC (permalink / raw)
  To: Xiaoyun Li
  Cc: dev, konstantin.ananyev, bruce.richardson, wenzhuo.lu, helin.zhang

Hi,

05/10/2017 14:33, Xiaoyun Li:
>  create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy.c
>  create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
>  create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
>  create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
>  create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c

Why are you adding some .c files in the include directory?
I think it should be located in lib/librte_eal/common/arch/x86/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v7 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-05 12:33         ` [dpdk-dev] [PATCH v7 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
@ 2017-10-09 17:47           ` Thomas Monjalon
  2017-10-13  1:06             ` Li, Xiaoyun
  0 siblings, 1 reply; 88+ messages in thread
From: Thomas Monjalon @ 2017-10-09 17:47 UTC (permalink / raw)
  To: Xiaoyun Li
  Cc: dev, konstantin.ananyev, bruce.richardson, wenzhuo.lu, helin.zhang

05/10/2017 14:33, Xiaoyun Li:
> +/**
> + * Macro for copying unaligned block from one location to another with constant load offset,
> + * 47 bytes leftover maximum,
> + * locations should not overlap.
> + * Requirements:
> + * - Store is aligned
> + * - Load offset is <offset>, which must be immediate value within [1, 15]
> + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
> + * - <dst>, <src>, <len> must be variables
> + * - __m128i <xmm0> ~ <xmm8> must be pre-defined
> + */
> +#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, 

Naive question:
Is there a real benefit of using a macro compared to a static inline
function optimized by a modern compiler?

Anyway, if you are doing a new version, please reduce lines length
and fix the indent from spaces to tabs.

Thank you

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v7 0/3] run-time Linking support
  2017-10-09 17:40         ` Thomas Monjalon
@ 2017-10-13  0:58           ` Li, Xiaoyun
  0 siblings, 0 replies; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-13  0:58 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, Ananyev, Konstantin, Richardson, Bruce, Lu, Wenzhuo, Zhang, Helin

Hi
Sorry for the late reply.
The original rte_memcpy.h is in include directory. So I just add them there.
But you are right, I should move them to common/arch/x86/ directory.
I will modify that in next version.


Best Regards
Xiaoyun Li



> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Tuesday, October 10, 2017 01:41
> To: Li, Xiaoyun <xiaoyun.li@intel.com>
> Cc: dev@dpdk.org; Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> Richardson, Bruce <bruce.richardson@intel.com>; Lu, Wenzhuo
> <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v7 0/3] run-time Linking support
> 
> Hi,
> 
> 05/10/2017 14:33, Xiaoyun Li:
> >  create mode 100644
> lib/librte_eal/common/include/arch/x86/rte_memcpy.c
> >  create mode 100644
> lib/librte_eal/common/include/arch/x86/rte_memcpy_avx2.c
> >  create mode 100644
> lib/librte_eal/common/include/arch/x86/rte_memcpy_avx512f.c
> >  create mode 100644
> lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
> >  create mode 100644
> lib/librte_eal/common/include/arch/x86/rte_memcpy_sse.c
> 
> Why are you adding some .c files in the include directory?
> I think it should be located in lib/librte_eal/common/arch/x86/

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v7 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-09 17:47           ` Thomas Monjalon
@ 2017-10-13  1:06             ` Li, Xiaoyun
  2017-10-13  7:21               ` Thomas Monjalon
  0 siblings, 1 reply; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-13  1:06 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, Ananyev, Konstantin, Richardson, Bruce, Lu, Wenzhuo, Zhang, Helin

Hi
Sorry for the late reply. I took AL last 3 days.

> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Tuesday, October 10, 2017 01:47
> To: Li, Xiaoyun <xiaoyun.li@intel.com>
> Cc: dev@dpdk.org; Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> Richardson, Bruce <bruce.richardson@intel.com>; Lu, Wenzhuo
> <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v7 1/3] eal/x86: run-time dispatch over
> memcpy
> 
> 05/10/2017 14:33, Xiaoyun Li:
> > +/**
> > + * Macro for copying unaligned block from one location to another
> > +with constant load offset,
> > + * 47 bytes leftover maximum,
> > + * locations should not overlap.
> > + * Requirements:
> > + * - Store is aligned
> > + * - Load offset is <offset>, which must be immediate value within
> > +[1, 15]
> > + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit
> > +forwards are available for loading
> > + * - <dst>, <src>, <len> must be variables
> > + * - __m128i <xmm0> ~ <xmm8> must be pre-defined  */ #define
> > +MOVEUNALIGNED_LEFT47_IMM(dst, src, len,
> 
> Naive question:
> Is there a real benefit of using a macro compared to a static inline function
> optimized by a modern compiler?
> 
The macro is in the existing DPDK codes. I didn't touch it. I just change the file name and the function name to rte_memcpy_internal.
So I am not clear about if there is real benefit.
In my opinion, I think it is the same as static inline function.

Do I need to change them to inline function?

> Anyway, if you are doing a new version, please reduce lines length and fix
> the indent from spaces to tabs.
> 
They are original DPDK codes so I didn't touch them.
But in next version, I will fix them.

Best Regards
Xiaoyun Li



> Thank you

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v7 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-13  1:06             ` Li, Xiaoyun
@ 2017-10-13  7:21               ` Thomas Monjalon
  2017-10-13  7:30                 ` Li, Xiaoyun
  2017-10-13  7:31                 ` Ananyev, Konstantin
  0 siblings, 2 replies; 88+ messages in thread
From: Thomas Monjalon @ 2017-10-13  7:21 UTC (permalink / raw)
  To: Li, Xiaoyun
  Cc: dev, Ananyev, Konstantin, Richardson, Bruce, Lu, Wenzhuo, Zhang, Helin

13/10/2017 03:06, Li, Xiaoyun:
> Hi
> Sorry for the late reply. I took AL last 3 days.
> 
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > 05/10/2017 14:33, Xiaoyun Li:
> > > +/**
> > > + * Macro for copying unaligned block from one location to another
> > > +with constant load offset,
> > > + * 47 bytes leftover maximum,
> > > + * locations should not overlap.
> > > + * Requirements:
> > > + * - Store is aligned
> > > + * - Load offset is <offset>, which must be immediate value within
> > > +[1, 15]
> > > + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit
> > > +forwards are available for loading
> > > + * - <dst>, <src>, <len> must be variables
> > > + * - __m128i <xmm0> ~ <xmm8> must be pre-defined  */ #define
> > > +MOVEUNALIGNED_LEFT47_IMM(dst, src, len,
> > 
> > Naive question:
> > Is there a real benefit of using a macro compared to a static inline function
> > optimized by a modern compiler?
> > 
> The macro is in the existing DPDK codes. I didn't touch it. I just change the file name and the function name to rte_memcpy_internal.
> So I am not clear about if there is real benefit.
> In my opinion, I think it is the same as static inline function.
> 
> Do I need to change them to inline function?

In this patch, it appears as a new macro.
If you can, inline function is cleaner for the new one.

> > Anyway, if you are doing a new version, please reduce lines length and fix
> > the indent from spaces to tabs.
> > 
> They are original DPDK codes so I didn't touch them.
> But in next version, I will fix them.

Just to be sure: we are talking about fixing checkpatch warnings
only for the code added, changed or moved.

Thanks

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v7 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-13  7:21               ` Thomas Monjalon
@ 2017-10-13  7:30                 ` Li, Xiaoyun
  2017-10-13  7:31                 ` Ananyev, Konstantin
  1 sibling, 0 replies; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-13  7:30 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, Ananyev, Konstantin, Richardson, Bruce, Lu, Wenzhuo, Zhang, Helin

OK. Would send new patchset later.
Thanks!


Best Regards
Xiaoyun Li




> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Friday, October 13, 2017 15:22
> To: Li, Xiaoyun <xiaoyun.li@intel.com>
> Cc: dev@dpdk.org; Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> Richardson, Bruce <bruce.richardson@intel.com>; Lu, Wenzhuo
> <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v7 1/3] eal/x86: run-time dispatch over
> memcpy
> 
> 13/10/2017 03:06, Li, Xiaoyun:
> > Hi
> > Sorry for the late reply. I took AL last 3 days.
> >
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > 05/10/2017 14:33, Xiaoyun Li:
> > > > +/**
> > > > + * Macro for copying unaligned block from one location to another
> > > > +with constant load offset,
> > > > + * 47 bytes leftover maximum,
> > > > + * locations should not overlap.
> > > > + * Requirements:
> > > > + * - Store is aligned
> > > > + * - Load offset is <offset>, which must be immediate value
> > > > +within [1, 15]
> > > > + * - For <src>, make sure <offset> bit backwards & <16 - offset>
> > > > +bit forwards are available for loading
> > > > + * - <dst>, <src>, <len> must be variables
> > > > + * - __m128i <xmm0> ~ <xmm8> must be pre-defined  */ #define
> > > > +MOVEUNALIGNED_LEFT47_IMM(dst, src, len,
> > >
> > > Naive question:
> > > Is there a real benefit of using a macro compared to a static inline
> > > function optimized by a modern compiler?
> > >
> > The macro is in the existing DPDK codes. I didn't touch it. I just change the
> file name and the function name to rte_memcpy_internal.
> > So I am not clear about if there is real benefit.
> > In my opinion, I think it is the same as static inline function.
> >
> > Do I need to change them to inline function?
> 
> In this patch, it appears as a new macro.
> If you can, inline function is cleaner for the new one.
> 
> > > Anyway, if you are doing a new version, please reduce lines length
> > > and fix the indent from spaces to tabs.
> > >
> > They are original DPDK codes so I didn't touch them.
> > But in next version, I will fix them.
> 
> Just to be sure: we are talking about fixing checkpatch warnings only for the
> code added, changed or moved.
> 
> Thanks

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v7 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-13  7:21               ` Thomas Monjalon
  2017-10-13  7:30                 ` Li, Xiaoyun
@ 2017-10-13  7:31                 ` Ananyev, Konstantin
  2017-10-13  7:36                   ` Thomas Monjalon
  1 sibling, 1 reply; 88+ messages in thread
From: Ananyev, Konstantin @ 2017-10-13  7:31 UTC (permalink / raw)
  To: Thomas Monjalon, Li, Xiaoyun
  Cc: dev, Richardson, Bruce, Lu, Wenzhuo, Zhang, Helin



> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Friday, October 13, 2017 8:22 AM
> To: Li, Xiaoyun <xiaoyun.li@intel.com>
> Cc: dev@dpdk.org; Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; Lu, Wenzhuo
> <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v7 1/3] eal/x86: run-time dispatch over memcpy
> 
> 13/10/2017 03:06, Li, Xiaoyun:
> > Hi
> > Sorry for the late reply. I took AL last 3 days.
> >
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > 05/10/2017 14:33, Xiaoyun Li:
> > > > +/**
> > > > + * Macro for copying unaligned block from one location to another
> > > > +with constant load offset,
> > > > + * 47 bytes leftover maximum,
> > > > + * locations should not overlap.
> > > > + * Requirements:
> > > > + * - Store is aligned
> > > > + * - Load offset is <offset>, which must be immediate value within
> > > > +[1, 15]
> > > > + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit
> > > > +forwards are available for loading
> > > > + * - <dst>, <src>, <len> must be variables
> > > > + * - __m128i <xmm0> ~ <xmm8> must be pre-defined  */ #define
> > > > +MOVEUNALIGNED_LEFT47_IMM(dst, src, len,
> > >
> > > Naive question:
> > > Is there a real benefit of using a macro compared to a static inline function
> > > optimized by a modern compiler?
> > >
> > The macro is in the existing DPDK codes. I didn't touch it. I just change the file name and the function name to rte_memcpy_internal.
> > So I am not clear about if there is real benefit.
> > In my opinion, I think it is the same as static inline function.
> >
> > Do I need to change them to inline function?
> 
> In this patch, it appears as a new macro.

Ah no, it definitely been there before.
All we did here - git mv rte_memcpy.h rte_memcpyu_interlan.h
and then in rte_memcpy_internal.h renamed rte_memcpy() to rte_memcpy_internal().

> If you can, inline function is cleaner for the new one.

I don't think it will be straightforward - one of the parameters is a constant value.
My preference would be to keep original rte_memcpy() code intact as much as we can here
(except probably cosmetic changes - indentation, line length fixing etc.).
After all that patch is for adding architecture function selection at runtime only.
If we like to improve our rte_memcpy() any furher - NP with that, but let it be a
separate patch.
Konstantin

> 
> > > Anyway, if you are doing a new version, please reduce lines length and fix
> > > the indent from spaces to tabs.
> > >
> > They are original DPDK codes so I didn't touch them.
> > But in next version, I will fix them.
> 
> Just to be sure: we are talking about fixing checkpatch warnings
> only for the code added, changed or moved.
> 
> Thanks

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v7 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-13  7:31                 ` Ananyev, Konstantin
@ 2017-10-13  7:36                   ` Thomas Monjalon
  2017-10-13  7:41                     ` Li, Xiaoyun
  0 siblings, 1 reply; 88+ messages in thread
From: Thomas Monjalon @ 2017-10-13  7:36 UTC (permalink / raw)
  To: Ananyev, Konstantin, Li, Xiaoyun
  Cc: dev, Richardson, Bruce, Lu, Wenzhuo, Zhang, Helin

13/10/2017 09:31, Ananyev, Konstantin:
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > 13/10/2017 03:06, Li, Xiaoyun:
> > > Hi
> > > Sorry for the late reply. I took AL last 3 days.
> > >
> > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > 05/10/2017 14:33, Xiaoyun Li:
> > > > > +/**
> > > > > + * Macro for copying unaligned block from one location to another
> > > > > +with constant load offset,
> > > > > + * 47 bytes leftover maximum,
> > > > > + * locations should not overlap.
> > > > > + * Requirements:
> > > > > + * - Store is aligned
> > > > > + * - Load offset is <offset>, which must be immediate value within
> > > > > +[1, 15]
> > > > > + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit
> > > > > +forwards are available for loading
> > > > > + * - <dst>, <src>, <len> must be variables
> > > > > + * - __m128i <xmm0> ~ <xmm8> must be pre-defined  */ #define
> > > > > +MOVEUNALIGNED_LEFT47_IMM(dst, src, len,
> > > >
> > > > Naive question:
> > > > Is there a real benefit of using a macro compared to a static inline function
> > > > optimized by a modern compiler?
> > > >
> > > The macro is in the existing DPDK codes. I didn't touch it. I just change the file name and the function name to rte_memcpy_internal.
> > > So I am not clear about if there is real benefit.
> > > In my opinion, I think it is the same as static inline function.
> > >
> > > Do I need to change them to inline function?
> > 
> > In this patch, it appears as a new macro.
> 
> Ah no, it definitely been there before.
> All we did here - git mv rte_memcpy.h rte_memcpyu_interlan.h
> and then in rte_memcpy_internal.h renamed rte_memcpy() to rte_memcpy_internal().
> 
> > If you can, inline function is cleaner for the new one.
> 
> I don't think it will be straightforward - one of the parameters is a constant value.
> My preference would be to keep original rte_memcpy() code intact as much as we can here
> (except probably cosmetic changes - indentation, line length fixing etc.).
> After all that patch is for adding architecture function selection at runtime only.
> If we like to improve our rte_memcpy() any furher - NP with that, but let it be a
> separate patch.

OK

I am waiting this patch to close RC1 today.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v7 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-13  7:36                   ` Thomas Monjalon
@ 2017-10-13  7:41                     ` Li, Xiaoyun
  0 siblings, 0 replies; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-13  7:41 UTC (permalink / raw)
  To: Thomas Monjalon, Ananyev, Konstantin
  Cc: dev, Richardson, Bruce, Lu, Wenzhuo, Zhang, Helin



> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Friday, October 13, 2017 15:36
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Li, Xiaoyun
> <xiaoyun.li@intel.com>
> Cc: dev@dpdk.org; Richardson, Bruce <bruce.richardson@intel.com>; Lu,
> Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v7 1/3] eal/x86: run-time dispatch over
> memcpy
> 
> 13/10/2017 09:31, Ananyev, Konstantin:
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > 13/10/2017 03:06, Li, Xiaoyun:
> > > > Hi
> > > > Sorry for the late reply. I took AL last 3 days.
> > > >
> > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > 05/10/2017 14:33, Xiaoyun Li:
> > > > > > +/**
> > > > > > + * Macro for copying unaligned block from one location to
> > > > > > +another with constant load offset,
> > > > > > + * 47 bytes leftover maximum,
> > > > > > + * locations should not overlap.
> > > > > > + * Requirements:
> > > > > > + * - Store is aligned
> > > > > > + * - Load offset is <offset>, which must be immediate value
> > > > > > +within [1, 15]
> > > > > > + * - For <src>, make sure <offset> bit backwards & <16 -
> > > > > > +offset> bit forwards are available for loading
> > > > > > + * - <dst>, <src>, <len> must be variables
> > > > > > + * - __m128i <xmm0> ~ <xmm8> must be pre-defined  */ #define
> > > > > > +MOVEUNALIGNED_LEFT47_IMM(dst, src, len,
> > > > >
> > > > > Naive question:
> > > > > Is there a real benefit of using a macro compared to a static
> > > > > inline function optimized by a modern compiler?
> > > > >
> > > > The macro is in the existing DPDK codes. I didn't touch it. I just change
> the file name and the function name to rte_memcpy_internal.
> > > > So I am not clear about if there is real benefit.
> > > > In my opinion, I think it is the same as static inline function.
> > > >
> > > > Do I need to change them to inline function?
> > >
> > > In this patch, it appears as a new macro.
> >
> > Ah no, it definitely been there before.
> > All we did here - git mv rte_memcpy.h rte_memcpyu_interlan.h and then
> > in rte_memcpy_internal.h renamed rte_memcpy() to
> rte_memcpy_internal().
> >
> > > If you can, inline function is cleaner for the new one.
> >
> > I don't think it will be straightforward - one of the parameters is a constant
> value.
> > My preference would be to keep original rte_memcpy() code intact as
> > much as we can here (except probably cosmetic changes - indentation, line
> length fixing etc.).
> > After all that patch is for adding architecture function selection at runtime
> only.
> > If we like to improve our rte_memcpy() any furher - NP with that, but
> > let it be a separate patch.
> 
> OK
> 
Then I will just modify indentation and line length fix and keep the original macro.

> I am waiting this patch to close RC1 today.
I will do it ASAP.

Best Regards
Xiaoyun Li

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [dpdk-dev] [PATCH v8 0/3] run-time Linking support
  2017-10-05 12:33       ` [dpdk-dev] [PATCH v7 0/3] run-time Linking support Xiaoyun Li
                           ` (4 preceding siblings ...)
  2017-10-09 17:40         ` Thomas Monjalon
@ 2017-10-13  9:01         ` Xiaoyun Li
  2017-10-13  9:01           ` [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
                             ` (3 more replies)
  5 siblings, 4 replies; 88+ messages in thread
From: Xiaoyun Li @ 2017-10-13  9:01 UTC (permalink / raw)
  To: thomas, konstantin.ananyev
  Cc: dev, bruce.richardson, wenzhuo.lu, helin.zhang, Xiaoyun Li

This patchset dynamically selects functions at run-time based on CPU flags
that current machine supports.This patchset modifies mempcy, memcpy perf
test and x86 EFD, using function pointers and bind them at constructor time.
Then in the cloud environment, users can compiler once for the minimum target
such as 'haswell'(not 'native') and run on different platforms (equal or above
haswell) and can get ISA optimization based on running CPU.

Xiaoyun Li (3):
  eal/x86: run-time dispatch over memcpy
  app/test: run-time dispatch over memcpy perf test
  efd: run-time dispatch over x86 EFD functions

---
v2
* Use gcc function multi-versioning to avoid compilation issues.
* Add macros for AVX512 and AVX2. Only if users enable AVX512 and the compiler
supports it, the AVX512 codes would be compiled. Only if the compiler supports
AVX2, the AVX2 codes would be compiled.

v3
* Reduce function calls via only keep rte_memcpy_xxx.
* Add conditions that when copy size is small, use inline code path.
Otherwise, use dynamic code path.
* To support attribute target, clang version must be greater than 3.7.
Otherwise, would choose SSE/AVX code path, the same as before.
* Move two mocro functions to the top of the code since they would be used in
inline SSE/AVX and dynamic SSE/AVX codes.

v4
* Modify rte_memcpy.h to several .c files and modify makefiles to compile
AVX2 and AVX512 files.

v5
* Delete redundant repeated codes of rte_memcpy_xxx.
* Modify makefiles to enable reuse of existing rte_memcpy.
* Delete redundant codes of rte_efd_x86.h in v4. Move it into .c file and enable
compilation -mavx2 for it in makefile since it is already chosen at run-time.

v6
* Fix shared target build failure.
* Safely remove redundant efd x86 avx2 codes since the file is compiled
with -mavx2.

v7
* Modify the added version map code in v6 to be more reasonable.
* Safely remove redundant efd x86 avx2 codes since the file is compiled
with -mavx2.

v8
* Move added .c files to .../common/arch/x86 directory.
* Fix patchset warnings.

 lib/librte_eal/bsdapp/eal/Makefile                 |  18 +
 lib/librte_eal/bsdapp/eal/rte_eal_version.map      |   1 +
 lib/librte_eal/common/arch/x86/rte_memcpy.c        |  59 ++
 lib/librte_eal/common/arch/x86/rte_memcpy_avx2.c   |  44 +
 .../common/arch/x86/rte_memcpy_avx512f.c           |  44 +
 lib/librte_eal/common/arch/x86/rte_memcpy_sse.c    |  40 +
 .../common/include/arch/x86/rte_memcpy.h           | 861 +-----------------
 .../common/include/arch/x86/rte_memcpy_internal.h  | 966 +++++++++++++++++++++
 lib/librte_eal/linuxapp/eal/Makefile               |  18 +
 lib/librte_eal/linuxapp/eal/rte_eal_version.map    |   1 +
 lib/librte_efd/Makefile                            |   6 +
 lib/librte_efd/rte_efd_x86.c                       |  77 ++
 lib/librte_efd/rte_efd_x86.h                       |  48 +-
 mk/rte.cpuflags.mk                                 |  14 +
 test/test/test_memcpy_perf.c                       |  50 +-
 15 files changed, 1342 insertions(+), 905 deletions(-)
 create mode 100644 lib/librte_eal/common/arch/x86/rte_memcpy.c
 create mode 100644 lib/librte_eal/common/arch/x86/rte_memcpy_avx2.c
 create mode 100644 lib/librte_eal/common/arch/x86/rte_memcpy_avx512f.c
 create mode 100644 lib/librte_eal/common/arch/x86/rte_memcpy_sse.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
 create mode 100644 lib/librte_efd/rte_efd_x86.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-13  9:01         ` [dpdk-dev] [PATCH v8 " Xiaoyun Li
@ 2017-10-13  9:01           ` Xiaoyun Li
  2017-10-13  9:28             ` Thomas Monjalon
  2017-10-17 21:24             ` Thomas Monjalon
  2017-10-13  9:01           ` [dpdk-dev] [PATCH v8 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
                             ` (2 subsequent siblings)
  3 siblings, 2 replies; 88+ messages in thread
From: Xiaoyun Li @ 2017-10-13  9:01 UTC (permalink / raw)
  To: thomas, konstantin.ananyev
  Cc: dev, bruce.richardson, wenzhuo.lu, helin.zhang, Xiaoyun Li

This patch dynamically selects functions of memcpy at run-time based
on CPU flags that current machine supports. This patch uses function
pointers which are bind to the relative functions at constrctor time.
In addition, AVX512 instructions set would be compiled only if users
config it enabled and the compiler supports it.

Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
---
 lib/librte_eal/bsdapp/eal/Makefile                 |  18 +
 lib/librte_eal/bsdapp/eal/rte_eal_version.map      |   1 +
 lib/librte_eal/common/arch/x86/rte_memcpy.c        |  59 ++
 lib/librte_eal/common/arch/x86/rte_memcpy_avx2.c   |  44 +
 .../common/arch/x86/rte_memcpy_avx512f.c           |  44 +
 lib/librte_eal/common/arch/x86/rte_memcpy_sse.c    |  40 +
 .../common/include/arch/x86/rte_memcpy.h           | 861 +-----------------
 .../common/include/arch/x86/rte_memcpy_internal.h  | 966 +++++++++++++++++++++
 lib/librte_eal/linuxapp/eal/Makefile               |  18 +
 lib/librte_eal/linuxapp/eal/rte_eal_version.map    |   1 +
 mk/rte.cpuflags.mk                                 |  14 +
 11 files changed, 1220 insertions(+), 846 deletions(-)
 create mode 100644 lib/librte_eal/common/arch/x86/rte_memcpy.c
 create mode 100644 lib/librte_eal/common/arch/x86/rte_memcpy_avx2.c
 create mode 100644 lib/librte_eal/common/arch/x86/rte_memcpy_avx512f.c
 create mode 100644 lib/librte_eal/common/arch/x86/rte_memcpy_sse.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h

diff --git a/lib/librte_eal/bsdapp/eal/Makefile b/lib/librte_eal/bsdapp/eal/Makefile
index 317a75e..02091fb 100644
--- a/lib/librte_eal/bsdapp/eal/Makefile
+++ b/lib/librte_eal/bsdapp/eal/Makefile
@@ -93,6 +93,24 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_ARCH_X86) += rte_spinlock.c
 
+# for run-time dispatch of memcpy
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy.c
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_sse.c
+
+# if the compiler supports AVX512, add avx512 file
+ifneq ($(findstring CC_SUPPORT_AVX512F,$(MACHINE_CFLAGS)),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx512f.c
+CFLAGS_rte_memcpy_avx512f.o += -mavx512f
+CFLAGS_rte_memcpy_avx512f.o += -DRTE_MACHINE_CPUFLAG_AVX512F
+endif
+
+# if the compiler supports AVX2, add avx2 file
+ifneq ($(findstring CC_SUPPORT_AVX2,$(MACHINE_CFLAGS)),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx2.c
+CFLAGS_rte_memcpy_avx2.o += -mavx2
+CFLAGS_rte_memcpy_avx2.o += -DRTE_MACHINE_CPUFLAG_AVX2
+endif
+
 CFLAGS_eal_common_cpuflags.o := $(CPUFLAGS_LIST)
 
 CFLAGS_eal.o := -D_GNU_SOURCE
diff --git a/lib/librte_eal/bsdapp/eal/rte_eal_version.map b/lib/librte_eal/bsdapp/eal/rte_eal_version.map
index d19f264..080896f 100644
--- a/lib/librte_eal/bsdapp/eal/rte_eal_version.map
+++ b/lib/librte_eal/bsdapp/eal/rte_eal_version.map
@@ -243,6 +243,7 @@ DPDK_17.11 {
 	rte_eal_iova_mode;
 	rte_eal_mbuf_default_mempool_ops;
 	rte_lcore_has_role;
+	rte_memcpy_ptr;
 	rte_pci_get_iommu_class;
 	rte_pci_match;
 
diff --git a/lib/librte_eal/common/arch/x86/rte_memcpy.c b/lib/librte_eal/common/arch/x86/rte_memcpy.c
new file mode 100644
index 0000000..74ae702
--- /dev/null
+++ b/lib/librte_eal/common/arch/x86/rte_memcpy.c
@@ -0,0 +1,59 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+#include <rte_cpuflags.h>
+#include <rte_log.h>
+
+void *(*rte_memcpy_ptr)(void *dst, const void *src, size_t n) = NULL;
+
+static void __attribute__((constructor))
+rte_memcpy_init(void)
+{
+#ifdef CC_SUPPORT_AVX512F
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F)) {
+		rte_memcpy_ptr = rte_memcpy_avx512f;
+		RTE_LOG(DEBUG, EAL, "AVX512 memcpy is using!\n");
+		return;
+	}
+#endif
+#ifdef CC_SUPPORT_AVX2
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) {
+		rte_memcpy_ptr = rte_memcpy_avx2;
+		RTE_LOG(DEBUG, EAL, "AVX2 memcpy is using!\n");
+		return;
+	}
+#endif
+	rte_memcpy_ptr = rte_memcpy_sse;
+	RTE_LOG(DEBUG, EAL, "Default SSE/AVX memcpy is using!\n");
+}
diff --git a/lib/librte_eal/common/arch/x86/rte_memcpy_avx2.c b/lib/librte_eal/common/arch/x86/rte_memcpy_avx2.c
new file mode 100644
index 0000000..3ad229c
--- /dev/null
+++ b/lib/librte_eal/common/arch/x86/rte_memcpy_avx2.c
@@ -0,0 +1,44 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+
+#ifndef RTE_MACHINE_CPUFLAG_AVX2
+#error RTE_MACHINE_CPUFLAG_AVX2 not defined
+#endif
+
+void *
+rte_memcpy_avx2(void *dst, const void *src, size_t n)
+{
+	return rte_memcpy_internal(dst, src, n);
+}
diff --git a/lib/librte_eal/common/arch/x86/rte_memcpy_avx512f.c b/lib/librte_eal/common/arch/x86/rte_memcpy_avx512f.c
new file mode 100644
index 0000000..be8d964
--- /dev/null
+++ b/lib/librte_eal/common/arch/x86/rte_memcpy_avx512f.c
@@ -0,0 +1,44 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+
+#ifndef RTE_MACHINE_CPUFLAG_AVX512F
+#error RTE_MACHINE_CPUFLAG_AVX512F not defined
+#endif
+
+void *
+rte_memcpy_avx512f(void *dst, const void *src, size_t n)
+{
+	return rte_memcpy_internal(dst, src, n);
+}
diff --git a/lib/librte_eal/common/arch/x86/rte_memcpy_sse.c b/lib/librte_eal/common/arch/x86/rte_memcpy_sse.c
new file mode 100644
index 0000000..55d6b41
--- /dev/null
+++ b/lib/librte_eal/common/arch/x86/rte_memcpy_sse.c
@@ -0,0 +1,40 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+
+void *
+rte_memcpy_sse(void *dst, const void *src, size_t n)
+{
+	return rte_memcpy_internal(dst, src, n);
+}
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index 74c280c..460dcdb 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -1,7 +1,7 @@
 /*-
  *   BSD LICENSE
  *
- *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
  *   All rights reserved.
  *
  *   Redistribution and use in source and binary forms, with or without
@@ -34,867 +34,36 @@
 #ifndef _RTE_MEMCPY_X86_64_H_
 #define _RTE_MEMCPY_X86_64_H_
 
-/**
- * @file
- *
- * Functions for SSE/AVX/AVX2/AVX512 implementation of memcpy().
- */
-
-#include <stdio.h>
-#include <stdint.h>
-#include <string.h>
-#include <rte_vect.h>
-#include <rte_common.h>
+#include <rte_memcpy_internal.h>
 
 #ifdef __cplusplus
 extern "C" {
 #endif
 
-/**
- * Copy bytes from one location to another. The locations must not overlap.
- *
- * @note This is implemented as a macro, so it's address should not be taken
- * and care is needed as parameter expressions may be evaluated multiple times.
- *
- * @param dst
- *   Pointer to the destination of the data.
- * @param src
- *   Pointer to the source data.
- * @param n
- *   Number of bytes to copy.
- * @return
- *   Pointer to the destination data.
- */
-static __rte_always_inline void *
-rte_memcpy(void *dst, const void *src, size_t n);
-
-#ifdef RTE_MACHINE_CPUFLAG_AVX512F
+#define RTE_X86_MEMCPY_THRESH 128
 
-#define ALIGNMENT_MASK 0x3F
+extern void *
+(*rte_memcpy_ptr)(void *dst, const void *src, size_t n);
 
 /**
- * AVX512 implementation below
+ * Different implementations of memcpy.
  */
+extern void*
+rte_memcpy_avx512f(void *dst, const void *src, size_t n);
 
-/**
- * Copy 16 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov16(uint8_t *dst, const uint8_t *src)
-{
-	__m128i xmm0;
-
-	xmm0 = _mm_loadu_si128((const __m128i *)src);
-	_mm_storeu_si128((__m128i *)dst, xmm0);
-}
-
-/**
- * Copy 32 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov32(uint8_t *dst, const uint8_t *src)
-{
-	__m256i ymm0;
+extern void *
+rte_memcpy_avx2(void *dst, const void *src, size_t n);
 
-	ymm0 = _mm256_loadu_si256((const __m256i *)src);
-	_mm256_storeu_si256((__m256i *)dst, ymm0);
-}
-
-/**
- * Copy 64 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov64(uint8_t *dst, const uint8_t *src)
-{
-	__m512i zmm0;
-
-	zmm0 = _mm512_loadu_si512((const void *)src);
-	_mm512_storeu_si512((void *)dst, zmm0);
-}
-
-/**
- * Copy 128 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov64(dst + 0 * 64, src + 0 * 64);
-	rte_mov64(dst + 1 * 64, src + 1 * 64);
-}
-
-/**
- * Copy 256 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov256(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov64(dst + 0 * 64, src + 0 * 64);
-	rte_mov64(dst + 1 * 64, src + 1 * 64);
-	rte_mov64(dst + 2 * 64, src + 2 * 64);
-	rte_mov64(dst + 3 * 64, src + 3 * 64);
-}
-
-/**
- * Copy 128-byte blocks from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
-{
-	__m512i zmm0, zmm1;
-
-	while (n >= 128) {
-		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
-		n -= 128;
-		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
-		src = src + 128;
-		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
-		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
-		dst = dst + 128;
-	}
-}
-
-/**
- * Copy 512-byte blocks from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
-{
-	__m512i zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, zmm7;
-
-	while (n >= 512) {
-		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
-		n -= 512;
-		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
-		zmm2 = _mm512_loadu_si512((const void *)(src + 2 * 64));
-		zmm3 = _mm512_loadu_si512((const void *)(src + 3 * 64));
-		zmm4 = _mm512_loadu_si512((const void *)(src + 4 * 64));
-		zmm5 = _mm512_loadu_si512((const void *)(src + 5 * 64));
-		zmm6 = _mm512_loadu_si512((const void *)(src + 6 * 64));
-		zmm7 = _mm512_loadu_si512((const void *)(src + 7 * 64));
-		src = src + 512;
-		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
-		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
-		_mm512_storeu_si512((void *)(dst + 2 * 64), zmm2);
-		_mm512_storeu_si512((void *)(dst + 3 * 64), zmm3);
-		_mm512_storeu_si512((void *)(dst + 4 * 64), zmm4);
-		_mm512_storeu_si512((void *)(dst + 5 * 64), zmm5);
-		_mm512_storeu_si512((void *)(dst + 6 * 64), zmm6);
-		_mm512_storeu_si512((void *)(dst + 7 * 64), zmm7);
-		dst = dst + 512;
-	}
-}
-
-static inline void *
-rte_memcpy_generic(void *dst, const void *src, size_t n)
-{
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
-	void *ret = dst;
-	size_t dstofss;
-	size_t bits;
-
-	/**
-	 * Copy less than 16 bytes
-	 */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08)
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		return ret;
-	}
-
-	/**
-	 * Fast way when copy size doesn't exceed 512 bytes
-	 */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				  (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov32((uint8_t *)dst - 32 + n,
-				  (const uint8_t *)src - 32 + n);
-		return ret;
-	}
-	if (n <= 512) {
-		if (n >= 256) {
-			n -= 256;
-			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 256;
-			dst = (uint8_t *)dst + 256;
-		}
-		if (n >= 128) {
-			n -= 128;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 128;
-			dst = (uint8_t *)dst + 128;
-		}
-COPY_BLOCK_128_BACK63:
-		if (n > 64) {
-			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov64((uint8_t *)dst - 64 + n,
-					  (const uint8_t *)src - 64 + n);
-			return ret;
-		}
-		if (n > 0)
-			rte_mov64((uint8_t *)dst - 64 + n,
-					  (const uint8_t *)src - 64 + n);
-		return ret;
-	}
-
-	/**
-	 * Make store aligned when copy size exceeds 512 bytes
-	 */
-	dstofss = ((uintptr_t)dst & 0x3F);
-	if (dstofss > 0) {
-		dstofss = 64 - dstofss;
-		n -= dstofss;
-		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-		src = (const uint8_t *)src + dstofss;
-		dst = (uint8_t *)dst + dstofss;
-	}
-
-	/**
-	 * Copy 512-byte blocks.
-	 * Use copy block function for better instruction order control,
-	 * which is important when load is unaligned.
-	 */
-	rte_mov512blocks((uint8_t *)dst, (const uint8_t *)src, n);
-	bits = n;
-	n = n & 511;
-	bits -= n;
-	src = (const uint8_t *)src + bits;
-	dst = (uint8_t *)dst + bits;
-
-	/**
-	 * Copy 128-byte blocks.
-	 * Use copy block function for better instruction order control,
-	 * which is important when load is unaligned.
-	 */
-	if (n >= 128) {
-		rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
-		bits = n;
-		n = n & 127;
-		bits -= n;
-		src = (const uint8_t *)src + bits;
-		dst = (uint8_t *)dst + bits;
-	}
-
-	/**
-	 * Copy whatever left
-	 */
-	goto COPY_BLOCK_128_BACK63;
-}
-
-#elif defined RTE_MACHINE_CPUFLAG_AVX2
-
-#define ALIGNMENT_MASK 0x1F
-
-/**
- * AVX2 implementation below
- */
-
-/**
- * Copy 16 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov16(uint8_t *dst, const uint8_t *src)
-{
-	__m128i xmm0;
-
-	xmm0 = _mm_loadu_si128((const __m128i *)src);
-	_mm_storeu_si128((__m128i *)dst, xmm0);
-}
-
-/**
- * Copy 32 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov32(uint8_t *dst, const uint8_t *src)
-{
-	__m256i ymm0;
-
-	ymm0 = _mm256_loadu_si256((const __m256i *)src);
-	_mm256_storeu_si256((__m256i *)dst, ymm0);
-}
-
-/**
- * Copy 64 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov64(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
-	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
-}
-
-/**
- * Copy 128 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
-	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
-	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
-	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
-}
-
-/**
- * Copy 128-byte blocks from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
-{
-	__m256i ymm0, ymm1, ymm2, ymm3;
-
-	while (n >= 128) {
-		ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32));
-		n -= 128;
-		ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32));
-		ymm2 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 2 * 32));
-		ymm3 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 3 * 32));
-		src = (const uint8_t *)src + 128;
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2);
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3);
-		dst = (uint8_t *)dst + 128;
-	}
-}
-
-static inline void *
-rte_memcpy_generic(void *dst, const void *src, size_t n)
-{
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
-	void *ret = dst;
-	size_t dstofss;
-	size_t bits;
-
-	/**
-	 * Copy less than 16 bytes
-	 */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08) {
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		}
-		return ret;
-	}
-
-	/**
-	 * Fast way when copy size doesn't exceed 256 bytes
-	 */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				(const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 48) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst + 16, (const uint8_t *)src + 16);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				(const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov32((uint8_t *)dst - 32 + n,
-				(const uint8_t *)src - 32 + n);
-		return ret;
-	}
-	if (n <= 256) {
-		if (n >= 128) {
-			n -= 128;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 128;
-			dst = (uint8_t *)dst + 128;
-		}
-COPY_BLOCK_128_BACK31:
-		if (n >= 64) {
-			n -= 64;
-			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 64;
-			dst = (uint8_t *)dst + 64;
-		}
-		if (n > 32) {
-			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov32((uint8_t *)dst - 32 + n,
-					(const uint8_t *)src - 32 + n);
-			return ret;
-		}
-		if (n > 0) {
-			rte_mov32((uint8_t *)dst - 32 + n,
-					(const uint8_t *)src - 32 + n);
-		}
-		return ret;
-	}
-
-	/**
-	 * Make store aligned when copy size exceeds 256 bytes
-	 */
-	dstofss = (uintptr_t)dst & 0x1F;
-	if (dstofss > 0) {
-		dstofss = 32 - dstofss;
-		n -= dstofss;
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		src = (const uint8_t *)src + dstofss;
-		dst = (uint8_t *)dst + dstofss;
-	}
-
-	/**
-	 * Copy 128-byte blocks
-	 */
-	rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
-	bits = n;
-	n = n & 127;
-	bits -= n;
-	src = (const uint8_t *)src + bits;
-	dst = (uint8_t *)dst + bits;
-
-	/**
-	 * Copy whatever left
-	 */
-	goto COPY_BLOCK_128_BACK31;
-}
-
-#else /* RTE_MACHINE_CPUFLAG */
-
-#define ALIGNMENT_MASK 0x0F
-
-/**
- * SSE & AVX implementation below
- */
-
-/**
- * Copy 16 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov16(uint8_t *dst, const uint8_t *src)
-{
-	__m128i xmm0;
-
-	xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);
-	_mm_storeu_si128((__m128i *)dst, xmm0);
-}
-
-/**
- * Copy 32 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov32(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-}
-
-/**
- * Copy 64 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov64(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
-	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
-}
-
-/**
- * Copy 128 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
-	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
-	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
-	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
-	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
-	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
-}
-
-/**
- * Copy 256 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov256(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
-	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
-	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
-	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
-	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
-	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
-	rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);
-	rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);
-	rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);
-	rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);
-	rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);
-	rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);
-	rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);
-	rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
-}
-
-/**
- * Macro for copying unaligned block from one location to another with constant load offset,
- * 47 bytes leftover maximum,
- * locations should not overlap.
- * Requirements:
- * - Store is aligned
- * - Load offset is <offset>, which must be immediate value within [1, 15]
- * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
- * - <dst>, <src>, <len> must be variables
- * - __m128i <xmm0> ~ <xmm8> must be pre-defined
- */
-#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)                                                     \
-__extension__ ({                                                                                            \
-    int tmp;                                                                                                \
-    while (len >= 128 + 16 - offset) {                                                                      \
-        xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));                  \
-        len -= 128;                                                                                         \
-        xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));                  \
-        xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));                  \
-        xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16));                  \
-        xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16));                  \
-        xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16));                  \
-        xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16));                  \
-        xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \
-        xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \
-        src = (const uint8_t *)src + 128;                                                                   \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));        \
-        dst = (uint8_t *)dst + 128;                                                                         \
-    }                                                                                                       \
-    tmp = len;                                                                                              \
-    len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
-    tmp -= len;                                                                                             \
-    src = (const uint8_t *)src + tmp;                                                                       \
-    dst = (uint8_t *)dst + tmp;                                                                             \
-    if (len >= 32 + 16 - offset) {                                                                          \
-        while (len >= 32 + 16 - offset) {                                                                   \
-            xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));              \
-            len -= 32;                                                                                      \
-            xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));              \
-            xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));              \
-            src = (const uint8_t *)src + 32;                                                                \
-            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));    \
-            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));    \
-            dst = (uint8_t *)dst + 32;                                                                      \
-        }                                                                                                   \
-        tmp = len;                                                                                          \
-        len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
-        tmp -= len;                                                                                         \
-        src = (const uint8_t *)src + tmp;                                                                   \
-        dst = (uint8_t *)dst + tmp;                                                                         \
-    }                                                                                                       \
-})
-
-/**
- * Macro for copying unaligned block from one location to another,
- * 47 bytes leftover maximum,
- * locations should not overlap.
- * Use switch here because the aligning instruction requires immediate value for shift count.
- * Requirements:
- * - Store is aligned
- * - Load offset is <offset>, which must be within [1, 15]
- * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
- * - <dst>, <src>, <len> must be variables
- * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined
- */
-#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                   \
-__extension__ ({                                                      \
-    switch (offset) {                                                 \
-    case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break;    \
-    case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break;    \
-    case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break;    \
-    case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break;    \
-    case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break;    \
-    case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break;    \
-    case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break;    \
-    case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break;    \
-    case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break;    \
-    case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break;    \
-    case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;    \
-    case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break;    \
-    case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break;    \
-    case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break;    \
-    case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break;    \
-    default:;                                                         \
-    }                                                                 \
-})
-
-static inline void *
-rte_memcpy_generic(void *dst, const void *src, size_t n)
-{
-	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
-	void *ret = dst;
-	size_t dstofss;
-	size_t srcofs;
-
-	/**
-	 * Copy less than 16 bytes
-	 */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08) {
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		}
-		return ret;
-	}
-
-	/**
-	 * Fast way when copy size doesn't exceed 512 bytes
-	 */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 48) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
-		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 128) {
-		goto COPY_BLOCK_128_BACK15;
-	}
-	if (n <= 512) {
-		if (n >= 256) {
-			n -= 256;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov128((uint8_t *)dst + 128, (const uint8_t *)src + 128);
-			src = (const uint8_t *)src + 256;
-			dst = (uint8_t *)dst + 256;
-		}
-COPY_BLOCK_255_BACK15:
-		if (n >= 128) {
-			n -= 128;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 128;
-			dst = (uint8_t *)dst + 128;
-		}
-COPY_BLOCK_128_BACK15:
-		if (n >= 64) {
-			n -= 64;
-			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 64;
-			dst = (uint8_t *)dst + 64;
-		}
-COPY_BLOCK_64_BACK15:
-		if (n >= 32) {
-			n -= 32;
-			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 32;
-			dst = (uint8_t *)dst + 32;
-		}
-		if (n > 16) {
-			rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-			return ret;
-		}
-		if (n > 0) {
-			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		}
-		return ret;
-	}
-
-	/**
-	 * Make store aligned when copy size exceeds 512 bytes,
-	 * and make sure the first 15 bytes are copied, because
-	 * unaligned copy functions require up to 15 bytes
-	 * backwards access.
-	 */
-	dstofss = (uintptr_t)dst & 0x0F;
-	if (dstofss > 0) {
-		dstofss = 16 - dstofss + 16;
-		n -= dstofss;
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		src = (const uint8_t *)src + dstofss;
-		dst = (uint8_t *)dst + dstofss;
-	}
-	srcofs = ((uintptr_t)src & 0x0F);
-
-	/**
-	 * For aligned copy
-	 */
-	if (srcofs == 0) {
-		/**
-		 * Copy 256-byte blocks
-		 */
-		for (; n >= 256; n -= 256) {
-			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
-			dst = (uint8_t *)dst + 256;
-			src = (const uint8_t *)src + 256;
-		}
-
-		/**
-		 * Copy whatever left
-		 */
-		goto COPY_BLOCK_255_BACK15;
-	}
-
-	/**
-	 * For copy with unaligned load
-	 */
-	MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
-
-	/**
-	 * Copy whatever left
-	 */
-	goto COPY_BLOCK_64_BACK15;
-}
-
-#endif /* RTE_MACHINE_CPUFLAG */
-
-static inline void *
-rte_memcpy_aligned(void *dst, const void *src, size_t n)
-{
-	void *ret = dst;
-
-	/* Copy size <= 16 bytes */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dst = *(const uint8_t *)src;
-			src = (const uint8_t *)src + 1;
-			dst = (uint8_t *)dst + 1;
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dst = *(const uint16_t *)src;
-			src = (const uint16_t *)src + 1;
-			dst = (uint16_t *)dst + 1;
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dst = *(const uint32_t *)src;
-			src = (const uint32_t *)src + 1;
-			dst = (uint32_t *)dst + 1;
-		}
-		if (n & 0x08)
-			*(uint64_t *)dst = *(const uint64_t *)src;
-
-		return ret;
-	}
-
-	/* Copy 16 <= size <= 32 bytes */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				(const uint8_t *)src - 16 + n);
-
-		return ret;
-	}
-
-	/* Copy 32 < size <= 64 bytes */
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov32((uint8_t *)dst - 32 + n,
-				(const uint8_t *)src - 32 + n);
-
-		return ret;
-	}
-
-	/* Copy 64 bytes blocks */
-	for (; n >= 64; n -= 64) {
-		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-		dst = (uint8_t *)dst + 64;
-		src = (const uint8_t *)src + 64;
-	}
-
-	/* Copy whatever left */
-	rte_mov64((uint8_t *)dst - 64 + n,
-			(const uint8_t *)src - 64 + n);
-
-	return ret;
-}
+extern void *
+rte_memcpy_sse(void *dst, const void *src, size_t n);
 
 static inline void *
 rte_memcpy(void *dst, const void *src, size_t n)
 {
-	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
-		return rte_memcpy_aligned(dst, src, n);
+	if (n <= RTE_X86_MEMCPY_THRESH)
+		return rte_memcpy_internal(dst, src, n);
 	else
-		return rte_memcpy_generic(dst, src, n);
+		return (*rte_memcpy_ptr)(dst, src, n);
 }
 
 #ifdef __cplusplus
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
new file mode 100644
index 0000000..63ba628
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
@@ -0,0 +1,966 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_MEMCPY_INTERNAL_X86_64_H_
+#define _RTE_MEMCPY_INTERNAL_X86_64_H_
+
+/**
+ * @file
+ *
+ * Functions for SSE/AVX/AVX2/AVX512 implementation of memcpy().
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <string.h>
+#include <rte_vect.h>
+#include <rte_common.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Copy bytes from one location to another. The locations must not overlap.
+ *
+ * @note This is implemented as a macro, so it's address should not be taken
+ * and care is needed as parameter expressions may be evaluated multiple times.
+ *
+ * @param dst
+ *   Pointer to the destination of the data.
+ * @param src
+ *   Pointer to the source data.
+ * @param n
+ *   Number of bytes to copy.
+ * @return
+ *   Pointer to the destination data.
+ */
+
+#ifdef RTE_MACHINE_CPUFLAG_AVX512F
+
+#define ALIGNMENT_MASK 0x3F
+
+/**
+ * AVX512 implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_loadu_si256((const __m256i *)src);
+	_mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+	__m512i zmm0;
+
+	zmm0 = _mm512_loadu_si512((const void *)src);
+	_mm512_storeu_si512((void *)dst, zmm0);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov64(dst + 0 * 64, src + 0 * 64);
+	rte_mov64(dst + 1 * 64, src + 1 * 64);
+}
+
+/**
+ * Copy 256 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov256(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov64(dst + 0 * 64, src + 0 * 64);
+	rte_mov64(dst + 1 * 64, src + 1 * 64);
+	rte_mov64(dst + 2 * 64, src + 2 * 64);
+	rte_mov64(dst + 3 * 64, src + 3 * 64);
+}
+
+/**
+ * Copy 128-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m512i zmm0, zmm1;
+
+	while (n >= 128) {
+		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
+		n -= 128;
+		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
+		src = src + 128;
+		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
+		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
+		dst = dst + 128;
+	}
+}
+
+/**
+ * Copy 512-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m512i zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, zmm7;
+
+	while (n >= 512) {
+		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
+		n -= 512;
+		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
+		zmm2 = _mm512_loadu_si512((const void *)(src + 2 * 64));
+		zmm3 = _mm512_loadu_si512((const void *)(src + 3 * 64));
+		zmm4 = _mm512_loadu_si512((const void *)(src + 4 * 64));
+		zmm5 = _mm512_loadu_si512((const void *)(src + 5 * 64));
+		zmm6 = _mm512_loadu_si512((const void *)(src + 6 * 64));
+		zmm7 = _mm512_loadu_si512((const void *)(src + 7 * 64));
+		src = src + 512;
+		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
+		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
+		_mm512_storeu_si512((void *)(dst + 2 * 64), zmm2);
+		_mm512_storeu_si512((void *)(dst + 3 * 64), zmm3);
+		_mm512_storeu_si512((void *)(dst + 4 * 64), zmm4);
+		_mm512_storeu_si512((void *)(dst + 5 * 64), zmm5);
+		_mm512_storeu_si512((void *)(dst + 6 * 64), zmm6);
+		_mm512_storeu_si512((void *)(dst + 7 * 64), zmm7);
+		dst = dst + 512;
+	}
+}
+
+static inline void *
+rte_memcpy_generic(void *dst, const void *src, size_t n)
+{
+	uintptr_t dstu = (uintptr_t)dst;
+	uintptr_t srcu = (uintptr_t)src;
+	void *ret = dst;
+	size_t dstofss;
+	size_t bits;
+
+	/**
+	 * Copy less than 16 bytes
+	 */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dstu = *(const uint8_t *)srcu;
+			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+			dstu = (uintptr_t)((uint8_t *)dstu + 1);
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dstu = *(const uint16_t *)srcu;
+			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+			dstu = (uintptr_t)((uint16_t *)dstu + 1);
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dstu = *(const uint32_t *)srcu;
+			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+			dstu = (uintptr_t)((uint32_t *)dstu + 1);
+		}
+		if (n & 0x08)
+			*(uint64_t *)dstu = *(const uint64_t *)srcu;
+		return ret;
+	}
+
+	/**
+	 * Fast way when copy size doesn't exceed 512 bytes
+	 */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				  (const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				  (const uint8_t *)src - 32 + n);
+		return ret;
+	}
+	if (n <= 512) {
+		if (n >= 256) {
+			n -= 256;
+			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 256;
+			dst = (uint8_t *)dst + 256;
+		}
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+COPY_BLOCK_128_BACK63:
+		if (n > 64) {
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov64((uint8_t *)dst - 64 + n,
+					  (const uint8_t *)src - 64 + n);
+			return ret;
+		}
+		if (n > 0)
+			rte_mov64((uint8_t *)dst - 64 + n,
+					  (const uint8_t *)src - 64 + n);
+		return ret;
+	}
+
+	/**
+	 * Make store aligned when copy size exceeds 512 bytes
+	 */
+	dstofss = ((uintptr_t)dst & 0x3F);
+	if (dstofss > 0) {
+		dstofss = 64 - dstofss;
+		n -= dstofss;
+		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+		src = (const uint8_t *)src + dstofss;
+		dst = (uint8_t *)dst + dstofss;
+	}
+
+	/**
+	 * Copy 512-byte blocks.
+	 * Use copy block function for better instruction order control,
+	 * which is important when load is unaligned.
+	 */
+	rte_mov512blocks((uint8_t *)dst, (const uint8_t *)src, n);
+	bits = n;
+	n = n & 511;
+	bits -= n;
+	src = (const uint8_t *)src + bits;
+	dst = (uint8_t *)dst + bits;
+
+	/**
+	 * Copy 128-byte blocks.
+	 * Use copy block function for better instruction order control,
+	 * which is important when load is unaligned.
+	 */
+	if (n >= 128) {
+		rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
+		bits = n;
+		n = n & 127;
+		bits -= n;
+		src = (const uint8_t *)src + bits;
+		dst = (uint8_t *)dst + bits;
+	}
+
+	/**
+	 * Copy whatever left
+	 */
+	goto COPY_BLOCK_128_BACK63;
+}
+
+#elif defined RTE_MACHINE_CPUFLAG_AVX2
+
+#define ALIGNMENT_MASK 0x1F
+
+/**
+ * AVX2 implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_loadu_si256((const __m256i *)src);
+	_mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
+	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
+	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
+	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
+	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
+}
+
+/**
+ * Copy 128-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m256i ymm0, ymm1, ymm2, ymm3;
+
+	while (n >= 128) {
+		ymm0 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 0 * 32));
+		n -= 128;
+		ymm1 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 1 * 32));
+		ymm2 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 2 * 32));
+		ymm3 = _mm256_loadu_si256((const __m256i *)
+				((const uint8_t *)src + 3 * 32));
+		src = (const uint8_t *)src + 128;
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3);
+		dst = (uint8_t *)dst + 128;
+	}
+}
+
+static inline void *
+rte_memcpy_generic(void *dst, const void *src, size_t n)
+{
+	uintptr_t dstu = (uintptr_t)dst;
+	uintptr_t srcu = (uintptr_t)src;
+	void *ret = dst;
+	size_t dstofss;
+	size_t bits;
+
+	/**
+	 * Copy less than 16 bytes
+	 */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dstu = *(const uint8_t *)srcu;
+			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+			dstu = (uintptr_t)((uint8_t *)dstu + 1);
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dstu = *(const uint16_t *)srcu;
+			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+			dstu = (uintptr_t)((uint16_t *)dstu + 1);
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dstu = *(const uint32_t *)srcu;
+			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+			dstu = (uintptr_t)((uint32_t *)dstu + 1);
+		}
+		if (n & 0x08)
+			*(uint64_t *)dstu = *(const uint64_t *)srcu;
+		return ret;
+	}
+
+	/**
+	 * Fast way when copy size doesn't exceed 256 bytes
+	 */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 48) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst + 16, (const uint8_t *)src + 16);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				(const uint8_t *)src - 32 + n);
+		return ret;
+	}
+	if (n <= 256) {
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+COPY_BLOCK_128_BACK31:
+		if (n >= 64) {
+			n -= 64;
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 64;
+			dst = (uint8_t *)dst + 64;
+		}
+		if (n > 32) {
+			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov32((uint8_t *)dst - 32 + n,
+					(const uint8_t *)src - 32 + n);
+			return ret;
+		}
+		if (n > 0) {
+			rte_mov32((uint8_t *)dst - 32 + n,
+					(const uint8_t *)src - 32 + n);
+		}
+		return ret;
+	}
+
+	/**
+	 * Make store aligned when copy size exceeds 256 bytes
+	 */
+	dstofss = (uintptr_t)dst & 0x1F;
+	if (dstofss > 0) {
+		dstofss = 32 - dstofss;
+		n -= dstofss;
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		src = (const uint8_t *)src + dstofss;
+		dst = (uint8_t *)dst + dstofss;
+	}
+
+	/**
+	 * Copy 128-byte blocks
+	 */
+	rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
+	bits = n;
+	n = n & 127;
+	bits -= n;
+	src = (const uint8_t *)src + bits;
+	dst = (uint8_t *)dst + bits;
+
+	/**
+	 * Copy whatever left
+	 */
+	goto COPY_BLOCK_128_BACK31;
+}
+
+#else /* RTE_MACHINE_CPUFLAG */
+
+#define ALIGNMENT_MASK 0x0F
+
+/**
+ * SSE & AVX implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
+	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
+	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
+	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
+}
+
+/**
+ * Copy 256 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov256(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
+	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
+	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
+	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
+	rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);
+	rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);
+	rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);
+	rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);
+	rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);
+	rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);
+	rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);
+	rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
+}
+
+/**
+ * Macro for copying unaligned block from one location to another with constant
+ * load offset, 47 bytes leftover maximum,
+ * locations should not overlap.
+ * Requirements:
+ * - Store is aligned
+ * - Load offset is <offset>, which must be immediate value within [1, 15]
+ * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards
+ *   are available for loading
+ * - <dst>, <src>, <len> must be variables
+ * - __m128i <xmm0> ~ <xmm8> must be pre-defined
+ */
+#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)(		      \
+__extension__ ({							      \
+	int tmp;							      \
+	while (len >= 128 + 16 - offset) {				      \
+		xmm0 = _mm_loadu_si128((const __m128i *)		      \
+			((const uint8_t *)src - offset + 0 * 16));	      \
+		len -= 128;						      \
+		xmm1 = _mm_loadu_si128((const __m128i *)		      \
+			((const uint8_t *)src - offset + 1 * 16));	      \
+		xmm2 = _mm_loadu_si128((const __m128i *)		      \
+			((const uint8_t *)src - offset + 2 * 16));	      \
+		xmm3 = _mm_loadu_si128((const __m128i *)		      \
+			((const uint8_t *)src - offset + 3 * 16));	      \
+		xmm4 = _mm_loadu_si128((const __m128i *)		      \
+			((const uint8_t *)src - offset + 4 * 16));	      \
+		xmm5 = _mm_loadu_si128((const __m128i *)		      \
+			((const uint8_t *)src - offset + 5 * 16));	      \
+		xmm6 = _mm_loadu_si128((const __m128i *)		      \
+			((const uint8_t *)src - offset + 6 * 16));	      \
+		xmm7 = _mm_loadu_si128((const __m128i *)		      \
+			((const uint8_t *)src - offset + 7 * 16));	      \
+		xmm8 = _mm_loadu_si128((const __m128i *)		      \
+			((const uint8_t *)src - offset + 8 * 16));	      \
+		src = (const uint8_t *)src + 128;			      \
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16),        \
+			_mm_alignr_epi8(xmm1, xmm0, offset));		      \
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16),        \
+			_mm_alignr_epi8(xmm2, xmm1, offset));		      \
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16),        \
+			_mm_alignr_epi8(xmm3, xmm2, offset));		      \
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16),        \
+			_mm_alignr_epi8(xmm4, xmm3, offset));		      \
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16),        \
+			_mm_alignr_epi8(xmm5, xmm4, offset));		      \
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16),        \
+			_mm_alignr_epi8(xmm6, xmm5, offset));		      \
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16),        \
+			_mm_alignr_epi8(xmm7, xmm6, offset));		      \
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16),        \
+			_mm_alignr_epi8(xmm8, xmm7, offset));		      \
+		dst = (uint8_t *)dst + 128;				      \
+	}								      \
+	tmp = len;							      \
+	len = ((len - 16 + offset) & 127) + 16 - offset;		      \
+	tmp -= len;							      \
+	src = (const uint8_t *)src + tmp;				      \
+	dst = (uint8_t *)dst + tmp;					      \
+	if (len >= 32 + 16 - offset) {					      \
+		while (len >= 32 + 16 - offset) {			      \
+			xmm0 = _mm_loadu_si128((const __m128i *)	      \
+				((const uint8_t *)src - offset + 0 * 16));    \
+			len -= 32;					      \
+			xmm1 = _mm_loadu_si128((const __m128i *)	      \
+				((const uint8_t *)src - offset + 1 * 16));    \
+			xmm2 = _mm_loadu_si128((const __m128i *)	      \
+				((const uint8_t *)src - offset + 2 * 16));    \
+			src = (const uint8_t *)src + 32;		      \
+			_mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16),\
+				_mm_alignr_epi8(xmm1, xmm0, offset));	      \
+			_mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16),\
+				_mm_alignr_epi8(xmm2, xmm1, offset));	      \
+			dst = (uint8_t *)dst + 32;			      \
+		}							      \
+		tmp = len;						      \
+		len = ((len - 16 + offset) & 31) + 16 - offset;		      \
+		tmp -= len;						      \
+		src = (const uint8_t *)src + tmp;			      \
+		dst = (uint8_t *)dst + tmp;				      \
+	}								      \
+}))
+
+/**
+ * Macro for copying unaligned block from one location to another,
+ * 47 bytes leftover maximum,
+ * locations should not overlap.
+ * Use switch here because the aligning instruction requires immediate value
+ * for shift count.
+ * Requirements:
+ * - Store is aligned
+ * - Load offset is <offset>, which must be within [1, 15]
+ * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards
+ *   are available for loading
+ * - <dst>, <src>, <len> must be variables
+ * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be
+ *   pre-defined
+ */
+#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)(			    \
+__extension__ ({							    \
+	switch (offset) {						    \
+	case 0x01:							    \
+		MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01);		    \
+		break;							    \
+	case 0x02:							    \
+		MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02);		    \
+		break;							    \
+	case 0x03:							    \
+		MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03);		    \
+		break;							    \
+	case 0x04:							    \
+		MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04);		    \
+		break;							    \
+	case 0x05:							    \
+		MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05);		    \
+		break;							    \
+	case 0x06:							    \
+		MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06);		    \
+		break;							    \
+	case 0x07:							    \
+		MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07);		    \
+		break;							    \
+	case 0x08:							    \
+		MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08);		    \
+		break;							    \
+	case 0x09:							    \
+		MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09);		    \
+		break;							    \
+	case 0x0A:							    \
+		MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A);		    \
+		break;							    \
+	case 0x0B:							    \
+		MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B);		    \
+		break;							    \
+	case 0x0C:							    \
+		MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C);		    \
+		break;							    \
+	case 0x0D:							    \
+		MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D);		    \
+		break;							    \
+	case 0x0E:							    \
+		MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E);		    \
+		break;							    \
+	case 0x0F:							    \
+		MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F);		    \
+		break;							    \
+	default:							    \
+		break;							    \
+	}								    \
+}))
+
+static inline void *
+rte_memcpy_generic(void *dst, const void *src, size_t n)
+{
+	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
+	uintptr_t dstu = (uintptr_t)dst;
+	uintptr_t srcu = (uintptr_t)src;
+	void *ret = dst;
+	size_t dstofss;
+	size_t srcofs;
+
+	/**
+	 * Copy less than 16 bytes
+	 */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dstu = *(const uint8_t *)srcu;
+			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+			dstu = (uintptr_t)((uint8_t *)dstu + 1);
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dstu = *(const uint16_t *)srcu;
+			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+			dstu = (uintptr_t)((uint16_t *)dstu + 1);
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dstu = *(const uint32_t *)srcu;
+			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+			dstu = (uintptr_t)((uint32_t *)dstu + 1);
+		}
+		if (n & 0x08)
+			*(uint64_t *)dstu = *(const uint64_t *)srcu;
+		return ret;
+	}
+
+	/**
+	 * Fast way when copy size doesn't exceed 512 bytes
+	 */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 48) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 128)
+		goto COPY_BLOCK_128_BACK15;
+	if (n <= 512) {
+		if (n >= 256) {
+			n -= 256;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov128((uint8_t *)dst + 128,
+					(const uint8_t *)src + 128);
+			src = (const uint8_t *)src + 256;
+			dst = (uint8_t *)dst + 256;
+		}
+COPY_BLOCK_255_BACK15:
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+COPY_BLOCK_128_BACK15:
+		if (n >= 64) {
+			n -= 64;
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 64;
+			dst = (uint8_t *)dst + 64;
+		}
+COPY_BLOCK_64_BACK15:
+		if (n >= 32) {
+			n -= 32;
+			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 32;
+			dst = (uint8_t *)dst + 32;
+		}
+		if (n > 16) {
+			rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov16((uint8_t *)dst - 16 + n,
+					(const uint8_t *)src - 16 + n);
+			return ret;
+		}
+		if (n > 0) {
+			rte_mov16((uint8_t *)dst - 16 + n,
+					(const uint8_t *)src - 16 + n);
+		}
+		return ret;
+	}
+
+	/**
+	 * Make store aligned when copy size exceeds 512 bytes,
+	 * and make sure the first 15 bytes are copied, because
+	 * unaligned copy functions require up to 15 bytes
+	 * backwards access.
+	 */
+	dstofss = (uintptr_t)dst & 0x0F;
+	if (dstofss > 0) {
+		dstofss = 16 - dstofss + 16;
+		n -= dstofss;
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		src = (const uint8_t *)src + dstofss;
+		dst = (uint8_t *)dst + dstofss;
+	}
+	srcofs = ((uintptr_t)src & 0x0F);
+
+	/**
+	 * For aligned copy
+	 */
+	if (srcofs == 0) {
+		/**
+		 * Copy 256-byte blocks
+		 */
+		for (; n >= 256; n -= 256) {
+			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
+			dst = (uint8_t *)dst + 256;
+			src = (const uint8_t *)src + 256;
+		}
+
+		/**
+		 * Copy whatever left
+		 */
+		goto COPY_BLOCK_255_BACK15;
+	}
+
+	/**
+	 * For copy with unaligned load
+	 */
+	MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
+
+	/**
+	 * Copy whatever left
+	 */
+	goto COPY_BLOCK_64_BACK15;
+}
+
+#endif /* RTE_MACHINE_CPUFLAG */
+
+static inline void *
+rte_memcpy_aligned(void *dst, const void *src, size_t n)
+{
+	void *ret = dst;
+
+	/* Copy size <= 16 bytes */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dst = *(const uint8_t *)src;
+			src = (const uint8_t *)src + 1;
+			dst = (uint8_t *)dst + 1;
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dst = *(const uint16_t *)src;
+			src = (const uint16_t *)src + 1;
+			dst = (uint16_t *)dst + 1;
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dst = *(const uint32_t *)src;
+			src = (const uint32_t *)src + 1;
+			dst = (uint32_t *)dst + 1;
+		}
+		if (n & 0x08)
+			*(uint64_t *)dst = *(const uint64_t *)src;
+
+		return ret;
+	}
+
+	/* Copy 16 <= size <= 32 bytes */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+
+		return ret;
+	}
+
+	/* Copy 32 < size <= 64 bytes */
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				(const uint8_t *)src - 32 + n);
+
+		return ret;
+	}
+
+	/* Copy 64 bytes blocks */
+	for (; n >= 64; n -= 64) {
+		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+		dst = (uint8_t *)dst + 64;
+		src = (const uint8_t *)src + 64;
+	}
+
+	/* Copy whatever left */
+	rte_mov64((uint8_t *)dst - 64 + n,
+			(const uint8_t *)src - 64 + n);
+
+	return ret;
+}
+
+static inline void *
+rte_memcpy_internal(void *dst, const void *src, size_t n)
+{
+	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
+		return rte_memcpy_aligned(dst, src, n);
+	else
+		return rte_memcpy_generic(dst, src, n);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_MEMCPY_INTERNAL_X86_64_H_ */
diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
index 21e0b4a..6ee7f23 100644
--- a/lib/librte_eal/linuxapp/eal/Makefile
+++ b/lib/librte_eal/linuxapp/eal/Makefile
@@ -102,6 +102,24 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_ARCH_X86) += rte_spinlock.c
 
+# for run-time dispatch of memcpy
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy.c
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_sse.c
+
+# if the compiler supports AVX512, add avx512 file
+ifneq ($(findstring CC_SUPPORT_AVX512F,$(MACHINE_CFLAGS)),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx512f.c
+CFLAGS_rte_memcpy_avx512f.o += -mavx512f
+CFLAGS_rte_memcpy_avx512f.o += -DRTE_MACHINE_CPUFLAG_AVX512F
+endif
+
+# if the compiler supports AVX2, add avx2 file
+ifneq ($(findstring CC_SUPPORT_AVX2,$(MACHINE_CFLAGS)),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx2.c
+CFLAGS_rte_memcpy_avx2.o += -mavx2
+CFLAGS_rte_memcpy_avx2.o += -DRTE_MACHINE_CPUFLAG_AVX2
+endif
+
 CFLAGS_eal_common_cpuflags.o := $(CPUFLAGS_LIST)
 
 CFLAGS_eal.o := -D_GNU_SOURCE
diff --git a/lib/librte_eal/linuxapp/eal/rte_eal_version.map b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
index fe186cb..c173ccf 100644
--- a/lib/librte_eal/linuxapp/eal/rte_eal_version.map
+++ b/lib/librte_eal/linuxapp/eal/rte_eal_version.map
@@ -247,6 +247,7 @@ DPDK_17.11 {
 	rte_eal_iova_mode;
 	rte_eal_mbuf_default_mempool_ops;
 	rte_lcore_has_role;
+	rte_memcpy_ptr;
 	rte_pci_get_iommu_class;
 	rte_pci_match;
 
diff --git a/mk/rte.cpuflags.mk b/mk/rte.cpuflags.mk
index a813c91..8a7a1e7 100644
--- a/mk/rte.cpuflags.mk
+++ b/mk/rte.cpuflags.mk
@@ -134,6 +134,20 @@ endif
 
 MACHINE_CFLAGS += $(addprefix -DRTE_MACHINE_CPUFLAG_,$(CPUFLAGS))
 
+# Check if the compiler suppoerts AVX512
+CC_SUPPORT_AVX512F := $(shell $(CC) -mavx512f -dM -E - < /dev/null 2>&1 | grep -q AVX512 && echo 1)
+ifeq ($(CC_SUPPORT_AVX512F),1)
+ifeq ($(CONFIG_RTE_ENABLE_AVX512),y)
+MACHINE_CFLAGS += -DCC_SUPPORT_AVX512F
+endif
+endif
+
+# Check if the compiler supports AVX2
+CC_SUPPORT_AVX2 := $(shell $(CC) -mavx2 -dM -E - < /dev/null 2>&1 | grep -q AVX2 && echo 1)
+ifeq ($(CC_SUPPORT_AVX2),1)
+MACHINE_CFLAGS += -DCC_SUPPORT_AVX2
+endif
+
 # To strip whitespace
 comma:= ,
 empty:=
-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [dpdk-dev] [PATCH v8 2/3] app/test: run-time dispatch over memcpy perf test
  2017-10-13  9:01         ` [dpdk-dev] [PATCH v8 " Xiaoyun Li
  2017-10-13  9:01           ` [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
@ 2017-10-13  9:01           ` Xiaoyun Li
  2017-10-13  9:01           ` [dpdk-dev] [PATCH v8 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
  2017-10-13 13:13           ` [dpdk-dev] [PATCH v8 0/3] run-time Linking support Thomas Monjalon
  3 siblings, 0 replies; 88+ messages in thread
From: Xiaoyun Li @ 2017-10-13  9:01 UTC (permalink / raw)
  To: thomas, konstantin.ananyev
  Cc: dev, bruce.richardson, wenzhuo.lu, helin.zhang, Xiaoyun Li

This patch modifies assignment of alignment unit from build-time
to run-time based on CPU flags that machine supports.

Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
---
 test/test/test_memcpy_perf.c | 50 ++++++++++++++++++++++++++++++++------------
 1 file changed, 37 insertions(+), 13 deletions(-)

diff --git a/test/test/test_memcpy_perf.c b/test/test/test_memcpy_perf.c
index ff3aaaa..37c4d4c 100644
--- a/test/test/test_memcpy_perf.c
+++ b/test/test/test_memcpy_perf.c
@@ -42,6 +42,7 @@
 #include <rte_malloc.h>
 
 #include <rte_memcpy.h>
+#include <rte_cpuflags.h>
 
 #include "test.h"
 
@@ -79,13 +80,7 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
 #define TEST_BATCH_SIZE         100
 
 /* Data is aligned on this many bytes (power of 2) */
-#ifdef RTE_MACHINE_CPUFLAG_AVX512F
-#define ALIGNMENT_UNIT          64
-#elif defined RTE_MACHINE_CPUFLAG_AVX2
-#define ALIGNMENT_UNIT          32
-#else /* RTE_MACHINE_CPUFLAG */
-#define ALIGNMENT_UNIT          16
-#endif /* RTE_MACHINE_CPUFLAG */
+static uint8_t alignment_unit = 16;
 
 /*
  * Pointers used in performance tests. The two large buffers are for uncached
@@ -95,25 +90,53 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
 static uint8_t *large_buf_read, *large_buf_write;
 static uint8_t *small_buf_read, *small_buf_write;
 
+/* Initialise alignment_unit based on machine at run-time. */
+static void
+init_alignment_unit(void){
+#ifdef CC_SUPPORT_AVX512
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F)) {
+		alignment_unit = 64;
+		return;
+	}
+#endif
+#ifdef CC_SUPPORT_AVX2
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) {
+		alignment_unit = 32;
+		return;
+	}
+#endif
+	alignment_unit = 16;
+}
+
 /* Initialise data buffers. */
 static int
 init_buffers(void)
 {
 	unsigned i;
 
-	large_buf_read = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	init_alignment_unit();
+
+	large_buf_read = rte_malloc("memcpy",
+				    LARGE_BUFFER_SIZE + alignment_unit,
+				    alignment_unit);
 	if (large_buf_read == NULL)
 		goto error_large_buf_read;
 
-	large_buf_write = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	large_buf_write = rte_malloc("memcpy",
+				     LARGE_BUFFER_SIZE + alignment_unit,
+				     alignment_unit);
 	if (large_buf_write == NULL)
 		goto error_large_buf_write;
 
-	small_buf_read = rte_malloc("memcpy", SMALL_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	small_buf_read = rte_malloc("memcpy",
+				    SMALL_BUFFER_SIZE + alignment_unit,
+				    alignment_unit);
 	if (small_buf_read == NULL)
 		goto error_small_buf_read;
 
-	small_buf_write = rte_malloc("memcpy", SMALL_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
+	small_buf_write = rte_malloc("memcpy",
+				     SMALL_BUFFER_SIZE + alignment_unit,
+				     alignment_unit);
 	if (small_buf_write == NULL)
 		goto error_small_buf_write;
 
@@ -153,7 +176,7 @@ static inline size_t
 get_rand_offset(size_t uoffset)
 {
 	return ((rte_rand() % (LARGE_BUFFER_SIZE - SMALL_BUFFER_SIZE)) &
-			~(ALIGNMENT_UNIT - 1)) + uoffset;
+			~(alignment_unit - 1)) + uoffset;
 }
 
 /* Fill in source and destination addresses. */
@@ -321,7 +344,8 @@ perf_test(void)
 		   "(bytes)        (ticks)        (ticks)        (ticks)        (ticks)\n"
 		   "------- -------------- -------------- -------------- --------------");
 
-	printf("\n========================== %2dB aligned ============================", ALIGNMENT_UNIT);
+	printf("\n========================= %2dB aligned ============================",
+		alignment_unit);
 	/* Do aligned tests where size is a variable */
 	perf_test_variable_aligned();
 	printf("\n------- -------------- -------------- -------------- --------------");
-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* [dpdk-dev] [PATCH v8 3/3] efd: run-time dispatch over x86 EFD functions
  2017-10-13  9:01         ` [dpdk-dev] [PATCH v8 " Xiaoyun Li
  2017-10-13  9:01           ` [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
  2017-10-13  9:01           ` [dpdk-dev] [PATCH v8 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
@ 2017-10-13  9:01           ` Xiaoyun Li
  2017-10-13 13:13           ` [dpdk-dev] [PATCH v8 0/3] run-time Linking support Thomas Monjalon
  3 siblings, 0 replies; 88+ messages in thread
From: Xiaoyun Li @ 2017-10-13  9:01 UTC (permalink / raw)
  To: thomas, konstantin.ananyev
  Cc: dev, bruce.richardson, wenzhuo.lu, helin.zhang, Xiaoyun Li

This patch enables x86 EFD file be compiled only if the compiler
supports AVX2 since it is already chosen at run-time.

Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
---
 lib/librte_efd/Makefile      |  6 ++++
 lib/librte_efd/rte_efd_x86.c | 77 ++++++++++++++++++++++++++++++++++++++++++++
 lib/librte_efd/rte_efd_x86.h | 48 ++-------------------------
 3 files changed, 85 insertions(+), 46 deletions(-)
 create mode 100644 lib/librte_efd/rte_efd_x86.c

diff --git a/lib/librte_efd/Makefile b/lib/librte_efd/Makefile
index b9277bc..35bb2bd 100644
--- a/lib/librte_efd/Makefile
+++ b/lib/librte_efd/Makefile
@@ -44,6 +44,12 @@ LIBABIVER := 1
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_EFD) := rte_efd.c
 
+# if the compiler supports AVX2, add efd x86 file
+ifneq ($(findstring CC_SUPPORT_AVX2,$(MACHINE_CFLAGS)),)
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_efd_x86.c
+CFLAGS_rte_efd_x86.o += -mavx2
+endif
+
 # install this header file
 SYMLINK-$(CONFIG_RTE_LIBRTE_EFD)-include := rte_efd.h
 
diff --git a/lib/librte_efd/rte_efd_x86.c b/lib/librte_efd/rte_efd_x86.c
new file mode 100644
index 0000000..49677db
--- /dev/null
+++ b/lib/librte_efd/rte_efd_x86.c
@@ -0,0 +1,77 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2016-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+/* rte_efd_x86.c
+ * This file holds all x86 specific EFD functions
+ */
+#include <rte_efd.h>
+#include <rte_efd_x86.h>
+
+#if (RTE_EFD_VALUE_NUM_BITS == 8 || RTE_EFD_VALUE_NUM_BITS == 16 || \
+	RTE_EFD_VALUE_NUM_BITS == 24 || RTE_EFD_VALUE_NUM_BITS == 32)
+#define EFD_LOAD_SI128(val) _mm_load_si128(val)
+#else
+#define EFD_LOAD_SI128(val) _mm_lddqu_si128(val)
+#endif
+
+efd_value_t
+efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
+		const efd_lookuptbl_t *group_lookup_table,
+		const uint32_t hash_val_a, const uint32_t hash_val_b)
+{
+	efd_value_t value = 0;
+	uint32_t i = 0;
+	__m256i vhash_val_a = _mm256_set1_epi32(hash_val_a);
+	__m256i vhash_val_b = _mm256_set1_epi32(hash_val_b);
+
+	for (; i < RTE_EFD_VALUE_NUM_BITS; i += 8) {
+		__m256i vhash_idx =
+				_mm256_cvtepu16_epi32(EFD_LOAD_SI128(
+				(__m128i const *) &group_hash_idx[i]));
+		__m256i vlookup_table = _mm256_cvtepu16_epi32(
+				EFD_LOAD_SI128((__m128i const *)
+				&group_lookup_table[i]));
+		__m256i vhash = _mm256_add_epi32(vhash_val_a,
+				_mm256_mullo_epi32(vhash_idx, vhash_val_b));
+		__m256i vbucket_idx = _mm256_srli_epi32(vhash,
+				EFD_LOOKUPTBL_SHIFT);
+		__m256i vresult = _mm256_srlv_epi32(vlookup_table,
+				vbucket_idx);
+
+		value |= (_mm256_movemask_ps(
+			(__m256) _mm256_slli_epi32(vresult, 31))
+			& ((1 << (RTE_EFD_VALUE_NUM_BITS - i)) - 1)) << i;
+	}
+
+	return value;
+}
diff --git a/lib/librte_efd/rte_efd_x86.h b/lib/librte_efd/rte_efd_x86.h
index 34f37d7..7a082aa 100644
--- a/lib/librte_efd/rte_efd_x86.h
+++ b/lib/librte_efd/rte_efd_x86.h
@@ -36,51 +36,7 @@
  */
 #include <immintrin.h>
 
-#if (RTE_EFD_VALUE_NUM_BITS == 8 || RTE_EFD_VALUE_NUM_BITS == 16 || \
-	RTE_EFD_VALUE_NUM_BITS == 24 || RTE_EFD_VALUE_NUM_BITS == 32)
-#define EFD_LOAD_SI128(val) _mm_load_si128(val)
-#else
-#define EFD_LOAD_SI128(val) _mm_lddqu_si128(val)
-#endif
-
-static inline efd_value_t
+extern efd_value_t
 efd_lookup_internal_avx2(const efd_hashfunc_t *group_hash_idx,
 		const efd_lookuptbl_t *group_lookup_table,
-		const uint32_t hash_val_a, const uint32_t hash_val_b)
-{
-#ifdef RTE_MACHINE_CPUFLAG_AVX2
-	efd_value_t value = 0;
-	uint32_t i = 0;
-	__m256i vhash_val_a = _mm256_set1_epi32(hash_val_a);
-	__m256i vhash_val_b = _mm256_set1_epi32(hash_val_b);
-
-	for (; i < RTE_EFD_VALUE_NUM_BITS; i += 8) {
-		__m256i vhash_idx =
-				_mm256_cvtepu16_epi32(EFD_LOAD_SI128(
-				(__m128i const *) &group_hash_idx[i]));
-		__m256i vlookup_table = _mm256_cvtepu16_epi32(
-				EFD_LOAD_SI128((__m128i const *)
-				&group_lookup_table[i]));
-		__m256i vhash = _mm256_add_epi32(vhash_val_a,
-				_mm256_mullo_epi32(vhash_idx, vhash_val_b));
-		__m256i vbucket_idx = _mm256_srli_epi32(vhash,
-				EFD_LOOKUPTBL_SHIFT);
-		__m256i vresult = _mm256_srlv_epi32(vlookup_table,
-				vbucket_idx);
-
-		value |= (_mm256_movemask_ps(
-			(__m256) _mm256_slli_epi32(vresult, 31))
-			& ((1 << (RTE_EFD_VALUE_NUM_BITS - i)) - 1)) << i;
-	}
-
-	return value;
-#else
-	RTE_SET_USED(group_hash_idx);
-	RTE_SET_USED(group_lookup_table);
-	RTE_SET_USED(hash_val_a);
-	RTE_SET_USED(hash_val_b);
-	/* Return dummy value, only to avoid compilation breakage */
-	return 0;
-#endif
-
-}
+		const uint32_t hash_val_a, const uint32_t hash_val_b);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-13  9:01           ` [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
@ 2017-10-13  9:28             ` Thomas Monjalon
  2017-10-13 10:26               ` Ananyev, Konstantin
  2017-10-17 21:24             ` Thomas Monjalon
  1 sibling, 1 reply; 88+ messages in thread
From: Thomas Monjalon @ 2017-10-13  9:28 UTC (permalink / raw)
  To: Xiaoyun Li, konstantin.ananyev
  Cc: dev, bruce.richardson, wenzhuo.lu, helin.zhang

13/10/2017 11:01, Xiaoyun Li:
>  lib/librte_eal/common/arch/x86/rte_memcpy.c        |  59 ++
>  lib/librte_eal/common/arch/x86/rte_memcpy_avx2.c   |  44 +
>  .../common/arch/x86/rte_memcpy_avx512f.c           |  44 +
>  lib/librte_eal/common/arch/x86/rte_memcpy_sse.c    |  40 +
>  .../common/include/arch/x86/rte_memcpy.h           | 861 +-----------------
>  .../common/include/arch/x86/rte_memcpy_internal.h  | 966 +++++++++++++++++++++

I think that rte_memcpy_internal.h should not be in the include directory.
Can it be moved to lib/librte_eal/common/arch/ ?

> --- a/lib/librte_eal/bsdapp/eal/rte_eal_version.map
> +++ b/lib/librte_eal/bsdapp/eal/rte_eal_version.map
> @@ -243,6 +243,7 @@ DPDK_17.11 {
>  	rte_eal_iova_mode;
>  	rte_eal_mbuf_default_mempool_ops;
>  	rte_lcore_has_role;
> +	rte_memcpy_ptr;

I don't know what is the consequence of adding this function in the .map
file for architectures where it does not exist?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-13  9:28             ` Thomas Monjalon
@ 2017-10-13 10:26               ` Ananyev, Konstantin
  0 siblings, 0 replies; 88+ messages in thread
From: Ananyev, Konstantin @ 2017-10-13 10:26 UTC (permalink / raw)
  To: Thomas Monjalon, Li, Xiaoyun
  Cc: dev, Richardson, Bruce, Lu, Wenzhuo, Zhang, Helin



> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Friday, October 13, 2017 10:29 AM
> To: Li, Xiaoyun <xiaoyun.li@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Cc: dev@dpdk.org; Richardson, Bruce <bruce.richardson@intel.com>; Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> <helin.zhang@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
> 
> 13/10/2017 11:01, Xiaoyun Li:
> >  lib/librte_eal/common/arch/x86/rte_memcpy.c        |  59 ++
> >  lib/librte_eal/common/arch/x86/rte_memcpy_avx2.c   |  44 +
> >  .../common/arch/x86/rte_memcpy_avx512f.c           |  44 +
> >  lib/librte_eal/common/arch/x86/rte_memcpy_sse.c    |  40 +
> >  .../common/include/arch/x86/rte_memcpy.h           | 861 +-----------------
> >  .../common/include/arch/x86/rte_memcpy_internal.h  | 966 +++++++++++++++++++++
> 
> I think that rte_memcpy_internal.h should not be in the include directory.
> Can it be moved to lib/librte_eal/common/arch/ ?

I am afraid we can't - for size < 128 bytes we still use inline version of memcpy -
to avoid perfomance regression.
So we still need that file to stay in include dir.

> 
> > --- a/lib/librte_eal/bsdapp/eal/rte_eal_version.map
> > +++ b/lib/librte_eal/bsdapp/eal/rte_eal_version.map
> > @@ -243,6 +243,7 @@ DPDK_17.11 {
> >  	rte_eal_iova_mode;
> >  	rte_eal_mbuf_default_mempool_ops;
> >  	rte_lcore_has_role;
> > +	rte_memcpy_ptr;
> 
> I don't know what is the consequence of adding this function in the .map
> file for architectures where it does not exist?

I don't have arm/ppc box to try...
Though I tried to add unexciting function name into
lib/librte_eal/linuxapp/eal/rte_eal_version.map.
Didn't encounter any problems. 
So my guess - it is harmless.
Konstantin

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 0/3] run-time Linking support
  2017-10-13  9:01         ` [dpdk-dev] [PATCH v8 " Xiaoyun Li
                             ` (2 preceding siblings ...)
  2017-10-13  9:01           ` [dpdk-dev] [PATCH v8 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
@ 2017-10-13 13:13           ` Thomas Monjalon
  3 siblings, 0 replies; 88+ messages in thread
From: Thomas Monjalon @ 2017-10-13 13:13 UTC (permalink / raw)
  To: Xiaoyun Li
  Cc: dev, konstantin.ananyev, bruce.richardson, wenzhuo.lu, helin.zhang

13/10/2017 11:01, Xiaoyun Li:
> This patchset dynamically selects functions at run-time based on CPU flags
> that current machine supports.This patchset modifies mempcy, memcpy perf
> test and x86 EFD, using function pointers and bind them at constructor time.
> Then in the cloud environment, users can compiler once for the minimum target
> such as 'haswell'(not 'native') and run on different platforms (equal or above
> haswell) and can get ISA optimization based on running CPU.
> 
> Xiaoyun Li (3):
>   eal/x86: run-time dispatch over memcpy
>   app/test: run-time dispatch over memcpy perf test
>   efd: run-time dispatch over x86 EFD functions

Applied, thanks

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-13  9:01           ` [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
  2017-10-13  9:28             ` Thomas Monjalon
@ 2017-10-17 21:24             ` Thomas Monjalon
  2017-10-18  2:21               ` Li, Xiaoyun
  1 sibling, 1 reply; 88+ messages in thread
From: Thomas Monjalon @ 2017-10-17 21:24 UTC (permalink / raw)
  To: Xiaoyun Li, konstantin.ananyev, bruce.richardson
  Cc: dev, wenzhuo.lu, helin.zhang, ophirmu

Hi,

13/10/2017 11:01, Xiaoyun Li:
> This patch dynamically selects functions of memcpy at run-time based
> on CPU flags that current machine supports. This patch uses function
> pointers which are bind to the relative functions at constrctor time.
> In addition, AVX512 instructions set would be compiled only if users
> config it enabled and the compiler supports it.
> 
> Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
> ---
Keeping only the major changes of the patch for later discussions:
[...]
>  static inline void *
>  rte_memcpy(void *dst, const void *src, size_t n)
>  {
> -	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
> -		return rte_memcpy_aligned(dst, src, n);
> +	if (n <= RTE_X86_MEMCPY_THRESH)
> +		return rte_memcpy_internal(dst, src, n);
>  	else
> -		return rte_memcpy_generic(dst, src, n);
> +		return (*rte_memcpy_ptr)(dst, src, n);
>  }
[...]
> +static inline void *
> +rte_memcpy_internal(void *dst, const void *src, size_t n)
> +{
> +	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
> +		return rte_memcpy_aligned(dst, src, n);
> +	else
> +		return rte_memcpy_generic(dst, src, n);
> +}

The significant change of this patch is to call a function pointer
for packet size > 128 (RTE_X86_MEMCPY_THRESH).

Please could you provide some benchmark numbers?

>From a test done at Mellanox, there might be a performance degradation
of about 15% in testpmd txonly with AVX2.
Is there someone else seeing a performance degradation?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-17 21:24             ` Thomas Monjalon
@ 2017-10-18  2:21               ` Li, Xiaoyun
  2017-10-18  6:22                 ` Li, Xiaoyun
  0 siblings, 1 reply; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-18  2:21 UTC (permalink / raw)
  To: Thomas Monjalon, Ananyev, Konstantin, Richardson, Bruce
  Cc: dev, Lu, Wenzhuo, Zhang, Helin, ophirmu

Hi

> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Wednesday, October 18, 2017 05:24
> To: Li, Xiaoyun <xiaoyun.li@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: dev@dpdk.org; Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> <helin.zhang@intel.com>; ophirmu@mellanox.com
> Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over
> memcpy
> 
> Hi,
> 
> 13/10/2017 11:01, Xiaoyun Li:
> > This patch dynamically selects functions of memcpy at run-time based
> > on CPU flags that current machine supports. This patch uses function
> > pointers which are bind to the relative functions at constrctor time.
> > In addition, AVX512 instructions set would be compiled only if users
> > config it enabled and the compiler supports it.
> >
> > Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
> > ---
> Keeping only the major changes of the patch for later discussions:
> [...]
> >  static inline void *
> >  rte_memcpy(void *dst, const void *src, size_t n)  {
> > -	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
> > -		return rte_memcpy_aligned(dst, src, n);
> > +	if (n <= RTE_X86_MEMCPY_THRESH)
> > +		return rte_memcpy_internal(dst, src, n);
> >  	else
> > -		return rte_memcpy_generic(dst, src, n);
> > +		return (*rte_memcpy_ptr)(dst, src, n);
> >  }
> [...]
> > +static inline void *
> > +rte_memcpy_internal(void *dst, const void *src, size_t n) {
> > +	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
> > +		return rte_memcpy_aligned(dst, src, n);
> > +	else
> > +		return rte_memcpy_generic(dst, src, n); }
> 
> The significant change of this patch is to call a function pointer for packet
> size > 128 (RTE_X86_MEMCPY_THRESH). 
The perf drop is due to function call replacing inline.

> Please could you provide some benchmark numbers?
I ran memcpy_perf_test which would show the time cost of memcpy. I ran it on broadwell with sse and avx2.
But I just draw pictures and looked at the trend not computed the exact percentage. Sorry about that.
The picture shows results of copy size of 2, 4, 6, 8, 9, 12, 16, 32, 64, 128, 192, 256, 320, 384, 448, 512, 768, 1024, 1518, 1522, 1536, 1600, 2048, 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192.
In my test, the size grows, the drop degrades. (Using copy time indicates the perf.)
>From the trend picture, when the size is smaller than 128 bytes, the perf drops a lot, almost 50%. And above 128 bytes, it approaches the original dpdk.
I computed it right now, it shows that when greater than 128 bytes and smaller than 1024 bytes, the perf drops about 15%. When above 1024 bytes, the perf drops about 4%.

> From a test done at Mellanox, there might be a performance degradation of
> about 15% in testpmd txonly with AVX2.
> Is there someone else seeing a performance degradation?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-18  2:21               ` Li, Xiaoyun
@ 2017-10-18  6:22                 ` Li, Xiaoyun
  2017-10-19  2:45                   ` Li, Xiaoyun
  0 siblings, 1 reply; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-18  6:22 UTC (permalink / raw)
  To: Li, Xiaoyun, Thomas Monjalon, Ananyev, Konstantin, Richardson, Bruce
  Cc: dev, Lu, Wenzhuo, Zhang, Helin, ophirmu



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Li, Xiaoyun
> Sent: Wednesday, October 18, 2017 10:22
> To: Thomas Monjalon <thomas@monjalon.net>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: dev@dpdk.org; Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> <helin.zhang@intel.com>; ophirmu@mellanox.com
> Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over
> memcpy
> 
> Hi
> 
> > -----Original Message-----
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > Sent: Wednesday, October 18, 2017 05:24
> > To: Li, Xiaoyun <xiaoyun.li@intel.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; Richardson, Bruce
> > <bruce.richardson@intel.com>
> > Cc: dev@dpdk.org; Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > <helin.zhang@intel.com>; ophirmu@mellanox.com
> > Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over
> > memcpy
> >
> > Hi,
> >
> > 13/10/2017 11:01, Xiaoyun Li:
> > > This patch dynamically selects functions of memcpy at run-time based
> > > on CPU flags that current machine supports. This patch uses function
> > > pointers which are bind to the relative functions at constrctor time.
> > > In addition, AVX512 instructions set would be compiled only if users
> > > config it enabled and the compiler supports it.
> > >
> > > Signed-off-by: Xiaoyun Li <xiaoyun.li@intel.com>
> > > ---
> > Keeping only the major changes of the patch for later discussions:
> > [...]
> > >  static inline void *
> > >  rte_memcpy(void *dst, const void *src, size_t n)  {
> > > -	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
> > > -		return rte_memcpy_aligned(dst, src, n);
> > > +	if (n <= RTE_X86_MEMCPY_THRESH)
> > > +		return rte_memcpy_internal(dst, src, n);
> > >  	else
> > > -		return rte_memcpy_generic(dst, src, n);
> > > +		return (*rte_memcpy_ptr)(dst, src, n);
> > >  }
> > [...]
> > > +static inline void *
> > > +rte_memcpy_internal(void *dst, const void *src, size_t n) {
> > > +	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
> > > +		return rte_memcpy_aligned(dst, src, n);
> > > +	else
> > > +		return rte_memcpy_generic(dst, src, n); }
> >
> > The significant change of this patch is to call a function pointer for
> > packet size > 128 (RTE_X86_MEMCPY_THRESH).
> The perf drop is due to function call replacing inline.
> 
> > Please could you provide some benchmark numbers?
> I ran memcpy_perf_test which would show the time cost of memcpy. I ran it
> on broadwell with sse and avx2.
> But I just draw pictures and looked at the trend not computed the exact
> percentage. Sorry about that.
> The picture shows results of copy size of 2, 4, 6, 8, 9, 12, 16, 32, 64, 128, 192,
> 256, 320, 384, 448, 512, 768, 1024, 1518, 1522, 1536, 1600, 2048, 2560, 3072,
> 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192.
> In my test, the size grows, the drop degrades. (Using copy time indicates the
> perf.) From the trend picture, when the size is smaller than 128 bytes, the
> perf drops a lot, almost 50%. And above 128 bytes, it approaches the original
> dpdk.
> I computed it right now, it shows that when greater than 128 bytes and
> smaller than 1024 bytes, the perf drops about 15%. When above 1024 bytes,
> the perf drops about 4%.
> 
> > From a test done at Mellanox, there might be a performance degradation
> > of about 15% in testpmd txonly with AVX2.

Another thing, I will test testpmd txonly with intel nics and mellanox these days.
And try adjusting the RTE_X86_MEMCPY_THRESH to see if there is any improvement.

> > Is there someone else seeing a performance degradation?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-18  6:22                 ` Li, Xiaoyun
@ 2017-10-19  2:45                   ` Li, Xiaoyun
  2017-10-19  6:58                     ` Thomas Monjalon
  0 siblings, 1 reply; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-19  2:45 UTC (permalink / raw)
  To: Thomas Monjalon, Ananyev, Konstantin, Richardson, Bruce
  Cc: dev, Lu, Wenzhuo, Zhang, Helin, ophirmu

Hi
> > >
> > > The significant change of this patch is to call a function pointer
> > > for packet size > 128 (RTE_X86_MEMCPY_THRESH).
> > The perf drop is due to function call replacing inline.
> >
> > > Please could you provide some benchmark numbers?
> > I ran memcpy_perf_test which would show the time cost of memcpy. I ran
> > it on broadwell with sse and avx2.
> > But I just draw pictures and looked at the trend not computed the
> > exact percentage. Sorry about that.
> > The picture shows results of copy size of 2, 4, 6, 8, 9, 12, 16, 32,
> > 64, 128, 192, 256, 320, 384, 448, 512, 768, 1024, 1518, 1522, 1536,
> > 1600, 2048, 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168,
> 7680, 8192.
> > In my test, the size grows, the drop degrades. (Using copy time
> > indicates the
> > perf.) From the trend picture, when the size is smaller than 128
> > bytes, the perf drops a lot, almost 50%. And above 128 bytes, it
> > approaches the original dpdk.
> > I computed it right now, it shows that when greater than 128 bytes and
> > smaller than 1024 bytes, the perf drops about 15%. When above 1024
> > bytes, the perf drops about 4%.
> >
> > > From a test done at Mellanox, there might be a performance
> > > degradation of about 15% in testpmd txonly with AVX2.
> 

I did tests on X710, XXV710, X540 and MT27710 but didn't see performance degradation.

I used command "./x86_64-native-linuxapp-gcc/app/testpmd -c 0xf -n 4 -- -I" and set fwd txonly. 
I tested it on v17.11-rc1, then revert my patch and tested it again.
Show port stats all and see the throughput pps. But the results are similar and no drop.

Did I miss something?

> Another thing, I will test testpmd txonly with intel nics and mellanox these
> days.
> And try adjusting the RTE_X86_MEMCPY_THRESH to see if there is any
> improvement.
> 
> > > Is there someone else seeing a performance degradation?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-19  2:45                   ` Li, Xiaoyun
@ 2017-10-19  6:58                     ` Thomas Monjalon
  2017-10-19  7:51                       ` Li, Xiaoyun
  0 siblings, 1 reply; 88+ messages in thread
From: Thomas Monjalon @ 2017-10-19  6:58 UTC (permalink / raw)
  To: Li, Xiaoyun
  Cc: Ananyev, Konstantin, Richardson, Bruce, dev, Lu, Wenzhuo, Zhang,
	Helin, ophirmu

19/10/2017 04:45, Li, Xiaoyun:
> Hi
> > > >
> > > > The significant change of this patch is to call a function pointer
> > > > for packet size > 128 (RTE_X86_MEMCPY_THRESH).
> > > The perf drop is due to function call replacing inline.
> > >
> > > > Please could you provide some benchmark numbers?
> > > I ran memcpy_perf_test which would show the time cost of memcpy. I ran
> > > it on broadwell with sse and avx2.
> > > But I just draw pictures and looked at the trend not computed the
> > > exact percentage. Sorry about that.
> > > The picture shows results of copy size of 2, 4, 6, 8, 9, 12, 16, 32,
> > > 64, 128, 192, 256, 320, 384, 448, 512, 768, 1024, 1518, 1522, 1536,
> > > 1600, 2048, 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168,
> > 7680, 8192.
> > > In my test, the size grows, the drop degrades. (Using copy time
> > > indicates the
> > > perf.) From the trend picture, when the size is smaller than 128
> > > bytes, the perf drops a lot, almost 50%. And above 128 bytes, it
> > > approaches the original dpdk.
> > > I computed it right now, it shows that when greater than 128 bytes and
> > > smaller than 1024 bytes, the perf drops about 15%. When above 1024
> > > bytes, the perf drops about 4%.
> > >
> > > > From a test done at Mellanox, there might be a performance
> > > > degradation of about 15% in testpmd txonly with AVX2.
> > 
> 
> I did tests on X710, XXV710, X540 and MT27710 but didn't see performance degradation.
> 
> I used command "./x86_64-native-linuxapp-gcc/app/testpmd -c 0xf -n 4 -- -I" and set fwd txonly. 
> I tested it on v17.11-rc1, then revert my patch and tested it again.
> Show port stats all and see the throughput pps. But the results are similar and no drop.
> 
> Did I miss something?

I do not understand. Yesterday you confirmed a 15% drop with buffers between
128 and 1024 bytes.
But you do not see this drop in your txonly tests, right?

> > Another thing, I will test testpmd txonly with intel nics and mellanox these
> > days.
> > And try adjusting the RTE_X86_MEMCPY_THRESH to see if there is any
> > improvement.
> > 
> > > > Is there someone else seeing a performance degradation?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-19  6:58                     ` Thomas Monjalon
@ 2017-10-19  7:51                       ` Li, Xiaoyun
  2017-10-19  8:33                         ` Thomas Monjalon
  0 siblings, 1 reply; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-19  7:51 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Ananyev, Konstantin, Richardson, Bruce, dev, Lu, Wenzhuo, Zhang,
	Helin, ophirmu



> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Thursday, October 19, 2017 14:59
> To: Li, Xiaoyun <xiaoyun.li@intel.com>
> Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson,
> Bruce <bruce.richardson@intel.com>; dev@dpdk.org; Lu, Wenzhuo
> <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>;
> ophirmu@mellanox.com
> Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over
> memcpy
> 
> 19/10/2017 04:45, Li, Xiaoyun:
> > Hi
> > > > >
> > > > > The significant change of this patch is to call a function
> > > > > pointer for packet size > 128 (RTE_X86_MEMCPY_THRESH).
> > > > The perf drop is due to function call replacing inline.
> > > >
> > > > > Please could you provide some benchmark numbers?
> > > > I ran memcpy_perf_test which would show the time cost of memcpy. I
> > > > ran it on broadwell with sse and avx2.
> > > > But I just draw pictures and looked at the trend not computed the
> > > > exact percentage. Sorry about that.
> > > > The picture shows results of copy size of 2, 4, 6, 8, 9, 12, 16,
> > > > 32, 64, 128, 192, 256, 320, 384, 448, 512, 768, 1024, 1518, 1522,
> > > > 1536, 1600, 2048, 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144,
> > > > 6656, 7168,
> > > 7680, 8192.
> > > > In my test, the size grows, the drop degrades. (Using copy time
> > > > indicates the
> > > > perf.) From the trend picture, when the size is smaller than 128
> > > > bytes, the perf drops a lot, almost 50%. And above 128 bytes, it
> > > > approaches the original dpdk.
> > > > I computed it right now, it shows that when greater than 128 bytes
> > > > and smaller than 1024 bytes, the perf drops about 15%. When above
> > > > 1024 bytes, the perf drops about 4%.
> > > >
> > > > > From a test done at Mellanox, there might be a performance
> > > > > degradation of about 15% in testpmd txonly with AVX2.
> > >
> >
> > I did tests on X710, XXV710, X540 and MT27710 but didn't see
> performance degradation.
> >
> > I used command "./x86_64-native-linuxapp-gcc/app/testpmd -c 0xf -n 4 -- -
> I" and set fwd txonly.
> > I tested it on v17.11-rc1, then revert my patch and tested it again.
> > Show port stats all and see the throughput pps. But the results are similar
> and no drop.
> >
> > Did I miss something?
> 
> I do not understand. Yesterday you confirmed a 15% drop with buffers
> between
> 128 and 1024 bytes.
> But you do not see this drop in your txonly tests, right?
> 
Yes. The drop is using test.
Using command "make test -j" and then " ./build/app/test -c f -n 4 " 
Then run "memcpy_perf_autotest"
The results are the cycles that memory copy costs.
But I just use it to show the trend because I heard that it's not recommended to use micro benchmarks like test_memcpy_perf for memcpy performance report as they aren't likely able to reflect performance of real world applications.
Details can be seen at https://software.intel.com/en-us/articles/performance-optimization-of-memcpy-in-dpdk

And I didn't see drop in testpmd txonly test. Maybe it's because not a lot memcpy calls.

> > > Another thing, I will test testpmd txonly with intel nics and
> > > mellanox these days.
> > > And try adjusting the RTE_X86_MEMCPY_THRESH to see if there is any
> > > improvement.
> > >
> > > > > Is there someone else seeing a performance degradation?
> 
> 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-19  7:51                       ` Li, Xiaoyun
@ 2017-10-19  8:33                         ` Thomas Monjalon
  2017-10-19  8:50                           ` Li, Xiaoyun
  0 siblings, 1 reply; 88+ messages in thread
From: Thomas Monjalon @ 2017-10-19  8:33 UTC (permalink / raw)
  To: Li, Xiaoyun
  Cc: Ananyev, Konstantin, Richardson, Bruce, dev, Lu, Wenzhuo, Zhang,
	Helin, ophirmu

19/10/2017 09:51, Li, Xiaoyun:
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > 19/10/2017 04:45, Li, Xiaoyun:
> > > Hi
> > > > > >
> > > > > > The significant change of this patch is to call a function
> > > > > > pointer for packet size > 128 (RTE_X86_MEMCPY_THRESH).
> > > > > The perf drop is due to function call replacing inline.
> > > > >
> > > > > > Please could you provide some benchmark numbers?
> > > > > I ran memcpy_perf_test which would show the time cost of memcpy. I
> > > > > ran it on broadwell with sse and avx2.
> > > > > But I just draw pictures and looked at the trend not computed the
> > > > > exact percentage. Sorry about that.
> > > > > The picture shows results of copy size of 2, 4, 6, 8, 9, 12, 16,
> > > > > 32, 64, 128, 192, 256, 320, 384, 448, 512, 768, 1024, 1518, 1522,
> > > > > 1536, 1600, 2048, 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144,
> > > > > 6656, 7168,
> > > > 7680, 8192.
> > > > > In my test, the size grows, the drop degrades. (Using copy time
> > > > > indicates the
> > > > > perf.) From the trend picture, when the size is smaller than 128
> > > > > bytes, the perf drops a lot, almost 50%. And above 128 bytes, it
> > > > > approaches the original dpdk.
> > > > > I computed it right now, it shows that when greater than 128 bytes
> > > > > and smaller than 1024 bytes, the perf drops about 15%. When above
> > > > > 1024 bytes, the perf drops about 4%.
> > > > >
> > > > > > From a test done at Mellanox, there might be a performance
> > > > > > degradation of about 15% in testpmd txonly with AVX2.
> > > >
> > >
> > > I did tests on X710, XXV710, X540 and MT27710 but didn't see
> > performance degradation.
> > >
> > > I used command "./x86_64-native-linuxapp-gcc/app/testpmd -c 0xf -n 4 -- -
> > I" and set fwd txonly.
> > > I tested it on v17.11-rc1, then revert my patch and tested it again.
> > > Show port stats all and see the throughput pps. But the results are similar
> > and no drop.
> > >
> > > Did I miss something?
> > 
> > I do not understand. Yesterday you confirmed a 15% drop with buffers
> > between
> > 128 and 1024 bytes.
> > But you do not see this drop in your txonly tests, right?
> > 
> Yes. The drop is using test.
> Using command "make test -j" and then " ./build/app/test -c f -n 4 " 
> Then run "memcpy_perf_autotest"
> The results are the cycles that memory copy costs.
> But I just use it to show the trend because I heard that it's not recommended to use micro benchmarks like test_memcpy_perf for memcpy performance report as they aren't likely able to reflect performance of real world applications.

Yes real applications can hide the memcpy cost.
Sometimes, the cost appear for real :)

> Details can be seen at https://software.intel.com/en-us/articles/performance-optimization-of-memcpy-in-dpdk
> 
> And I didn't see drop in testpmd txonly test. Maybe it's because not a lot memcpy calls.

It has been seen in a mlx4 use-case using more memcpy.
I think 15% in micro-benchmark is too much.
What can we do? Raise the threshold?

> > > > Another thing, I will test testpmd txonly with intel nics and
> > > > mellanox these days.
> > > > And try adjusting the RTE_X86_MEMCPY_THRESH to see if there is any
> > > > improvement.
> > > >
> > > > > > Is there someone else seeing a performance degradation?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-19  8:33                         ` Thomas Monjalon
@ 2017-10-19  8:50                           ` Li, Xiaoyun
  2017-10-19  8:59                             ` Ananyev, Konstantin
  2017-10-19  9:00                             ` Thomas Monjalon
  0 siblings, 2 replies; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-19  8:50 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Ananyev, Konstantin, Richardson, Bruce, dev, Lu, Wenzhuo, Zhang,
	Helin, ophirmu



> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Thursday, October 19, 2017 16:34
> To: Li, Xiaoyun <xiaoyun.li@intel.com>
> Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson,
> Bruce <bruce.richardson@intel.com>; dev@dpdk.org; Lu, Wenzhuo
> <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>;
> ophirmu@mellanox.com
> Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over
> memcpy
> 
> 19/10/2017 09:51, Li, Xiaoyun:
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > 19/10/2017 04:45, Li, Xiaoyun:
> > > > Hi
> > > > > > >
> > > > > > > The significant change of this patch is to call a function
> > > > > > > pointer for packet size > 128 (RTE_X86_MEMCPY_THRESH).
> > > > > > The perf drop is due to function call replacing inline.
> > > > > >
> > > > > > > Please could you provide some benchmark numbers?
> > > > > > I ran memcpy_perf_test which would show the time cost of
> > > > > > memcpy. I ran it on broadwell with sse and avx2.
> > > > > > But I just draw pictures and looked at the trend not computed
> > > > > > the exact percentage. Sorry about that.
> > > > > > The picture shows results of copy size of 2, 4, 6, 8, 9, 12,
> > > > > > 16, 32, 64, 128, 192, 256, 320, 384, 448, 512, 768, 1024,
> > > > > > 1518, 1522, 1536, 1600, 2048, 2560, 3072, 3584, 4096, 4608,
> > > > > > 5120, 5632, 6144, 6656, 7168,
> > > > > 7680, 8192.
> > > > > > In my test, the size grows, the drop degrades. (Using copy
> > > > > > time indicates the
> > > > > > perf.) From the trend picture, when the size is smaller than
> > > > > > 128 bytes, the perf drops a lot, almost 50%. And above 128
> > > > > > bytes, it approaches the original dpdk.
> > > > > > I computed it right now, it shows that when greater than 128
> > > > > > bytes and smaller than 1024 bytes, the perf drops about 15%.
> > > > > > When above
> > > > > > 1024 bytes, the perf drops about 4%.
> > > > > >
> > > > > > > From a test done at Mellanox, there might be a performance
> > > > > > > degradation of about 15% in testpmd txonly with AVX2.
> > > > >
> > > >
> > > > I did tests on X710, XXV710, X540 and MT27710 but didn't see
> > > performance degradation.
> > > >
> > > > I used command "./x86_64-native-linuxapp-gcc/app/testpmd -c 0xf -n
> > > > 4 -- -
> > > I" and set fwd txonly.
> > > > I tested it on v17.11-rc1, then revert my patch and tested it again.
> > > > Show port stats all and see the throughput pps. But the results
> > > > are similar
> > > and no drop.
> > > >
> > > > Did I miss something?
> > >
> > > I do not understand. Yesterday you confirmed a 15% drop with buffers
> > > between
> > > 128 and 1024 bytes.
> > > But you do not see this drop in your txonly tests, right?
> > >
> > Yes. The drop is using test.
> > Using command "make test -j" and then " ./build/app/test -c f -n 4 "
> > Then run "memcpy_perf_autotest"
> > The results are the cycles that memory copy costs.
> > But I just use it to show the trend because I heard that it's not
> recommended to use micro benchmarks like test_memcpy_perf for memcpy
> performance report as they aren't likely able to reflect performance of real
> world applications.
> 
> Yes real applications can hide the memcpy cost.
> Sometimes, the cost appear for real :)
> 
> > Details can be seen at
> > https://software.intel.com/en-us/articles/performance-optimization-of-
> > memcpy-in-dpdk
> >
> > And I didn't see drop in testpmd txonly test. Maybe it's because not a lot
> memcpy calls.
> 
> It has been seen in a mlx4 use-case using more memcpy.
> I think 15% in micro-benchmark is too much.
> What can we do? Raise the threshold?
> 
I think so. If there is big drop, can try raise the threshold. Maybe 1024? but not sure.
But I didn't reproduce the 15% drop on mellanox and not sure how to verify it.

> > > > > Another thing, I will test testpmd txonly with intel nics and
> > > > > mellanox these days.
> > > > > And try adjusting the RTE_X86_MEMCPY_THRESH to see if there is
> > > > > any improvement.
> > > > >
> > > > > > > Is there someone else seeing a performance degradation?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-19  8:50                           ` Li, Xiaoyun
@ 2017-10-19  8:59                             ` Ananyev, Konstantin
  2017-10-19  9:00                             ` Thomas Monjalon
  1 sibling, 0 replies; 88+ messages in thread
From: Ananyev, Konstantin @ 2017-10-19  8:59 UTC (permalink / raw)
  To: Li, Xiaoyun, Thomas Monjalon
  Cc: Richardson, Bruce, dev, Lu, Wenzhuo, Zhang, Helin, ophirmu



> -----Original Message-----
> From: Li, Xiaoyun
> Sent: Thursday, October 19, 2017 9:51 AM
> To: Thomas Monjalon <thomas@monjalon.net>
> Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; dev@dpdk.org; Lu, Wenzhuo
> <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>; ophirmu@mellanox.com
> Subject: RE: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
> 
> 
> 
> > -----Original Message-----
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > Sent: Thursday, October 19, 2017 16:34
> > To: Li, Xiaoyun <xiaoyun.li@intel.com>
> > Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson,
> > Bruce <bruce.richardson@intel.com>; dev@dpdk.org; Lu, Wenzhuo
> > <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>;
> > ophirmu@mellanox.com
> > Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over
> > memcpy
> >
> > 19/10/2017 09:51, Li, Xiaoyun:
> > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > 19/10/2017 04:45, Li, Xiaoyun:
> > > > > Hi
> > > > > > > >
> > > > > > > > The significant change of this patch is to call a function
> > > > > > > > pointer for packet size > 128 (RTE_X86_MEMCPY_THRESH).
> > > > > > > The perf drop is due to function call replacing inline.
> > > > > > >
> > > > > > > > Please could you provide some benchmark numbers?
> > > > > > > I ran memcpy_perf_test which would show the time cost of
> > > > > > > memcpy. I ran it on broadwell with sse and avx2.
> > > > > > > But I just draw pictures and looked at the trend not computed
> > > > > > > the exact percentage. Sorry about that.
> > > > > > > The picture shows results of copy size of 2, 4, 6, 8, 9, 12,
> > > > > > > 16, 32, 64, 128, 192, 256, 320, 384, 448, 512, 768, 1024,
> > > > > > > 1518, 1522, 1536, 1600, 2048, 2560, 3072, 3584, 4096, 4608,
> > > > > > > 5120, 5632, 6144, 6656, 7168,
> > > > > > 7680, 8192.
> > > > > > > In my test, the size grows, the drop degrades. (Using copy
> > > > > > > time indicates the
> > > > > > > perf.) From the trend picture, when the size is smaller than
> > > > > > > 128 bytes, the perf drops a lot, almost 50%. And above 128
> > > > > > > bytes, it approaches the original dpdk.
> > > > > > > I computed it right now, it shows that when greater than 128
> > > > > > > bytes and smaller than 1024 bytes, the perf drops about 15%.
> > > > > > > When above
> > > > > > > 1024 bytes, the perf drops about 4%.
> > > > > > >
> > > > > > > > From a test done at Mellanox, there might be a performance
> > > > > > > > degradation of about 15% in testpmd txonly with AVX2.
> > > > > >
> > > > >
> > > > > I did tests on X710, XXV710, X540 and MT27710 but didn't see
> > > > performance degradation.
> > > > >
> > > > > I used command "./x86_64-native-linuxapp-gcc/app/testpmd -c 0xf -n
> > > > > 4 -- -
> > > > I" and set fwd txonly.
> > > > > I tested it on v17.11-rc1, then revert my patch and tested it again.
> > > > > Show port stats all and see the throughput pps. But the results
> > > > > are similar
> > > > and no drop.
> > > > >
> > > > > Did I miss something?
> > > >
> > > > I do not understand. Yesterday you confirmed a 15% drop with buffers
> > > > between
> > > > 128 and 1024 bytes.
> > > > But you do not see this drop in your txonly tests, right?
> > > >
> > > Yes. The drop is using test.
> > > Using command "make test -j" and then " ./build/app/test -c f -n 4 "
> > > Then run "memcpy_perf_autotest"
> > > The results are the cycles that memory copy costs.
> > > But I just use it to show the trend because I heard that it's not
> > recommended to use micro benchmarks like test_memcpy_perf for memcpy
> > performance report as they aren't likely able to reflect performance of real
> > world applications.
> >
> > Yes real applications can hide the memcpy cost.
> > Sometimes, the cost appear for real :)
> >
> > > Details can be seen at
> > > https://software.intel.com/en-us/articles/performance-optimization-of-
> > > memcpy-in-dpdk
> > >
> > > And I didn't see drop in testpmd txonly test. Maybe it's because not a lot
> > memcpy calls.
> >
> > It has been seen in a mlx4 use-case using more memcpy.
> > I think 15% in micro-benchmark is too much.
> > What can we do? Raise the threshold?
> >
> I think so. If there is big drop, can try raise the threshold. Maybe 1024? but not sure.
> But I didn't reproduce the 15% drop on mellanox and not sure how to verify it.

Can we make it dynamically adjustable then?
A global variable initialized to some default value or so?
Unless you recon that it would affect performance any further...
Konstantin

> 
> > > > > > Another thing, I will test testpmd txonly with intel nics and
> > > > > > mellanox these days.
> > > > > > And try adjusting the RTE_X86_MEMCPY_THRESH to see if there is
> > > > > > any improvement.
> > > > > >
> > > > > > > > Is there someone else seeing a performance degradation?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-19  8:50                           ` Li, Xiaoyun
  2017-10-19  8:59                             ` Ananyev, Konstantin
@ 2017-10-19  9:00                             ` Thomas Monjalon
  2017-10-19  9:29                               ` Bruce Richardson
  1 sibling, 1 reply; 88+ messages in thread
From: Thomas Monjalon @ 2017-10-19  9:00 UTC (permalink / raw)
  To: Li, Xiaoyun
  Cc: Ananyev, Konstantin, Richardson, Bruce, dev, Lu, Wenzhuo, Zhang,
	Helin, ophirmu

19/10/2017 10:50, Li, Xiaoyun:
> 
> > -----Original Message-----
> > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > Sent: Thursday, October 19, 2017 16:34
> > To: Li, Xiaoyun <xiaoyun.li@intel.com>
> > Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson,
> > Bruce <bruce.richardson@intel.com>; dev@dpdk.org; Lu, Wenzhuo
> > <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>;
> > ophirmu@mellanox.com
> > Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over
> > memcpy
> > 
> > 19/10/2017 09:51, Li, Xiaoyun:
> > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > 19/10/2017 04:45, Li, Xiaoyun:
> > > > > Hi
> > > > > > > >
> > > > > > > > The significant change of this patch is to call a function
> > > > > > > > pointer for packet size > 128 (RTE_X86_MEMCPY_THRESH).
> > > > > > > The perf drop is due to function call replacing inline.
> > > > > > >
> > > > > > > > Please could you provide some benchmark numbers?
> > > > > > > I ran memcpy_perf_test which would show the time cost of
> > > > > > > memcpy. I ran it on broadwell with sse and avx2.
> > > > > > > But I just draw pictures and looked at the trend not computed
> > > > > > > the exact percentage. Sorry about that.
> > > > > > > The picture shows results of copy size of 2, 4, 6, 8, 9, 12,
> > > > > > > 16, 32, 64, 128, 192, 256, 320, 384, 448, 512, 768, 1024,
> > > > > > > 1518, 1522, 1536, 1600, 2048, 2560, 3072, 3584, 4096, 4608,
> > > > > > > 5120, 5632, 6144, 6656, 7168,
> > > > > > 7680, 8192.
> > > > > > > In my test, the size grows, the drop degrades. (Using copy
> > > > > > > time indicates the
> > > > > > > perf.) From the trend picture, when the size is smaller than
> > > > > > > 128 bytes, the perf drops a lot, almost 50%. And above 128
> > > > > > > bytes, it approaches the original dpdk.
> > > > > > > I computed it right now, it shows that when greater than 128
> > > > > > > bytes and smaller than 1024 bytes, the perf drops about 15%.
> > > > > > > When above
> > > > > > > 1024 bytes, the perf drops about 4%.
> > > > > > >
> > > > > > > > From a test done at Mellanox, there might be a performance
> > > > > > > > degradation of about 15% in testpmd txonly with AVX2.
> > > > > >
> > > > >
> > > > > I did tests on X710, XXV710, X540 and MT27710 but didn't see
> > > > performance degradation.
> > > > >
> > > > > I used command "./x86_64-native-linuxapp-gcc/app/testpmd -c 0xf -n
> > > > > 4 -- -
> > > > I" and set fwd txonly.
> > > > > I tested it on v17.11-rc1, then revert my patch and tested it again.
> > > > > Show port stats all and see the throughput pps. But the results
> > > > > are similar
> > > > and no drop.
> > > > >
> > > > > Did I miss something?
> > > >
> > > > I do not understand. Yesterday you confirmed a 15% drop with buffers
> > > > between
> > > > 128 and 1024 bytes.
> > > > But you do not see this drop in your txonly tests, right?
> > > >
> > > Yes. The drop is using test.
> > > Using command "make test -j" and then " ./build/app/test -c f -n 4 "
> > > Then run "memcpy_perf_autotest"
> > > The results are the cycles that memory copy costs.
> > > But I just use it to show the trend because I heard that it's not
> > recommended to use micro benchmarks like test_memcpy_perf for memcpy
> > performance report as they aren't likely able to reflect performance of real
> > world applications.
> > 
> > Yes real applications can hide the memcpy cost.
> > Sometimes, the cost appear for real :)
> > 
> > > Details can be seen at
> > > https://software.intel.com/en-us/articles/performance-optimization-of-
> > > memcpy-in-dpdk
> > >
> > > And I didn't see drop in testpmd txonly test. Maybe it's because not a lot
> > memcpy calls.
> > 
> > It has been seen in a mlx4 use-case using more memcpy.
> > I think 15% in micro-benchmark is too much.
> > What can we do? Raise the threshold?
> > 
> I think so. If there is big drop, can try raise the threshold. Maybe 1024? but not sure.
> But I didn't reproduce the 15% drop on mellanox and not sure how to verify it.

I think we should focus on micro-benchmark and find a reasonnable threshold
for a reasonnable drop tradeoff.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-19  9:00                             ` Thomas Monjalon
@ 2017-10-19  9:29                               ` Bruce Richardson
  2017-10-20  1:02                                 ` Li, Xiaoyun
  0 siblings, 1 reply; 88+ messages in thread
From: Bruce Richardson @ 2017-10-19  9:29 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Li, Xiaoyun, Ananyev, Konstantin, dev, Lu, Wenzhuo, Zhang, Helin,
	ophirmu

On Thu, Oct 19, 2017 at 11:00:33AM +0200, Thomas Monjalon wrote:
> 19/10/2017 10:50, Li, Xiaoyun:
> > 
> > > -----Original Message-----
> > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > Sent: Thursday, October 19, 2017 16:34
> > > To: Li, Xiaoyun <xiaoyun.li@intel.com>
> > > Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson,
> > > Bruce <bruce.richardson@intel.com>; dev@dpdk.org; Lu, Wenzhuo
> > > <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>;
> > > ophirmu@mellanox.com
> > > Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over
> > > memcpy
> > > 
> > > 19/10/2017 09:51, Li, Xiaoyun:
> > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > 19/10/2017 04:45, Li, Xiaoyun:
> > > > > > Hi
> > > > > > > > >
> > > > > > > > > The significant change of this patch is to call a function
> > > > > > > > > pointer for packet size > 128 (RTE_X86_MEMCPY_THRESH).
> > > > > > > > The perf drop is due to function call replacing inline.
> > > > > > > >
> > > > > > > > > Please could you provide some benchmark numbers?
> > > > > > > > I ran memcpy_perf_test which would show the time cost of
> > > > > > > > memcpy. I ran it on broadwell with sse and avx2.
> > > > > > > > But I just draw pictures and looked at the trend not computed
> > > > > > > > the exact percentage. Sorry about that.
> > > > > > > > The picture shows results of copy size of 2, 4, 6, 8, 9, 12,
> > > > > > > > 16, 32, 64, 128, 192, 256, 320, 384, 448, 512, 768, 1024,
> > > > > > > > 1518, 1522, 1536, 1600, 2048, 2560, 3072, 3584, 4096, 4608,
> > > > > > > > 5120, 5632, 6144, 6656, 7168,
> > > > > > > 7680, 8192.
> > > > > > > > In my test, the size grows, the drop degrades. (Using copy
> > > > > > > > time indicates the
> > > > > > > > perf.) From the trend picture, when the size is smaller than
> > > > > > > > 128 bytes, the perf drops a lot, almost 50%. And above 128
> > > > > > > > bytes, it approaches the original dpdk.
> > > > > > > > I computed it right now, it shows that when greater than 128
> > > > > > > > bytes and smaller than 1024 bytes, the perf drops about 15%.
> > > > > > > > When above
> > > > > > > > 1024 bytes, the perf drops about 4%.
> > > > > > > >
> > > > > > > > > From a test done at Mellanox, there might be a performance
> > > > > > > > > degradation of about 15% in testpmd txonly with AVX2.
> > > > > > >
> > > > > >
> > > > > > I did tests on X710, XXV710, X540 and MT27710 but didn't see
> > > > > performance degradation.
> > > > > >
> > > > > > I used command "./x86_64-native-linuxapp-gcc/app/testpmd -c 0xf -n
> > > > > > 4 -- -
> > > > > I" and set fwd txonly.
> > > > > > I tested it on v17.11-rc1, then revert my patch and tested it again.
> > > > > > Show port stats all and see the throughput pps. But the results
> > > > > > are similar
> > > > > and no drop.
> > > > > >
> > > > > > Did I miss something?
> > > > >
> > > > > I do not understand. Yesterday you confirmed a 15% drop with buffers
> > > > > between
> > > > > 128 and 1024 bytes.
> > > > > But you do not see this drop in your txonly tests, right?
> > > > >
> > > > Yes. The drop is using test.
> > > > Using command "make test -j" and then " ./build/app/test -c f -n 4 "
> > > > Then run "memcpy_perf_autotest"
> > > > The results are the cycles that memory copy costs.
> > > > But I just use it to show the trend because I heard that it's not
> > > recommended to use micro benchmarks like test_memcpy_perf for memcpy
> > > performance report as they aren't likely able to reflect performance of real
> > > world applications.
> > > 
> > > Yes real applications can hide the memcpy cost.
> > > Sometimes, the cost appear for real :)
> > > 
> > > > Details can be seen at
> > > > https://software.intel.com/en-us/articles/performance-optimization-of-
> > > > memcpy-in-dpdk
> > > >
> > > > And I didn't see drop in testpmd txonly test. Maybe it's because not a lot
> > > memcpy calls.
> > > 
> > > It has been seen in a mlx4 use-case using more memcpy.
> > > I think 15% in micro-benchmark is too much.
> > > What can we do? Raise the threshold?
> > > 
> > I think so. If there is big drop, can try raise the threshold. Maybe 1024? but not sure.
> > But I didn't reproduce the 15% drop on mellanox and not sure how to verify it.
> 
> I think we should focus on micro-benchmark and find a reasonnable threshold
> for a reasonnable drop tradeoff.
>
Sadly, it may not be that simple. What shows best performance for
micro-benchmarks may not show the same effect in a real application.

/Bruce

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-19  9:29                               ` Bruce Richardson
@ 2017-10-20  1:02                                 ` Li, Xiaoyun
  2017-10-25  6:55                                   ` Li, Xiaoyun
  0 siblings, 1 reply; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-20  1:02 UTC (permalink / raw)
  To: Richardson, Bruce, Thomas Monjalon
  Cc: Ananyev, Konstantin, dev, Lu, Wenzhuo, Zhang, Helin, ophirmu



> -----Original Message-----
> From: Richardson, Bruce
> Sent: Thursday, October 19, 2017 17:30
> To: Thomas Monjalon <thomas@monjalon.net>
> Cc: Li, Xiaoyun <xiaoyun.li@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; dev@dpdk.org; Lu, Wenzhuo
> <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>;
> ophirmu@mellanox.com
> Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over
> memcpy
> 
> On Thu, Oct 19, 2017 at 11:00:33AM +0200, Thomas Monjalon wrote:
> > 19/10/2017 10:50, Li, Xiaoyun:
> > >
> > > > -----Original Message-----
> > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > Sent: Thursday, October 19, 2017 16:34
> > > > To: Li, Xiaoyun <xiaoyun.li@intel.com>
> > > > Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> > > > Richardson, Bruce <bruce.richardson@intel.com>; dev@dpdk.org; Lu,
> > > > Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > > > <helin.zhang@intel.com>; ophirmu@mellanox.com
> > > > Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch
> > > > over memcpy
> > > >
> > > > 19/10/2017 09:51, Li, Xiaoyun:
> > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > > 19/10/2017 04:45, Li, Xiaoyun:
> > > > > > > Hi
> > > > > > > > > >
> > > > > > > > > > The significant change of this patch is to call a
> > > > > > > > > > function pointer for packet size > 128
> (RTE_X86_MEMCPY_THRESH).
> > > > > > > > > The perf drop is due to function call replacing inline.
> > > > > > > > >
> > > > > > > > > > Please could you provide some benchmark numbers?
> > > > > > > > > I ran memcpy_perf_test which would show the time cost of
> > > > > > > > > memcpy. I ran it on broadwell with sse and avx2.
> > > > > > > > > But I just draw pictures and looked at the trend not
> > > > > > > > > computed the exact percentage. Sorry about that.
> > > > > > > > > The picture shows results of copy size of 2, 4, 6, 8, 9,
> > > > > > > > > 12, 16, 32, 64, 128, 192, 256, 320, 384, 448, 512, 768,
> > > > > > > > > 1024, 1518, 1522, 1536, 1600, 2048, 2560, 3072, 3584,
> > > > > > > > > 4096, 4608, 5120, 5632, 6144, 6656, 7168,
> > > > > > > > 7680, 8192.
> > > > > > > > > In my test, the size grows, the drop degrades. (Using
> > > > > > > > > copy time indicates the
> > > > > > > > > perf.) From the trend picture, when the size is smaller
> > > > > > > > > than
> > > > > > > > > 128 bytes, the perf drops a lot, almost 50%. And above
> > > > > > > > > 128 bytes, it approaches the original dpdk.
> > > > > > > > > I computed it right now, it shows that when greater than
> > > > > > > > > 128 bytes and smaller than 1024 bytes, the perf drops about
> 15%.
> > > > > > > > > When above
> > > > > > > > > 1024 bytes, the perf drops about 4%.
> > > > > > > > >
> > > > > > > > > > From a test done at Mellanox, there might be a
> > > > > > > > > > performance degradation of about 15% in testpmd txonly
> with AVX2.
> > > > > > > >
> > > > > > >
> > > > > > > I did tests on X710, XXV710, X540 and MT27710 but didn't see
> > > > > > performance degradation.
> > > > > > >
> > > > > > > I used command "./x86_64-native-linuxapp-gcc/app/testpmd -c
> > > > > > > 0xf -n
> > > > > > > 4 -- -
> > > > > > I" and set fwd txonly.
> > > > > > > I tested it on v17.11-rc1, then revert my patch and tested it again.
> > > > > > > Show port stats all and see the throughput pps. But the
> > > > > > > results are similar
> > > > > > and no drop.
> > > > > > >
> > > > > > > Did I miss something?
> > > > > >
> > > > > > I do not understand. Yesterday you confirmed a 15% drop with
> > > > > > buffers between
> > > > > > 128 and 1024 bytes.
> > > > > > But you do not see this drop in your txonly tests, right?
> > > > > >
> > > > > Yes. The drop is using test.
> > > > > Using command "make test -j" and then " ./build/app/test -c f -n 4 "
> > > > > Then run "memcpy_perf_autotest"
> > > > > The results are the cycles that memory copy costs.
> > > > > But I just use it to show the trend because I heard that it's
> > > > > not
> > > > recommended to use micro benchmarks like test_memcpy_perf for
> > > > memcpy performance report as they aren't likely able to reflect
> > > > performance of real world applications.
> > > >
> > > > Yes real applications can hide the memcpy cost.
> > > > Sometimes, the cost appear for real :)
> > > >
> > > > > Details can be seen at
> > > > > https://software.intel.com/en-us/articles/performance-optimizati
> > > > > on-of-
> > > > > memcpy-in-dpdk
> > > > >
> > > > > And I didn't see drop in testpmd txonly test. Maybe it's because
> > > > > not a lot
> > > > memcpy calls.
> > > >
> > > > It has been seen in a mlx4 use-case using more memcpy.
> > > > I think 15% in micro-benchmark is too much.
> > > > What can we do? Raise the threshold?
> > > >
> > > I think so. If there is big drop, can try raise the threshold. Maybe 1024?
> but not sure.
> > > But I didn't reproduce the 15% drop on mellanox and not sure how to
> verify it.
> >
> > I think we should focus on micro-benchmark and find a reasonnable
> > threshold for a reasonnable drop tradeoff.
> >
> Sadly, it may not be that simple. What shows best performance for micro-
> benchmarks may not show the same effect in a real application.
> 
> /Bruce

Then how to measure the performance?

And I cannot reproduce 15% drop on mellanox.
Could the person who tested 15% drop help to do test again with 1024 threshold and see if there is any improvement?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-20  1:02                                 ` Li, Xiaoyun
@ 2017-10-25  6:55                                   ` Li, Xiaoyun
  2017-10-25  7:25                                     ` Thomas Monjalon
  2017-10-25  8:50                                     ` Ananyev, Konstantin
  0 siblings, 2 replies; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-25  6:55 UTC (permalink / raw)
  To: Li, Xiaoyun, Richardson, Bruce, Thomas Monjalon
  Cc: Ananyev, Konstantin, dev, Lu, Wenzhuo, Zhang, Helin, ophirmu

Hi

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Li, Xiaoyun
> Sent: Friday, October 20, 2017 09:03
> To: Richardson, Bruce <bruce.richardson@intel.com>; Thomas Monjalon
> <thomas@monjalon.net>
> Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org;
> Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> <helin.zhang@intel.com>; ophirmu@mellanox.com
> Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over
> memcpy
> 
> 
> 
> > -----Original Message-----
> > From: Richardson, Bruce
> > Sent: Thursday, October 19, 2017 17:30
> > To: Thomas Monjalon <thomas@monjalon.net>
> > Cc: Li, Xiaoyun <xiaoyun.li@intel.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; dev@dpdk.org; Lu, Wenzhuo
> > <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>;
> > ophirmu@mellanox.com
> > Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over
> > memcpy
> >
> > On Thu, Oct 19, 2017 at 11:00:33AM +0200, Thomas Monjalon wrote:
> > > 19/10/2017 10:50, Li, Xiaoyun:
> > > >
> > > > > -----Original Message-----
> > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > Sent: Thursday, October 19, 2017 16:34
> > > > > To: Li, Xiaoyun <xiaoyun.li@intel.com>
> > > > > Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> > > > > Richardson, Bruce <bruce.richardson@intel.com>; dev@dpdk.org;
> > > > > Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > > > > <helin.zhang@intel.com>; ophirmu@mellanox.com
> > > > > Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time
> > > > > dispatch over memcpy
> > > > >
> > > > > 19/10/2017 09:51, Li, Xiaoyun:
> > > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > > > 19/10/2017 04:45, Li, Xiaoyun:
> > > > > > > > Hi
> > > > > > > > > > >
> > > > > > > > > > > The significant change of this patch is to call a
> > > > > > > > > > > function pointer for packet size > 128
> > (RTE_X86_MEMCPY_THRESH).
> > > > > > > > > > The perf drop is due to function call replacing inline.
> > > > > > > > > >
> > > > > > > > > > > Please could you provide some benchmark numbers?
> > > > > > > > > > I ran memcpy_perf_test which would show the time cost
> > > > > > > > > > of memcpy. I ran it on broadwell with sse and avx2.
> > > > > > > > > > But I just draw pictures and looked at the trend not
> > > > > > > > > > computed the exact percentage. Sorry about that.
> > > > > > > > > > The picture shows results of copy size of 2, 4, 6, 8,
> > > > > > > > > > 9, 12, 16, 32, 64, 128, 192, 256, 320, 384, 448, 512,
> > > > > > > > > > 768, 1024, 1518, 1522, 1536, 1600, 2048, 2560, 3072,
> > > > > > > > > > 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168,
> > > > > > > > > 7680, 8192.
> > > > > > > > > > In my test, the size grows, the drop degrades. (Using
> > > > > > > > > > copy time indicates the
> > > > > > > > > > perf.) From the trend picture, when the size is
> > > > > > > > > > smaller than
> > > > > > > > > > 128 bytes, the perf drops a lot, almost 50%. And above
> > > > > > > > > > 128 bytes, it approaches the original dpdk.
> > > > > > > > > > I computed it right now, it shows that when greater
> > > > > > > > > > than
> > > > > > > > > > 128 bytes and smaller than 1024 bytes, the perf drops
> > > > > > > > > > about
> > 15%.
> > > > > > > > > > When above
> > > > > > > > > > 1024 bytes, the perf drops about 4%.
> > > > > > > > > >
> > > > > > > > > > > From a test done at Mellanox, there might be a
> > > > > > > > > > > performance degradation of about 15% in testpmd
> > > > > > > > > > > txonly
> > with AVX2.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I did tests on X710, XXV710, X540 and MT27710 but didn't
> > > > > > > > see
> > > > > > > performance degradation.
> > > > > > > >
> > > > > > > > I used command "./x86_64-native-linuxapp-gcc/app/testpmd
> > > > > > > > -c 0xf -n
> > > > > > > > 4 -- -
> > > > > > > I" and set fwd txonly.
> > > > > > > > I tested it on v17.11-rc1, then revert my patch and tested it
> again.
> > > > > > > > Show port stats all and see the throughput pps. But the
> > > > > > > > results are similar
> > > > > > > and no drop.
> > > > > > > >
> > > > > > > > Did I miss something?
> > > > > > >
> > > > > > > I do not understand. Yesterday you confirmed a 15% drop with
> > > > > > > buffers between
> > > > > > > 128 and 1024 bytes.
> > > > > > > But you do not see this drop in your txonly tests, right?
> > > > > > >
> > > > > > Yes. The drop is using test.
> > > > > > Using command "make test -j" and then " ./build/app/test -c f -n 4 "
> > > > > > Then run "memcpy_perf_autotest"
> > > > > > The results are the cycles that memory copy costs.
> > > > > > But I just use it to show the trend because I heard that it's
> > > > > > not
> > > > > recommended to use micro benchmarks like test_memcpy_perf for
> > > > > memcpy performance report as they aren't likely able to reflect
> > > > > performance of real world applications.
> > > > >
> > > > > Yes real applications can hide the memcpy cost.
> > > > > Sometimes, the cost appear for real :)
> > > > >
> > > > > > Details can be seen at
> > > > > > https://software.intel.com/en-us/articles/performance-optimiza
> > > > > > ti
> > > > > > on-of-
> > > > > > memcpy-in-dpdk
> > > > > >
> > > > > > And I didn't see drop in testpmd txonly test. Maybe it's
> > > > > > because not a lot
> > > > > memcpy calls.
> > > > >
> > > > > It has been seen in a mlx4 use-case using more memcpy.
> > > > > I think 15% in micro-benchmark is too much.
> > > > > What can we do? Raise the threshold?
> > > > >
> > > > I think so. If there is big drop, can try raise the threshold. Maybe 1024?
> > but not sure.
> > > > But I didn't reproduce the 15% drop on mellanox and not sure how
> > > > to
> > verify it.
> > >
> > > I think we should focus on micro-benchmark and find a reasonnable
> > > threshold for a reasonnable drop tradeoff.
> > >
> > Sadly, it may not be that simple. What shows best performance for
> > micro- benchmarks may not show the same effect in a real application.
> >
> > /Bruce
> 
> Then how to measure the performance?
> 
> And I cannot reproduce 15% drop on mellanox.
> Could the person who tested 15% drop help to do test again with 1024
> threshold and see if there is any improvement?

As Bruce said, best performance on micro-benchmark may not show the same effect in real applications.
And I cannot reproduce the 15% drop.
And I don't know if raising the threshold can improve the perf or not.
Could the person who tested 15% drop help to do test again with 1024 threshold and see if there is any improvement?

Best Regards
Xiaoyun Li

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-25  6:55                                   ` Li, Xiaoyun
@ 2017-10-25  7:25                                     ` Thomas Monjalon
  2017-10-29  8:49                                       ` Thomas Monjalon
  2017-10-25  8:50                                     ` Ananyev, Konstantin
  1 sibling, 1 reply; 88+ messages in thread
From: Thomas Monjalon @ 2017-10-25  7:25 UTC (permalink / raw)
  To: Li, Xiaoyun, Richardson, Bruce
  Cc: dev, Ananyev, Konstantin, Lu, Wenzhuo, Zhang, Helin, ophirmu

25/10/2017 08:55, Li, Xiaoyun:
> From: Li, Xiaoyun
> > From: Richardson, Bruce
> > > On Thu, Oct 19, 2017 at 11:00:33AM +0200, Thomas Monjalon wrote:
> > > > 19/10/2017 10:50, Li, Xiaoyun:
> > > > > From: Thomas Monjalon
> > > > > > 19/10/2017 09:51, Li, Xiaoyun:
> > > > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > > > > 19/10/2017 04:45, Li, Xiaoyun:
> > > > > > > > > Hi
> > > > > > > > > > > >
> > > > > > > > > > > > The significant change of this patch is to call a
> > > > > > > > > > > > function pointer for packet size > 128
> > > (RTE_X86_MEMCPY_THRESH).
> > > > > > > > > > > The perf drop is due to function call replacing inline.
> > > > > > > > > > >
> > > > > > > > > > > > Please could you provide some benchmark numbers?
> > > > > > > > > > > I ran memcpy_perf_test which would show the time cost
> > > > > > > > > > > of memcpy. I ran it on broadwell with sse and avx2.
> > > > > > > > > > > But I just draw pictures and looked at the trend not
> > > > > > > > > > > computed the exact percentage. Sorry about that.
> > > > > > > > > > > The picture shows results of copy size of 2, 4, 6, 8,
> > > > > > > > > > > 9, 12, 16, 32, 64, 128, 192, 256, 320, 384, 448, 512,
> > > > > > > > > > > 768, 1024, 1518, 1522, 1536, 1600, 2048, 2560, 3072,
> > > > > > > > > > > 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168,
> > > > > > > > > > 7680, 8192.
> > > > > > > > > > > In my test, the size grows, the drop degrades. (Using
> > > > > > > > > > > copy time indicates the
> > > > > > > > > > > perf.) From the trend picture, when the size is
> > > > > > > > > > > smaller than
> > > > > > > > > > > 128 bytes, the perf drops a lot, almost 50%. And above
> > > > > > > > > > > 128 bytes, it approaches the original dpdk.
> > > > > > > > > > > I computed it right now, it shows that when greater
> > > > > > > > > > > than
> > > > > > > > > > > 128 bytes and smaller than 1024 bytes, the perf drops
> > > > > > > > > > > about
> > > 15%.
> > > > > > > > > > > When above
> > > > > > > > > > > 1024 bytes, the perf drops about 4%.
> > > > > > > > > > >
> > > > > > > > > > > > From a test done at Mellanox, there might be a
> > > > > > > > > > > > performance degradation of about 15% in testpmd
> > > > > > > > > > > > txonly
> > > with AVX2.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I did tests on X710, XXV710, X540 and MT27710 but didn't
> > > > > > > > > see
> > > > > > > > performance degradation.
> > > > > > > > >
> > > > > > > > > I used command "./x86_64-native-linuxapp-gcc/app/testpmd
> > > > > > > > > -c 0xf -n
> > > > > > > > > 4 -- -
> > > > > > > > I" and set fwd txonly.
> > > > > > > > > I tested it on v17.11-rc1, then revert my patch and tested it
> > again.
> > > > > > > > > Show port stats all and see the throughput pps. But the
> > > > > > > > > results are similar
> > > > > > > > and no drop.
> > > > > > > > >
> > > > > > > > > Did I miss something?
> > > > > > > >
> > > > > > > > I do not understand. Yesterday you confirmed a 15% drop with
> > > > > > > > buffers between
> > > > > > > > 128 and 1024 bytes.
> > > > > > > > But you do not see this drop in your txonly tests, right?
> > > > > > > >
> > > > > > > Yes. The drop is using test.
> > > > > > > Using command "make test -j" and then " ./build/app/test -c f -n 4 "
> > > > > > > Then run "memcpy_perf_autotest"
> > > > > > > The results are the cycles that memory copy costs.
> > > > > > > But I just use it to show the trend because I heard that it's
> > > > > > > not
> > > > > > recommended to use micro benchmarks like test_memcpy_perf for
> > > > > > memcpy performance report as they aren't likely able to reflect
> > > > > > performance of real world applications.
> > > > > >
> > > > > > Yes real applications can hide the memcpy cost.
> > > > > > Sometimes, the cost appear for real :)
> > > > > >
> > > > > > > Details can be seen at
> > > > > > > https://software.intel.com/en-us/articles/performance-optimiza
> > > > > > > ti
> > > > > > > on-of-
> > > > > > > memcpy-in-dpdk
> > > > > > >
> > > > > > > And I didn't see drop in testpmd txonly test. Maybe it's
> > > > > > > because not a lot
> > > > > > memcpy calls.
> > > > > >
> > > > > > It has been seen in a mlx4 use-case using more memcpy.
> > > > > > I think 15% in micro-benchmark is too much.
> > > > > > What can we do? Raise the threshold?
> > > > > >
> > > > > I think so. If there is big drop, can try raise the threshold. Maybe 1024?
> > > but not sure.
> > > > > But I didn't reproduce the 15% drop on mellanox and not sure how
> > > > > to
> > > verify it.
> > > >
> > > > I think we should focus on micro-benchmark and find a reasonnable
> > > > threshold for a reasonnable drop tradeoff.
> > > >
> > > Sadly, it may not be that simple. What shows best performance for
> > > micro- benchmarks may not show the same effect in a real application.
> > >
> > > /Bruce
> > 
> > Then how to measure the performance?
> > 
> > And I cannot reproduce 15% drop on mellanox.
> > Could the person who tested 15% drop help to do test again with 1024
> > threshold and see if there is any improvement?
> 
> As Bruce said, best performance on micro-benchmark may not show the same effect in real applications.

Yes real applications may hide the impact.
You keep saying that it is a reason to allow degrading memcpy raw perf.
But can you see better performance with buffers of 256 bytes with
any application thanks to your patch?
I am not sure whether there is a benefit keeping a code which imply
a signicative drop in micro-benchmarks.

> And I cannot reproduce the 15% drop.
> And I don't know if raising the threshold can improve the perf or not.
> Could the person who tested 15% drop help to do test again with 1024 threshold and see if there is any improvement?

We will test a increased threshold today.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-25  6:55                                   ` Li, Xiaoyun
  2017-10-25  7:25                                     ` Thomas Monjalon
@ 2017-10-25  8:50                                     ` Ananyev, Konstantin
  2017-10-25  8:54                                       ` Li, Xiaoyun
  1 sibling, 1 reply; 88+ messages in thread
From: Ananyev, Konstantin @ 2017-10-25  8:50 UTC (permalink / raw)
  To: Li, Xiaoyun, Richardson, Bruce, Thomas Monjalon
  Cc: dev, Lu, Wenzhuo, Zhang, Helin, ophirmu



> -----Original Message-----
> From: Li, Xiaoyun
> Sent: Wednesday, October 25, 2017 7:55 AM
> To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; Thomas Monjalon
> <thomas@monjalon.net>
> Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org; Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> <helin.zhang@intel.com>; ophirmu@mellanox.com
> Subject: RE: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
> 
> Hi
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Li, Xiaoyun
> > Sent: Friday, October 20, 2017 09:03
> > To: Richardson, Bruce <bruce.richardson@intel.com>; Thomas Monjalon
> > <thomas@monjalon.net>
> > Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org;
> > Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > <helin.zhang@intel.com>; ophirmu@mellanox.com
> > Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over
> > memcpy
> >
> >
> >
> > > -----Original Message-----
> > > From: Richardson, Bruce
> > > Sent: Thursday, October 19, 2017 17:30
> > > To: Thomas Monjalon <thomas@monjalon.net>
> > > Cc: Li, Xiaoyun <xiaoyun.li@intel.com>; Ananyev, Konstantin
> > > <konstantin.ananyev@intel.com>; dev@dpdk.org; Lu, Wenzhuo
> > > <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>;
> > > ophirmu@mellanox.com
> > > Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over
> > > memcpy
> > >
> > > On Thu, Oct 19, 2017 at 11:00:33AM +0200, Thomas Monjalon wrote:
> > > > 19/10/2017 10:50, Li, Xiaoyun:
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > > Sent: Thursday, October 19, 2017 16:34
> > > > > > To: Li, Xiaoyun <xiaoyun.li@intel.com>
> > > > > > Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> > > > > > Richardson, Bruce <bruce.richardson@intel.com>; dev@dpdk.org;
> > > > > > Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > > > > > <helin.zhang@intel.com>; ophirmu@mellanox.com
> > > > > > Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time
> > > > > > dispatch over memcpy
> > > > > >
> > > > > > 19/10/2017 09:51, Li, Xiaoyun:
> > > > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > > > > 19/10/2017 04:45, Li, Xiaoyun:
> > > > > > > > > Hi
> > > > > > > > > > > >
> > > > > > > > > > > > The significant change of this patch is to call a
> > > > > > > > > > > > function pointer for packet size > 128
> > > (RTE_X86_MEMCPY_THRESH).
> > > > > > > > > > > The perf drop is due to function call replacing inline.
> > > > > > > > > > >
> > > > > > > > > > > > Please could you provide some benchmark numbers?
> > > > > > > > > > > I ran memcpy_perf_test which would show the time cost
> > > > > > > > > > > of memcpy. I ran it on broadwell with sse and avx2.
> > > > > > > > > > > But I just draw pictures and looked at the trend not
> > > > > > > > > > > computed the exact percentage. Sorry about that.
> > > > > > > > > > > The picture shows results of copy size of 2, 4, 6, 8,
> > > > > > > > > > > 9, 12, 16, 32, 64, 128, 192, 256, 320, 384, 448, 512,
> > > > > > > > > > > 768, 1024, 1518, 1522, 1536, 1600, 2048, 2560, 3072,
> > > > > > > > > > > 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168,
> > > > > > > > > > 7680, 8192.
> > > > > > > > > > > In my test, the size grows, the drop degrades. (Using
> > > > > > > > > > > copy time indicates the
> > > > > > > > > > > perf.) From the trend picture, when the size is
> > > > > > > > > > > smaller than
> > > > > > > > > > > 128 bytes, the perf drops a lot, almost 50%. And above
> > > > > > > > > > > 128 bytes, it approaches the original dpdk.
> > > > > > > > > > > I computed it right now, it shows that when greater
> > > > > > > > > > > than
> > > > > > > > > > > 128 bytes and smaller than 1024 bytes, the perf drops
> > > > > > > > > > > about
> > > 15%.
> > > > > > > > > > > When above
> > > > > > > > > > > 1024 bytes, the perf drops about 4%.
> > > > > > > > > > >
> > > > > > > > > > > > From a test done at Mellanox, there might be a
> > > > > > > > > > > > performance degradation of about 15% in testpmd
> > > > > > > > > > > > txonly
> > > with AVX2.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I did tests on X710, XXV710, X540 and MT27710 but didn't
> > > > > > > > > see
> > > > > > > > performance degradation.
> > > > > > > > >
> > > > > > > > > I used command "./x86_64-native-linuxapp-gcc/app/testpmd
> > > > > > > > > -c 0xf -n
> > > > > > > > > 4 -- -
> > > > > > > > I" and set fwd txonly.
> > > > > > > > > I tested it on v17.11-rc1, then revert my patch and tested it
> > again.
> > > > > > > > > Show port stats all and see the throughput pps. But the
> > > > > > > > > results are similar
> > > > > > > > and no drop.
> > > > > > > > >
> > > > > > > > > Did I miss something?
> > > > > > > >
> > > > > > > > I do not understand. Yesterday you confirmed a 15% drop with
> > > > > > > > buffers between
> > > > > > > > 128 and 1024 bytes.
> > > > > > > > But you do not see this drop in your txonly tests, right?
> > > > > > > >
> > > > > > > Yes. The drop is using test.
> > > > > > > Using command "make test -j" and then " ./build/app/test -c f -n 4 "
> > > > > > > Then run "memcpy_perf_autotest"
> > > > > > > The results are the cycles that memory copy costs.
> > > > > > > But I just use it to show the trend because I heard that it's
> > > > > > > not
> > > > > > recommended to use micro benchmarks like test_memcpy_perf for
> > > > > > memcpy performance report as they aren't likely able to reflect
> > > > > > performance of real world applications.
> > > > > >
> > > > > > Yes real applications can hide the memcpy cost.
> > > > > > Sometimes, the cost appear for real :)
> > > > > >
> > > > > > > Details can be seen at
> > > > > > > https://software.intel.com/en-us/articles/performance-optimiza
> > > > > > > ti
> > > > > > > on-of-
> > > > > > > memcpy-in-dpdk
> > > > > > >
> > > > > > > And I didn't see drop in testpmd txonly test. Maybe it's
> > > > > > > because not a lot
> > > > > > memcpy calls.
> > > > > >
> > > > > > It has been seen in a mlx4 use-case using more memcpy.
> > > > > > I think 15% in micro-benchmark is too much.
> > > > > > What can we do? Raise the threshold?
> > > > > >
> > > > > I think so. If there is big drop, can try raise the threshold. Maybe 1024?
> > > but not sure.
> > > > > But I didn't reproduce the 15% drop on mellanox and not sure how
> > > > > to
> > > verify it.
> > > >
> > > > I think we should focus on micro-benchmark and find a reasonnable
> > > > threshold for a reasonnable drop tradeoff.
> > > >
> > > Sadly, it may not be that simple. What shows best performance for
> > > micro- benchmarks may not show the same effect in a real application.
> > >
> > > /Bruce
> >
> > Then how to measure the performance?
> >
> > And I cannot reproduce 15% drop on mellanox.
> > Could the person who tested 15% drop help to do test again with 1024
> > threshold and see if there is any improvement?
> 
> As Bruce said, best performance on micro-benchmark may not show the same effect in real applications.
> And I cannot reproduce the 15% drop.
> And I don't know if raising the threshold can improve the perf or not.
> Could the person who tested 15% drop help to do test again with 1024 threshold and see if there is any improvement?

As I already asked before - why not to make that threshold dynamic?
Konstantin

> 
> Best Regards
> Xiaoyun Li
> 
> 

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-25  8:50                                     ` Ananyev, Konstantin
@ 2017-10-25  8:54                                       ` Li, Xiaoyun
  2017-10-25  9:00                                         ` Thomas Monjalon
  2017-10-25  9:14                                         ` Ananyev, Konstantin
  0 siblings, 2 replies; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-25  8:54 UTC (permalink / raw)
  To: Ananyev, Konstantin, Richardson, Bruce, Thomas Monjalon
  Cc: dev, Lu, Wenzhuo, Zhang, Helin, ophirmu



> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Wednesday, October 25, 2017 16:51
> To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> <bruce.richardson@intel.com>; Thomas Monjalon <thomas@monjalon.net>
> Cc: dev@dpdk.org; Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> <helin.zhang@intel.com>; ophirmu@mellanox.com
> Subject: RE: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over
> memcpy
> 
> 
> 
> > -----Original Message-----
> > From: Li, Xiaoyun
> > Sent: Wednesday, October 25, 2017 7:55 AM
> > To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> > <bruce.richardson@intel.com>; Thomas Monjalon <thomas@monjalon.net>
> > Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org;
> > Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > <helin.zhang@intel.com>; ophirmu@mellanox.com
> > Subject: RE: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over
> > memcpy
> >
> > Hi
> >
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Li, Xiaoyun
> > > Sent: Friday, October 20, 2017 09:03
> > > To: Richardson, Bruce <bruce.richardson@intel.com>; Thomas Monjalon
> > > <thomas@monjalon.net>
> > > Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> > > dev@dpdk.org; Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > > <helin.zhang@intel.com>; ophirmu@mellanox.com
> > > Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch
> > > over memcpy
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Richardson, Bruce
> > > > Sent: Thursday, October 19, 2017 17:30
> > > > To: Thomas Monjalon <thomas@monjalon.net>
> > > > Cc: Li, Xiaoyun <xiaoyun.li@intel.com>; Ananyev, Konstantin
> > > > <konstantin.ananyev@intel.com>; dev@dpdk.org; Lu, Wenzhuo
> > > > <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>;
> > > > ophirmu@mellanox.com
> > > > Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch
> > > > over memcpy
> > > >
> > > > On Thu, Oct 19, 2017 at 11:00:33AM +0200, Thomas Monjalon wrote:
> > > > > 19/10/2017 10:50, Li, Xiaoyun:
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > > > Sent: Thursday, October 19, 2017 16:34
> > > > > > > To: Li, Xiaoyun <xiaoyun.li@intel.com>
> > > > > > > Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> > > > > > > Richardson, Bruce <bruce.richardson@intel.com>;
> > > > > > > dev@dpdk.org; Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang,
> > > > > > > Helin <helin.zhang@intel.com>; ophirmu@mellanox.com
> > > > > > > Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time
> > > > > > > dispatch over memcpy
> > > > > > >
> > > > > > > 19/10/2017 09:51, Li, Xiaoyun:
> > > > > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > > > > > 19/10/2017 04:45, Li, Xiaoyun:
> > > > > > > > > > Hi
> > > > > > > > > > > > >
> > > > > > > > > > > > > The significant change of this patch is to call
> > > > > > > > > > > > > a function pointer for packet size > 128
> > > > (RTE_X86_MEMCPY_THRESH).
> > > > > > > > > > > > The perf drop is due to function call replacing inline.
> > > > > > > > > > > >
> > > > > > > > > > > > > Please could you provide some benchmark numbers?
> > > > > > > > > > > > I ran memcpy_perf_test which would show the time
> > > > > > > > > > > > cost of memcpy. I ran it on broadwell with sse and avx2.
> > > > > > > > > > > > But I just draw pictures and looked at the trend
> > > > > > > > > > > > not computed the exact percentage. Sorry about that.
> > > > > > > > > > > > The picture shows results of copy size of 2, 4, 6,
> > > > > > > > > > > > 8, 9, 12, 16, 32, 64, 128, 192, 256, 320, 384,
> > > > > > > > > > > > 448, 512, 768, 1024, 1518, 1522, 1536, 1600, 2048,
> > > > > > > > > > > > 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144,
> > > > > > > > > > > > 6656, 7168,
> > > > > > > > > > > 7680, 8192.
> > > > > > > > > > > > In my test, the size grows, the drop degrades.
> > > > > > > > > > > > (Using copy time indicates the
> > > > > > > > > > > > perf.) From the trend picture, when the size is
> > > > > > > > > > > > smaller than
> > > > > > > > > > > > 128 bytes, the perf drops a lot, almost 50%. And
> > > > > > > > > > > > above
> > > > > > > > > > > > 128 bytes, it approaches the original dpdk.
> > > > > > > > > > > > I computed it right now, it shows that when
> > > > > > > > > > > > greater than
> > > > > > > > > > > > 128 bytes and smaller than 1024 bytes, the perf
> > > > > > > > > > > > drops about
> > > > 15%.
> > > > > > > > > > > > When above
> > > > > > > > > > > > 1024 bytes, the perf drops about 4%.
> > > > > > > > > > > >
> > > > > > > > > > > > > From a test done at Mellanox, there might be a
> > > > > > > > > > > > > performance degradation of about 15% in testpmd
> > > > > > > > > > > > > txonly
> > > > with AVX2.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I did tests on X710, XXV710, X540 and MT27710 but
> > > > > > > > > > didn't see
> > > > > > > > > performance degradation.
> > > > > > > > > >
> > > > > > > > > > I used command
> > > > > > > > > > "./x86_64-native-linuxapp-gcc/app/testpmd
> > > > > > > > > > -c 0xf -n
> > > > > > > > > > 4 -- -
> > > > > > > > > I" and set fwd txonly.
> > > > > > > > > > I tested it on v17.11-rc1, then revert my patch and
> > > > > > > > > > tested it
> > > again.
> > > > > > > > > > Show port stats all and see the throughput pps. But
> > > > > > > > > > the results are similar
> > > > > > > > > and no drop.
> > > > > > > > > >
> > > > > > > > > > Did I miss something?
> > > > > > > > >
> > > > > > > > > I do not understand. Yesterday you confirmed a 15% drop
> > > > > > > > > with buffers between
> > > > > > > > > 128 and 1024 bytes.
> > > > > > > > > But you do not see this drop in your txonly tests, right?
> > > > > > > > >
> > > > > > > > Yes. The drop is using test.
> > > > > > > > Using command "make test -j" and then " ./build/app/test -c f -n
> 4 "
> > > > > > > > Then run "memcpy_perf_autotest"
> > > > > > > > The results are the cycles that memory copy costs.
> > > > > > > > But I just use it to show the trend because I heard that
> > > > > > > > it's not
> > > > > > > recommended to use micro benchmarks like test_memcpy_perf
> > > > > > > for memcpy performance report as they aren't likely able to
> > > > > > > reflect performance of real world applications.
> > > > > > >
> > > > > > > Yes real applications can hide the memcpy cost.
> > > > > > > Sometimes, the cost appear for real :)
> > > > > > >
> > > > > > > > Details can be seen at
> > > > > > > > https://software.intel.com/en-us/articles/performance-opti
> > > > > > > > miza
> > > > > > > > ti
> > > > > > > > on-of-
> > > > > > > > memcpy-in-dpdk
> > > > > > > >
> > > > > > > > And I didn't see drop in testpmd txonly test. Maybe it's
> > > > > > > > because not a lot
> > > > > > > memcpy calls.
> > > > > > >
> > > > > > > It has been seen in a mlx4 use-case using more memcpy.
> > > > > > > I think 15% in micro-benchmark is too much.
> > > > > > > What can we do? Raise the threshold?
> > > > > > >
> > > > > > I think so. If there is big drop, can try raise the threshold. Maybe
> 1024?
> > > > but not sure.
> > > > > > But I didn't reproduce the 15% drop on mellanox and not sure
> > > > > > how to
> > > > verify it.
> > > > >
> > > > > I think we should focus on micro-benchmark and find a
> > > > > reasonnable threshold for a reasonnable drop tradeoff.
> > > > >
> > > > Sadly, it may not be that simple. What shows best performance for
> > > > micro- benchmarks may not show the same effect in a real application.
> > > >
> > > > /Bruce
> > >
> > > Then how to measure the performance?
> > >
> > > And I cannot reproduce 15% drop on mellanox.
> > > Could the person who tested 15% drop help to do test again with 1024
> > > threshold and see if there is any improvement?
> >
> > As Bruce said, best performance on micro-benchmark may not show the
> same effect in real applications.
> > And I cannot reproduce the 15% drop.
> > And I don't know if raising the threshold can improve the perf or not.
> > Could the person who tested 15% drop help to do test again with 1024
> threshold and see if there is any improvement?
> 
> As I already asked before - why not to make that threshold dynamic?
> Konstantin
> 
I want to confirm that raising threshold is useful. Then can make it dynamic and set it very large as default.

> >
> > Best Regards
> > Xiaoyun Li
> >
> >

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-25  8:54                                       ` Li, Xiaoyun
@ 2017-10-25  9:00                                         ` Thomas Monjalon
  2017-10-25 10:32                                           ` Li, Xiaoyun
  2017-10-25  9:14                                         ` Ananyev, Konstantin
  1 sibling, 1 reply; 88+ messages in thread
From: Thomas Monjalon @ 2017-10-25  9:00 UTC (permalink / raw)
  To: Li, Xiaoyun
  Cc: Ananyev, Konstantin, Richardson, Bruce, dev, Lu, Wenzhuo, Zhang,
	Helin, ophirmu

25/10/2017 10:54, Li, Xiaoyun:
> > > > > > I think we should focus on micro-benchmark and find a
> > > > > > reasonnable threshold for a reasonnable drop tradeoff.
> > > > > >
> > > > > Sadly, it may not be that simple. What shows best performance for
> > > > > micro- benchmarks may not show the same effect in a real application.
> > > > >
> > > > > /Bruce
> > > >
> > > > Then how to measure the performance?
> > > >
> > > > And I cannot reproduce 15% drop on mellanox.
> > > > Could the person who tested 15% drop help to do test again with 1024
> > > > threshold and see if there is any improvement?
> > >
> > > As Bruce said, best performance on micro-benchmark may not show the
> > same effect in real applications.
> > > And I cannot reproduce the 15% drop.
> > > And I don't know if raising the threshold can improve the perf or not.
> > > Could the person who tested 15% drop help to do test again with 1024
> > threshold and see if there is any improvement?
> > 
> > As I already asked before - why not to make that threshold dynamic?
> > Konstantin
> > 
> I want to confirm that raising threshold is useful. Then can make it dynamic and set it very large as default.

You can confirm it with micro-benchmarks.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-25  8:54                                       ` Li, Xiaoyun
  2017-10-25  9:00                                         ` Thomas Monjalon
@ 2017-10-25  9:14                                         ` Ananyev, Konstantin
  1 sibling, 0 replies; 88+ messages in thread
From: Ananyev, Konstantin @ 2017-10-25  9:14 UTC (permalink / raw)
  To: Li, Xiaoyun, Richardson, Bruce, Thomas Monjalon
  Cc: dev, Lu, Wenzhuo, Zhang, Helin, ophirmu



> -----Original Message-----
> From: Li, Xiaoyun
> Sent: Wednesday, October 25, 2017 9:54 AM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; Thomas Monjalon
> <thomas@monjalon.net>
> Cc: dev@dpdk.org; Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>; ophirmu@mellanox.com
> Subject: RE: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
> 
> 
> 
> > -----Original Message-----
> > From: Ananyev, Konstantin
> > Sent: Wednesday, October 25, 2017 16:51
> > To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> > <bruce.richardson@intel.com>; Thomas Monjalon <thomas@monjalon.net>
> > Cc: dev@dpdk.org; Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > <helin.zhang@intel.com>; ophirmu@mellanox.com
> > Subject: RE: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over
> > memcpy
> >
> >
> >
> > > -----Original Message-----
> > > From: Li, Xiaoyun
> > > Sent: Wednesday, October 25, 2017 7:55 AM
> > > To: Li, Xiaoyun <xiaoyun.li@intel.com>; Richardson, Bruce
> > > <bruce.richardson@intel.com>; Thomas Monjalon <thomas@monjalon.net>
> > > Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; dev@dpdk.org;
> > > Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > > <helin.zhang@intel.com>; ophirmu@mellanox.com
> > > Subject: RE: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over
> > > memcpy
> > >
> > > Hi
> > >
> > > > -----Original Message-----
> > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Li, Xiaoyun
> > > > Sent: Friday, October 20, 2017 09:03
> > > > To: Richardson, Bruce <bruce.richardson@intel.com>; Thomas Monjalon
> > > > <thomas@monjalon.net>
> > > > Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> > > > dev@dpdk.org; Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang, Helin
> > > > <helin.zhang@intel.com>; ophirmu@mellanox.com
> > > > Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch
> > > > over memcpy
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Richardson, Bruce
> > > > > Sent: Thursday, October 19, 2017 17:30
> > > > > To: Thomas Monjalon <thomas@monjalon.net>
> > > > > Cc: Li, Xiaoyun <xiaoyun.li@intel.com>; Ananyev, Konstantin
> > > > > <konstantin.ananyev@intel.com>; dev@dpdk.org; Lu, Wenzhuo
> > > > > <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>;
> > > > > ophirmu@mellanox.com
> > > > > Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch
> > > > > over memcpy
> > > > >
> > > > > On Thu, Oct 19, 2017 at 11:00:33AM +0200, Thomas Monjalon wrote:
> > > > > > 19/10/2017 10:50, Li, Xiaoyun:
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > > > > Sent: Thursday, October 19, 2017 16:34
> > > > > > > > To: Li, Xiaoyun <xiaoyun.li@intel.com>
> > > > > > > > Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> > > > > > > > Richardson, Bruce <bruce.richardson@intel.com>;
> > > > > > > > dev@dpdk.org; Lu, Wenzhuo <wenzhuo.lu@intel.com>; Zhang,
> > > > > > > > Helin <helin.zhang@intel.com>; ophirmu@mellanox.com
> > > > > > > > Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time
> > > > > > > > dispatch over memcpy
> > > > > > > >
> > > > > > > > 19/10/2017 09:51, Li, Xiaoyun:
> > > > > > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > > > > > > 19/10/2017 04:45, Li, Xiaoyun:
> > > > > > > > > > > Hi
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > The significant change of this patch is to call
> > > > > > > > > > > > > > a function pointer for packet size > 128
> > > > > (RTE_X86_MEMCPY_THRESH).
> > > > > > > > > > > > > The perf drop is due to function call replacing inline.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > Please could you provide some benchmark numbers?
> > > > > > > > > > > > > I ran memcpy_perf_test which would show the time
> > > > > > > > > > > > > cost of memcpy. I ran it on broadwell with sse and avx2.
> > > > > > > > > > > > > But I just draw pictures and looked at the trend
> > > > > > > > > > > > > not computed the exact percentage. Sorry about that.
> > > > > > > > > > > > > The picture shows results of copy size of 2, 4, 6,
> > > > > > > > > > > > > 8, 9, 12, 16, 32, 64, 128, 192, 256, 320, 384,
> > > > > > > > > > > > > 448, 512, 768, 1024, 1518, 1522, 1536, 1600, 2048,
> > > > > > > > > > > > > 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144,
> > > > > > > > > > > > > 6656, 7168,
> > > > > > > > > > > > 7680, 8192.
> > > > > > > > > > > > > In my test, the size grows, the drop degrades.
> > > > > > > > > > > > > (Using copy time indicates the
> > > > > > > > > > > > > perf.) From the trend picture, when the size is
> > > > > > > > > > > > > smaller than
> > > > > > > > > > > > > 128 bytes, the perf drops a lot, almost 50%. And
> > > > > > > > > > > > > above
> > > > > > > > > > > > > 128 bytes, it approaches the original dpdk.
> > > > > > > > > > > > > I computed it right now, it shows that when
> > > > > > > > > > > > > greater than
> > > > > > > > > > > > > 128 bytes and smaller than 1024 bytes, the perf
> > > > > > > > > > > > > drops about
> > > > > 15%.
> > > > > > > > > > > > > When above
> > > > > > > > > > > > > 1024 bytes, the perf drops about 4%.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > From a test done at Mellanox, there might be a
> > > > > > > > > > > > > > performance degradation of about 15% in testpmd
> > > > > > > > > > > > > > txonly
> > > > > with AVX2.
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I did tests on X710, XXV710, X540 and MT27710 but
> > > > > > > > > > > didn't see
> > > > > > > > > > performance degradation.
> > > > > > > > > > >
> > > > > > > > > > > I used command
> > > > > > > > > > > "./x86_64-native-linuxapp-gcc/app/testpmd
> > > > > > > > > > > -c 0xf -n
> > > > > > > > > > > 4 -- -
> > > > > > > > > > I" and set fwd txonly.
> > > > > > > > > > > I tested it on v17.11-rc1, then revert my patch and
> > > > > > > > > > > tested it
> > > > again.
> > > > > > > > > > > Show port stats all and see the throughput pps. But
> > > > > > > > > > > the results are similar
> > > > > > > > > > and no drop.
> > > > > > > > > > >
> > > > > > > > > > > Did I miss something?
> > > > > > > > > >
> > > > > > > > > > I do not understand. Yesterday you confirmed a 15% drop
> > > > > > > > > > with buffers between
> > > > > > > > > > 128 and 1024 bytes.
> > > > > > > > > > But you do not see this drop in your txonly tests, right?
> > > > > > > > > >
> > > > > > > > > Yes. The drop is using test.
> > > > > > > > > Using command "make test -j" and then " ./build/app/test -c f -n
> > 4 "
> > > > > > > > > Then run "memcpy_perf_autotest"
> > > > > > > > > The results are the cycles that memory copy costs.
> > > > > > > > > But I just use it to show the trend because I heard that
> > > > > > > > > it's not
> > > > > > > > recommended to use micro benchmarks like test_memcpy_perf
> > > > > > > > for memcpy performance report as they aren't likely able to
> > > > > > > > reflect performance of real world applications.
> > > > > > > >
> > > > > > > > Yes real applications can hide the memcpy cost.
> > > > > > > > Sometimes, the cost appear for real :)
> > > > > > > >
> > > > > > > > > Details can be seen at
> > > > > > > > > https://software.intel.com/en-us/articles/performance-opti
> > > > > > > > > miza
> > > > > > > > > ti
> > > > > > > > > on-of-
> > > > > > > > > memcpy-in-dpdk
> > > > > > > > >
> > > > > > > > > And I didn't see drop in testpmd txonly test. Maybe it's
> > > > > > > > > because not a lot
> > > > > > > > memcpy calls.
> > > > > > > >
> > > > > > > > It has been seen in a mlx4 use-case using more memcpy.
> > > > > > > > I think 15% in micro-benchmark is too much.
> > > > > > > > What can we do? Raise the threshold?
> > > > > > > >
> > > > > > > I think so. If there is big drop, can try raise the threshold. Maybe
> > 1024?
> > > > > but not sure.
> > > > > > > But I didn't reproduce the 15% drop on mellanox and not sure
> > > > > > > how to
> > > > > verify it.
> > > > > >
> > > > > > I think we should focus on micro-benchmark and find a
> > > > > > reasonnable threshold for a reasonnable drop tradeoff.
> > > > > >
> > > > > Sadly, it may not be that simple. What shows best performance for
> > > > > micro- benchmarks may not show the same effect in a real application.
> > > > >
> > > > > /Bruce
> > > >
> > > > Then how to measure the performance?
> > > >
> > > > And I cannot reproduce 15% drop on mellanox.
> > > > Could the person who tested 15% drop help to do test again with 1024
> > > > threshold and see if there is any improvement?
> > >
> > > As Bruce said, best performance on micro-benchmark may not show the
> > same effect in real applications.
> > > And I cannot reproduce the 15% drop.
> > > And I don't know if raising the threshold can improve the perf or not.
> > > Could the person who tested 15% drop help to do test again with 1024
> > threshold and see if there is any improvement?
> >
> > As I already asked before - why not to make that threshold dynamic?
> > Konstantin
> >
> I want to confirm that raising threshold is useful. Then can make it dynamic and set it very large as default.

Ok.

> 
> > >
> > > Best Regards
> > > Xiaoyun Li
> > >
> > >

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-25  9:00                                         ` Thomas Monjalon
@ 2017-10-25 10:32                                           ` Li, Xiaoyun
  0 siblings, 0 replies; 88+ messages in thread
From: Li, Xiaoyun @ 2017-10-25 10:32 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Ananyev, Konstantin, Richardson, Bruce, dev, Lu, Wenzhuo, Zhang,
	Helin, ophirmu



> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Wednesday, October 25, 2017 17:00
> To: Li, Xiaoyun <xiaoyun.li@intel.com>
> Cc: Ananyev, Konstantin <konstantin.ananyev@intel.com>; Richardson,
> Bruce <bruce.richardson@intel.com>; dev@dpdk.org; Lu, Wenzhuo
> <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>;
> ophirmu@mellanox.com
> Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over
> memcpy
> 
> 25/10/2017 10:54, Li, Xiaoyun:
> > > > > > > I think we should focus on micro-benchmark and find a
> > > > > > > reasonnable threshold for a reasonnable drop tradeoff.
> > > > > > >
> > > > > > Sadly, it may not be that simple. What shows best performance
> > > > > > for
> > > > > > micro- benchmarks may not show the same effect in a real
> application.
> > > > > >
> > > > > > /Bruce
> > > > >
> > > > > Then how to measure the performance?
> > > > >
> > > > > And I cannot reproduce 15% drop on mellanox.
> > > > > Could the person who tested 15% drop help to do test again with
> > > > > 1024 threshold and see if there is any improvement?
> > > >
> > > > As Bruce said, best performance on micro-benchmark may not show
> > > > the
> > > same effect in real applications.
> > > > And I cannot reproduce the 15% drop.
> > > > And I don't know if raising the threshold can improve the perf or not.
> > > > Could the person who tested 15% drop help to do test again with
> > > > 1024
> > > threshold and see if there is any improvement?
> > >
> > > As I already asked before - why not to make that threshold dynamic?
> > > Konstantin
> > >
> > I want to confirm that raising threshold is useful. Then can make it dynamic
> and set it very large as default.
> 
> You can confirm it with micro-benchmarks.

I did tests on memcpy_perf_test. Set threshold to 1024.
But when smaller than 1024 bytes, it costs 2~4 cycles more than the original.
Such as original is 10, right now is 12. Then the drop is 2/12=16%.
I don't know this kind of drop matters a lot or not.
And above 1024 bytes, the drop is almost 4% as I said before.

/Xiaoyun

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-25  7:25                                     ` Thomas Monjalon
@ 2017-10-29  8:49                                       ` Thomas Monjalon
  2017-11-02 10:22                                         ` Wang, Zhihong
  0 siblings, 1 reply; 88+ messages in thread
From: Thomas Monjalon @ 2017-10-29  8:49 UTC (permalink / raw)
  To: Li, Xiaoyun
  Cc: dev, Richardson, Bruce, Ananyev, Konstantin, Lu, Wenzhuo, Zhang,
	Helin, ophirmu

25/10/2017 09:25, Thomas Monjalon:
> 25/10/2017 08:55, Li, Xiaoyun:
> > From: Li, Xiaoyun
> > > From: Richardson, Bruce
> > > > On Thu, Oct 19, 2017 at 11:00:33AM +0200, Thomas Monjalon wrote:
> > > > > 19/10/2017 10:50, Li, Xiaoyun:
> > > > > > From: Thomas Monjalon
> > > > > > > 19/10/2017 09:51, Li, Xiaoyun:
> > > > > > > > From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > > > > > > > > 19/10/2017 04:45, Li, Xiaoyun:
> > > > > > > > > > Hi
> > > > > > > > > > > > >
> > > > > > > > > > > > > The significant change of this patch is to call a
> > > > > > > > > > > > > function pointer for packet size > 128
> > > > (RTE_X86_MEMCPY_THRESH).
> > > > > > > > > > > > The perf drop is due to function call replacing inline.
> > > > > > > > > > > >
> > > > > > > > > > > > > Please could you provide some benchmark numbers?
> > > > > > > > > > > > I ran memcpy_perf_test which would show the time cost
> > > > > > > > > > > > of memcpy. I ran it on broadwell with sse and avx2.
> > > > > > > > > > > > But I just draw pictures and looked at the trend not
> > > > > > > > > > > > computed the exact percentage. Sorry about that.
> > > > > > > > > > > > The picture shows results of copy size of 2, 4, 6, 8,
> > > > > > > > > > > > 9, 12, 16, 32, 64, 128, 192, 256, 320, 384, 448, 512,
> > > > > > > > > > > > 768, 1024, 1518, 1522, 1536, 1600, 2048, 2560, 3072,
> > > > > > > > > > > > 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168,
> > > > > > > > > > > 7680, 8192.
> > > > > > > > > > > > In my test, the size grows, the drop degrades. (Using
> > > > > > > > > > > > copy time indicates the
> > > > > > > > > > > > perf.) From the trend picture, when the size is
> > > > > > > > > > > > smaller than
> > > > > > > > > > > > 128 bytes, the perf drops a lot, almost 50%. And above
> > > > > > > > > > > > 128 bytes, it approaches the original dpdk.
> > > > > > > > > > > > I computed it right now, it shows that when greater
> > > > > > > > > > > > than
> > > > > > > > > > > > 128 bytes and smaller than 1024 bytes, the perf drops
> > > > > > > > > > > > about
> > > > 15%.
> > > > > > > > > > > > When above
> > > > > > > > > > > > 1024 bytes, the perf drops about 4%.
> > > > > > > > > > > >
> > > > > > > > > > > > > From a test done at Mellanox, there might be a
> > > > > > > > > > > > > performance degradation of about 15% in testpmd
> > > > > > > > > > > > > txonly
> > > > with AVX2.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I did tests on X710, XXV710, X540 and MT27710 but didn't
> > > > > > > > > > see
> > > > > > > > > performance degradation.
> > > > > > > > > >
> > > > > > > > > > I used command "./x86_64-native-linuxapp-gcc/app/testpmd
> > > > > > > > > > -c 0xf -n
> > > > > > > > > > 4 -- -
> > > > > > > > > I" and set fwd txonly.
> > > > > > > > > > I tested it on v17.11-rc1, then revert my patch and tested it
> > > again.
> > > > > > > > > > Show port stats all and see the throughput pps. But the
> > > > > > > > > > results are similar
> > > > > > > > > and no drop.
> > > > > > > > > >
> > > > > > > > > > Did I miss something?
> > > > > > > > >
> > > > > > > > > I do not understand. Yesterday you confirmed a 15% drop with
> > > > > > > > > buffers between
> > > > > > > > > 128 and 1024 bytes.
> > > > > > > > > But you do not see this drop in your txonly tests, right?
> > > > > > > > >
> > > > > > > > Yes. The drop is using test.
> > > > > > > > Using command "make test -j" and then " ./build/app/test -c f -n 4 "
> > > > > > > > Then run "memcpy_perf_autotest"
> > > > > > > > The results are the cycles that memory copy costs.
> > > > > > > > But I just use it to show the trend because I heard that it's
> > > > > > > > not
> > > > > > > recommended to use micro benchmarks like test_memcpy_perf for
> > > > > > > memcpy performance report as they aren't likely able to reflect
> > > > > > > performance of real world applications.
> > > > > > >
> > > > > > > Yes real applications can hide the memcpy cost.
> > > > > > > Sometimes, the cost appear for real :)
> > > > > > >
> > > > > > > > Details can be seen at
> > > > > > > > https://software.intel.com/en-us/articles/performance-optimiza
> > > > > > > > ti
> > > > > > > > on-of-
> > > > > > > > memcpy-in-dpdk
> > > > > > > >
> > > > > > > > And I didn't see drop in testpmd txonly test. Maybe it's
> > > > > > > > because not a lot
> > > > > > > memcpy calls.
> > > > > > >
> > > > > > > It has been seen in a mlx4 use-case using more memcpy.
> > > > > > > I think 15% in micro-benchmark is too much.
> > > > > > > What can we do? Raise the threshold?
> > > > > > >
> > > > > > I think so. If there is big drop, can try raise the threshold. Maybe 1024?
> > > > but not sure.
> > > > > > But I didn't reproduce the 15% drop on mellanox and not sure how
> > > > > > to
> > > > verify it.
> > > > >
> > > > > I think we should focus on micro-benchmark and find a reasonnable
> > > > > threshold for a reasonnable drop tradeoff.
> > > > >
> > > > Sadly, it may not be that simple. What shows best performance for
> > > > micro- benchmarks may not show the same effect in a real application.
> > > >
> > > > /Bruce
> > > 
> > > Then how to measure the performance?
> > > 
> > > And I cannot reproduce 15% drop on mellanox.
> > > Could the person who tested 15% drop help to do test again with 1024
> > > threshold and see if there is any improvement?
> > 
> > As Bruce said, best performance on micro-benchmark may not show the same effect in real applications.
> 
> Yes real applications may hide the impact.
> You keep saying that it is a reason to allow degrading memcpy raw perf.
> But can you see better performance with buffers of 256 bytes with
> any application thanks to your patch?
> I am not sure whether there is a benefit keeping a code which imply
> a signicative drop in micro-benchmarks.
> 
> > And I cannot reproduce the 15% drop.
> > And I don't know if raising the threshold can improve the perf or not.
> > Could the person who tested 15% drop help to do test again with 1024 threshold and see if there is any improvement?
> 
> We will test a increased threshold today.

Sorry, I forgot to update.

It seems that increasing the threshold from 128 to 1024 has no impact.
We can recover the 15% drop only by reverting the patch.

I don't know what is creating this drop exactly.
When doing different tests on different environments, we do not see this drop.
If nobody else can see such issue, I guess we can ignore it.

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-10-29  8:49                                       ` Thomas Monjalon
@ 2017-11-02 10:22                                         ` Wang, Zhihong
  2017-11-02 10:44                                           ` Thomas Monjalon
  0 siblings, 1 reply; 88+ messages in thread
From: Wang, Zhihong @ 2017-11-02 10:22 UTC (permalink / raw)
  To: Thomas Monjalon, Li, Xiaoyun
  Cc: dev, Richardson, Bruce, Ananyev, Konstantin, Lu, Wenzhuo, Zhang,
	Helin, ophirmu

> I don't know what is creating this drop exactly.
> When doing different tests on different environments, we do not see this
> drop.
> If nobody else can see such issue, I guess we can ignore it.

Hi Thomas, Xiaoyun,

With this patch (commit 84cc318424d49372dd2a5fbf3cf84426bf95acce) I see
more than 20% performance drop in vhost loopback test with testpmd
macswap for 256 bytes packets, which means it impacts actual vSwitching
performance.

Suggest we fix it or revert it for this release.

Thanks
Zhihong

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-11-02 10:22                                         ` Wang, Zhihong
@ 2017-11-02 10:44                                           ` Thomas Monjalon
  2017-11-02 10:58                                             ` Li, Xiaoyun
  2017-11-03  7:47                                             ` Yao, Lei A
  0 siblings, 2 replies; 88+ messages in thread
From: Thomas Monjalon @ 2017-11-02 10:44 UTC (permalink / raw)
  To: Wang, Zhihong, Li, Xiaoyun
  Cc: dev, Richardson, Bruce, Ananyev, Konstantin, Lu, Wenzhuo, Zhang,
	Helin, ophirmu

02/11/2017 11:22, Wang, Zhihong:
> > I don't know what is creating this drop exactly.
> > When doing different tests on different environments, we do not see this
> > drop.
> > If nobody else can see such issue, I guess we can ignore it.
> 
> Hi Thomas, Xiaoyun,
> 
> With this patch (commit 84cc318424d49372dd2a5fbf3cf84426bf95acce) I see
> more than 20% performance drop in vhost loopback test with testpmd
> macswap for 256 bytes packets, which means it impacts actual vSwitching
> performance.
> 
> Suggest we fix it or revert it for this release.

I think we need more numbers to take a decision.
What is the benefit of this patch? In which use-cases?
What are the drawbacks? In which use-cases?

Please, it is a call to test performance with and without this patch
in more environments (CPU, packet size, applications).

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-11-02 10:44                                           ` Thomas Monjalon
@ 2017-11-02 10:58                                             ` Li, Xiaoyun
  2017-11-02 12:15                                               ` Thomas Monjalon
  2017-11-03  7:47                                             ` Yao, Lei A
  1 sibling, 1 reply; 88+ messages in thread
From: Li, Xiaoyun @ 2017-11-02 10:58 UTC (permalink / raw)
  To: Thomas Monjalon, Wang, Zhihong
  Cc: dev, Richardson, Bruce, Ananyev, Konstantin, Lu, Wenzhuo, Zhang,
	Helin, ophirmu



> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Thursday, November 2, 2017 18:45
> To: Wang, Zhihong <zhihong.wang@intel.com>; Li, Xiaoyun
> <xiaoyun.li@intel.com>
> Cc: dev@dpdk.org; Richardson, Bruce <bruce.richardson@intel.com>;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>; Lu, Wenzhuo
> <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>;
> ophirmu@mellanox.com
> Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over
> memcpy
> 
> 02/11/2017 11:22, Wang, Zhihong:
> > > I don't know what is creating this drop exactly.
> > > When doing different tests on different environments, we do not see
> > > this drop.
> > > If nobody else can see such issue, I guess we can ignore it.
> >
> > Hi Thomas, Xiaoyun,
> >
> > With this patch (commit 84cc318424d49372dd2a5fbf3cf84426bf95acce) I
> > see more than 20% performance drop in vhost loopback test with testpmd
> > macswap for 256 bytes packets, which means it impacts actual
> > vSwitching performance.
> >
> > Suggest we fix it or revert it for this release.
> 
> I think we need more numbers to take a decision.
> What is the benefit of this patch? In which use-cases?

 The benefit is that if compile it on a lower platform (such as only supports SSE),
when it run on higher platforms (such as AVX2 or AVX512). It would still can get ISA benefit (AVX2).
User case seems to be that some customers want it in cloud environment and don't want to compile on all platforms.

> What are the drawbacks? In which use-cases?

The drawback is perf drop. So far, see lot of drop in mellanox case and vhost case.

Should I send the revert patch or you revert it directly?

> 
> Please, it is a call to test performance with and without this patch in more
> environments (CPU, packet size, applications).

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-11-02 10:58                                             ` Li, Xiaoyun
@ 2017-11-02 12:15                                               ` Thomas Monjalon
  0 siblings, 0 replies; 88+ messages in thread
From: Thomas Monjalon @ 2017-11-02 12:15 UTC (permalink / raw)
  To: Li, Xiaoyun
  Cc: Wang, Zhihong, dev, Richardson, Bruce, Ananyev, Konstantin, Lu,
	Wenzhuo, Zhang, Helin, ophirmu

02/11/2017 11:58, Li, Xiaoyun:
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> > 02/11/2017 11:22, Wang, Zhihong:
> > > > I don't know what is creating this drop exactly.
> > > > When doing different tests on different environments, we do not see
> > > > this drop.
> > > > If nobody else can see such issue, I guess we can ignore it.
> > >
> > > Hi Thomas, Xiaoyun,
> > >
> > > With this patch (commit 84cc318424d49372dd2a5fbf3cf84426bf95acce) I
> > > see more than 20% performance drop in vhost loopback test with testpmd
> > > macswap for 256 bytes packets, which means it impacts actual
> > > vSwitching performance.
> > >
> > > Suggest we fix it or revert it for this release.
> > 
> > I think we need more numbers to take a decision.
> > What is the benefit of this patch? In which use-cases?
> 
>  The benefit is that if compile it on a lower platform (such as only supports SSE),
> when it run on higher platforms (such as AVX2 or AVX512). It would still can get ISA benefit (AVX2).

Yes, but you don't provide any number here.

> User case seems to be that some customers want it in cloud environment and don't want to compile on all platforms.
> 
> > What are the drawbacks? In which use-cases?
> 
> The drawback is perf drop. So far, see lot of drop in mellanox case and vhost case.
> 
> Should I send the revert patch or you revert it directly?

You should send the revert yourself with some good justifications.
I did not ask some numbers when accepting the patch (my mistake).
Please provide the numbers for the revert.

> > Please, it is a call to test performance with and without this patch in more
> > environments (CPU, packet size, applications).

Who can test it in more environments?

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy
  2017-11-02 10:44                                           ` Thomas Monjalon
  2017-11-02 10:58                                             ` Li, Xiaoyun
@ 2017-11-03  7:47                                             ` Yao, Lei A
  1 sibling, 0 replies; 88+ messages in thread
From: Yao, Lei A @ 2017-11-03  7:47 UTC (permalink / raw)
  To: Thomas Monjalon, Wang, Zhihong, Li, Xiaoyun
  Cc: dev, Richardson, Bruce, Ananyev, Konstantin, Lu, Wenzhuo, Zhang,
	Helin, ophirmu

Hi, Thomas

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Thomas Monjalon
> Sent: Thursday, November 2, 2017 6:45 PM
> To: Wang, Zhihong <zhihong.wang@intel.com>; Li, Xiaoyun
> <xiaoyun.li@intel.com>
> Cc: dev@dpdk.org; Richardson, Bruce <bruce.richardson@intel.com>;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>; Lu, Wenzhuo
> <wenzhuo.lu@intel.com>; Zhang, Helin <helin.zhang@intel.com>;
> ophirmu@mellanox.com
> Subject: Re: [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over
> memcpy
> 
> 02/11/2017 11:22, Wang, Zhihong:
> > > I don't know what is creating this drop exactly.
> > > When doing different tests on different environments, we do not see
> this
> > > drop.
> > > If nobody else can see such issue, I guess we can ignore it.
> >
> > Hi Thomas, Xiaoyun,
> >
> > With this patch (commit 84cc318424d49372dd2a5fbf3cf84426bf95acce) I see
> > more than 20% performance drop in vhost loopback test with testpmd
> > macswap for 256 bytes packets, which means it impacts actual vSwitching
> > performance.
> >
> > Suggest we fix it or revert it for this release.
> 
> I think we need more numbers to take a decision.
> What is the benefit of this patch? In which use-cases?
> What are the drawbacks? In which use-cases?
> 
> Please, it is a call to test performance with and without this patch
> in more environments (CPU, packet size, applications).

Following is the performance drop we observe in vhost/virtio loopback performance
with and without this patch
Test application: testpmd
CPU info: Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz
OS: Ubuntu 16.04

Mergebale Path	
packet size	Performance Drop
64	-1.30%
128	0.81%
158	-19.17%
188	-19.18%
218	-16.29%
230	-16.57%
256	-16.77%
280	-3.07%
300	-3.22%
380	-2.44%
420	-1.65%
512	-0.99%
1024	0.00%
1518	-0.68%
	
Vector Path	
packet size	Performance Drop
64	3.30%
128	7.18%
256	-12.77%
512	-0.98%
1024	0.27%
1518	0.68%

^ permalink raw reply	[flat|nested] 88+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/3] eal/x86: run-time dispatch over memcpy
       [not found] <506411689-94690-2-git-send-email-xiaoyun.li@intel.com>
@ 2017-10-02 12:31 ` Konstantin Ananyev
  0 siblings, 0 replies; 88+ messages in thread
From: Konstantin Ananyev @ 2017-10-02 12:31 UTC (permalink / raw)
  To: dev, xiaoyun.li, konstantin.ananyev; +Cc: Konstantin Ananyev

Hi Xiaoyun,
Just to be a bit more specific about what I suggest -
here is a draft patch below.
It still needs more testing and probably polishing,
but I suppose gives you an idea.
Konstantin


---
 lib/librte_eal/bsdapp/eal/Makefile                 |  20 +
 lib/librte_eal/common/arch/x86/rte_memcpy.c        |  58 ++
 lib/librte_eal/common/arch/x86/rte_memcpy_avx2.c   |  44 +
 .../common/arch/x86/rte_memcpy_avx512f.c           |  44 +
 lib/librte_eal/common/arch/x86/rte_memcpy_sse.c    |  40 +
 .../common/include/arch/x86/rte_memcpy.h           | 854 +------------------
 .../common/include/arch/x86/rte_memcpy_internal.h  | 904 +++++++++++++++++++++
 lib/librte_eal/linuxapp/eal/Makefile               |  20 +
 8 files changed, 1149 insertions(+), 835 deletions(-)
 create mode 100644 lib/librte_eal/common/arch/x86/rte_memcpy.c
 create mode 100644 lib/librte_eal/common/arch/x86/rte_memcpy_avx2.c
 create mode 100644 lib/librte_eal/common/arch/x86/rte_memcpy_avx512f.c
 create mode 100644 lib/librte_eal/common/arch/x86/rte_memcpy_sse.c
 create mode 100644 lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h

diff --git a/lib/librte_eal/bsdapp/eal/Makefile b/lib/librte_eal/bsdapp/eal/Makefile
index 005019e..32d025b 100644
--- a/lib/librte_eal/bsdapp/eal/Makefile
+++ b/lib/librte_eal/bsdapp/eal/Makefile
@@ -93,6 +93,26 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_BSDAPP) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_ARCH_X86) += rte_spinlock.c
 
+#memcpy dynamic stuff
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy.c
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_sse.c
+
+CC_SUPPORT_AVX2 := $(shell $(CC) -march=core-avx2 -dM -E - < /dev/null 2>&1 | grep -q AVX2 && echo 1)
+ifeq ($(CC_SUPPORT_AVX2),1)
+CFLAGS_rte_memcpy.o += -DCC_SUPPORT_AVX2
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx2.c
+CFLAGS_rte_memcpy_avx2.o += -mavx2
+CFLAGS_rte_memcpy_avx2.o += -DRTE_MACHINE_CPUFLAG_AVX2
+endif
+
+CC_SUPPORT_AVX512F := $(shell $(CC) -mavx512f -dM -E - < /dev/null 2>&1 | grep -q AVX512F && echo 1)
+ifeq ($(CC_SUPPORT_AVX512F),1)
+CFLAGS_rte_memcpy.o += -DCC_SUPPORT_AVX512F
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx512f.c
+CFLAGS_rte_memcpy_avx512f.o += -mavx512f
+CFLAGS_rte_memcpy_avx512f.o += -DRTE_MACHINE_CPUFLAG_AVX512F
+endif
+
 CFLAGS_eal_common_cpuflags.o := $(CPUFLAGS_LIST)
 
 CFLAGS_eal.o := -D_GNU_SOURCE
diff --git a/lib/librte_eal/common/arch/x86/rte_memcpy.c b/lib/librte_eal/common/arch/x86/rte_memcpy.c
new file mode 100644
index 0000000..9feb2b5
--- /dev/null
+++ b/lib/librte_eal/common/arch/x86/rte_memcpy.c
@@ -0,0 +1,58 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+#include <rte_cpuflags.h>
+
+void *(*rte_memcpy_ptr)(void *dst, const void *src, size_t n) = NULL;
+
+static void __attribute__((constructor))
+rte_memcpy_init(void)
+{
+#ifdef CC_SUPPORT_AVX512F
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX512F)) {
+		rte_memcpy_ptr = rte_memcpy_avx512f;
+		printf("%s: AVX512 is using!\n", __func__);
+		return;
+	}
+#endif
+#ifdef CC_SUPPORT_AVX2
+	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) {
+		rte_memcpy_ptr = rte_memcpy_avx2;
+		printf("%s: AVX2 is using!\n", __func__);
+		return;
+	}
+#endif
+	rte_memcpy_ptr = rte_memcpy_sse;
+	printf("%s:Default SSE/AVX is using!\n", __func__);
+}
diff --git a/lib/librte_eal/common/arch/x86/rte_memcpy_avx2.c b/lib/librte_eal/common/arch/x86/rte_memcpy_avx2.c
new file mode 100644
index 0000000..3ad229c
--- /dev/null
+++ b/lib/librte_eal/common/arch/x86/rte_memcpy_avx2.c
@@ -0,0 +1,44 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+
+#ifndef RTE_MACHINE_CPUFLAG_AVX2
+#error RTE_MACHINE_CPUFLAG_AVX2 not defined
+#endif
+
+void *
+rte_memcpy_avx2(void *dst, const void *src, size_t n)
+{
+	return rte_memcpy_internal(dst, src, n);
+}
diff --git a/lib/librte_eal/common/arch/x86/rte_memcpy_avx512f.c b/lib/librte_eal/common/arch/x86/rte_memcpy_avx512f.c
new file mode 100644
index 0000000..be8d964
--- /dev/null
+++ b/lib/librte_eal/common/arch/x86/rte_memcpy_avx512f.c
@@ -0,0 +1,44 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+
+#ifndef RTE_MACHINE_CPUFLAG_AVX512F
+#error RTE_MACHINE_CPUFLAG_AVX512F not defined
+#endif
+
+void *
+rte_memcpy_avx512f(void *dst, const void *src, size_t n)
+{
+	return rte_memcpy_internal(dst, src, n);
+}
diff --git a/lib/librte_eal/common/arch/x86/rte_memcpy_sse.c b/lib/librte_eal/common/arch/x86/rte_memcpy_sse.c
new file mode 100644
index 0000000..55d6b41
--- /dev/null
+++ b/lib/librte_eal/common/arch/x86/rte_memcpy_sse.c
@@ -0,0 +1,40 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#include <rte_memcpy.h>
+
+void *
+rte_memcpy_sse(void *dst, const void *src, size_t n)
+{
+	return rte_memcpy_internal(dst, src, n);
+}
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index 74c280c..9856d29 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -1,7 +1,7 @@
 /*-
  *   BSD LICENSE
  *
- *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   Copyright(c) 2010-2017 Intel Corporation. All rights reserved.
  *   All rights reserved.
  *
  *   Redistribution and use in source and binary forms, with or without
@@ -36,20 +36,27 @@
 
 /**
  * @file
- *
- * Functions for SSE/AVX/AVX2/AVX512 implementation of memcpy().
  */
 
-#include <stdio.h>
-#include <stdint.h>
-#include <string.h>
-#include <rte_vect.h>
-#include <rte_common.h>
+#include <rte_memcpy_internal.h>
 
 #ifdef __cplusplus
 extern "C" {
 #endif
 
+#define RTE_X86_MEMCPY_THRESH 128
+
+extern void *(*rte_memcpy_ptr)(void *dst, const void *src, size_t n);
+
+extern void *
+rte_memcpy_sse(void *dst, const void *src, size_t n);
+
+extern void *
+rte_memcpy_avx2(void *dst, const void *src, size_t n);
+
+extern void *
+rte_memcpy_avx512f(void *dst, const void *src, size_t n);
+
 /**
  * Copy bytes from one location to another. The locations must not overlap.
  *
@@ -65,840 +72,17 @@ extern "C" {
  * @return
  *   Pointer to the destination data.
  */
-static __rte_always_inline void *
-rte_memcpy(void *dst, const void *src, size_t n);
-
-#ifdef RTE_MACHINE_CPUFLAG_AVX512F
-
-#define ALIGNMENT_MASK 0x3F
-
-/**
- * AVX512 implementation below
- */
-
-/**
- * Copy 16 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov16(uint8_t *dst, const uint8_t *src)
-{
-	__m128i xmm0;
-
-	xmm0 = _mm_loadu_si128((const __m128i *)src);
-	_mm_storeu_si128((__m128i *)dst, xmm0);
-}
-
-/**
- * Copy 32 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov32(uint8_t *dst, const uint8_t *src)
-{
-	__m256i ymm0;
-
-	ymm0 = _mm256_loadu_si256((const __m256i *)src);
-	_mm256_storeu_si256((__m256i *)dst, ymm0);
-}
-
-/**
- * Copy 64 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov64(uint8_t *dst, const uint8_t *src)
-{
-	__m512i zmm0;
-
-	zmm0 = _mm512_loadu_si512((const void *)src);
-	_mm512_storeu_si512((void *)dst, zmm0);
-}
-
-/**
- * Copy 128 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov64(dst + 0 * 64, src + 0 * 64);
-	rte_mov64(dst + 1 * 64, src + 1 * 64);
-}
-
-/**
- * Copy 256 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov256(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov64(dst + 0 * 64, src + 0 * 64);
-	rte_mov64(dst + 1 * 64, src + 1 * 64);
-	rte_mov64(dst + 2 * 64, src + 2 * 64);
-	rte_mov64(dst + 3 * 64, src + 3 * 64);
-}
-
-/**
- * Copy 128-byte blocks from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
-{
-	__m512i zmm0, zmm1;
-
-	while (n >= 128) {
-		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
-		n -= 128;
-		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
-		src = src + 128;
-		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
-		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
-		dst = dst + 128;
-	}
-}
-
-/**
- * Copy 512-byte blocks from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
-{
-	__m512i zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, zmm7;
-
-	while (n >= 512) {
-		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
-		n -= 512;
-		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
-		zmm2 = _mm512_loadu_si512((const void *)(src + 2 * 64));
-		zmm3 = _mm512_loadu_si512((const void *)(src + 3 * 64));
-		zmm4 = _mm512_loadu_si512((const void *)(src + 4 * 64));
-		zmm5 = _mm512_loadu_si512((const void *)(src + 5 * 64));
-		zmm6 = _mm512_loadu_si512((const void *)(src + 6 * 64));
-		zmm7 = _mm512_loadu_si512((const void *)(src + 7 * 64));
-		src = src + 512;
-		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
-		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
-		_mm512_storeu_si512((void *)(dst + 2 * 64), zmm2);
-		_mm512_storeu_si512((void *)(dst + 3 * 64), zmm3);
-		_mm512_storeu_si512((void *)(dst + 4 * 64), zmm4);
-		_mm512_storeu_si512((void *)(dst + 5 * 64), zmm5);
-		_mm512_storeu_si512((void *)(dst + 6 * 64), zmm6);
-		_mm512_storeu_si512((void *)(dst + 7 * 64), zmm7);
-		dst = dst + 512;
-	}
-}
-
-static inline void *
-rte_memcpy_generic(void *dst, const void *src, size_t n)
-{
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
-	void *ret = dst;
-	size_t dstofss;
-	size_t bits;
-
-	/**
-	 * Copy less than 16 bytes
-	 */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08)
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		return ret;
-	}
-
-	/**
-	 * Fast way when copy size doesn't exceed 512 bytes
-	 */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				  (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov32((uint8_t *)dst - 32 + n,
-				  (const uint8_t *)src - 32 + n);
-		return ret;
-	}
-	if (n <= 512) {
-		if (n >= 256) {
-			n -= 256;
-			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 256;
-			dst = (uint8_t *)dst + 256;
-		}
-		if (n >= 128) {
-			n -= 128;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 128;
-			dst = (uint8_t *)dst + 128;
-		}
-COPY_BLOCK_128_BACK63:
-		if (n > 64) {
-			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov64((uint8_t *)dst - 64 + n,
-					  (const uint8_t *)src - 64 + n);
-			return ret;
-		}
-		if (n > 0)
-			rte_mov64((uint8_t *)dst - 64 + n,
-					  (const uint8_t *)src - 64 + n);
-		return ret;
-	}
-
-	/**
-	 * Make store aligned when copy size exceeds 512 bytes
-	 */
-	dstofss = ((uintptr_t)dst & 0x3F);
-	if (dstofss > 0) {
-		dstofss = 64 - dstofss;
-		n -= dstofss;
-		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-		src = (const uint8_t *)src + dstofss;
-		dst = (uint8_t *)dst + dstofss;
-	}
-
-	/**
-	 * Copy 512-byte blocks.
-	 * Use copy block function for better instruction order control,
-	 * which is important when load is unaligned.
-	 */
-	rte_mov512blocks((uint8_t *)dst, (const uint8_t *)src, n);
-	bits = n;
-	n = n & 511;
-	bits -= n;
-	src = (const uint8_t *)src + bits;
-	dst = (uint8_t *)dst + bits;
-
-	/**
-	 * Copy 128-byte blocks.
-	 * Use copy block function for better instruction order control,
-	 * which is important when load is unaligned.
-	 */
-	if (n >= 128) {
-		rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
-		bits = n;
-		n = n & 127;
-		bits -= n;
-		src = (const uint8_t *)src + bits;
-		dst = (uint8_t *)dst + bits;
-	}
-
-	/**
-	 * Copy whatever left
-	 */
-	goto COPY_BLOCK_128_BACK63;
-}
-
-#elif defined RTE_MACHINE_CPUFLAG_AVX2
-
-#define ALIGNMENT_MASK 0x1F
-
-/**
- * AVX2 implementation below
- */
-
-/**
- * Copy 16 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov16(uint8_t *dst, const uint8_t *src)
-{
-	__m128i xmm0;
-
-	xmm0 = _mm_loadu_si128((const __m128i *)src);
-	_mm_storeu_si128((__m128i *)dst, xmm0);
-}
-
-/**
- * Copy 32 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov32(uint8_t *dst, const uint8_t *src)
-{
-	__m256i ymm0;
-
-	ymm0 = _mm256_loadu_si256((const __m256i *)src);
-	_mm256_storeu_si256((__m256i *)dst, ymm0);
-}
-
-/**
- * Copy 64 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov64(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
-	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
-}
-
-/**
- * Copy 128 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
-	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
-	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
-	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
-}
-
-/**
- * Copy 128-byte blocks from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
-{
-	__m256i ymm0, ymm1, ymm2, ymm3;
-
-	while (n >= 128) {
-		ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32));
-		n -= 128;
-		ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32));
-		ymm2 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 2 * 32));
-		ymm3 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 3 * 32));
-		src = (const uint8_t *)src + 128;
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2);
-		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3);
-		dst = (uint8_t *)dst + 128;
-	}
-}
-
-static inline void *
-rte_memcpy_generic(void *dst, const void *src, size_t n)
-{
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
-	void *ret = dst;
-	size_t dstofss;
-	size_t bits;
-
-	/**
-	 * Copy less than 16 bytes
-	 */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08) {
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		}
-		return ret;
-	}
-
-	/**
-	 * Fast way when copy size doesn't exceed 256 bytes
-	 */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				(const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 48) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst + 16, (const uint8_t *)src + 16);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				(const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov32((uint8_t *)dst - 32 + n,
-				(const uint8_t *)src - 32 + n);
-		return ret;
-	}
-	if (n <= 256) {
-		if (n >= 128) {
-			n -= 128;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 128;
-			dst = (uint8_t *)dst + 128;
-		}
-COPY_BLOCK_128_BACK31:
-		if (n >= 64) {
-			n -= 64;
-			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 64;
-			dst = (uint8_t *)dst + 64;
-		}
-		if (n > 32) {
-			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov32((uint8_t *)dst - 32 + n,
-					(const uint8_t *)src - 32 + n);
-			return ret;
-		}
-		if (n > 0) {
-			rte_mov32((uint8_t *)dst - 32 + n,
-					(const uint8_t *)src - 32 + n);
-		}
-		return ret;
-	}
-
-	/**
-	 * Make store aligned when copy size exceeds 256 bytes
-	 */
-	dstofss = (uintptr_t)dst & 0x1F;
-	if (dstofss > 0) {
-		dstofss = 32 - dstofss;
-		n -= dstofss;
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		src = (const uint8_t *)src + dstofss;
-		dst = (uint8_t *)dst + dstofss;
-	}
-
-	/**
-	 * Copy 128-byte blocks
-	 */
-	rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
-	bits = n;
-	n = n & 127;
-	bits -= n;
-	src = (const uint8_t *)src + bits;
-	dst = (uint8_t *)dst + bits;
-
-	/**
-	 * Copy whatever left
-	 */
-	goto COPY_BLOCK_128_BACK31;
-}
-
-#else /* RTE_MACHINE_CPUFLAG */
-
-#define ALIGNMENT_MASK 0x0F
-
-/**
- * SSE & AVX implementation below
- */
-
-/**
- * Copy 16 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov16(uint8_t *dst, const uint8_t *src)
-{
-	__m128i xmm0;
-
-	xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);
-	_mm_storeu_si128((__m128i *)dst, xmm0);
-}
-
-/**
- * Copy 32 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov32(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-}
-
-/**
- * Copy 64 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov64(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
-	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
-}
-
-/**
- * Copy 128 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov128(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
-	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
-	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
-	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
-	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
-	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
-}
-
-/**
- * Copy 256 bytes from one location to another,
- * locations should not overlap.
- */
-static inline void
-rte_mov256(uint8_t *dst, const uint8_t *src)
-{
-	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
-	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
-	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
-	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
-	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
-	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
-	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
-	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
-	rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);
-	rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);
-	rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);
-	rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);
-	rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);
-	rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);
-	rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);
-	rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
-}
-
-/**
- * Macro for copying unaligned block from one location to another with constant load offset,
- * 47 bytes leftover maximum,
- * locations should not overlap.
- * Requirements:
- * - Store is aligned
- * - Load offset is <offset>, which must be immediate value within [1, 15]
- * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
- * - <dst>, <src>, <len> must be variables
- * - __m128i <xmm0> ~ <xmm8> must be pre-defined
- */
-#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)                                                     \
-__extension__ ({                                                                                            \
-    int tmp;                                                                                                \
-    while (len >= 128 + 16 - offset) {                                                                      \
-        xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));                  \
-        len -= 128;                                                                                         \
-        xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));                  \
-        xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));                  \
-        xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16));                  \
-        xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16));                  \
-        xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16));                  \
-        xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16));                  \
-        xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \
-        xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \
-        src = (const uint8_t *)src + 128;                                                                   \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \
-        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));        \
-        dst = (uint8_t *)dst + 128;                                                                         \
-    }                                                                                                       \
-    tmp = len;                                                                                              \
-    len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
-    tmp -= len;                                                                                             \
-    src = (const uint8_t *)src + tmp;                                                                       \
-    dst = (uint8_t *)dst + tmp;                                                                             \
-    if (len >= 32 + 16 - offset) {                                                                          \
-        while (len >= 32 + 16 - offset) {                                                                   \
-            xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));              \
-            len -= 32;                                                                                      \
-            xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));              \
-            xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));              \
-            src = (const uint8_t *)src + 32;                                                                \
-            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));    \
-            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));    \
-            dst = (uint8_t *)dst + 32;                                                                      \
-        }                                                                                                   \
-        tmp = len;                                                                                          \
-        len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
-        tmp -= len;                                                                                         \
-        src = (const uint8_t *)src + tmp;                                                                   \
-        dst = (uint8_t *)dst + tmp;                                                                         \
-    }                                                                                                       \
-})
-
-/**
- * Macro for copying unaligned block from one location to another,
- * 47 bytes leftover maximum,
- * locations should not overlap.
- * Use switch here because the aligning instruction requires immediate value for shift count.
- * Requirements:
- * - Store is aligned
- * - Load offset is <offset>, which must be within [1, 15]
- * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
- * - <dst>, <src>, <len> must be variables
- * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined
- */
-#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                   \
-__extension__ ({                                                      \
-    switch (offset) {                                                 \
-    case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break;    \
-    case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break;    \
-    case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break;    \
-    case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break;    \
-    case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break;    \
-    case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break;    \
-    case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break;    \
-    case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break;    \
-    case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break;    \
-    case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break;    \
-    case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;    \
-    case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break;    \
-    case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break;    \
-    case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break;    \
-    case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break;    \
-    default:;                                                         \
-    }                                                                 \
-})
-
-static inline void *
-rte_memcpy_generic(void *dst, const void *src, size_t n)
-{
-	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
-	uintptr_t dstu = (uintptr_t)dst;
-	uintptr_t srcu = (uintptr_t)src;
-	void *ret = dst;
-	size_t dstofss;
-	size_t srcofs;
-
-	/**
-	 * Copy less than 16 bytes
-	 */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dstu = *(const uint8_t *)srcu;
-			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
-			dstu = (uintptr_t)((uint8_t *)dstu + 1);
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dstu = *(const uint16_t *)srcu;
-			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
-			dstu = (uintptr_t)((uint16_t *)dstu + 1);
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dstu = *(const uint32_t *)srcu;
-			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
-			dstu = (uintptr_t)((uint32_t *)dstu + 1);
-		}
-		if (n & 0x08) {
-			*(uint64_t *)dstu = *(const uint64_t *)srcu;
-		}
-		return ret;
-	}
-
-	/**
-	 * Fast way when copy size doesn't exceed 512 bytes
-	 */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 48) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
-		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		return ret;
-	}
-	if (n <= 128) {
-		goto COPY_BLOCK_128_BACK15;
-	}
-	if (n <= 512) {
-		if (n >= 256) {
-			n -= 256;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov128((uint8_t *)dst + 128, (const uint8_t *)src + 128);
-			src = (const uint8_t *)src + 256;
-			dst = (uint8_t *)dst + 256;
-		}
-COPY_BLOCK_255_BACK15:
-		if (n >= 128) {
-			n -= 128;
-			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 128;
-			dst = (uint8_t *)dst + 128;
-		}
-COPY_BLOCK_128_BACK15:
-		if (n >= 64) {
-			n -= 64;
-			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 64;
-			dst = (uint8_t *)dst + 64;
-		}
-COPY_BLOCK_64_BACK15:
-		if (n >= 32) {
-			n -= 32;
-			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-			src = (const uint8_t *)src + 32;
-			dst = (uint8_t *)dst + 32;
-		}
-		if (n > 16) {
-			rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-			return ret;
-		}
-		if (n > 0) {
-			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
-		}
-		return ret;
-	}
-
-	/**
-	 * Make store aligned when copy size exceeds 512 bytes,
-	 * and make sure the first 15 bytes are copied, because
-	 * unaligned copy functions require up to 15 bytes
-	 * backwards access.
-	 */
-	dstofss = (uintptr_t)dst & 0x0F;
-	if (dstofss > 0) {
-		dstofss = 16 - dstofss + 16;
-		n -= dstofss;
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		src = (const uint8_t *)src + dstofss;
-		dst = (uint8_t *)dst + dstofss;
-	}
-	srcofs = ((uintptr_t)src & 0x0F);
-
-	/**
-	 * For aligned copy
-	 */
-	if (srcofs == 0) {
-		/**
-		 * Copy 256-byte blocks
-		 */
-		for (; n >= 256; n -= 256) {
-			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
-			dst = (uint8_t *)dst + 256;
-			src = (const uint8_t *)src + 256;
-		}
-
-		/**
-		 * Copy whatever left
-		 */
-		goto COPY_BLOCK_255_BACK15;
-	}
-
-	/**
-	 * For copy with unaligned load
-	 */
-	MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
-
-	/**
-	 * Copy whatever left
-	 */
-	goto COPY_BLOCK_64_BACK15;
-}
-
-#endif /* RTE_MACHINE_CPUFLAG */
-
-static inline void *
-rte_memcpy_aligned(void *dst, const void *src, size_t n)
-{
-	void *ret = dst;
-
-	/* Copy size <= 16 bytes */
-	if (n < 16) {
-		if (n & 0x01) {
-			*(uint8_t *)dst = *(const uint8_t *)src;
-			src = (const uint8_t *)src + 1;
-			dst = (uint8_t *)dst + 1;
-		}
-		if (n & 0x02) {
-			*(uint16_t *)dst = *(const uint16_t *)src;
-			src = (const uint16_t *)src + 1;
-			dst = (uint16_t *)dst + 1;
-		}
-		if (n & 0x04) {
-			*(uint32_t *)dst = *(const uint32_t *)src;
-			src = (const uint32_t *)src + 1;
-			dst = (uint32_t *)dst + 1;
-		}
-		if (n & 0x08)
-			*(uint64_t *)dst = *(const uint64_t *)src;
-
-		return ret;
-	}
-
-	/* Copy 16 <= size <= 32 bytes */
-	if (n <= 32) {
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov16((uint8_t *)dst - 16 + n,
-				(const uint8_t *)src - 16 + n);
-
-		return ret;
-	}
-
-	/* Copy 32 < size <= 64 bytes */
-	if (n <= 64) {
-		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov32((uint8_t *)dst - 32 + n,
-				(const uint8_t *)src - 32 + n);
-
-		return ret;
-	}
-
-	/* Copy 64 bytes blocks */
-	for (; n >= 64; n -= 64) {
-		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-		dst = (uint8_t *)dst + 64;
-		src = (const uint8_t *)src + 64;
-	}
-
-	/* Copy whatever left */
-	rte_mov64((uint8_t *)dst - 64 + n,
-			(const uint8_t *)src - 64 + n);
-
-	return ret;
-}
-
 static inline void *
 rte_memcpy(void *dst, const void *src, size_t n)
 {
-	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
-		return rte_memcpy_aligned(dst, src, n);
+	if (n <= RTE_X86_MEMCPY_THRESH)
+		return rte_memcpy_internal(dst, src, n);
 	else
-		return rte_memcpy_generic(dst, src, n);
+		return (*rte_memcpy_ptr)(dst, src, n);
 }
 
 #ifdef __cplusplus
 }
 #endif
 
-#endif /* _RTE_MEMCPY_X86_64_H_ */
+#endif /* _RTE_MEMCPY_INTERNAL_X86_64_H_ */
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
new file mode 100644
index 0000000..66e8398
--- /dev/null
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy_internal.h
@@ -0,0 +1,904 @@
+/*-
+ *   BSD LICENSE
+ *
+ *   Copyright(c) 2010-2014 Intel Corporation. All rights reserved.
+ *   All rights reserved.
+ *
+ *   Redistribution and use in source and binary forms, with or without
+ *   modification, are permitted provided that the following conditions
+ *   are met:
+ *
+ *     * Redistributions of source code must retain the above copyright
+ *       notice, this list of conditions and the following disclaimer.
+ *     * Redistributions in binary form must reproduce the above copyright
+ *       notice, this list of conditions and the following disclaimer in
+ *       the documentation and/or other materials provided with the
+ *       distribution.
+ *     * Neither the name of Intel Corporation nor the names of its
+ *       contributors may be used to endorse or promote products derived
+ *       from this software without specific prior written permission.
+ *
+ *   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ *   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ *   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ *   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ *   OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ *   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ *   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ *   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ *   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ *   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ *   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ */
+
+#ifndef _RTE_MEMCPY_INTERNAL_X86_64_H_
+#define _RTE_MEMCPY_INTERNAL_X86_64_H_
+
+/**
+ * @file
+ *
+ * Functions for SSE/AVX/AVX2/AVX512 implementation of memcpy().
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <string.h>
+#include <rte_vect.h>
+#include <rte_common.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * Copy bytes from one location to another. The locations must not overlap.
+ *
+ * @note This is implemented as a macro, so it's address should not be taken
+ * and care is needed as parameter expressions may be evaluated multiple times.
+ *
+ * @param dst
+ *   Pointer to the destination of the data.
+ * @param src
+ *   Pointer to the source data.
+ * @param n
+ *   Number of bytes to copy.
+ * @return
+ *   Pointer to the destination data.
+ */
+static __rte_always_inline void *
+rte_memcpy(void *dst, const void *src, size_t n);
+
+#ifdef RTE_MACHINE_CPUFLAG_AVX512F
+
+#define ALIGNMENT_MASK 0x3F
+
+/**
+ * AVX512 implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_loadu_si256((const __m256i *)src);
+	_mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+	__m512i zmm0;
+
+	zmm0 = _mm512_loadu_si512((const void *)src);
+	_mm512_storeu_si512((void *)dst, zmm0);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov64(dst + 0 * 64, src + 0 * 64);
+	rte_mov64(dst + 1 * 64, src + 1 * 64);
+}
+
+/**
+ * Copy 256 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov256(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov64(dst + 0 * 64, src + 0 * 64);
+	rte_mov64(dst + 1 * 64, src + 1 * 64);
+	rte_mov64(dst + 2 * 64, src + 2 * 64);
+	rte_mov64(dst + 3 * 64, src + 3 * 64);
+}
+
+/**
+ * Copy 128-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m512i zmm0, zmm1;
+
+	while (n >= 128) {
+		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
+		n -= 128;
+		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
+		src = src + 128;
+		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
+		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
+		dst = dst + 128;
+	}
+}
+
+/**
+ * Copy 512-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m512i zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, zmm7;
+
+	while (n >= 512) {
+		zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
+		n -= 512;
+		zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
+		zmm2 = _mm512_loadu_si512((const void *)(src + 2 * 64));
+		zmm3 = _mm512_loadu_si512((const void *)(src + 3 * 64));
+		zmm4 = _mm512_loadu_si512((const void *)(src + 4 * 64));
+		zmm5 = _mm512_loadu_si512((const void *)(src + 5 * 64));
+		zmm6 = _mm512_loadu_si512((const void *)(src + 6 * 64));
+		zmm7 = _mm512_loadu_si512((const void *)(src + 7 * 64));
+		src = src + 512;
+		_mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
+		_mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
+		_mm512_storeu_si512((void *)(dst + 2 * 64), zmm2);
+		_mm512_storeu_si512((void *)(dst + 3 * 64), zmm3);
+		_mm512_storeu_si512((void *)(dst + 4 * 64), zmm4);
+		_mm512_storeu_si512((void *)(dst + 5 * 64), zmm5);
+		_mm512_storeu_si512((void *)(dst + 6 * 64), zmm6);
+		_mm512_storeu_si512((void *)(dst + 7 * 64), zmm7);
+		dst = dst + 512;
+	}
+}
+
+static inline void *
+rte_memcpy_generic(void *dst, const void *src, size_t n)
+{
+	uintptr_t dstu = (uintptr_t)dst;
+	uintptr_t srcu = (uintptr_t)src;
+	void *ret = dst;
+	size_t dstofss;
+	size_t bits;
+
+	/**
+	 * Copy less than 16 bytes
+	 */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dstu = *(const uint8_t *)srcu;
+			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+			dstu = (uintptr_t)((uint8_t *)dstu + 1);
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dstu = *(const uint16_t *)srcu;
+			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+			dstu = (uintptr_t)((uint16_t *)dstu + 1);
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dstu = *(const uint32_t *)srcu;
+			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+			dstu = (uintptr_t)((uint32_t *)dstu + 1);
+		}
+		if (n & 0x08)
+			*(uint64_t *)dstu = *(const uint64_t *)srcu;
+		return ret;
+	}
+
+	/**
+	 * Fast way when copy size doesn't exceed 512 bytes
+	 */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				  (const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				  (const uint8_t *)src - 32 + n);
+		return ret;
+	}
+	if (n <= 512) {
+		if (n >= 256) {
+			n -= 256;
+			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 256;
+			dst = (uint8_t *)dst + 256;
+		}
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+COPY_BLOCK_128_BACK63:
+		if (n > 64) {
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov64((uint8_t *)dst - 64 + n,
+					  (const uint8_t *)src - 64 + n);
+			return ret;
+		}
+		if (n > 0)
+			rte_mov64((uint8_t *)dst - 64 + n,
+					  (const uint8_t *)src - 64 + n);
+		return ret;
+	}
+
+	/**
+	 * Make store aligned when copy size exceeds 512 bytes
+	 */
+	dstofss = ((uintptr_t)dst & 0x3F);
+	if (dstofss > 0) {
+		dstofss = 64 - dstofss;
+		n -= dstofss;
+		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+		src = (const uint8_t *)src + dstofss;
+		dst = (uint8_t *)dst + dstofss;
+	}
+
+	/**
+	 * Copy 512-byte blocks.
+	 * Use copy block function for better instruction order control,
+	 * which is important when load is unaligned.
+	 */
+	rte_mov512blocks((uint8_t *)dst, (const uint8_t *)src, n);
+	bits = n;
+	n = n & 511;
+	bits -= n;
+	src = (const uint8_t *)src + bits;
+	dst = (uint8_t *)dst + bits;
+
+	/**
+	 * Copy 128-byte blocks.
+	 * Use copy block function for better instruction order control,
+	 * which is important when load is unaligned.
+	 */
+	if (n >= 128) {
+		rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
+		bits = n;
+		n = n & 127;
+		bits -= n;
+		src = (const uint8_t *)src + bits;
+		dst = (uint8_t *)dst + bits;
+	}
+
+	/**
+	 * Copy whatever left
+	 */
+	goto COPY_BLOCK_128_BACK63;
+}
+
+#elif defined RTE_MACHINE_CPUFLAG_AVX2
+
+#define ALIGNMENT_MASK 0x1F
+
+/**
+ * AVX2 implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+	__m256i ymm0;
+
+	ymm0 = _mm256_loadu_si256((const __m256i *)src);
+	_mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
+	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
+	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
+	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
+	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
+}
+
+/**
+ * Copy 128-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m256i ymm0, ymm1, ymm2, ymm3;
+
+	while (n >= 128) {
+		ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32));
+		n -= 128;
+		ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32));
+		ymm2 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 2 * 32));
+		ymm3 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 3 * 32));
+		src = (const uint8_t *)src + 128;
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3);
+		dst = (uint8_t *)dst + 128;
+	}
+}
+
+static inline void *
+rte_memcpy_generic(void *dst, const void *src, size_t n)
+{
+	uintptr_t dstu = (uintptr_t)dst;
+	uintptr_t srcu = (uintptr_t)src;
+	void *ret = dst;
+	size_t dstofss;
+	size_t bits;
+
+	/**
+	 * Copy less than 16 bytes
+	 */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dstu = *(const uint8_t *)srcu;
+			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+			dstu = (uintptr_t)((uint8_t *)dstu + 1);
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dstu = *(const uint16_t *)srcu;
+			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+			dstu = (uintptr_t)((uint16_t *)dstu + 1);
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dstu = *(const uint32_t *)srcu;
+			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+			dstu = (uintptr_t)((uint32_t *)dstu + 1);
+		}
+		if (n & 0x08) {
+			*(uint64_t *)dstu = *(const uint64_t *)srcu;
+		}
+		return ret;
+	}
+
+	/**
+	 * Fast way when copy size doesn't exceed 256 bytes
+	 */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 48) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst + 16, (const uint8_t *)src + 16);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				(const uint8_t *)src - 32 + n);
+		return ret;
+	}
+	if (n <= 256) {
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+COPY_BLOCK_128_BACK31:
+		if (n >= 64) {
+			n -= 64;
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 64;
+			dst = (uint8_t *)dst + 64;
+		}
+		if (n > 32) {
+			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov32((uint8_t *)dst - 32 + n,
+					(const uint8_t *)src - 32 + n);
+			return ret;
+		}
+		if (n > 0) {
+			rte_mov32((uint8_t *)dst - 32 + n,
+					(const uint8_t *)src - 32 + n);
+		}
+		return ret;
+	}
+
+	/**
+	 * Make store aligned when copy size exceeds 256 bytes
+	 */
+	dstofss = (uintptr_t)dst & 0x1F;
+	if (dstofss > 0) {
+		dstofss = 32 - dstofss;
+		n -= dstofss;
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		src = (const uint8_t *)src + dstofss;
+		dst = (uint8_t *)dst + dstofss;
+	}
+
+	/**
+	 * Copy 128-byte blocks
+	 */
+	rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
+	bits = n;
+	n = n & 127;
+	bits -= n;
+	src = (const uint8_t *)src + bits;
+	dst = (uint8_t *)dst + bits;
+
+	/**
+	 * Copy whatever left
+	 */
+	goto COPY_BLOCK_128_BACK31;
+}
+
+#else /* RTE_MACHINE_CPUFLAG */
+
+#define ALIGNMENT_MASK 0x0F
+
+/**
+ * SSE & AVX implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
+	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
+	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
+	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
+}
+
+/**
+ * Copy 256 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov256(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
+	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
+	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
+	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
+	rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);
+	rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);
+	rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);
+	rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);
+	rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);
+	rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);
+	rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);
+	rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
+}
+
+/**
+ * Macro for copying unaligned block from one location to another with constant load offset,
+ * 47 bytes leftover maximum,
+ * locations should not overlap.
+ * Requirements:
+ * - Store is aligned
+ * - Load offset is <offset>, which must be immediate value within [1, 15]
+ * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
+ * - <dst>, <src>, <len> must be variables
+ * - __m128i <xmm0> ~ <xmm8> must be pre-defined
+ */
+#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)                                                     \
+__extension__ ({                                                                                            \
+    int tmp;                                                                                                \
+    while (len >= 128 + 16 - offset) {                                                                      \
+        xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));                  \
+        len -= 128;                                                                                         \
+        xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));                  \
+        xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));                  \
+        xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16));                  \
+        xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16));                  \
+        xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16));                  \
+        xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16));                  \
+        xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \
+        xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \
+        src = (const uint8_t *)src + 128;                                                                   \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));        \
+        dst = (uint8_t *)dst + 128;                                                                         \
+    }                                                                                                       \
+    tmp = len;                                                                                              \
+    len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
+    tmp -= len;                                                                                             \
+    src = (const uint8_t *)src + tmp;                                                                       \
+    dst = (uint8_t *)dst + tmp;                                                                             \
+    if (len >= 32 + 16 - offset) {                                                                          \
+        while (len >= 32 + 16 - offset) {                                                                   \
+            xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));              \
+            len -= 32;                                                                                      \
+            xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));              \
+            xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));              \
+            src = (const uint8_t *)src + 32;                                                                \
+            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));    \
+            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));    \
+            dst = (uint8_t *)dst + 32;                                                                      \
+        }                                                                                                   \
+        tmp = len;                                                                                          \
+        len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
+        tmp -= len;                                                                                         \
+        src = (const uint8_t *)src + tmp;                                                                   \
+        dst = (uint8_t *)dst + tmp;                                                                         \
+    }                                                                                                       \
+})
+
+/**
+ * Macro for copying unaligned block from one location to another,
+ * 47 bytes leftover maximum,
+ * locations should not overlap.
+ * Use switch here because the aligning instruction requires immediate value for shift count.
+ * Requirements:
+ * - Store is aligned
+ * - Load offset is <offset>, which must be within [1, 15]
+ * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
+ * - <dst>, <src>, <len> must be variables
+ * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined
+ */
+#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                   \
+__extension__ ({                                                      \
+    switch (offset) {                                                 \
+    case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break;    \
+    case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break;    \
+    case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break;    \
+    case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break;    \
+    case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break;    \
+    case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break;    \
+    case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break;    \
+    case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break;    \
+    case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break;    \
+    case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break;    \
+    case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;    \
+    case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break;    \
+    case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break;    \
+    case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break;    \
+    case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break;    \
+    default:;                                                         \
+    }                                                                 \
+})
+
+static inline void *
+rte_memcpy_generic(void *dst, const void *src, size_t n)
+{
+	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
+	uintptr_t dstu = (uintptr_t)dst;
+	uintptr_t srcu = (uintptr_t)src;
+	void *ret = dst;
+	size_t dstofss;
+	size_t srcofs;
+
+	/**
+	 * Copy less than 16 bytes
+	 */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dstu = *(const uint8_t *)srcu;
+			srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+			dstu = (uintptr_t)((uint8_t *)dstu + 1);
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dstu = *(const uint16_t *)srcu;
+			srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+			dstu = (uintptr_t)((uint16_t *)dstu + 1);
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dstu = *(const uint32_t *)srcu;
+			srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+			dstu = (uintptr_t)((uint32_t *)dstu + 1);
+		}
+		if (n & 0x08) {
+			*(uint64_t *)dstu = *(const uint64_t *)srcu;
+		}
+		return ret;
+	}
+
+	/**
+	 * Fast way when copy size doesn't exceed 512 bytes
+	 */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 48) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
+		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 128) {
+		goto COPY_BLOCK_128_BACK15;
+	}
+	if (n <= 512) {
+		if (n >= 256) {
+			n -= 256;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov128((uint8_t *)dst + 128, (const uint8_t *)src + 128);
+			src = (const uint8_t *)src + 256;
+			dst = (uint8_t *)dst + 256;
+		}
+COPY_BLOCK_255_BACK15:
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+COPY_BLOCK_128_BACK15:
+		if (n >= 64) {
+			n -= 64;
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 64;
+			dst = (uint8_t *)dst + 64;
+		}
+COPY_BLOCK_64_BACK15:
+		if (n >= 32) {
+			n -= 32;
+			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 32;
+			dst = (uint8_t *)dst + 32;
+		}
+		if (n > 16) {
+			rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
+			return ret;
+		}
+		if (n > 0) {
+			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
+		}
+		return ret;
+	}
+
+	/**
+	 * Make store aligned when copy size exceeds 512 bytes,
+	 * and make sure the first 15 bytes are copied, because
+	 * unaligned copy functions require up to 15 bytes
+	 * backwards access.
+	 */
+	dstofss = (uintptr_t)dst & 0x0F;
+	if (dstofss > 0) {
+		dstofss = 16 - dstofss + 16;
+		n -= dstofss;
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		src = (const uint8_t *)src + dstofss;
+		dst = (uint8_t *)dst + dstofss;
+	}
+	srcofs = ((uintptr_t)src & 0x0F);
+
+	/**
+	 * For aligned copy
+	 */
+	if (srcofs == 0) {
+		/**
+		 * Copy 256-byte blocks
+		 */
+		for (; n >= 256; n -= 256) {
+			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
+			dst = (uint8_t *)dst + 256;
+			src = (const uint8_t *)src + 256;
+		}
+
+		/**
+		 * Copy whatever left
+		 */
+		goto COPY_BLOCK_255_BACK15;
+	}
+
+	/**
+	 * For copy with unaligned load
+	 */
+	MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
+
+	/**
+	 * Copy whatever left
+	 */
+	goto COPY_BLOCK_64_BACK15;
+}
+
+#endif /* RTE_MACHINE_CPUFLAG */
+
+static inline void *
+rte_memcpy_aligned(void *dst, const void *src, size_t n)
+{
+	void *ret = dst;
+
+	/* Copy size <= 16 bytes */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dst = *(const uint8_t *)src;
+			src = (const uint8_t *)src + 1;
+			dst = (uint8_t *)dst + 1;
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dst = *(const uint16_t *)src;
+			src = (const uint16_t *)src + 1;
+			dst = (uint16_t *)dst + 1;
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dst = *(const uint32_t *)src;
+			src = (const uint32_t *)src + 1;
+			dst = (uint32_t *)dst + 1;
+		}
+		if (n & 0x08)
+			*(uint64_t *)dst = *(const uint64_t *)src;
+
+		return ret;
+	}
+
+	/* Copy 16 <= size <= 32 bytes */
+	if (n <= 32) {
+		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n,
+				(const uint8_t *)src - 16 + n);
+
+		return ret;
+	}
+
+	/* Copy 32 < size <= 64 bytes */
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov32((uint8_t *)dst - 32 + n,
+				(const uint8_t *)src - 32 + n);
+
+		return ret;
+	}
+
+	/* Copy 64 bytes blocks */
+	for (; n >= 64; n -= 64) {
+		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+		dst = (uint8_t *)dst + 64;
+		src = (const uint8_t *)src + 64;
+	}
+
+	/* Copy whatever left */
+	rte_mov64((uint8_t *)dst - 64 + n,
+			(const uint8_t *)src - 64 + n);
+
+	return ret;
+}
+
+static inline void *
+rte_memcpy_internal(void *dst, const void *src, size_t n)
+{
+	if (!(((uintptr_t)dst | (uintptr_t)src) & ALIGNMENT_MASK))
+		return rte_memcpy_aligned(dst, src, n);
+	else
+		return rte_memcpy_generic(dst, src, n);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_MEMCPY_INTERNAL_X86_64_H_ */
diff --git a/lib/librte_eal/linuxapp/eal/Makefile b/lib/librte_eal/linuxapp/eal/Makefile
index 90bca4d..2e50dd8 100644
--- a/lib/librte_eal/linuxapp/eal/Makefile
+++ b/lib/librte_eal/linuxapp/eal/Makefile
@@ -105,6 +105,26 @@ SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += rte_service.c
 SRCS-$(CONFIG_RTE_EXEC_ENV_LINUXAPP) += rte_cpuflags.c
 SRCS-$(CONFIG_RTE_ARCH_X86) += rte_spinlock.c
 
+#memcpy dynamic stuff
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy.c
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_sse.c
+
+CC_SUPPORT_AVX2 := $(shell $(CC) -march=core-avx2 -dM -E - < /dev/null 2>&1 | grep -q AVX2 && echo 1)
+ifeq ($(CC_SUPPORT_AVX2),1)
+CFLAGS_rte_memcpy.o += -DCC_SUPPORT_AVX2
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx2.c
+CFLAGS_rte_memcpy_avx2.o += -mavx2
+CFLAGS_rte_memcpy_avx2.o += -DRTE_MACHINE_CPUFLAG_AVX2
+endif
+
+CC_SUPPORT_AVX512F := $(shell $(CC) -mavx512f -dM -E - < /dev/null 2>&1 | grep -q AVX512F && echo 1)
+ifeq ($(CC_SUPPORT_AVX512F),1)
+CFLAGS_rte_memcpy.o += -DCC_SUPPORT_AVX512F
+SRCS-$(CONFIG_RTE_ARCH_X86) += rte_memcpy_avx512f.c
+CFLAGS_rte_memcpy_avx512f.o += -mavx512f
+CFLAGS_rte_memcpy_avx512f.o += -DRTE_MACHINE_CPUFLAG_AVX512F
+endif
+
 CFLAGS_eal_common_cpuflags.o := $(CPUFLAGS_LIST)
 
 CFLAGS_eal.o := -D_GNU_SOURCE
-- 
2.7.4

^ permalink raw reply	[flat|nested] 88+ messages in thread

end of thread, other threads:[~2017-11-03  7:47 UTC | newest]

Thread overview: 88+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-09-26  7:41 [dpdk-dev] [PATCH v3 0/3] dynamic linking support Xiaoyun Li
2017-09-26  7:41 ` [dpdk-dev] [PATCH v3 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
2017-10-01 23:41   ` Ananyev, Konstantin
2017-10-02  0:12     ` Li, Xiaoyun
2017-09-26  7:41 ` [dpdk-dev] [PATCH v3 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
2017-09-26  7:41 ` [dpdk-dev] [PATCH v3 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
2017-10-02  0:08   ` Ananyev, Konstantin
2017-10-02  0:09     ` Li, Xiaoyun
2017-10-02  9:35     ` Ananyev, Konstantin
2017-10-02 16:13 ` [dpdk-dev] [PATCH v4 0/3] run-time Linking support Xiaoyun Li
2017-10-02 16:13   ` [dpdk-dev] [PATCH v4 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
2017-10-02 16:39     ` Ananyev, Konstantin
2017-10-02 23:10       ` Li, Xiaoyun
2017-10-03 11:15         ` Ananyev, Konstantin
2017-10-03 11:39           ` Li, Xiaoyun
2017-10-03 12:12             ` Ananyev, Konstantin
2017-10-03 12:23               ` Li, Xiaoyun
2017-10-02 16:13   ` [dpdk-dev] [PATCH v4 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
2017-10-02 16:13   ` [dpdk-dev] [PATCH v4 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
2017-10-02 16:52     ` Ananyev, Konstantin
2017-10-03  8:15       ` Li, Xiaoyun
2017-10-03 11:23         ` Ananyev, Konstantin
2017-10-03 11:27           ` Li, Xiaoyun
2017-10-03 14:59   ` [dpdk-dev] [PATCH v5 0/3] run-time Linking support Xiaoyun Li
2017-10-03 14:59     ` [dpdk-dev] [PATCH v5 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
2017-10-03 14:59     ` [dpdk-dev] [PATCH v5 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
2017-10-03 14:59     ` [dpdk-dev] [PATCH v5 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
2017-10-04 17:56     ` [dpdk-dev] [PATCH v5 0/3] run-time Linking support Ananyev, Konstantin
2017-10-04 22:33       ` Li, Xiaoyun
2017-10-04 22:58     ` [dpdk-dev] [PATCH v6 " Xiaoyun Li
2017-10-04 22:58       ` [dpdk-dev] [PATCH v6 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
2017-10-05  9:37         ` Ananyev, Konstantin
2017-10-05  9:38           ` Ananyev, Konstantin
2017-10-05 11:19           ` Li, Xiaoyun
2017-10-05 11:26             ` Richardson, Bruce
2017-10-05 11:26             ` Li, Xiaoyun
2017-10-05 12:12               ` Ananyev, Konstantin
2017-10-04 22:58       ` [dpdk-dev] [PATCH v6 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
2017-10-04 22:58       ` [dpdk-dev] [PATCH v6 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
2017-10-05  9:40         ` Ananyev, Konstantin
2017-10-05 10:23           ` Li, Xiaoyun
2017-10-05 12:33       ` [dpdk-dev] [PATCH v7 0/3] run-time Linking support Xiaoyun Li
2017-10-05 12:33         ` [dpdk-dev] [PATCH v7 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
2017-10-09 17:47           ` Thomas Monjalon
2017-10-13  1:06             ` Li, Xiaoyun
2017-10-13  7:21               ` Thomas Monjalon
2017-10-13  7:30                 ` Li, Xiaoyun
2017-10-13  7:31                 ` Ananyev, Konstantin
2017-10-13  7:36                   ` Thomas Monjalon
2017-10-13  7:41                     ` Li, Xiaoyun
2017-10-05 12:33         ` [dpdk-dev] [PATCH v7 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
2017-10-05 12:33         ` [dpdk-dev] [PATCH v7 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
2017-10-05 13:24         ` [dpdk-dev] [PATCH v7 0/3] run-time Linking support Ananyev, Konstantin
2017-10-09 17:40         ` Thomas Monjalon
2017-10-13  0:58           ` Li, Xiaoyun
2017-10-13  9:01         ` [dpdk-dev] [PATCH v8 " Xiaoyun Li
2017-10-13  9:01           ` [dpdk-dev] [PATCH v8 1/3] eal/x86: run-time dispatch over memcpy Xiaoyun Li
2017-10-13  9:28             ` Thomas Monjalon
2017-10-13 10:26               ` Ananyev, Konstantin
2017-10-17 21:24             ` Thomas Monjalon
2017-10-18  2:21               ` Li, Xiaoyun
2017-10-18  6:22                 ` Li, Xiaoyun
2017-10-19  2:45                   ` Li, Xiaoyun
2017-10-19  6:58                     ` Thomas Monjalon
2017-10-19  7:51                       ` Li, Xiaoyun
2017-10-19  8:33                         ` Thomas Monjalon
2017-10-19  8:50                           ` Li, Xiaoyun
2017-10-19  8:59                             ` Ananyev, Konstantin
2017-10-19  9:00                             ` Thomas Monjalon
2017-10-19  9:29                               ` Bruce Richardson
2017-10-20  1:02                                 ` Li, Xiaoyun
2017-10-25  6:55                                   ` Li, Xiaoyun
2017-10-25  7:25                                     ` Thomas Monjalon
2017-10-29  8:49                                       ` Thomas Monjalon
2017-11-02 10:22                                         ` Wang, Zhihong
2017-11-02 10:44                                           ` Thomas Monjalon
2017-11-02 10:58                                             ` Li, Xiaoyun
2017-11-02 12:15                                               ` Thomas Monjalon
2017-11-03  7:47                                             ` Yao, Lei A
2017-10-25  8:50                                     ` Ananyev, Konstantin
2017-10-25  8:54                                       ` Li, Xiaoyun
2017-10-25  9:00                                         ` Thomas Monjalon
2017-10-25 10:32                                           ` Li, Xiaoyun
2017-10-25  9:14                                         ` Ananyev, Konstantin
2017-10-13  9:01           ` [dpdk-dev] [PATCH v8 2/3] app/test: run-time dispatch over memcpy perf test Xiaoyun Li
2017-10-13  9:01           ` [dpdk-dev] [PATCH v8 3/3] efd: run-time dispatch over x86 EFD functions Xiaoyun Li
2017-10-13 13:13           ` [dpdk-dev] [PATCH v8 0/3] run-time Linking support Thomas Monjalon
     [not found] <506411689-94690-2-git-send-email-xiaoyun.li@intel.com>
2017-10-02 12:31 ` [dpdk-dev] [PATCH v3 1/3] eal/x86: run-time dispatch over memcpy Konstantin Ananyev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).