[dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization

DPDK patches and discussions
 help / color / mirror / Atom feed

* [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization
@ 2015-01-29  2:38 Zhihong Wang
  2015-01-29  2:38 ` [dpdk-dev] [PATCH v2 1/4] app/test: Disabled VTA for memcpy test in app/test/Makefile Zhihong Wang
                   ` (6 more replies)
  0 siblings, 7 replies; 12+ messages in thread
From: Zhihong Wang @ 2015-01-29  2:38 UTC (permalink / raw)
  To: dev

This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
It also extends memcpy test coverage with unaligned cases and more test points.

Optimization techniques are summarized below:

1. Utilize full cache bandwidth

2. Enforce aligned stores

3. Apply load address alignment based on architecture features

4. Make load/store address available as early as possible

5. General optimization techniques like inlining, branch reducing, prefetch pattern access

--------------
Changes in v2:

1. Reduced constant test cases in app/test/test_memcpy_perf.c for fast build

2. Modified macro definition for better code readability & safety

Zhihong Wang (4):
  app/test: Disabled VTA for memcpy test in app/test/Makefile
  app/test: Removed unnecessary test cases in app/test/test_memcpy.c
  app/test: Extended test coverage in app/test/test_memcpy_perf.c
  lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
    and AVX platforms

 app/test/Makefile                                  |   6 +
 app/test/test_memcpy.c                             |  52 +-
 app/test/test_memcpy_perf.c                        | 220 ++++---
 .../common/include/arch/x86/rte_memcpy.h           | 680 +++++++++++++++------
 4 files changed, 654 insertions(+), 304 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [dpdk-dev] [PATCH v2 1/4] app/test: Disabled VTA for memcpy test in app/test/Makefile
  2015-01-29  2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang
@ 2015-01-29  2:38 ` Zhihong Wang
  2015-01-29  2:38 ` [dpdk-dev] [PATCH v2 2/4] app/test: Removed unnecessary test cases in app/test/test_memcpy.c Zhihong Wang
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Zhihong Wang @ 2015-01-29  2:38 UTC (permalink / raw)
  To: dev

VTA is for debugging only, it increases compile time and binary size, especially when there're a lot of inlines.
So disable it since memcpy test contains a lot of inline calls.

Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
 app/test/Makefile | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/app/test/Makefile b/app/test/Makefile
index 4311f96..94dbadf 100644
--- a/app/test/Makefile
+++ b/app/test/Makefile
@@ -143,6 +143,12 @@ CFLAGS_test_kni.o += -Wno-deprecated-declarations
 endif
 CFLAGS += -D_GNU_SOURCE
 
+# Disable VTA for memcpy test
+ifeq ($(CC), gcc)
+CFLAGS_test_memcpy.o += -fno-var-tracking-assignments
+CFLAGS_test_memcpy_perf.o += -fno-var-tracking-assignments
+endif
+
 # this application needs libraries first
 DEPDIRS-y += lib
 
-- 
1.9.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [dpdk-dev] [PATCH v2 2/4] app/test: Removed unnecessary test cases in app/test/test_memcpy.c
  2015-01-29  2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang
  2015-01-29  2:38 ` [dpdk-dev] [PATCH v2 1/4] app/test: Disabled VTA for memcpy test in app/test/Makefile Zhihong Wang
@ 2015-01-29  2:38 ` Zhihong Wang
  2015-01-29  2:38 ` [dpdk-dev] [PATCH v2 3/4] app/test: Extended test coverage in app/test/test_memcpy_perf.c Zhihong Wang
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Zhihong Wang @ 2015-01-29  2:38 UTC (permalink / raw)
  To: dev

Removed unnecessary test cases for base move functions since the function "func_test" covers them all.

Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
 app/test/test_memcpy.c | 52 +-------------------------------------------------
 1 file changed, 1 insertion(+), 51 deletions(-)

diff --git a/app/test/test_memcpy.c b/app/test/test_memcpy.c
index 56b8e1e..b2bb4e0 100644
--- a/app/test/test_memcpy.c
+++ b/app/test/test_memcpy.c
@@ -78,56 +78,9 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
 #define TEST_BATCH_SIZE         100
 
 /* Data is aligned on this many bytes (power of 2) */
-#define ALIGNMENT_UNIT          16
+#define ALIGNMENT_UNIT          32
 
 
-
-/* Structure with base memcpy func pointer, and number of bytes it copies */
-struct base_memcpy_func {
-	void (*func)(uint8_t *dst, const uint8_t *src);
-	unsigned size;
-};
-
-/* To create base_memcpy_func structure entries */
-#define BASE_FUNC(n) {rte_mov##n, n}
-
-/* Max number of bytes that can be copies with a "base" memcpy functions */
-#define MAX_BASE_FUNC_SIZE 256
-
-/*
- * Test the "base" memcpy functions, that a copy fixed number of bytes.
- */
-static int
-base_func_test(void)
-{
-	const struct base_memcpy_func base_memcpy_funcs[6] = {
-		BASE_FUNC(16),
-		BASE_FUNC(32),
-		BASE_FUNC(48),
-		BASE_FUNC(64),
-		BASE_FUNC(128),
-		BASE_FUNC(256),
-	};
-	unsigned i, j;
-	unsigned num_funcs = sizeof(base_memcpy_funcs) / sizeof(base_memcpy_funcs[0]);
-	uint8_t dst[MAX_BASE_FUNC_SIZE];
-	uint8_t src[MAX_BASE_FUNC_SIZE];
-
-	for (i = 0; i < num_funcs; i++) {
-		unsigned size = base_memcpy_funcs[i].size;
-		for (j = 0; j < size; j++) {
-			dst[j] = 0;
-			src[j] = (uint8_t) rte_rand();
-		}
-		base_memcpy_funcs[i].func(dst, src);
-		for (j = 0; j < size; j++)
-			if (dst[j] != src[j])
-				return -1;
-	}
-
-	return 0;
-}
-
 /*
  * Create two buffers, and initialise one with random values. These are copied
  * to the second buffer and then compared to see if the copy was successful.
@@ -218,9 +171,6 @@ test_memcpy(void)
 	ret = func_test();
 	if (ret != 0)
 		return -1;
-	ret = base_func_test();
-	if (ret != 0)
-		return -1;
 	return 0;
 }
 
-- 
1.9.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [dpdk-dev] [PATCH v2 3/4] app/test: Extended test coverage in app/test/test_memcpy_perf.c
  2015-01-29  2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang
  2015-01-29  2:38 ` [dpdk-dev] [PATCH v2 1/4] app/test: Disabled VTA for memcpy test in app/test/Makefile Zhihong Wang
  2015-01-29  2:38 ` [dpdk-dev] [PATCH v2 2/4] app/test: Removed unnecessary test cases in app/test/test_memcpy.c Zhihong Wang
@ 2015-01-29  2:38 ` Zhihong Wang
  2015-01-29  2:38 ` [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms Zhihong Wang
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Zhihong Wang @ 2015-01-29  2:38 UTC (permalink / raw)
  To: dev

Main code changes:

1. Added more typical data points for a thorough performance test

2. Added unaligned test cases since it's common in DPDK usage

Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
 app/test/test_memcpy_perf.c | 220 +++++++++++++++++++++++++++-----------------
 1 file changed, 138 insertions(+), 82 deletions(-)

diff --git a/app/test/test_memcpy_perf.c b/app/test/test_memcpy_perf.c
index 7809610..754828e 100644
--- a/app/test/test_memcpy_perf.c
+++ b/app/test/test_memcpy_perf.c
@@ -54,9 +54,10 @@
 /* List of buffer sizes to test */
 #if TEST_VALUE_RANGE == 0
 static size_t buf_sizes[] = {
-	0, 1, 7, 8, 9, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128, 129, 255,
-	256, 257, 320, 384, 511, 512, 513, 1023, 1024, 1025, 1518, 1522, 1600,
-	2048, 3072, 4096, 5120, 6144, 7168, 8192
+	1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128,
+	129, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384, 385, 447, 448,
+	449, 511, 512, 513, 767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600,
+	2048, 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192
 };
 /* MUST be as large as largest packet size above */
 #define SMALL_BUFFER_SIZE       8192
@@ -78,7 +79,7 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
 #define TEST_BATCH_SIZE         100
 
 /* Data is aligned on this many bytes (power of 2) */
-#define ALIGNMENT_UNIT          16
+#define ALIGNMENT_UNIT          32
 
 /*
  * Pointers used in performance tests. The two large buffers are for uncached
@@ -94,19 +95,19 @@ init_buffers(void)
 {
 	unsigned i;
 
-	large_buf_read = rte_malloc("memcpy", LARGE_BUFFER_SIZE, ALIGNMENT_UNIT);
+	large_buf_read = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
 	if (large_buf_read == NULL)
 		goto error_large_buf_read;
 
-	large_buf_write = rte_malloc("memcpy", LARGE_BUFFER_SIZE, ALIGNMENT_UNIT);
+	large_buf_write = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
 	if (large_buf_write == NULL)
 		goto error_large_buf_write;
 
-	small_buf_read = rte_malloc("memcpy", SMALL_BUFFER_SIZE, ALIGNMENT_UNIT);
+	small_buf_read = rte_malloc("memcpy", SMALL_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
 	if (small_buf_read == NULL)
 		goto error_small_buf_read;
 
-	small_buf_write = rte_malloc("memcpy", SMALL_BUFFER_SIZE, ALIGNMENT_UNIT);
+	small_buf_write = rte_malloc("memcpy", SMALL_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
 	if (small_buf_write == NULL)
 		goto error_small_buf_write;
 
@@ -140,25 +141,25 @@ free_buffers(void)
 
 /*
  * Get a random offset into large array, with enough space needed to perform
- * max copy size. Offset is aligned.
+ * max copy size. Offset is aligned, uoffset is used for unalignment setting.
  */
 static inline size_t
-get_rand_offset(void)
+get_rand_offset(size_t uoffset)
 {
-	return ((rte_rand() % (LARGE_BUFFER_SIZE - SMALL_BUFFER_SIZE)) &
-	                ~(ALIGNMENT_UNIT - 1));
+	return (((rte_rand() % (LARGE_BUFFER_SIZE - SMALL_BUFFER_SIZE)) &
+			~(ALIGNMENT_UNIT - 1)) + uoffset);
 }
 
 /* Fill in source and destination addresses. */
 static inline void
-fill_addr_arrays(size_t *dst_addr, int is_dst_cached,
-		size_t *src_addr, int is_src_cached)
+fill_addr_arrays(size_t *dst_addr, int is_dst_cached, size_t dst_uoffset,
+				 size_t *src_addr, int is_src_cached, size_t src_uoffset)
 {
 	unsigned int i;
 
 	for (i = 0; i < TEST_BATCH_SIZE; i++) {
-		dst_addr[i] = (is_dst_cached) ? 0 : get_rand_offset();
-		src_addr[i] = (is_src_cached) ? 0 : get_rand_offset();
+		dst_addr[i] = (is_dst_cached) ? dst_uoffset : get_rand_offset(dst_uoffset);
+		src_addr[i] = (is_src_cached) ? src_uoffset : get_rand_offset(src_uoffset);
 	}
 }
 
@@ -169,16 +170,17 @@ fill_addr_arrays(size_t *dst_addr, int is_dst_cached,
  */
 static void
 do_uncached_write(uint8_t *dst, int is_dst_cached,
-		const uint8_t *src, int is_src_cached, size_t size)
+				  const uint8_t *src, int is_src_cached, size_t size)
 {
 	unsigned i, j;
 	size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];
 
 	for (i = 0; i < (TEST_ITERATIONS / TEST_BATCH_SIZE); i++) {
-		fill_addr_arrays(dst_addrs, is_dst_cached,
-			 src_addrs, is_src_cached);
-		for (j = 0; j < TEST_BATCH_SIZE; j++)
+		fill_addr_arrays(dst_addrs, is_dst_cached, 0,
+						 src_addrs, is_src_cached, 0);
+		for (j = 0; j < TEST_BATCH_SIZE; j++) {
 			rte_memcpy(dst+dst_addrs[j], src+src_addrs[j], size);
+		}
 	}
 }
 
@@ -186,52 +188,111 @@ do_uncached_write(uint8_t *dst, int is_dst_cached,
  * Run a single memcpy performance test. This is a macro to ensure that if
  * the "size" parameter is a constant it won't be converted to a variable.
  */
-#define SINGLE_PERF_TEST(dst, is_dst_cached, src, is_src_cached, size) do {   \
-	unsigned int iter, t;                                                 \
-	size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];        \
-	uint64_t start_time, total_time = 0;                                  \
-	uint64_t total_time2 = 0;                                             \
-	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {  \
-		fill_addr_arrays(dst_addrs, is_dst_cached,                    \
-		                 src_addrs, is_src_cached);                   \
-		start_time = rte_rdtsc();                                     \
-		for (t = 0; t < TEST_BATCH_SIZE; t++)                         \
-			rte_memcpy(dst+dst_addrs[t], src+src_addrs[t], size); \
-		total_time += rte_rdtsc() - start_time;                       \
-	}                                                                     \
-	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {  \
-		fill_addr_arrays(dst_addrs, is_dst_cached,                    \
-		                 src_addrs, is_src_cached);                   \
-		start_time = rte_rdtsc();                                     \
-		for (t = 0; t < TEST_BATCH_SIZE; t++)                         \
-			memcpy(dst+dst_addrs[t], src+src_addrs[t], size);     \
-		total_time2 += rte_rdtsc() - start_time;                      \
-	}                                                                     \
-	printf("%8.0f -",  (double)total_time /TEST_ITERATIONS);              \
-	printf("%5.0f",  (double)total_time2 / TEST_ITERATIONS);              \
+#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset,                   \
+                         src, is_src_cached, src_uoffset, size)             \
+do {                                                                        \
+    unsigned int iter, t;                                                   \
+    size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];          \
+    uint64_t start_time, total_time = 0;                                    \
+    uint64_t total_time2 = 0;                                               \
+    for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
+        fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
+                         src_addrs, is_src_cached, src_uoffset);            \
+        start_time = rte_rdtsc();                                           \
+        for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
+            rte_memcpy(dst+dst_addrs[t], src+src_addrs[t], size);           \
+        total_time += rte_rdtsc() - start_time;                             \
+    }                                                                       \
+    for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
+        fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
+                         src_addrs, is_src_cached, src_uoffset);            \
+        start_time = rte_rdtsc();                                           \
+        for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
+            memcpy(dst+dst_addrs[t], src+src_addrs[t], size);               \
+        total_time2 += rte_rdtsc() - start_time;                            \
+    }                                                                       \
+    printf("%8.0f -",  (double)total_time /TEST_ITERATIONS);                \
+    printf("%5.0f",  (double)total_time2 / TEST_ITERATIONS);                \
 } while (0)
 
-/* Run memcpy() tests for each cached/uncached permutation. */
-#define ALL_PERF_TESTS_FOR_SIZE(n) do {                             \
-	if (__builtin_constant_p(n))                                \
-		printf("\nC%6u", (unsigned)n);                      \
-	else                                                        \
-		printf("\n%7u", (unsigned)n);                       \
-	SINGLE_PERF_TEST(small_buf_write, 1, small_buf_read, 1, n); \
-	SINGLE_PERF_TEST(large_buf_write, 0, small_buf_read, 1, n); \
-	SINGLE_PERF_TEST(small_buf_write, 1, large_buf_read, 0, n); \
-	SINGLE_PERF_TEST(large_buf_write, 0, large_buf_read, 0, n); \
+/* Run aligned memcpy tests for each cached/uncached permutation */
+#define ALL_PERF_TESTS_FOR_SIZE(n)                                       \
+do {                                                                     \
+    if (__builtin_constant_p(n))                                         \
+        printf("\nC%6u", (unsigned)n);                                   \
+    else                                                                 \
+        printf("\n%7u", (unsigned)n);                                    \
+    SINGLE_PERF_TEST(small_buf_write, 1, 0, small_buf_read, 1, 0, n);    \
+    SINGLE_PERF_TEST(large_buf_write, 0, 0, small_buf_read, 1, 0, n);    \
+    SINGLE_PERF_TEST(small_buf_write, 1, 0, large_buf_read, 0, 0, n);    \
+    SINGLE_PERF_TEST(large_buf_write, 0, 0, large_buf_read, 0, 0, n);    \
 } while (0)
 
-/*
- * Run performance tests for a number of different sizes and cached/uncached
- * permutations.
- */
+/* Run unaligned memcpy tests for each cached/uncached permutation */
+#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n)                             \
+do {                                                                     \
+    if (__builtin_constant_p(n))                                         \
+        printf("\nC%6u", (unsigned)n);                                   \
+    else                                                                 \
+        printf("\n%7u", (unsigned)n);                                    \
+    SINGLE_PERF_TEST(small_buf_write, 1, 1, small_buf_read, 1, 5, n);    \
+    SINGLE_PERF_TEST(large_buf_write, 0, 1, small_buf_read, 1, 5, n);    \
+    SINGLE_PERF_TEST(small_buf_write, 1, 1, large_buf_read, 0, 5, n);    \
+    SINGLE_PERF_TEST(large_buf_write, 0, 1, large_buf_read, 0, 5, n);    \
+} while (0)
+
+/* Run memcpy tests for constant length */
+#define ALL_PERF_TEST_FOR_CONSTANT                                      \
+do {                                                                    \
+    TEST_CONSTANT(6U); TEST_CONSTANT(64U); TEST_CONSTANT(128U);         \
+    TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U);      \
+    TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U);    \
+} while (0)
+
+/* Run all memcpy tests for aligned constant cases */
+static inline void
+perf_test_constant_aligned(void)
+{
+#define TEST_CONSTANT ALL_PERF_TESTS_FOR_SIZE
+	ALL_PERF_TEST_FOR_CONSTANT;
+#undef TEST_CONSTANT
+}
+
+/* Run all memcpy tests for unaligned constant cases */
+static inline void
+perf_test_constant_unaligned(void)
+{
+#define TEST_CONSTANT ALL_PERF_TESTS_FOR_SIZE_UNALIGNED
+	ALL_PERF_TEST_FOR_CONSTANT;
+#undef TEST_CONSTANT
+}
+
+/* Run all memcpy tests for aligned variable cases */
+static inline void
+perf_test_variable_aligned(void)
+{
+	unsigned n = sizeof(buf_sizes) / sizeof(buf_sizes[0]);
+	unsigned i;
+	for (i = 0; i < n; i++) {
+		ALL_PERF_TESTS_FOR_SIZE((size_t)buf_sizes[i]);
+	}
+}
+
+/* Run all memcpy tests for unaligned variable cases */
+static inline void
+perf_test_variable_unaligned(void)
+{
+	unsigned n = sizeof(buf_sizes) / sizeof(buf_sizes[0]);
+	unsigned i;
+	for (i = 0; i < n; i++) {
+		ALL_PERF_TESTS_FOR_SIZE_UNALIGNED((size_t)buf_sizes[i]);
+	}
+}
+
+/* Run all memcpy tests */
 static int
 perf_test(void)
 {
-	const unsigned num_buf_sizes = sizeof(buf_sizes) / sizeof(buf_sizes[0]);
-	unsigned i;
 	int ret;
 
 	ret = init_buffers();
@@ -239,7 +300,8 @@ perf_test(void)
 		return ret;
 
 #if TEST_VALUE_RANGE != 0
-	/* Setup buf_sizes array, if required */
+	/* Set up buf_sizes array, if required */
+	unsigned i;
 	for (i = 0; i < TEST_VALUE_RANGE; i++)
 		buf_sizes[i] = i;
 #endif
@@ -248,28 +310,23 @@ perf_test(void)
 	do_uncached_write(large_buf_write, 0, small_buf_read, 1, SMALL_BUFFER_SIZE);
 
 	printf("\n** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **\n"
-	       "======= ============== ============== ============== ==============\n"
-	       "   Size Cache to cache   Cache to mem   Mem to cache     Mem to mem\n"
-	       "(bytes)        (ticks)        (ticks)        (ticks)        (ticks)\n"
-	       "------- -------------- -------------- -------------- --------------");
-
-	/* Do tests where size is a variable */
-	for (i = 0; i < num_buf_sizes; i++) {
-		ALL_PERF_TESTS_FOR_SIZE((size_t)buf_sizes[i]);
-	}
+		   "======= ============== ============== ============== ==============\n"
+		   "   Size Cache to cache   Cache to mem   Mem to cache     Mem to mem\n"
+		   "(bytes)        (ticks)        (ticks)        (ticks)        (ticks)\n"
+		   "------- -------------- -------------- -------------- --------------");
+
+	printf("\n========================== %2dB aligned ============================", ALIGNMENT_UNIT);
+	/* Do aligned tests where size is a variable */
+	perf_test_variable_aligned();
 	printf("\n------- -------------- -------------- -------------- --------------");
-	/* Do tests where size is a compile-time constant */
-	ALL_PERF_TESTS_FOR_SIZE(63U);
-	ALL_PERF_TESTS_FOR_SIZE(64U);
-	ALL_PERF_TESTS_FOR_SIZE(65U);
-	ALL_PERF_TESTS_FOR_SIZE(255U);
-	ALL_PERF_TESTS_FOR_SIZE(256U);
-	ALL_PERF_TESTS_FOR_SIZE(257U);
-	ALL_PERF_TESTS_FOR_SIZE(1023U);
-	ALL_PERF_TESTS_FOR_SIZE(1024U);
-	ALL_PERF_TESTS_FOR_SIZE(1025U);
-	ALL_PERF_TESTS_FOR_SIZE(1518U);
-
+	/* Do aligned tests where size is a compile-time constant */
+	perf_test_constant_aligned();
+	printf("\n=========================== Unaligned =============================");
+	/* Do unaligned tests where size is a variable */
+	perf_test_variable_unaligned();
+	printf("\n------- -------------- -------------- -------------- --------------");
+	/* Do unaligned tests where size is a compile-time constant */
+	perf_test_constant_unaligned();
 	printf("\n======= ============== ============== ============== ==============\n\n");
 
 	free_buffers();
@@ -277,7 +334,6 @@ perf_test(void)
 	return 0;
 }
 
-
 static int
 test_memcpy_perf(void)
 {
-- 
1.9.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
  2015-01-29  2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang
                   ` (2 preceding siblings ...)
  2015-01-29  2:38 ` [dpdk-dev] [PATCH v2 3/4] app/test: Extended test coverage in app/test/test_memcpy_perf.c Zhihong Wang
@ 2015-01-29  2:38 ` Zhihong Wang
  2015-01-29 15:17   ` Ananyev, Konstantin
  2015-01-29  6:16 ` [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Fu, JingguoX
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 12+ messages in thread
From: Zhihong Wang @ 2015-01-29  2:38 UTC (permalink / raw)
  To: dev

Main code changes:

1. Differentiate architectural features based on CPU flags

    a. Implement separated move functions for SSE/AVX/AVX2 to make full utilization of cache bandwidth

    b. Implement separated copy flow specifically optimized for target architecture

2. Rewrite the memcpy function "rte_memcpy"

    a. Add store aligning

    b. Add load aligning based on architectural features

    c. Put block copy loop into inline move functions for better control of instruction order

    d. Eliminate unnecessary MOVs

3. Rewrite the inline move functions

    a. Add move functions for unaligned load cases

    b. Change instruction order in copy loops for better pipeline utilization

    c. Use intrinsics instead of assembly code

4. Remove slow glibc call for constant copies

Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
 .../common/include/arch/x86/rte_memcpy.h           | 680 +++++++++++++++------
 1 file changed, 509 insertions(+), 171 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index fb9eba8..7b2d382 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -34,166 +34,189 @@
 #ifndef _RTE_MEMCPY_X86_64_H_
 #define _RTE_MEMCPY_X86_64_H_
 
+/**
+ * @file
+ *
+ * Functions for SSE/AVX/AVX2 implementation of memcpy().
+ */
+
+#include <stdio.h>
 #include <stdint.h>
 #include <string.h>
-#include <emmintrin.h>
+#include <x86intrin.h>
 
 #ifdef __cplusplus
 extern "C" {
 #endif
 
-#include "generic/rte_memcpy.h"
+/**
+ * Copy bytes from one location to another. The locations must not overlap.
+ *
+ * @note This is implemented as a macro, so it's address should not be taken
+ * and care is needed as parameter expressions may be evaluated multiple times.
+ *
+ * @param dst
+ *   Pointer to the destination of the data.
+ * @param src
+ *   Pointer to the source data.
+ * @param n
+ *   Number of bytes to copy.
+ * @return
+ *   Pointer to the destination data.
+ */
+static inline void *
+rte_memcpy(void *dst, const void *src, size_t n) __attribute__((always_inline));
 
-#ifdef __INTEL_COMPILER
-#pragma warning(disable:593) /* Stop unused variable warning (reg_a etc). */
-#endif
+#ifdef RTE_MACHINE_CPUFLAG_AVX2
 
+/**
+ * AVX2 implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
 static inline void
 rte_mov16(uint8_t *dst, const uint8_t *src)
 {
-	__m128i reg_a;
-	asm volatile (
-		"movdqu (%[src]), %[reg_a]\n\t"
-		"movdqu %[reg_a], (%[dst])\n\t"
-		: [reg_a] "=x" (reg_a)
-		: [src] "r" (src),
-		  [dst] "r"(dst)
-		: "memory"
-	);
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
 }
 
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
 static inline void
 rte_mov32(uint8_t *dst, const uint8_t *src)
 {
-	__m128i reg_a, reg_b;
-	asm volatile (
-		"movdqu (%[src]), %[reg_a]\n\t"
-		"movdqu 16(%[src]), %[reg_b]\n\t"
-		"movdqu %[reg_a], (%[dst])\n\t"
-		"movdqu %[reg_b], 16(%[dst])\n\t"
-		: [reg_a] "=x" (reg_a),
-		  [reg_b] "=x" (reg_b)
-		: [src] "r" (src),
-		  [dst] "r"(dst)
-		: "memory"
-	);
-}
+	__m256i ymm0;
 
-static inline void
-rte_mov48(uint8_t *dst, const uint8_t *src)
-{
-	__m128i reg_a, reg_b, reg_c;
-	asm volatile (
-		"movdqu (%[src]), %[reg_a]\n\t"
-		"movdqu 16(%[src]), %[reg_b]\n\t"
-		"movdqu 32(%[src]), %[reg_c]\n\t"
-		"movdqu %[reg_a], (%[dst])\n\t"
-		"movdqu %[reg_b], 16(%[dst])\n\t"
-		"movdqu %[reg_c], 32(%[dst])\n\t"
-		: [reg_a] "=x" (reg_a),
-		  [reg_b] "=x" (reg_b),
-		  [reg_c] "=x" (reg_c)
-		: [src] "r" (src),
-		  [dst] "r"(dst)
-		: "memory"
-	);
+	ymm0 = _mm256_loadu_si256((const __m256i *)src);
+	_mm256_storeu_si256((__m256i *)dst, ymm0);
 }
 
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
 static inline void
 rte_mov64(uint8_t *dst, const uint8_t *src)
 {
-	__m128i reg_a, reg_b, reg_c, reg_d;
-	asm volatile (
-		"movdqu (%[src]), %[reg_a]\n\t"
-		"movdqu 16(%[src]), %[reg_b]\n\t"
-		"movdqu 32(%[src]), %[reg_c]\n\t"
-		"movdqu 48(%[src]), %[reg_d]\n\t"
-		"movdqu %[reg_a], (%[dst])\n\t"
-		"movdqu %[reg_b], 16(%[dst])\n\t"
-		"movdqu %[reg_c], 32(%[dst])\n\t"
-		"movdqu %[reg_d], 48(%[dst])\n\t"
-		: [reg_a] "=x" (reg_a),
-		  [reg_b] "=x" (reg_b),
-		  [reg_c] "=x" (reg_c),
-		  [reg_d] "=x" (reg_d)
-		: [src] "r" (src),
-		  [dst] "r"(dst)
-		: "memory"
-	);
+	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
+	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
 }
 
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
 static inline void
 rte_mov128(uint8_t *dst, const uint8_t *src)
 {
-	__m128i reg_a, reg_b, reg_c, reg_d, reg_e, reg_f, reg_g, reg_h;
-	asm volatile (
-		"movdqu (%[src]), %[reg_a]\n\t"
-		"movdqu 16(%[src]), %[reg_b]\n\t"
-		"movdqu 32(%[src]), %[reg_c]\n\t"
-		"movdqu 48(%[src]), %[reg_d]\n\t"
-		"movdqu 64(%[src]), %[reg_e]\n\t"
-		"movdqu 80(%[src]), %[reg_f]\n\t"
-		"movdqu 96(%[src]), %[reg_g]\n\t"
-		"movdqu 112(%[src]), %[reg_h]\n\t"
-		"movdqu %[reg_a], (%[dst])\n\t"
-		"movdqu %[reg_b], 16(%[dst])\n\t"
-		"movdqu %[reg_c], 32(%[dst])\n\t"
-		"movdqu %[reg_d], 48(%[dst])\n\t"
-		"movdqu %[reg_e], 64(%[dst])\n\t"
-		"movdqu %[reg_f], 80(%[dst])\n\t"
-		"movdqu %[reg_g], 96(%[dst])\n\t"
-		"movdqu %[reg_h], 112(%[dst])\n\t"
-		: [reg_a] "=x" (reg_a),
-		  [reg_b] "=x" (reg_b),
-		  [reg_c] "=x" (reg_c),
-		  [reg_d] "=x" (reg_d),
-		  [reg_e] "=x" (reg_e),
-		  [reg_f] "=x" (reg_f),
-		  [reg_g] "=x" (reg_g),
-		  [reg_h] "=x" (reg_h)
-		: [src] "r" (src),
-		  [dst] "r"(dst)
-		: "memory"
-	);
+	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
+	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
+	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
+	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
 }
 
-#ifdef __INTEL_COMPILER
-#pragma warning(enable:593)
-#endif
-
+/**
+ * Copy 256 bytes from one location to another,
+ * locations should not overlap.
+ */
 static inline void
 rte_mov256(uint8_t *dst, const uint8_t *src)
 {
-	rte_mov128(dst, src);
-	rte_mov128(dst + 128, src + 128);
+	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
+	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
+	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
+	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
+	rte_mov32((uint8_t *)dst + 4 * 32, (const uint8_t *)src + 4 * 32);
+	rte_mov32((uint8_t *)dst + 5 * 32, (const uint8_t *)src + 5 * 32);
+	rte_mov32((uint8_t *)dst + 6 * 32, (const uint8_t *)src + 6 * 32);
+	rte_mov32((uint8_t *)dst + 7 * 32, (const uint8_t *)src + 7 * 32);
 }
 
-#define rte_memcpy(dst, src, n)              \
-	({ (__builtin_constant_p(n)) ?       \
-	memcpy((dst), (src), (n)) :          \
-	rte_memcpy_func((dst), (src), (n)); })
+/**
+ * Copy 64-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m256i ymm0, ymm1;
+
+	while (n >= 64) {
+		ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32));
+		n -= 64;
+		ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32));
+		src = (const uint8_t *)src + 64;
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
+		dst = (uint8_t *)dst + 64;
+	}
+}
+
+/**
+ * Copy 256-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov256blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7;
+
+	while (n >= 256) {
+		ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32));
+		n -= 256;
+		ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32));
+		ymm2 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 2 * 32));
+		ymm3 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 3 * 32));
+		ymm4 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 4 * 32));
+		ymm5 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 5 * 32));
+		ymm6 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 6 * 32));
+		ymm7 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 7 * 32));
+		src = (const uint8_t *)src + 256;
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 4 * 32), ymm4);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 5 * 32), ymm5);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 6 * 32), ymm6);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 7 * 32), ymm7);
+		dst = (uint8_t *)dst + 256;
+	}
+}
 
 static inline void *
-rte_memcpy_func(void *dst, const void *src, size_t n)
+rte_memcpy(void *dst, const void *src, size_t n)
 {
 	void *ret = dst;
+	int dstofss;
+	int bits;
 
-	/* We can't copy < 16 bytes using XMM registers so do it manually. */
+	/**
+	 * Copy less than 16 bytes
+	 */
 	if (n < 16) {
 		if (n & 0x01) {
 			*(uint8_t *)dst = *(const uint8_t *)src;
-			dst = (uint8_t *)dst + 1;
 			src = (const uint8_t *)src + 1;
+			dst = (uint8_t *)dst + 1;
 		}
 		if (n & 0x02) {
 			*(uint16_t *)dst = *(const uint16_t *)src;
-			dst = (uint16_t *)dst + 1;
 			src = (const uint16_t *)src + 1;
+			dst = (uint16_t *)dst + 1;
 		}
 		if (n & 0x04) {
 			*(uint32_t *)dst = *(const uint32_t *)src;
-			dst = (uint32_t *)dst + 1;
 			src = (const uint32_t *)src + 1;
+			dst = (uint32_t *)dst + 1;
 		}
 		if (n & 0x08) {
 			*(uint64_t *)dst = *(const uint64_t *)src;
@@ -201,95 +224,410 @@ rte_memcpy_func(void *dst, const void *src, size_t n)
 		return ret;
 	}
 
-	/* Special fast cases for <= 128 bytes */
+	/**
+	 * Fast way when copy size doesn't exceed 512 bytes
+	 */
 	if (n <= 32) {
 		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
 		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
 		return ret;
 	}
-
 	if (n <= 64) {
 		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
 		rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
 		return ret;
 	}
-
-	if (n <= 128) {
-		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov64((uint8_t *)dst - 64 + n, (const uint8_t *)src - 64 + n);
+	if (n <= 512) {
+		if (n >= 256) {
+			n -= 256;
+			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 256;
+			dst = (uint8_t *)dst + 256;
+		}
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+		if (n >= 64) {
+			n -= 64;
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 64;
+			dst = (uint8_t *)dst + 64;
+		}
+COPY_BLOCK_64_BACK31:
+		if (n > 32) {
+			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
+			return ret;
+		}
+		if (n > 0) {
+			rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
+		}
 		return ret;
 	}
 
-	/*
-	 * For large copies > 128 bytes. This combination of 256, 64 and 16 byte
-	 * copies was found to be faster than doing 128 and 32 byte copies as
-	 * well.
+	/**
+	 * Make store aligned when copy size exceeds 512 bytes
 	 */
-	for ( ; n >= 256; n -= 256) {
-		rte_mov256((uint8_t *)dst, (const uint8_t *)src);
-		dst = (uint8_t *)dst + 256;
-		src = (const uint8_t *)src + 256;
+	dstofss = 32 - (int)((long long)(void *)dst & 0x1F);
+	n -= dstofss;
+	rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+	src = (const uint8_t *)src + dstofss;
+	dst = (uint8_t *)dst + dstofss;
+
+	/**
+	 * Copy 256-byte blocks.
+	 * Use copy block function for better instruction order control,
+	 * which is important when load is unaligned.
+	 */
+	rte_mov256blocks((uint8_t *)dst, (const uint8_t *)src, n);
+	bits = n;
+	n = n & 255;
+	bits -= n;
+	src = (const uint8_t *)src + bits;
+	dst = (uint8_t *)dst + bits;
+
+	/**
+	 * Copy 64-byte blocks.
+	 * Use copy block function for better instruction order control,
+	 * which is important when load is unaligned.
+	 */
+	if (n >= 64) {
+		rte_mov64blocks((uint8_t *)dst, (const uint8_t *)src, n);
+		bits = n;
+		n = n & 63;
+		bits -= n;
+		src = (const uint8_t *)src + bits;
+		dst = (uint8_t *)dst + bits;
 	}
 
-	/*
-	 * We split the remaining bytes (which will be less than 256) into
-	 * 64byte (2^6) chunks.
-	 * Using incrementing integers in the case labels of a switch statement
-	 * enourages the compiler to use a jump table. To get incrementing
-	 * integers, we shift the 2 relevant bits to the LSB position to first
-	 * get decrementing integers, and then subtract.
+	/**
+	 * Copy whatever left
 	 */
-	switch (3 - (n >> 6)) {
-	case 0x00:
-		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-		n -= 64;
-		dst = (uint8_t *)dst + 64;
-		src = (const uint8_t *)src + 64;      /* fallthrough */
-	case 0x01:
-		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-		n -= 64;
-		dst = (uint8_t *)dst + 64;
-		src = (const uint8_t *)src + 64;      /* fallthrough */
-	case 0x02:
-		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-		n -= 64;
-		dst = (uint8_t *)dst + 64;
-		src = (const uint8_t *)src + 64;      /* fallthrough */
-	default:
-		;
+	goto COPY_BLOCK_64_BACK31;
+}
+
+#else /* RTE_MACHINE_CPUFLAG_AVX2 */
+
+/**
+ * SSE & AVX implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
+	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
+	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
+	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
+}
+
+/**
+ * Copy 256 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov256(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
+	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
+	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
+	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
+	rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);
+	rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);
+	rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);
+	rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);
+	rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);
+	rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);
+	rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);
+	rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
+}
+
+/**
+ * Macro for copying unaligned block from one location to another with constant load offset,
+ * 47 bytes leftover maximum,
+ * locations should not overlap.
+ * Requirements:
+ * - Store is aligned
+ * - Load offset is <offset>, which must be immediate value within [1, 15]
+ * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
+ * - <dst>, <src>, <len> must be variables
+ * - __m128i <xmm0> ~ <xmm8> must be pre-defined
+ */
+#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)                                                     \
+({                                                                                                          \
+    int tmp;                                                                                                \
+    while (len >= 128 + 16 - offset) {                                                                      \
+        xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));                  \
+        len -= 128;                                                                                         \
+        xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));                  \
+        xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));                  \
+        xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16));                  \
+        xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16));                  \
+        xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16));                  \
+        xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16));                  \
+        xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \
+        xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \
+        src = (const uint8_t *)src + 128;                                                                   \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \
+        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));        \
+        dst = (uint8_t *)dst + 128;                                                                         \
+    }                                                                                                       \
+    tmp = len;                                                                                              \
+    len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
+    tmp -= len;                                                                                             \
+    src = (const uint8_t *)src + tmp;                                                                       \
+    dst = (uint8_t *)dst + tmp;                                                                             \
+    if (len >= 32 + 16 - offset) {                                                                          \
+        while (len >= 32 + 16 - offset) {                                                                   \
+            xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));              \
+            len -= 32;                                                                                      \
+            xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));              \
+            xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));              \
+            src = (const uint8_t *)src + 32;                                                                \
+            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));    \
+            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));    \
+            dst = (uint8_t *)dst + 32;                                                                      \
+        }                                                                                                   \
+        tmp = len;                                                                                          \
+        len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
+        tmp -= len;                                                                                         \
+        src = (const uint8_t *)src + tmp;                                                                   \
+        dst = (uint8_t *)dst + tmp;                                                                         \
+    }                                                                                                       \
+})
+
+/**
+ * Macro for copying unaligned block from one location to another,
+ * 47 bytes leftover maximum,
+ * locations should not overlap.
+ * Use switch here because the aligning instruction requires immediate value for shift count.
+ * Requirements:
+ * - Store is aligned
+ * - Load offset is <offset>, which must be within [1, 15]
+ * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
+ * - <dst>, <src>, <len> must be variables
+ * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined
+ */
+#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                   \
+({                                                                    \
+    switch (offset) {                                                 \
+    case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break;    \
+    case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break;    \
+    case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break;    \
+    case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break;    \
+    case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break;    \
+    case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break;    \
+    case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break;    \
+    case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break;    \
+    case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break;    \
+    case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break;    \
+    case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;    \
+    case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break;    \
+    case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break;    \
+    case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break;    \
+    case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break;    \
+    default:;                                                         \
+    }                                                                 \
+})
+
+static inline void *
+rte_memcpy(void *dst, const void *src, size_t n)
+{
+	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
+	void *ret = dst;
+	int dstofss;
+	int srcofs;
+
+	/**
+	 * Copy less than 16 bytes
+	 */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dst = *(const uint8_t *)src;
+			src = (const uint8_t *)src + 1;
+			dst = (uint8_t *)dst + 1;
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dst = *(const uint16_t *)src;
+			src = (const uint16_t *)src + 1;
+			dst = (uint16_t *)dst + 1;
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dst = *(const uint32_t *)src;
+			src = (const uint32_t *)src + 1;
+			dst = (uint32_t *)dst + 1;
+		}
+		if (n & 0x08) {
+			*(uint64_t *)dst = *(const uint64_t *)src;
+		}
+		return ret;
 	}
 
-	/*
-	 * We split the remaining bytes (which will be less than 64) into
-	 * 16byte (2^4) chunks, using the same switch structure as above.
+	/**
+	 * Fast way when copy size doesn't exceed 512 bytes
 	 */
-	switch (3 - (n >> 4)) {
-	case 0x00:
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		n -= 16;
-		dst = (uint8_t *)dst + 16;
-		src = (const uint8_t *)src + 16;      /* fallthrough */
-	case 0x01:
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		n -= 16;
-		dst = (uint8_t *)dst + 16;
-		src = (const uint8_t *)src + 16;      /* fallthrough */
-	case 0x02:
+	if (n <= 32) {
 		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		n -= 16;
-		dst = (uint8_t *)dst + 16;
-		src = (const uint8_t *)src + 16;      /* fallthrough */
-	default:
-		;
+		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
+		return ret;
 	}
-
-	/* Copy any remaining bytes, without going beyond end of buffers */
-	if (n != 0) {
+	if (n <= 48) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
 		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 128) {
+		goto COPY_BLOCK_128_BACK15;
 	}
-	return ret;
+	if (n <= 512) {
+		if (n >= 256) {
+			n -= 256;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov128((uint8_t *)dst + 128, (const uint8_t *)src + 128);
+			src = (const uint8_t *)src + 256;
+			dst = (uint8_t *)dst + 256;
+		}
+COPY_BLOCK_255_BACK15:
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+COPY_BLOCK_128_BACK15:
+		if (n >= 64) {
+			n -= 64;
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 64;
+			dst = (uint8_t *)dst + 64;
+		}
+COPY_BLOCK_64_BACK15:
+		if (n >= 32) {
+			n -= 32;
+			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 32;
+			dst = (uint8_t *)dst + 32;
+		}
+		if (n > 16) {
+			rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
+			return ret;
+		}
+		if (n > 0) {
+			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
+		}
+		return ret;
+	}
+
+	/**
+	 * Make store aligned when copy size exceeds 512 bytes,
+	 * and make sure the first 15 bytes are copied, because
+	 * unaligned copy functions require up to 15 bytes
+	 * backwards access.
+	 */
+	dstofss = 16 - (int)((long long)(void *)dst & 0x0F) + 16;
+	n -= dstofss;
+	rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+	src = (const uint8_t *)src + dstofss;
+	dst = (uint8_t *)dst + dstofss;
+	srcofs = (int)((long long)(const void *)src & 0x0F);
+
+	/**
+	 * For aligned copy
+	 */
+	if (srcofs == 0) {
+		/**
+		 * Copy 256-byte blocks
+		 */
+		for (; n >= 256; n -= 256) {
+			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
+			dst = (uint8_t *)dst + 256;
+			src = (const uint8_t *)src + 256;
+		}
+
+		/**
+		 * Copy whatever left
+		 */
+		goto COPY_BLOCK_255_BACK15;
+	}
+
+	/**
+	 * For copy with unaligned load
+	 */
+	MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
+
+	/**
+	 * Copy whatever left
+	 */
+	goto COPY_BLOCK_64_BACK15;
 }
 
+#endif /* RTE_MACHINE_CPUFLAG_AVX2 */
+
 #ifdef __cplusplus
 }
 #endif
-- 
1.9.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization
  2015-01-29  2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang
                   ` (3 preceding siblings ...)
  2015-01-29  2:38 ` [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms Zhihong Wang
@ 2015-01-29  6:16 ` Fu, JingguoX
  2015-02-10  3:06 ` Liang, Cunming
  2015-02-16 15:57 ` De Lara Guarch, Pablo
  6 siblings, 0 replies; 12+ messages in thread
From: Fu, JingguoX @ 2015-01-29  6:16 UTC (permalink / raw)
  To: Wang, Zhihong, dev

Basic Information

        Patch name        DPDK memcpy optimization v2
        Brief description about test purpose    Verify memory copy and memory copy performance cases on variety OS
        Test Flag         Tested-by
        Tester name       jingguox.fu at intel.com

        Test Tool Chain information     N/A
	  Commit ID	88fa98a60b34812bfed92e5b2706fcf7e1cbcbc8
        Test Result Summary     Total 6 cases, 6 passed, 0 failed
        
Test environment

        -   Environment 1:
            OS: Ubuntu12.04 3.2.0-23-generic X86_64
            GCC: gcc version 4.6.3
            CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
            NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ [8086:10fb] (rev 01)

        -   Environment 2: 
            OS: Ubuntu14.04 3.13.0-24-generic
            GCC: gcc version 4.8.2
            CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
            NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ [8086:10fb] (rev 01)

            Environment 3:
            OS: Fedora18 3.6.10-4.fc18.x86_64
            GCC: gcc version 4.7.2 20121109
            CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
            NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ [8086:10fb] (rev 01)

Detailed Testing information

        Test Case - name  test_memcpy
        Test Case - Description 
                  Create two buffers, and initialise one with random values. These are copied 
                  to the second buffer and then compared to see if the copy was successful. The 
                  bytes outside the copied area are also checked to make sure they were not changed.
        Test Case -test sample/application
                  test application in app/test
        Test Case -command / instruction
                  # ./app/test/test -n 1 -c ffff
                  #RTE>> memcpy_autotest
        Test Case - expected
                  #RTE>> Test	OK
        Test Result- PASSED

        Test Case - name  test_memcpy_perf
        Test Case - Description
                  a number of different sizes and cached/uncached permutations
        Test Case -test sample/application
                  test application in app/test
        Test Case -command / instruction
                  # ./app/test/test -n 1 -c ffff
                  #RTE>> memcpy_perf_autotest
        Test Case - expected
                  #RTE>> Test	OK
        Test Result- PASSED


-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhihong Wang
Sent: Thursday, January 29, 2015 10:39
To: dev@dpdk.org
Subject: [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization

This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
It also extends memcpy test coverage with unaligned cases and more test points.

Optimization techniques are summarized below:

1. Utilize full cache bandwidth

2. Enforce aligned stores

3. Apply load address alignment based on architecture features

4. Make load/store address available as early as possible

5. General optimization techniques like inlining, branch reducing, prefetch pattern access

--------------
Changes in v2:

1. Reduced constant test cases in app/test/test_memcpy_perf.c for fast build

2. Modified macro definition for better code readability & safety

Zhihong Wang (4):
  app/test: Disabled VTA for memcpy test in app/test/Makefile
  app/test: Removed unnecessary test cases in app/test/test_memcpy.c
  app/test: Extended test coverage in app/test/test_memcpy_perf.c
  lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
    and AVX platforms

 app/test/Makefile                                  |   6 +
 app/test/test_memcpy.c                             |  52 +-
 app/test/test_memcpy_perf.c                        | 220 ++++---
 .../common/include/arch/x86/rte_memcpy.h           | 680 +++++++++++++++------
 4 files changed, 654 insertions(+), 304 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
  2015-01-29  2:38 ` [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms Zhihong Wang
@ 2015-01-29 15:17   ` Ananyev, Konstantin
  2015-01-30  5:57     ` Wang, Zhihong
  0 siblings, 1 reply; 12+ messages in thread
From: Ananyev, Konstantin @ 2015-01-29 15:17 UTC (permalink / raw)
  To: Wang, Zhihong, dev

Hi Zhihong,

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhihong Wang
> Sent: Thursday, January 29, 2015 2:39 AM
> To: dev@dpdk.org
> Subject: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
> 
> Main code changes:
> 
> 1. Differentiate architectural features based on CPU flags
> 
>     a. Implement separated move functions for SSE/AVX/AVX2 to make full utilization of cache bandwidth
> 
>     b. Implement separated copy flow specifically optimized for target architecture
> 
> 2. Rewrite the memcpy function "rte_memcpy"
> 
>     a. Add store aligning
> 
>     b. Add load aligning based on architectural features
> 
>     c. Put block copy loop into inline move functions for better control of instruction order
> 
>     d. Eliminate unnecessary MOVs
> 
> 3. Rewrite the inline move functions
> 
>     a. Add move functions for unaligned load cases
> 
>     b. Change instruction order in copy loops for better pipeline utilization
> 
>     c. Use intrinsics instead of assembly code
> 
> 4. Remove slow glibc call for constant copies
> 
> Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
> ---
>  .../common/include/arch/x86/rte_memcpy.h           | 680 +++++++++++++++------
>  1 file changed, 509 insertions(+), 171 deletions(-)
> 
> diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> index fb9eba8..7b2d382 100644
> --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> @@ -34,166 +34,189 @@
>  #ifndef _RTE_MEMCPY_X86_64_H_
>  #define _RTE_MEMCPY_X86_64_H_
> 
> +/**
> + * @file
> + *
> + * Functions for SSE/AVX/AVX2 implementation of memcpy().
> + */
> +
> +#include <stdio.h>
>  #include <stdint.h>
>  #include <string.h>
> -#include <emmintrin.h>
> +#include <x86intrin.h>
> 
>  #ifdef __cplusplus
>  extern "C" {
>  #endif
> 
> -#include "generic/rte_memcpy.h"
> +/**
> + * Copy bytes from one location to another. The locations must not overlap.
> + *
> + * @note This is implemented as a macro, so it's address should not be taken
> + * and care is needed as parameter expressions may be evaluated multiple times.
> + *
> + * @param dst
> + *   Pointer to the destination of the data.
> + * @param src
> + *   Pointer to the source data.
> + * @param n
> + *   Number of bytes to copy.
> + * @return
> + *   Pointer to the destination data.
> + */
> +static inline void *
> +rte_memcpy(void *dst, const void *src, size_t n) __attribute__((always_inline));
> 
> -#ifdef __INTEL_COMPILER
> -#pragma warning(disable:593) /* Stop unused variable warning (reg_a etc). */
> -#endif
> +#ifdef RTE_MACHINE_CPUFLAG_AVX2
> 
> +/**
> + * AVX2 implementation below
> + */
> +
> +/**
> + * Copy 16 bytes from one location to another,
> + * locations should not overlap.
> + */
>  static inline void
>  rte_mov16(uint8_t *dst, const uint8_t *src)
>  {
> -	__m128i reg_a;
> -	asm volatile (
> -		"movdqu (%[src]), %[reg_a]\n\t"
> -		"movdqu %[reg_a], (%[dst])\n\t"
> -		: [reg_a] "=x" (reg_a)
> -		: [src] "r" (src),
> -		  [dst] "r"(dst)
> -		: "memory"
> -	);
> +	__m128i xmm0;
> +
> +	xmm0 = _mm_loadu_si128((const __m128i *)src);
> +	_mm_storeu_si128((__m128i *)dst, xmm0);
>  }
> 
> +/**
> + * Copy 32 bytes from one location to another,
> + * locations should not overlap.
> + */
>  static inline void
>  rte_mov32(uint8_t *dst, const uint8_t *src)
>  {
> -	__m128i reg_a, reg_b;
> -	asm volatile (
> -		"movdqu (%[src]), %[reg_a]\n\t"
> -		"movdqu 16(%[src]), %[reg_b]\n\t"
> -		"movdqu %[reg_a], (%[dst])\n\t"
> -		"movdqu %[reg_b], 16(%[dst])\n\t"
> -		: [reg_a] "=x" (reg_a),
> -		  [reg_b] "=x" (reg_b)
> -		: [src] "r" (src),
> -		  [dst] "r"(dst)
> -		: "memory"
> -	);
> -}
> +	__m256i ymm0;
> 
> -static inline void
> -rte_mov48(uint8_t *dst, const uint8_t *src)
> -{
> -	__m128i reg_a, reg_b, reg_c;
> -	asm volatile (
> -		"movdqu (%[src]), %[reg_a]\n\t"
> -		"movdqu 16(%[src]), %[reg_b]\n\t"
> -		"movdqu 32(%[src]), %[reg_c]\n\t"
> -		"movdqu %[reg_a], (%[dst])\n\t"
> -		"movdqu %[reg_b], 16(%[dst])\n\t"
> -		"movdqu %[reg_c], 32(%[dst])\n\t"
> -		: [reg_a] "=x" (reg_a),
> -		  [reg_b] "=x" (reg_b),
> -		  [reg_c] "=x" (reg_c)
> -		: [src] "r" (src),
> -		  [dst] "r"(dst)
> -		: "memory"
> -	);
> +	ymm0 = _mm256_loadu_si256((const __m256i *)src);
> +	_mm256_storeu_si256((__m256i *)dst, ymm0);
>  }
> 
> +/**
> + * Copy 64 bytes from one location to another,
> + * locations should not overlap.
> + */
>  static inline void
>  rte_mov64(uint8_t *dst, const uint8_t *src)
>  {
> -	__m128i reg_a, reg_b, reg_c, reg_d;
> -	asm volatile (
> -		"movdqu (%[src]), %[reg_a]\n\t"
> -		"movdqu 16(%[src]), %[reg_b]\n\t"
> -		"movdqu 32(%[src]), %[reg_c]\n\t"
> -		"movdqu 48(%[src]), %[reg_d]\n\t"
> -		"movdqu %[reg_a], (%[dst])\n\t"
> -		"movdqu %[reg_b], 16(%[dst])\n\t"
> -		"movdqu %[reg_c], 32(%[dst])\n\t"
> -		"movdqu %[reg_d], 48(%[dst])\n\t"
> -		: [reg_a] "=x" (reg_a),
> -		  [reg_b] "=x" (reg_b),
> -		  [reg_c] "=x" (reg_c),
> -		  [reg_d] "=x" (reg_d)
> -		: [src] "r" (src),
> -		  [dst] "r"(dst)
> -		: "memory"
> -	);
> +	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
> +	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
>  }
> 
> +/**
> + * Copy 128 bytes from one location to another,
> + * locations should not overlap.
> + */
>  static inline void
>  rte_mov128(uint8_t *dst, const uint8_t *src)
>  {
> -	__m128i reg_a, reg_b, reg_c, reg_d, reg_e, reg_f, reg_g, reg_h;
> -	asm volatile (
> -		"movdqu (%[src]), %[reg_a]\n\t"
> -		"movdqu 16(%[src]), %[reg_b]\n\t"
> -		"movdqu 32(%[src]), %[reg_c]\n\t"
> -		"movdqu 48(%[src]), %[reg_d]\n\t"
> -		"movdqu 64(%[src]), %[reg_e]\n\t"
> -		"movdqu 80(%[src]), %[reg_f]\n\t"
> -		"movdqu 96(%[src]), %[reg_g]\n\t"
> -		"movdqu 112(%[src]), %[reg_h]\n\t"
> -		"movdqu %[reg_a], (%[dst])\n\t"
> -		"movdqu %[reg_b], 16(%[dst])\n\t"
> -		"movdqu %[reg_c], 32(%[dst])\n\t"
> -		"movdqu %[reg_d], 48(%[dst])\n\t"
> -		"movdqu %[reg_e], 64(%[dst])\n\t"
> -		"movdqu %[reg_f], 80(%[dst])\n\t"
> -		"movdqu %[reg_g], 96(%[dst])\n\t"
> -		"movdqu %[reg_h], 112(%[dst])\n\t"
> -		: [reg_a] "=x" (reg_a),
> -		  [reg_b] "=x" (reg_b),
> -		  [reg_c] "=x" (reg_c),
> -		  [reg_d] "=x" (reg_d),
> -		  [reg_e] "=x" (reg_e),
> -		  [reg_f] "=x" (reg_f),
> -		  [reg_g] "=x" (reg_g),
> -		  [reg_h] "=x" (reg_h)
> -		: [src] "r" (src),
> -		  [dst] "r"(dst)
> -		: "memory"
> -	);
> +	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
> +	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
> +	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
> +	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
>  }
> 
> -#ifdef __INTEL_COMPILER
> -#pragma warning(enable:593)
> -#endif
> -
> +/**
> + * Copy 256 bytes from one location to another,
> + * locations should not overlap.
> + */
>  static inline void
>  rte_mov256(uint8_t *dst, const uint8_t *src)
>  {
> -	rte_mov128(dst, src);
> -	rte_mov128(dst + 128, src + 128);
> +	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
> +	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
> +	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
> +	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
> +	rte_mov32((uint8_t *)dst + 4 * 32, (const uint8_t *)src + 4 * 32);
> +	rte_mov32((uint8_t *)dst + 5 * 32, (const uint8_t *)src + 5 * 32);
> +	rte_mov32((uint8_t *)dst + 6 * 32, (const uint8_t *)src + 6 * 32);
> +	rte_mov32((uint8_t *)dst + 7 * 32, (const uint8_t *)src + 7 * 32);
>  }
> 
> -#define rte_memcpy(dst, src, n)              \
> -	({ (__builtin_constant_p(n)) ?       \
> -	memcpy((dst), (src), (n)) :          \
> -	rte_memcpy_func((dst), (src), (n)); })
> +/**
> + * Copy 64-byte blocks from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov64blocks(uint8_t *dst, const uint8_t *src, size_t n)
> +{
> +	__m256i ymm0, ymm1;
> +
> +	while (n >= 64) {
> +		ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32));
> +		n -= 64;
> +		ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32));
> +		src = (const uint8_t *)src + 64;
> +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
> +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
> +		dst = (uint8_t *)dst + 64;
> +	}
> +}
> +
> +/**
> + * Copy 256-byte blocks from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov256blocks(uint8_t *dst, const uint8_t *src, size_t n)
> +{
> +	__m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7;
> +
> +	while (n >= 256) {
> +		ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32));
> +		n -= 256;
> +		ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32));
> +		ymm2 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 2 * 32));
> +		ymm3 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 3 * 32));
> +		ymm4 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 4 * 32));
> +		ymm5 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 5 * 32));
> +		ymm6 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 6 * 32));
> +		ymm7 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 7 * 32));
> +		src = (const uint8_t *)src + 256;
> +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
> +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
> +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2);
> +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3);
> +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 4 * 32), ymm4);
> +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 5 * 32), ymm5);
> +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 6 * 32), ymm6);
> +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 7 * 32), ymm7);
> +		dst = (uint8_t *)dst + 256;
> +	}
> +}
> 
>  static inline void *
> -rte_memcpy_func(void *dst, const void *src, size_t n)
> +rte_memcpy(void *dst, const void *src, size_t n)
>  {
>  	void *ret = dst;
> +	int dstofss;
> +	int bits;
> 
> -	/* We can't copy < 16 bytes using XMM registers so do it manually. */
> +	/**
> +	 * Copy less than 16 bytes
> +	 */
>  	if (n < 16) {
>  		if (n & 0x01) {
>  			*(uint8_t *)dst = *(const uint8_t *)src;
> -			dst = (uint8_t *)dst + 1;
>  			src = (const uint8_t *)src + 1;
> +			dst = (uint8_t *)dst + 1;
>  		}
>  		if (n & 0x02) {
>  			*(uint16_t *)dst = *(const uint16_t *)src;
> -			dst = (uint16_t *)dst + 1;
>  			src = (const uint16_t *)src + 1;
> +			dst = (uint16_t *)dst + 1;
>  		}
>  		if (n & 0x04) {
>  			*(uint32_t *)dst = *(const uint32_t *)src;
> -			dst = (uint32_t *)dst + 1;
>  			src = (const uint32_t *)src + 1;
> +			dst = (uint32_t *)dst + 1;
>  		}
>  		if (n & 0x08) {
>  			*(uint64_t *)dst = *(const uint64_t *)src;
> @@ -201,95 +224,410 @@ rte_memcpy_func(void *dst, const void *src, size_t n)
>  		return ret;
>  	}
> 
> -	/* Special fast cases for <= 128 bytes */
> +	/**
> +	 * Fast way when copy size doesn't exceed 512 bytes
> +	 */
>  	if (n <= 32) {
>  		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
>  		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
>  		return ret;
>  	}
> -
>  	if (n <= 64) {
>  		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
>  		rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
>  		return ret;
>  	}
> -
> -	if (n <= 128) {
> -		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> -		rte_mov64((uint8_t *)dst - 64 + n, (const uint8_t *)src - 64 + n);
> +	if (n <= 512) {
> +		if (n >= 256) {
> +			n -= 256;
> +			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
> +			src = (const uint8_t *)src + 256;
> +			dst = (uint8_t *)dst + 256;
> +		}
> +		if (n >= 128) {
> +			n -= 128;
> +			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> +			src = (const uint8_t *)src + 128;
> +			dst = (uint8_t *)dst + 128;
> +		}
> +		if (n >= 64) {
> +			n -= 64;
> +			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> +			src = (const uint8_t *)src + 64;
> +			dst = (uint8_t *)dst + 64;
> +		}
> +COPY_BLOCK_64_BACK31:
> +		if (n > 32) {
> +			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> +			rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
> +			return ret;
> +		}
> +		if (n > 0) {
> +			rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
> +		}
>  		return ret;
>  	}
> 
> -	/*
> -	 * For large copies > 128 bytes. This combination of 256, 64 and 16 byte
> -	 * copies was found to be faster than doing 128 and 32 byte copies as
> -	 * well.
> +	/**
> +	 * Make store aligned when copy size exceeds 512 bytes
>  	 */
> -	for ( ; n >= 256; n -= 256) {
> -		rte_mov256((uint8_t *)dst, (const uint8_t *)src);
> -		dst = (uint8_t *)dst + 256;
> -		src = (const uint8_t *)src + 256;
> +	dstofss = 32 - (int)((long long)(void *)dst & 0x1F);
> +	n -= dstofss;
> +	rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> +	src = (const uint8_t *)src + dstofss;
> +	dst = (uint8_t *)dst + dstofss;
> +
> +	/**
> +	 * Copy 256-byte blocks.
> +	 * Use copy block function for better instruction order control,
> +	 * which is important when load is unaligned.
> +	 */
> +	rte_mov256blocks((uint8_t *)dst, (const uint8_t *)src, n);
> +	bits = n;
> +	n = n & 255;
> +	bits -= n;
> +	src = (const uint8_t *)src + bits;
> +	dst = (uint8_t *)dst + bits;
> +
> +	/**
> +	 * Copy 64-byte blocks.
> +	 * Use copy block function for better instruction order control,
> +	 * which is important when load is unaligned.
> +	 */
> +	if (n >= 64) {
> +		rte_mov64blocks((uint8_t *)dst, (const uint8_t *)src, n);
> +		bits = n;
> +		n = n & 63;
> +		bits -= n;
> +		src = (const uint8_t *)src + bits;
> +		dst = (uint8_t *)dst + bits;
>  	}
> 
> -	/*
> -	 * We split the remaining bytes (which will be less than 256) into
> -	 * 64byte (2^6) chunks.
> -	 * Using incrementing integers in the case labels of a switch statement
> -	 * enourages the compiler to use a jump table. To get incrementing
> -	 * integers, we shift the 2 relevant bits to the LSB position to first
> -	 * get decrementing integers, and then subtract.
> +	/**
> +	 * Copy whatever left
>  	 */
> -	switch (3 - (n >> 6)) {
> -	case 0x00:
> -		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> -		n -= 64;
> -		dst = (uint8_t *)dst + 64;
> -		src = (const uint8_t *)src + 64;      /* fallthrough */
> -	case 0x01:
> -		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> -		n -= 64;
> -		dst = (uint8_t *)dst + 64;
> -		src = (const uint8_t *)src + 64;      /* fallthrough */
> -	case 0x02:
> -		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> -		n -= 64;
> -		dst = (uint8_t *)dst + 64;
> -		src = (const uint8_t *)src + 64;      /* fallthrough */
> -	default:
> -		;
> +	goto COPY_BLOCK_64_BACK31;
> +}
> +
> +#else /* RTE_MACHINE_CPUFLAG_AVX2 */
> +
> +/**
> + * SSE & AVX implementation below
> + */
> +
> +/**
> + * Copy 16 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov16(uint8_t *dst, const uint8_t *src)
> +{
> +	__m128i xmm0;
> +
> +	xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);
> +	_mm_storeu_si128((__m128i *)dst, xmm0);
> +}
> +
> +/**
> + * Copy 32 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov32(uint8_t *dst, const uint8_t *src)
> +{
> +	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> +	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> +}
> +
> +/**
> + * Copy 64 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov64(uint8_t *dst, const uint8_t *src)
> +{
> +	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> +	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> +	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> +	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
> +}
> +
> +/**
> + * Copy 128 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov128(uint8_t *dst, const uint8_t *src)
> +{
> +	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> +	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> +	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> +	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
> +	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
> +	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
> +	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
> +	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
> +}
> +
> +/**
> + * Copy 256 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov256(uint8_t *dst, const uint8_t *src)
> +{
> +	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> +	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> +	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> +	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
> +	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
> +	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
> +	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
> +	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
> +	rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);
> +	rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);
> +	rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);
> +	rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);
> +	rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);
> +	rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);
> +	rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);
> +	rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
> +}
> +
> +/**
> + * Macro for copying unaligned block from one location to another with constant load offset,
> + * 47 bytes leftover maximum,
> + * locations should not overlap.
> + * Requirements:
> + * - Store is aligned
> + * - Load offset is <offset>, which must be immediate value within [1, 15]
> + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
> + * - <dst>, <src>, <len> must be variables
> + * - __m128i <xmm0> ~ <xmm8> must be pre-defined
> + */
> +#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)                                                     \
> +({                                                                                                          \
> +    int tmp;                                                                                                \
> +    while (len >= 128 + 16 - offset) {                                                                      \
> +        xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));                  \
> +        len -= 128;                                                                                         \
> +        xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));                  \
> +        xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));                  \
> +        xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16));                  \
> +        xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16));                  \
> +        xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16));                  \
> +        xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16));                  \
> +        xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \
> +        xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \
> +        src = (const uint8_t *)src + 128;                                                                   \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \
> +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));        \
> +        dst = (uint8_t *)dst + 128;                                                                         \
> +    }                                                                                                       \
> +    tmp = len;                                                                                              \
> +    len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
> +    tmp -= len;                                                                                             \
> +    src = (const uint8_t *)src + tmp;                                                                       \
> +    dst = (uint8_t *)dst + tmp;                                                                             \
> +    if (len >= 32 + 16 - offset) {                                                                          \
> +        while (len >= 32 + 16 - offset) {                                                                   \
> +            xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));              \
> +            len -= 32;                                                                                      \
> +            xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));              \
> +            xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));              \
> +            src = (const uint8_t *)src + 32;                                                                \
> +            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));    \
> +            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));    \
> +            dst = (uint8_t *)dst + 32;                                                                      \
> +        }                                                                                                   \
> +        tmp = len;                                                                                          \
> +        len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
> +        tmp -= len;                                                                                         \
> +        src = (const uint8_t *)src + tmp;                                                                   \
> +        dst = (uint8_t *)dst + tmp;                                                                         \
> +    }                                                                                                       \
> +})
> +
> +/**
> + * Macro for copying unaligned block from one location to another,
> + * 47 bytes leftover maximum,
> + * locations should not overlap.
> + * Use switch here because the aligning instruction requires immediate value for shift count.
> + * Requirements:
> + * - Store is aligned
> + * - Load offset is <offset>, which must be within [1, 15]
> + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
> + * - <dst>, <src>, <len> must be variables
> + * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined
> + */
> +#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                   \
> +({                                                                    \
> +    switch (offset) {                                                 \
> +    case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break;    \
> +    case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break;    \
> +    case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break;    \
> +    case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break;    \
> +    case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break;    \
> +    case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break;    \
> +    case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break;    \
> +    case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break;    \
> +    case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break;    \
> +    case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break;    \
> +    case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;    \
> +    case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break;    \
> +    case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break;    \
> +    case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break;    \
> +    case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break;    \
> +    default:;                                                         \
> +    }                                                                 \
> +})

We probably didn't understand each other properly.
My thought was to do something like:

#define MOVEUNALIGNED_128(offset)	do { \
 _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \
 _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \
 _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \
 _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \
 _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \
 _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \
 _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \
 _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));       \    
} while(0);


Then at MOVEUNALIGNED_LEFT47_IMM:
....
xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \
xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \
src = (const uint8_t *)src + 128;

switch(offset) {
    case 0x01: MOVEUNALIGNED_128(1); break;  
    ...
    case 0x0f: MOVEUNALIGNED_128(f); break;
}
...
dst = (uint8_t *)dst + 128

An then in rte_memcpy() you don't need a switch for MOVEUNALIGNED_LEFT47_IMM itself.
Thought it might help to generate smaller code for rte_ememcpy() while keeping performance reasonably high.

Konstantin


> +
> +static inline void *
> +rte_memcpy(void *dst, const void *src, size_t n)
> +{
> +	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
> +	void *ret = dst;
> +	int dstofss;
> +	int srcofs;
> +
> +	/**
> +	 * Copy less than 16 bytes
> +	 */
> +	if (n < 16) {
> +		if (n & 0x01) {
> +			*(uint8_t *)dst = *(const uint8_t *)src;
> +			src = (const uint8_t *)src + 1;
> +			dst = (uint8_t *)dst + 1;
> +		}
> +		if (n & 0x02) {
> +			*(uint16_t *)dst = *(const uint16_t *)src;
> +			src = (const uint16_t *)src + 1;
> +			dst = (uint16_t *)dst + 1;
> +		}
> +		if (n & 0x04) {
> +			*(uint32_t *)dst = *(const uint32_t *)src;
> +			src = (const uint32_t *)src + 1;
> +			dst = (uint32_t *)dst + 1;
> +		}
> +		if (n & 0x08) {
> +			*(uint64_t *)dst = *(const uint64_t *)src;
> +		}
> +		return ret;
>  	}
> 
> -	/*
> -	 * We split the remaining bytes (which will be less than 64) into
> -	 * 16byte (2^4) chunks, using the same switch structure as above.
> +	/**
> +	 * Fast way when copy size doesn't exceed 512 bytes
>  	 */
> -	switch (3 - (n >> 4)) {
> -	case 0x00:
> -		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> -		n -= 16;
> -		dst = (uint8_t *)dst + 16;
> -		src = (const uint8_t *)src + 16;      /* fallthrough */
> -	case 0x01:
> -		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> -		n -= 16;
> -		dst = (uint8_t *)dst + 16;
> -		src = (const uint8_t *)src + 16;      /* fallthrough */
> -	case 0x02:
> +	if (n <= 32) {
>  		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> -		n -= 16;
> -		dst = (uint8_t *)dst + 16;
> -		src = (const uint8_t *)src + 16;      /* fallthrough */
> -	default:
> -		;
> +		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
> +		return ret;
>  	}
> -
> -	/* Copy any remaining bytes, without going beyond end of buffers */
> -	if (n != 0) {
> +	if (n <= 48) {
> +		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> +		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
> +		return ret;
> +	}
> +	if (n <= 64) {
> +		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> +		rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
>  		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
> +		return ret;
> +	}
> +	if (n <= 128) {
> +		goto COPY_BLOCK_128_BACK15;
>  	}
> -	return ret;
> +	if (n <= 512) {
> +		if (n >= 256) {
> +			n -= 256;
> +			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> +			rte_mov128((uint8_t *)dst + 128, (const uint8_t *)src + 128);
> +			src = (const uint8_t *)src + 256;
> +			dst = (uint8_t *)dst + 256;
> +		}
> +COPY_BLOCK_255_BACK15:
> +		if (n >= 128) {
> +			n -= 128;
> +			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> +			src = (const uint8_t *)src + 128;
> +			dst = (uint8_t *)dst + 128;
> +		}
> +COPY_BLOCK_128_BACK15:
> +		if (n >= 64) {
> +			n -= 64;
> +			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> +			src = (const uint8_t *)src + 64;
> +			dst = (uint8_t *)dst + 64;
> +		}
> +COPY_BLOCK_64_BACK15:
> +		if (n >= 32) {
> +			n -= 32;
> +			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> +			src = (const uint8_t *)src + 32;
> +			dst = (uint8_t *)dst + 32;
> +		}
> +		if (n > 16) {
> +			rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> +			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
> +			return ret;
> +		}
> +		if (n > 0) {
> +			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
> +		}
> +		return ret;
> +	}
> +
> +	/**
> +	 * Make store aligned when copy size exceeds 512 bytes,
> +	 * and make sure the first 15 bytes are copied, because
> +	 * unaligned copy functions require up to 15 bytes
> +	 * backwards access.
> +	 */
> +	dstofss = 16 - (int)((long long)(void *)dst & 0x0F) + 16;
> +	n -= dstofss;
> +	rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> +	src = (const uint8_t *)src + dstofss;
> +	dst = (uint8_t *)dst + dstofss;
> +	srcofs = (int)((long long)(const void *)src & 0x0F);
> +
> +	/**
> +	 * For aligned copy
> +	 */
> +	if (srcofs == 0) {
> +		/**
> +		 * Copy 256-byte blocks
> +		 */
> +		for (; n >= 256; n -= 256) {
> +			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
> +			dst = (uint8_t *)dst + 256;
> +			src = (const uint8_t *)src + 256;
> +		}
> +
> +		/**
> +		 * Copy whatever left
> +		 */
> +		goto COPY_BLOCK_255_BACK15;
> +	}
> +
> +	/**
> +	 * For copy with unaligned load
> +	 */
> +	MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
> +
> +	/**
> +	 * Copy whatever left
> +	 */
> +	goto COPY_BLOCK_64_BACK15;
>  }
> 
> +#endif /* RTE_MACHINE_CPUFLAG_AVX2 */
> +
>  #ifdef __cplusplus
>  }
>  #endif
> --
> 1.9.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
  2015-01-29 15:17   ` Ananyev, Konstantin
@ 2015-01-30  5:57     ` Wang, Zhihong
  2015-01-30 10:44       ` Ananyev, Konstantin
  0 siblings, 1 reply; 12+ messages in thread
From: Wang, Zhihong @ 2015-01-30  5:57 UTC (permalink / raw)
  To: Ananyev, Konstantin, dev

Hey Konstantin,

This method does reduce code size but lead to significant performance drop.
I think we need to keep the original code.


Thanks
Zhihong (John)


> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Thursday, January 29, 2015 11:18 PM
> To: Wang, Zhihong; dev@dpdk.org
> Subject: RE: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in
> arch/x86/rte_memcpy.h for both SSE and AVX platforms
> 
> Hi Zhihong,
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhihong Wang
> > Sent: Thursday, January 29, 2015 2:39 AM
> > To: dev@dpdk.org
> > Subject: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in
> > arch/x86/rte_memcpy.h for both SSE and AVX platforms
> >
> > Main code changes:
> >
> > 1. Differentiate architectural features based on CPU flags
> >
> >     a. Implement separated move functions for SSE/AVX/AVX2 to make
> > full utilization of cache bandwidth
> >
> >     b. Implement separated copy flow specifically optimized for target
> > architecture
> >
> > 2. Rewrite the memcpy function "rte_memcpy"
> >
> >     a. Add store aligning
> >
> >     b. Add load aligning based on architectural features
> >
> >     c. Put block copy loop into inline move functions for better
> > control of instruction order
> >
> >     d. Eliminate unnecessary MOVs
> >
> > 3. Rewrite the inline move functions
> >
> >     a. Add move functions for unaligned load cases
> >
> >     b. Change instruction order in copy loops for better pipeline
> > utilization
> >
> >     c. Use intrinsics instead of assembly code
> >
> > 4. Remove slow glibc call for constant copies
> >
> > Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
> > ---
> >  .../common/include/arch/x86/rte_memcpy.h           | 680
> +++++++++++++++------
> >  1 file changed, 509 insertions(+), 171 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > index fb9eba8..7b2d382 100644
> > --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > @@ -34,166 +34,189 @@
> >  #ifndef _RTE_MEMCPY_X86_64_H_
> >  #define _RTE_MEMCPY_X86_64_H_
> >
> > +/**
> > + * @file
> > + *
> > + * Functions for SSE/AVX/AVX2 implementation of memcpy().
> > + */
> > +
> > +#include <stdio.h>
> >  #include <stdint.h>
> >  #include <string.h>
> > -#include <emmintrin.h>
> > +#include <x86intrin.h>
> >
> >  #ifdef __cplusplus
> >  extern "C" {
> >  #endif
> >
> > -#include "generic/rte_memcpy.h"
> > +/**
> > + * Copy bytes from one location to another. The locations must not
> overlap.
> > + *
> > + * @note This is implemented as a macro, so it's address should not
> > +be taken
> > + * and care is needed as parameter expressions may be evaluated
> multiple times.
> > + *
> > + * @param dst
> > + *   Pointer to the destination of the data.
> > + * @param src
> > + *   Pointer to the source data.
> > + * @param n
> > + *   Number of bytes to copy.
> > + * @return
> > + *   Pointer to the destination data.
> > + */
> > +static inline void *
> > +rte_memcpy(void *dst, const void *src, size_t n)
> > +__attribute__((always_inline));
> >
> > -#ifdef __INTEL_COMPILER
> > -#pragma warning(disable:593) /* Stop unused variable warning (reg_a
> > etc). */ -#endif
> > +#ifdef RTE_MACHINE_CPUFLAG_AVX2
> >
> > +/**
> > + * AVX2 implementation below
> > + */
> > +
> > +/**
> > + * Copy 16 bytes from one location to another,
> > + * locations should not overlap.
> > + */
> >  static inline void
> >  rte_mov16(uint8_t *dst, const uint8_t *src)  {
> > -	__m128i reg_a;
> > -	asm volatile (
> > -		"movdqu (%[src]), %[reg_a]\n\t"
> > -		"movdqu %[reg_a], (%[dst])\n\t"
> > -		: [reg_a] "=x" (reg_a)
> > -		: [src] "r" (src),
> > -		  [dst] "r"(dst)
> > -		: "memory"
> > -	);
> > +	__m128i xmm0;
> > +
> > +	xmm0 = _mm_loadu_si128((const __m128i *)src);
> > +	_mm_storeu_si128((__m128i *)dst, xmm0);
> >  }
> >
> > +/**
> > + * Copy 32 bytes from one location to another,
> > + * locations should not overlap.
> > + */
> >  static inline void
> >  rte_mov32(uint8_t *dst, const uint8_t *src)  {
> > -	__m128i reg_a, reg_b;
> > -	asm volatile (
> > -		"movdqu (%[src]), %[reg_a]\n\t"
> > -		"movdqu 16(%[src]), %[reg_b]\n\t"
> > -		"movdqu %[reg_a], (%[dst])\n\t"
> > -		"movdqu %[reg_b], 16(%[dst])\n\t"
> > -		: [reg_a] "=x" (reg_a),
> > -		  [reg_b] "=x" (reg_b)
> > -		: [src] "r" (src),
> > -		  [dst] "r"(dst)
> > -		: "memory"
> > -	);
> > -}
> > +	__m256i ymm0;
> >
> > -static inline void
> > -rte_mov48(uint8_t *dst, const uint8_t *src) -{
> > -	__m128i reg_a, reg_b, reg_c;
> > -	asm volatile (
> > -		"movdqu (%[src]), %[reg_a]\n\t"
> > -		"movdqu 16(%[src]), %[reg_b]\n\t"
> > -		"movdqu 32(%[src]), %[reg_c]\n\t"
> > -		"movdqu %[reg_a], (%[dst])\n\t"
> > -		"movdqu %[reg_b], 16(%[dst])\n\t"
> > -		"movdqu %[reg_c], 32(%[dst])\n\t"
> > -		: [reg_a] "=x" (reg_a),
> > -		  [reg_b] "=x" (reg_b),
> > -		  [reg_c] "=x" (reg_c)
> > -		: [src] "r" (src),
> > -		  [dst] "r"(dst)
> > -		: "memory"
> > -	);
> > +	ymm0 = _mm256_loadu_si256((const __m256i *)src);
> > +	_mm256_storeu_si256((__m256i *)dst, ymm0);
> >  }
> >
> > +/**
> > + * Copy 64 bytes from one location to another,
> > + * locations should not overlap.
> > + */
> >  static inline void
> >  rte_mov64(uint8_t *dst, const uint8_t *src)  {
> > -	__m128i reg_a, reg_b, reg_c, reg_d;
> > -	asm volatile (
> > -		"movdqu (%[src]), %[reg_a]\n\t"
> > -		"movdqu 16(%[src]), %[reg_b]\n\t"
> > -		"movdqu 32(%[src]), %[reg_c]\n\t"
> > -		"movdqu 48(%[src]), %[reg_d]\n\t"
> > -		"movdqu %[reg_a], (%[dst])\n\t"
> > -		"movdqu %[reg_b], 16(%[dst])\n\t"
> > -		"movdqu %[reg_c], 32(%[dst])\n\t"
> > -		"movdqu %[reg_d], 48(%[dst])\n\t"
> > -		: [reg_a] "=x" (reg_a),
> > -		  [reg_b] "=x" (reg_b),
> > -		  [reg_c] "=x" (reg_c),
> > -		  [reg_d] "=x" (reg_d)
> > -		: [src] "r" (src),
> > -		  [dst] "r"(dst)
> > -		: "memory"
> > -	);
> > +	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
> > +	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
> >  }
> >
> > +/**
> > + * Copy 128 bytes from one location to another,
> > + * locations should not overlap.
> > + */
> >  static inline void
> >  rte_mov128(uint8_t *dst, const uint8_t *src)  {
> > -	__m128i reg_a, reg_b, reg_c, reg_d, reg_e, reg_f, reg_g, reg_h;
> > -	asm volatile (
> > -		"movdqu (%[src]), %[reg_a]\n\t"
> > -		"movdqu 16(%[src]), %[reg_b]\n\t"
> > -		"movdqu 32(%[src]), %[reg_c]\n\t"
> > -		"movdqu 48(%[src]), %[reg_d]\n\t"
> > -		"movdqu 64(%[src]), %[reg_e]\n\t"
> > -		"movdqu 80(%[src]), %[reg_f]\n\t"
> > -		"movdqu 96(%[src]), %[reg_g]\n\t"
> > -		"movdqu 112(%[src]), %[reg_h]\n\t"
> > -		"movdqu %[reg_a], (%[dst])\n\t"
> > -		"movdqu %[reg_b], 16(%[dst])\n\t"
> > -		"movdqu %[reg_c], 32(%[dst])\n\t"
> > -		"movdqu %[reg_d], 48(%[dst])\n\t"
> > -		"movdqu %[reg_e], 64(%[dst])\n\t"
> > -		"movdqu %[reg_f], 80(%[dst])\n\t"
> > -		"movdqu %[reg_g], 96(%[dst])\n\t"
> > -		"movdqu %[reg_h], 112(%[dst])\n\t"
> > -		: [reg_a] "=x" (reg_a),
> > -		  [reg_b] "=x" (reg_b),
> > -		  [reg_c] "=x" (reg_c),
> > -		  [reg_d] "=x" (reg_d),
> > -		  [reg_e] "=x" (reg_e),
> > -		  [reg_f] "=x" (reg_f),
> > -		  [reg_g] "=x" (reg_g),
> > -		  [reg_h] "=x" (reg_h)
> > -		: [src] "r" (src),
> > -		  [dst] "r"(dst)
> > -		: "memory"
> > -	);
> > +	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
> > +	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
> > +	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
> > +	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
> >  }
> >
> > -#ifdef __INTEL_COMPILER
> > -#pragma warning(enable:593)
> > -#endif
> > -
> > +/**
> > + * Copy 256 bytes from one location to another,
> > + * locations should not overlap.
> > + */
> >  static inline void
> >  rte_mov256(uint8_t *dst, const uint8_t *src)  {
> > -	rte_mov128(dst, src);
> > -	rte_mov128(dst + 128, src + 128);
> > +	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
> > +	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
> > +	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
> > +	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
> > +	rte_mov32((uint8_t *)dst + 4 * 32, (const uint8_t *)src + 4 * 32);
> > +	rte_mov32((uint8_t *)dst + 5 * 32, (const uint8_t *)src + 5 * 32);
> > +	rte_mov32((uint8_t *)dst + 6 * 32, (const uint8_t *)src + 6 * 32);
> > +	rte_mov32((uint8_t *)dst + 7 * 32, (const uint8_t *)src + 7 * 32);
> >  }
> >
> > -#define rte_memcpy(dst, src, n)              \
> > -	({ (__builtin_constant_p(n)) ?       \
> > -	memcpy((dst), (src), (n)) :          \
> > -	rte_memcpy_func((dst), (src), (n)); })
> > +/**
> > + * Copy 64-byte blocks from one location to another,
> > + * locations should not overlap.
> > + */
> > +static inline void
> > +rte_mov64blocks(uint8_t *dst, const uint8_t *src, size_t n) {
> > +	__m256i ymm0, ymm1;
> > +
> > +	while (n >= 64) {
> > +		ymm0 = _mm256_loadu_si256((const __m256i *)((const
> uint8_t *)src + 0 * 32));
> > +		n -= 64;
> > +		ymm1 = _mm256_loadu_si256((const __m256i *)((const
> uint8_t *)src + 1 * 32));
> > +		src = (const uint8_t *)src + 64;
> > +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32),
> ymm0);
> > +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32),
> ymm1);
> > +		dst = (uint8_t *)dst + 64;
> > +	}
> > +}
> > +
> > +/**
> > + * Copy 256-byte blocks from one location to another,
> > + * locations should not overlap.
> > + */
> > +static inline void
> > +rte_mov256blocks(uint8_t *dst, const uint8_t *src, size_t n) {
> > +	__m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7;
> > +
> > +	while (n >= 256) {
> > +		ymm0 = _mm256_loadu_si256((const __m256i *)((const
> uint8_t *)src + 0 * 32));
> > +		n -= 256;
> > +		ymm1 = _mm256_loadu_si256((const __m256i *)((const
> uint8_t *)src + 1 * 32));
> > +		ymm2 = _mm256_loadu_si256((const __m256i *)((const
> uint8_t *)src + 2 * 32));
> > +		ymm3 = _mm256_loadu_si256((const __m256i *)((const
> uint8_t *)src + 3 * 32));
> > +		ymm4 = _mm256_loadu_si256((const __m256i *)((const
> uint8_t *)src + 4 * 32));
> > +		ymm5 = _mm256_loadu_si256((const __m256i *)((const
> uint8_t *)src + 5 * 32));
> > +		ymm6 = _mm256_loadu_si256((const __m256i *)((const
> uint8_t *)src + 6 * 32));
> > +		ymm7 = _mm256_loadu_si256((const __m256i *)((const
> uint8_t *)src + 7 * 32));
> > +		src = (const uint8_t *)src + 256;
> > +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32),
> ymm0);
> > +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32),
> ymm1);
> > +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32),
> ymm2);
> > +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32),
> ymm3);
> > +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 4 * 32),
> ymm4);
> > +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 5 * 32),
> ymm5);
> > +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 6 * 32),
> ymm6);
> > +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 7 * 32),
> ymm7);
> > +		dst = (uint8_t *)dst + 256;
> > +	}
> > +}
> >
> >  static inline void *
> > -rte_memcpy_func(void *dst, const void *src, size_t n)
> > +rte_memcpy(void *dst, const void *src, size_t n)
> >  {
> >  	void *ret = dst;
> > +	int dstofss;
> > +	int bits;
> >
> > -	/* We can't copy < 16 bytes using XMM registers so do it manually. */
> > +	/**
> > +	 * Copy less than 16 bytes
> > +	 */
> >  	if (n < 16) {
> >  		if (n & 0x01) {
> >  			*(uint8_t *)dst = *(const uint8_t *)src;
> > -			dst = (uint8_t *)dst + 1;
> >  			src = (const uint8_t *)src + 1;
> > +			dst = (uint8_t *)dst + 1;
> >  		}
> >  		if (n & 0x02) {
> >  			*(uint16_t *)dst = *(const uint16_t *)src;
> > -			dst = (uint16_t *)dst + 1;
> >  			src = (const uint16_t *)src + 1;
> > +			dst = (uint16_t *)dst + 1;
> >  		}
> >  		if (n & 0x04) {
> >  			*(uint32_t *)dst = *(const uint32_t *)src;
> > -			dst = (uint32_t *)dst + 1;
> >  			src = (const uint32_t *)src + 1;
> > +			dst = (uint32_t *)dst + 1;
> >  		}
> >  		if (n & 0x08) {
> >  			*(uint64_t *)dst = *(const uint64_t *)src; @@ -201,95
> +224,410 @@
> > rte_memcpy_func(void *dst, const void *src, size_t n)
> >  		return ret;
> >  	}
> >
> > -	/* Special fast cases for <= 128 bytes */
> > +	/**
> > +	 * Fast way when copy size doesn't exceed 512 bytes
> > +	 */
> >  	if (n <= 32) {
> >  		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> >  		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 +
> n);
> >  		return ret;
> >  	}
> > -
> >  	if (n <= 64) {
> >  		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> >  		rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 +
> n);
> >  		return ret;
> >  	}
> > -
> > -	if (n <= 128) {
> > -		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > -		rte_mov64((uint8_t *)dst - 64 + n, (const uint8_t *)src - 64 +
> n);
> > +	if (n <= 512) {
> > +		if (n >= 256) {
> > +			n -= 256;
> > +			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
> > +			src = (const uint8_t *)src + 256;
> > +			dst = (uint8_t *)dst + 256;
> > +		}
> > +		if (n >= 128) {
> > +			n -= 128;
> > +			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> > +			src = (const uint8_t *)src + 128;
> > +			dst = (uint8_t *)dst + 128;
> > +		}
> > +		if (n >= 64) {
> > +			n -= 64;
> > +			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > +			src = (const uint8_t *)src + 64;
> > +			dst = (uint8_t *)dst + 64;
> > +		}
> > +COPY_BLOCK_64_BACK31:
> > +		if (n > 32) {
> > +			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > +			rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t
> *)src - 32 + n);
> > +			return ret;
> > +		}
> > +		if (n > 0) {
> > +			rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t
> *)src - 32 + n);
> > +		}
> >  		return ret;
> >  	}
> >
> > -	/*
> > -	 * For large copies > 128 bytes. This combination of 256, 64 and 16
> byte
> > -	 * copies was found to be faster than doing 128 and 32 byte copies as
> > -	 * well.
> > +	/**
> > +	 * Make store aligned when copy size exceeds 512 bytes
> >  	 */
> > -	for ( ; n >= 256; n -= 256) {
> > -		rte_mov256((uint8_t *)dst, (const uint8_t *)src);
> > -		dst = (uint8_t *)dst + 256;
> > -		src = (const uint8_t *)src + 256;
> > +	dstofss = 32 - (int)((long long)(void *)dst & 0x1F);
> > +	n -= dstofss;
> > +	rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > +	src = (const uint8_t *)src + dstofss;
> > +	dst = (uint8_t *)dst + dstofss;
> > +
> > +	/**
> > +	 * Copy 256-byte blocks.
> > +	 * Use copy block function for better instruction order control,
> > +	 * which is important when load is unaligned.
> > +	 */
> > +	rte_mov256blocks((uint8_t *)dst, (const uint8_t *)src, n);
> > +	bits = n;
> > +	n = n & 255;
> > +	bits -= n;
> > +	src = (const uint8_t *)src + bits;
> > +	dst = (uint8_t *)dst + bits;
> > +
> > +	/**
> > +	 * Copy 64-byte blocks.
> > +	 * Use copy block function for better instruction order control,
> > +	 * which is important when load is unaligned.
> > +	 */
> > +	if (n >= 64) {
> > +		rte_mov64blocks((uint8_t *)dst, (const uint8_t *)src, n);
> > +		bits = n;
> > +		n = n & 63;
> > +		bits -= n;
> > +		src = (const uint8_t *)src + bits;
> > +		dst = (uint8_t *)dst + bits;
> >  	}
> >
> > -	/*
> > -	 * We split the remaining bytes (which will be less than 256) into
> > -	 * 64byte (2^6) chunks.
> > -	 * Using incrementing integers in the case labels of a switch
> statement
> > -	 * enourages the compiler to use a jump table. To get incrementing
> > -	 * integers, we shift the 2 relevant bits to the LSB position to first
> > -	 * get decrementing integers, and then subtract.
> > +	/**
> > +	 * Copy whatever left
> >  	 */
> > -	switch (3 - (n >> 6)) {
> > -	case 0x00:
> > -		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > -		n -= 64;
> > -		dst = (uint8_t *)dst + 64;
> > -		src = (const uint8_t *)src + 64;      /* fallthrough */
> > -	case 0x01:
> > -		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > -		n -= 64;
> > -		dst = (uint8_t *)dst + 64;
> > -		src = (const uint8_t *)src + 64;      /* fallthrough */
> > -	case 0x02:
> > -		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > -		n -= 64;
> > -		dst = (uint8_t *)dst + 64;
> > -		src = (const uint8_t *)src + 64;      /* fallthrough */
> > -	default:
> > -		;
> > +	goto COPY_BLOCK_64_BACK31;
> > +}
> > +
> > +#else /* RTE_MACHINE_CPUFLAG_AVX2 */
> > +
> > +/**
> > + * SSE & AVX implementation below
> > + */
> > +
> > +/**
> > + * Copy 16 bytes from one location to another,
> > + * locations should not overlap.
> > + */
> > +static inline void
> > +rte_mov16(uint8_t *dst, const uint8_t *src) {
> > +	__m128i xmm0;
> > +
> > +	xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);
> > +	_mm_storeu_si128((__m128i *)dst, xmm0); }
> > +
> > +/**
> > + * Copy 32 bytes from one location to another,
> > + * locations should not overlap.
> > + */
> > +static inline void
> > +rte_mov32(uint8_t *dst, const uint8_t *src) {
> > +	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> > +	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16); }
> > +
> > +/**
> > + * Copy 64 bytes from one location to another,
> > + * locations should not overlap.
> > + */
> > +static inline void
> > +rte_mov64(uint8_t *dst, const uint8_t *src) {
> > +	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> > +	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> > +	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> > +	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16); }
> > +
> > +/**
> > + * Copy 128 bytes from one location to another,
> > + * locations should not overlap.
> > + */
> > +static inline void
> > +rte_mov128(uint8_t *dst, const uint8_t *src) {
> > +	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> > +	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> > +	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> > +	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
> > +	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
> > +	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
> > +	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
> > +	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16); }
> > +
> > +/**
> > + * Copy 256 bytes from one location to another,
> > + * locations should not overlap.
> > + */
> > +static inline void
> > +rte_mov256(uint8_t *dst, const uint8_t *src) {
> > +	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> > +	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> > +	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> > +	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
> > +	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
> > +	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
> > +	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
> > +	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
> > +	rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);
> > +	rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);
> > +	rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);
> > +	rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);
> > +	rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);
> > +	rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);
> > +	rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);
> > +	rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
> > +}
> > +
> > +/**
> > + * Macro for copying unaligned block from one location to another
> > +with constant load offset,
> > + * 47 bytes leftover maximum,
> > + * locations should not overlap.
> > + * Requirements:
> > + * - Store is aligned
> > + * - Load offset is <offset>, which must be immediate value within
> > +[1, 15]
> > + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit
> > +forwards are available for loading
> > + * - <dst>, <src>, <len> must be variables
> > + * - __m128i <xmm0> ~ <xmm8> must be pre-defined  */
> > +#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)
> \
> > +({                                                                                                          \
> > +    int tmp;                                                                                                \
> > +    while (len >= 128 + 16 - offset) {                                                                      \
> > +        xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 0 * 16));                  \
> > +        len -= 128;                                                                                         \
> > +        xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 1 * 16));                  \
> > +        xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 2 * 16));                  \
> > +        xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 3 * 16));                  \
> > +        xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 4 * 16));                  \
> > +        xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 5 * 16));                  \
> > +        xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 6 * 16));                  \
> > +        xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 7 * 16));                  \
> > +        xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 8 * 16));                  \
> > +        src = (const uint8_t *)src + 128;                                                                   \
> > +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16),
> _mm_alignr_epi8(xmm1, xmm0, offset));        \
> > +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16),
> _mm_alignr_epi8(xmm2, xmm1, offset));        \
> > +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16),
> _mm_alignr_epi8(xmm3, xmm2, offset));        \
> > +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16),
> _mm_alignr_epi8(xmm4, xmm3, offset));        \
> > +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16),
> _mm_alignr_epi8(xmm5, xmm4, offset));        \
> > +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16),
> _mm_alignr_epi8(xmm6, xmm5, offset));        \
> > +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16),
> _mm_alignr_epi8(xmm7, xmm6, offset));        \
> > +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16),
> _mm_alignr_epi8(xmm8, xmm7, offset));        \
> > +        dst = (uint8_t *)dst + 128;                                                                         \
> > +    }                                                                                                       \
> > +    tmp = len;                                                                                              \
> > +    len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
> > +    tmp -= len;                                                                                             \
> > +    src = (const uint8_t *)src + tmp;                                                                       \
> > +    dst = (uint8_t *)dst + tmp;                                                                             \
> > +    if (len >= 32 + 16 - offset) {                                                                          \
> > +        while (len >= 32 + 16 - offset) {                                                                   \
> > +            xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 0 * 16));              \
> > +            len -= 32;                                                                                      \
> > +            xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 1 * 16));              \
> > +            xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 2 * 16));              \
> > +            src = (const uint8_t *)src + 32;                                                                \
> > +            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16),
> _mm_alignr_epi8(xmm1, xmm0, offset));    \
> > +            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16),
> _mm_alignr_epi8(xmm2, xmm1, offset));    \
> > +            dst = (uint8_t *)dst + 32;                                                                      \
> > +        }                                                                                                   \
> > +        tmp = len;                                                                                          \
> > +        len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
> > +        tmp -= len;                                                                                         \
> > +        src = (const uint8_t *)src + tmp;                                                                   \
> > +        dst = (uint8_t *)dst + tmp;                                                                         \
> > +    }                                                                                                       \
> > +})
> > +
> > +/**
> > + * Macro for copying unaligned block from one location to another,
> > + * 47 bytes leftover maximum,
> > + * locations should not overlap.
> > + * Use switch here because the aligning instruction requires immediate
> value for shift count.
> > + * Requirements:
> > + * - Store is aligned
> > + * - Load offset is <offset>, which must be within [1, 15]
> > + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit
> > +forwards are available for loading
> > + * - <dst>, <src>, <len> must be variables
> > + * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM
> must be
> > +pre-defined  */
> > +#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                   \
> > +({                                                                    \
> > +    switch (offset) {                                                 \
> > +    case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break;    \
> > +    case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break;    \
> > +    case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break;    \
> > +    case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break;    \
> > +    case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break;    \
> > +    case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break;    \
> > +    case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break;    \
> > +    case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break;    \
> > +    case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break;    \
> > +    case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break;    \
> > +    case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;    \
> > +    case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break;    \
> > +    case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break;    \
> > +    case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break;    \
> > +    case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break;    \
> > +    default:;                                                         \
> > +    }                                                                 \
> > +})
> 
> We probably didn't understand each other properly.
> My thought was to do something like:
> 
> #define MOVEUNALIGNED_128(offset)	do { \
>  _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16),
> _mm_alignr_epi8(xmm1, xmm0, offset));        \
>  _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16),
> _mm_alignr_epi8(xmm2, xmm1, offset));        \
>  _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16),
> _mm_alignr_epi8(xmm3, xmm2, offset));        \
>  _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16),
> _mm_alignr_epi8(xmm4, xmm3, offset));        \
>  _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16),
> _mm_alignr_epi8(xmm5, xmm4, offset));        \
>  _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16),
> _mm_alignr_epi8(xmm6, xmm5, offset));        \
>  _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16),
> _mm_alignr_epi8(xmm7, xmm6, offset));        \
>  _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16),
> _mm_alignr_epi8(xmm8, xmm7, offset));       \
> } while(0);
> 
> 
> Then at MOVEUNALIGNED_LEFT47_IMM:
> ....
> xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7
> * 16));                  \
> xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8
> * 16));                  \
> src = (const uint8_t *)src + 128;
> 
> switch(offset) {
>     case 0x01: MOVEUNALIGNED_128(1); break;
>     ...
>     case 0x0f: MOVEUNALIGNED_128(f); break; } ...
> dst = (uint8_t *)dst + 128
> 
> An then in rte_memcpy() you don't need a switch for
> MOVEUNALIGNED_LEFT47_IMM itself.
> Thought it might help to generate smaller code for rte_ememcpy() while
> keeping performance reasonably high.
> 
> Konstantin
> 
> 
> > +
> > +static inline void *
> > +rte_memcpy(void *dst, const void *src, size_t n) {
> > +	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7,
> xmm8;
> > +	void *ret = dst;
> > +	int dstofss;
> > +	int srcofs;
> > +
> > +	/**
> > +	 * Copy less than 16 bytes
> > +	 */
> > +	if (n < 16) {
> > +		if (n & 0x01) {
> > +			*(uint8_t *)dst = *(const uint8_t *)src;
> > +			src = (const uint8_t *)src + 1;
> > +			dst = (uint8_t *)dst + 1;
> > +		}
> > +		if (n & 0x02) {
> > +			*(uint16_t *)dst = *(const uint16_t *)src;
> > +			src = (const uint16_t *)src + 1;
> > +			dst = (uint16_t *)dst + 1;
> > +		}
> > +		if (n & 0x04) {
> > +			*(uint32_t *)dst = *(const uint32_t *)src;
> > +			src = (const uint32_t *)src + 1;
> > +			dst = (uint32_t *)dst + 1;
> > +		}
> > +		if (n & 0x08) {
> > +			*(uint64_t *)dst = *(const uint64_t *)src;
> > +		}
> > +		return ret;
> >  	}
> >
> > -	/*
> > -	 * We split the remaining bytes (which will be less than 64) into
> > -	 * 16byte (2^4) chunks, using the same switch structure as above.
> > +	/**
> > +	 * Fast way when copy size doesn't exceed 512 bytes
> >  	 */
> > -	switch (3 - (n >> 4)) {
> > -	case 0x00:
> > -		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> > -		n -= 16;
> > -		dst = (uint8_t *)dst + 16;
> > -		src = (const uint8_t *)src + 16;      /* fallthrough */
> > -	case 0x01:
> > -		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> > -		n -= 16;
> > -		dst = (uint8_t *)dst + 16;
> > -		src = (const uint8_t *)src + 16;      /* fallthrough */
> > -	case 0x02:
> > +	if (n <= 32) {
> >  		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> > -		n -= 16;
> > -		dst = (uint8_t *)dst + 16;
> > -		src = (const uint8_t *)src + 16;      /* fallthrough */
> > -	default:
> > -		;
> > +		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 +
> n);
> > +		return ret;
> >  	}
> > -
> > -	/* Copy any remaining bytes, without going beyond end of buffers
> */
> > -	if (n != 0) {
> > +	if (n <= 48) {
> > +		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > +		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 +
> n);
> > +		return ret;
> > +	}
> > +	if (n <= 64) {
> > +		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > +		rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
> >  		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 +
> n);
> > +		return ret;
> > +	}
> > +	if (n <= 128) {
> > +		goto COPY_BLOCK_128_BACK15;
> >  	}
> > -	return ret;
> > +	if (n <= 512) {
> > +		if (n >= 256) {
> > +			n -= 256;
> > +			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> > +			rte_mov128((uint8_t *)dst + 128, (const uint8_t *)src
> + 128);
> > +			src = (const uint8_t *)src + 256;
> > +			dst = (uint8_t *)dst + 256;
> > +		}
> > +COPY_BLOCK_255_BACK15:
> > +		if (n >= 128) {
> > +			n -= 128;
> > +			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> > +			src = (const uint8_t *)src + 128;
> > +			dst = (uint8_t *)dst + 128;
> > +		}
> > +COPY_BLOCK_128_BACK15:
> > +		if (n >= 64) {
> > +			n -= 64;
> > +			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > +			src = (const uint8_t *)src + 64;
> > +			dst = (uint8_t *)dst + 64;
> > +		}
> > +COPY_BLOCK_64_BACK15:
> > +		if (n >= 32) {
> > +			n -= 32;
> > +			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > +			src = (const uint8_t *)src + 32;
> > +			dst = (uint8_t *)dst + 32;
> > +		}
> > +		if (n > 16) {
> > +			rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> > +			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t
> *)src - 16 + n);
> > +			return ret;
> > +		}
> > +		if (n > 0) {
> > +			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t
> *)src - 16 + n);
> > +		}
> > +		return ret;
> > +	}
> > +
> > +	/**
> > +	 * Make store aligned when copy size exceeds 512 bytes,
> > +	 * and make sure the first 15 bytes are copied, because
> > +	 * unaligned copy functions require up to 15 bytes
> > +	 * backwards access.
> > +	 */
> > +	dstofss = 16 - (int)((long long)(void *)dst & 0x0F) + 16;
> > +	n -= dstofss;
> > +	rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > +	src = (const uint8_t *)src + dstofss;
> > +	dst = (uint8_t *)dst + dstofss;
> > +	srcofs = (int)((long long)(const void *)src & 0x0F);
> > +
> > +	/**
> > +	 * For aligned copy
> > +	 */
> > +	if (srcofs == 0) {
> > +		/**
> > +		 * Copy 256-byte blocks
> > +		 */
> > +		for (; n >= 256; n -= 256) {
> > +			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
> > +			dst = (uint8_t *)dst + 256;
> > +			src = (const uint8_t *)src + 256;
> > +		}
> > +
> > +		/**
> > +		 * Copy whatever left
> > +		 */
> > +		goto COPY_BLOCK_255_BACK15;
> > +	}
> > +
> > +	/**
> > +	 * For copy with unaligned load
> > +	 */
> > +	MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
> > +
> > +	/**
> > +	 * Copy whatever left
> > +	 */
> > +	goto COPY_BLOCK_64_BACK15;
> >  }
> >
> > +#endif /* RTE_MACHINE_CPUFLAG_AVX2 */
> > +
> >  #ifdef __cplusplus
> >  }
> >  #endif
> > --
> > 1.9.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
  2015-01-30  5:57     ` Wang, Zhihong
@ 2015-01-30 10:44       ` Ananyev, Konstantin
  0 siblings, 0 replies; 12+ messages in thread
From: Ananyev, Konstantin @ 2015-01-30 10:44 UTC (permalink / raw)
  To: Wang, Zhihong, dev

Hey Zhihong,

> -----Original Message-----
> From: Wang, Zhihong
> Sent: Friday, January 30, 2015 5:57 AM
> To: Ananyev, Konstantin; dev@dpdk.org
> Subject: RE: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
> 
> Hey Konstantin,
> 
> This method does reduce code size but lead to significant performance drop.
> I think we need to keep the original code.

Sure, no point to make it slower.
Thanks for trying it anyway.
Konstantin

> 
> 
> Thanks
> Zhihong (John)
> 
> 
> > -----Original Message-----
> > From: Ananyev, Konstantin
> > Sent: Thursday, January 29, 2015 11:18 PM
> > To: Wang, Zhihong; dev@dpdk.org
> > Subject: RE: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in
> > arch/x86/rte_memcpy.h for both SSE and AVX platforms
> >
> > Hi Zhihong,
> >
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhihong Wang
> > > Sent: Thursday, January 29, 2015 2:39 AM
> > > To: dev@dpdk.org
> > > Subject: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in
> > > arch/x86/rte_memcpy.h for both SSE and AVX platforms
> > >
> > > Main code changes:
> > >
> > > 1. Differentiate architectural features based on CPU flags
> > >
> > >     a. Implement separated move functions for SSE/AVX/AVX2 to make
> > > full utilization of cache bandwidth
> > >
> > >     b. Implement separated copy flow specifically optimized for target
> > > architecture
> > >
> > > 2. Rewrite the memcpy function "rte_memcpy"
> > >
> > >     a. Add store aligning
> > >
> > >     b. Add load aligning based on architectural features
> > >
> > >     c. Put block copy loop into inline move functions for better
> > > control of instruction order
> > >
> > >     d. Eliminate unnecessary MOVs
> > >
> > > 3. Rewrite the inline move functions
> > >
> > >     a. Add move functions for unaligned load cases
> > >
> > >     b. Change instruction order in copy loops for better pipeline
> > > utilization
> > >
> > >     c. Use intrinsics instead of assembly code
> > >
> > > 4. Remove slow glibc call for constant copies
> > >
> > > Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
> > > ---
> > >  .../common/include/arch/x86/rte_memcpy.h           | 680
> > +++++++++++++++------
> > >  1 file changed, 509 insertions(+), 171 deletions(-)
> > >
> > > diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > > b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > > index fb9eba8..7b2d382 100644
> > > --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > > +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > > @@ -34,166 +34,189 @@
> > >  #ifndef _RTE_MEMCPY_X86_64_H_
> > >  #define _RTE_MEMCPY_X86_64_H_
> > >
> > > +/**
> > > + * @file
> > > + *
> > > + * Functions for SSE/AVX/AVX2 implementation of memcpy().
> > > + */
> > > +
> > > +#include <stdio.h>
> > >  #include <stdint.h>
> > >  #include <string.h>
> > > -#include <emmintrin.h>
> > > +#include <x86intrin.h>
> > >
> > >  #ifdef __cplusplus
> > >  extern "C" {
> > >  #endif
> > >
> > > -#include "generic/rte_memcpy.h"
> > > +/**
> > > + * Copy bytes from one location to another. The locations must not
> > overlap.
> > > + *
> > > + * @note This is implemented as a macro, so it's address should not
> > > +be taken
> > > + * and care is needed as parameter expressions may be evaluated
> > multiple times.
> > > + *
> > > + * @param dst
> > > + *   Pointer to the destination of the data.
> > > + * @param src
> > > + *   Pointer to the source data.
> > > + * @param n
> > > + *   Number of bytes to copy.
> > > + * @return
> > > + *   Pointer to the destination data.
> > > + */
> > > +static inline void *
> > > +rte_memcpy(void *dst, const void *src, size_t n)
> > > +__attribute__((always_inline));
> > >
> > > -#ifdef __INTEL_COMPILER
> > > -#pragma warning(disable:593) /* Stop unused variable warning (reg_a
> > > etc). */ -#endif
> > > +#ifdef RTE_MACHINE_CPUFLAG_AVX2
> > >
> > > +/**
> > > + * AVX2 implementation below
> > > + */
> > > +
> > > +/**
> > > + * Copy 16 bytes from one location to another,
> > > + * locations should not overlap.
> > > + */
> > >  static inline void
> > >  rte_mov16(uint8_t *dst, const uint8_t *src)  {
> > > -	__m128i reg_a;
> > > -	asm volatile (
> > > -		"movdqu (%[src]), %[reg_a]\n\t"
> > > -		"movdqu %[reg_a], (%[dst])\n\t"
> > > -		: [reg_a] "=x" (reg_a)
> > > -		: [src] "r" (src),
> > > -		  [dst] "r"(dst)
> > > -		: "memory"
> > > -	);
> > > +	__m128i xmm0;
> > > +
> > > +	xmm0 = _mm_loadu_si128((const __m128i *)src);
> > > +	_mm_storeu_si128((__m128i *)dst, xmm0);
> > >  }
> > >
> > > +/**
> > > + * Copy 32 bytes from one location to another,
> > > + * locations should not overlap.
> > > + */
> > >  static inline void
> > >  rte_mov32(uint8_t *dst, const uint8_t *src)  {
> > > -	__m128i reg_a, reg_b;
> > > -	asm volatile (
> > > -		"movdqu (%[src]), %[reg_a]\n\t"
> > > -		"movdqu 16(%[src]), %[reg_b]\n\t"
> > > -		"movdqu %[reg_a], (%[dst])\n\t"
> > > -		"movdqu %[reg_b], 16(%[dst])\n\t"
> > > -		: [reg_a] "=x" (reg_a),
> > > -		  [reg_b] "=x" (reg_b)
> > > -		: [src] "r" (src),
> > > -		  [dst] "r"(dst)
> > > -		: "memory"
> > > -	);
> > > -}
> > > +	__m256i ymm0;
> > >
> > > -static inline void
> > > -rte_mov48(uint8_t *dst, const uint8_t *src) -{
> > > -	__m128i reg_a, reg_b, reg_c;
> > > -	asm volatile (
> > > -		"movdqu (%[src]), %[reg_a]\n\t"
> > > -		"movdqu 16(%[src]), %[reg_b]\n\t"
> > > -		"movdqu 32(%[src]), %[reg_c]\n\t"
> > > -		"movdqu %[reg_a], (%[dst])\n\t"
> > > -		"movdqu %[reg_b], 16(%[dst])\n\t"
> > > -		"movdqu %[reg_c], 32(%[dst])\n\t"
> > > -		: [reg_a] "=x" (reg_a),
> > > -		  [reg_b] "=x" (reg_b),
> > > -		  [reg_c] "=x" (reg_c)
> > > -		: [src] "r" (src),
> > > -		  [dst] "r"(dst)
> > > -		: "memory"
> > > -	);
> > > +	ymm0 = _mm256_loadu_si256((const __m256i *)src);
> > > +	_mm256_storeu_si256((__m256i *)dst, ymm0);
> > >  }
> > >
> > > +/**
> > > + * Copy 64 bytes from one location to another,
> > > + * locations should not overlap.
> > > + */
> > >  static inline void
> > >  rte_mov64(uint8_t *dst, const uint8_t *src)  {
> > > -	__m128i reg_a, reg_b, reg_c, reg_d;
> > > -	asm volatile (
> > > -		"movdqu (%[src]), %[reg_a]\n\t"
> > > -		"movdqu 16(%[src]), %[reg_b]\n\t"
> > > -		"movdqu 32(%[src]), %[reg_c]\n\t"
> > > -		"movdqu 48(%[src]), %[reg_d]\n\t"
> > > -		"movdqu %[reg_a], (%[dst])\n\t"
> > > -		"movdqu %[reg_b], 16(%[dst])\n\t"
> > > -		"movdqu %[reg_c], 32(%[dst])\n\t"
> > > -		"movdqu %[reg_d], 48(%[dst])\n\t"
> > > -		: [reg_a] "=x" (reg_a),
> > > -		  [reg_b] "=x" (reg_b),
> > > -		  [reg_c] "=x" (reg_c),
> > > -		  [reg_d] "=x" (reg_d)
> > > -		: [src] "r" (src),
> > > -		  [dst] "r"(dst)
> > > -		: "memory"
> > > -	);
> > > +	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
> > > +	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
> > >  }
> > >
> > > +/**
> > > + * Copy 128 bytes from one location to another,
> > > + * locations should not overlap.
> > > + */
> > >  static inline void
> > >  rte_mov128(uint8_t *dst, const uint8_t *src)  {
> > > -	__m128i reg_a, reg_b, reg_c, reg_d, reg_e, reg_f, reg_g, reg_h;
> > > -	asm volatile (
> > > -		"movdqu (%[src]), %[reg_a]\n\t"
> > > -		"movdqu 16(%[src]), %[reg_b]\n\t"
> > > -		"movdqu 32(%[src]), %[reg_c]\n\t"
> > > -		"movdqu 48(%[src]), %[reg_d]\n\t"
> > > -		"movdqu 64(%[src]), %[reg_e]\n\t"
> > > -		"movdqu 80(%[src]), %[reg_f]\n\t"
> > > -		"movdqu 96(%[src]), %[reg_g]\n\t"
> > > -		"movdqu 112(%[src]), %[reg_h]\n\t"
> > > -		"movdqu %[reg_a], (%[dst])\n\t"
> > > -		"movdqu %[reg_b], 16(%[dst])\n\t"
> > > -		"movdqu %[reg_c], 32(%[dst])\n\t"
> > > -		"movdqu %[reg_d], 48(%[dst])\n\t"
> > > -		"movdqu %[reg_e], 64(%[dst])\n\t"
> > > -		"movdqu %[reg_f], 80(%[dst])\n\t"
> > > -		"movdqu %[reg_g], 96(%[dst])\n\t"
> > > -		"movdqu %[reg_h], 112(%[dst])\n\t"
> > > -		: [reg_a] "=x" (reg_a),
> > > -		  [reg_b] "=x" (reg_b),
> > > -		  [reg_c] "=x" (reg_c),
> > > -		  [reg_d] "=x" (reg_d),
> > > -		  [reg_e] "=x" (reg_e),
> > > -		  [reg_f] "=x" (reg_f),
> > > -		  [reg_g] "=x" (reg_g),
> > > -		  [reg_h] "=x" (reg_h)
> > > -		: [src] "r" (src),
> > > -		  [dst] "r"(dst)
> > > -		: "memory"
> > > -	);
> > > +	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
> > > +	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
> > > +	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
> > > +	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
> > >  }
> > >
> > > -#ifdef __INTEL_COMPILER
> > > -#pragma warning(enable:593)
> > > -#endif
> > > -
> > > +/**
> > > + * Copy 256 bytes from one location to another,
> > > + * locations should not overlap.
> > > + */
> > >  static inline void
> > >  rte_mov256(uint8_t *dst, const uint8_t *src)  {
> > > -	rte_mov128(dst, src);
> > > -	rte_mov128(dst + 128, src + 128);
> > > +	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
> > > +	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
> > > +	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
> > > +	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
> > > +	rte_mov32((uint8_t *)dst + 4 * 32, (const uint8_t *)src + 4 * 32);
> > > +	rte_mov32((uint8_t *)dst + 5 * 32, (const uint8_t *)src + 5 * 32);
> > > +	rte_mov32((uint8_t *)dst + 6 * 32, (const uint8_t *)src + 6 * 32);
> > > +	rte_mov32((uint8_t *)dst + 7 * 32, (const uint8_t *)src + 7 * 32);
> > >  }
> > >
> > > -#define rte_memcpy(dst, src, n)              \
> > > -	({ (__builtin_constant_p(n)) ?       \
> > > -	memcpy((dst), (src), (n)) :          \
> > > -	rte_memcpy_func((dst), (src), (n)); })
> > > +/**
> > > + * Copy 64-byte blocks from one location to another,
> > > + * locations should not overlap.
> > > + */
> > > +static inline void
> > > +rte_mov64blocks(uint8_t *dst, const uint8_t *src, size_t n) {
> > > +	__m256i ymm0, ymm1;
> > > +
> > > +	while (n >= 64) {
> > > +		ymm0 = _mm256_loadu_si256((const __m256i *)((const
> > uint8_t *)src + 0 * 32));
> > > +		n -= 64;
> > > +		ymm1 = _mm256_loadu_si256((const __m256i *)((const
> > uint8_t *)src + 1 * 32));
> > > +		src = (const uint8_t *)src + 64;
> > > +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32),
> > ymm0);
> > > +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32),
> > ymm1);
> > > +		dst = (uint8_t *)dst + 64;
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * Copy 256-byte blocks from one location to another,
> > > + * locations should not overlap.
> > > + */
> > > +static inline void
> > > +rte_mov256blocks(uint8_t *dst, const uint8_t *src, size_t n) {
> > > +	__m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7;
> > > +
> > > +	while (n >= 256) {
> > > +		ymm0 = _mm256_loadu_si256((const __m256i *)((const
> > uint8_t *)src + 0 * 32));
> > > +		n -= 256;
> > > +		ymm1 = _mm256_loadu_si256((const __m256i *)((const
> > uint8_t *)src + 1 * 32));
> > > +		ymm2 = _mm256_loadu_si256((const __m256i *)((const
> > uint8_t *)src + 2 * 32));
> > > +		ymm3 = _mm256_loadu_si256((const __m256i *)((const
> > uint8_t *)src + 3 * 32));
> > > +		ymm4 = _mm256_loadu_si256((const __m256i *)((const
> > uint8_t *)src + 4 * 32));
> > > +		ymm5 = _mm256_loadu_si256((const __m256i *)((const
> > uint8_t *)src + 5 * 32));
> > > +		ymm6 = _mm256_loadu_si256((const __m256i *)((const
> > uint8_t *)src + 6 * 32));
> > > +		ymm7 = _mm256_loadu_si256((const __m256i *)((const
> > uint8_t *)src + 7 * 32));
> > > +		src = (const uint8_t *)src + 256;
> > > +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32),
> > ymm0);
> > > +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32),
> > ymm1);
> > > +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32),
> > ymm2);
> > > +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32),
> > ymm3);
> > > +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 4 * 32),
> > ymm4);
> > > +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 5 * 32),
> > ymm5);
> > > +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 6 * 32),
> > ymm6);
> > > +		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 7 * 32),
> > ymm7);
> > > +		dst = (uint8_t *)dst + 256;
> > > +	}
> > > +}
> > >
> > >  static inline void *
> > > -rte_memcpy_func(void *dst, const void *src, size_t n)
> > > +rte_memcpy(void *dst, const void *src, size_t n)
> > >  {
> > >  	void *ret = dst;
> > > +	int dstofss;
> > > +	int bits;
> > >
> > > -	/* We can't copy < 16 bytes using XMM registers so do it manually. */
> > > +	/**
> > > +	 * Copy less than 16 bytes
> > > +	 */
> > >  	if (n < 16) {
> > >  		if (n & 0x01) {
> > >  			*(uint8_t *)dst = *(const uint8_t *)src;
> > > -			dst = (uint8_t *)dst + 1;
> > >  			src = (const uint8_t *)src + 1;
> > > +			dst = (uint8_t *)dst + 1;
> > >  		}
> > >  		if (n & 0x02) {
> > >  			*(uint16_t *)dst = *(const uint16_t *)src;
> > > -			dst = (uint16_t *)dst + 1;
> > >  			src = (const uint16_t *)src + 1;
> > > +			dst = (uint16_t *)dst + 1;
> > >  		}
> > >  		if (n & 0x04) {
> > >  			*(uint32_t *)dst = *(const uint32_t *)src;
> > > -			dst = (uint32_t *)dst + 1;
> > >  			src = (const uint32_t *)src + 1;
> > > +			dst = (uint32_t *)dst + 1;
> > >  		}
> > >  		if (n & 0x08) {
> > >  			*(uint64_t *)dst = *(const uint64_t *)src; @@ -201,95
> > +224,410 @@
> > > rte_memcpy_func(void *dst, const void *src, size_t n)
> > >  		return ret;
> > >  	}
> > >
> > > -	/* Special fast cases for <= 128 bytes */
> > > +	/**
> > > +	 * Fast way when copy size doesn't exceed 512 bytes
> > > +	 */
> > >  	if (n <= 32) {
> > >  		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> > >  		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 +
> > n);
> > >  		return ret;
> > >  	}
> > > -
> > >  	if (n <= 64) {
> > >  		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > >  		rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 +
> > n);
> > >  		return ret;
> > >  	}
> > > -
> > > -	if (n <= 128) {
> > > -		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > > -		rte_mov64((uint8_t *)dst - 64 + n, (const uint8_t *)src - 64 +
> > n);
> > > +	if (n <= 512) {
> > > +		if (n >= 256) {
> > > +			n -= 256;
> > > +			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
> > > +			src = (const uint8_t *)src + 256;
> > > +			dst = (uint8_t *)dst + 256;
> > > +		}
> > > +		if (n >= 128) {
> > > +			n -= 128;
> > > +			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> > > +			src = (const uint8_t *)src + 128;
> > > +			dst = (uint8_t *)dst + 128;
> > > +		}
> > > +		if (n >= 64) {
> > > +			n -= 64;
> > > +			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > > +			src = (const uint8_t *)src + 64;
> > > +			dst = (uint8_t *)dst + 64;
> > > +		}
> > > +COPY_BLOCK_64_BACK31:
> > > +		if (n > 32) {
> > > +			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > > +			rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t
> > *)src - 32 + n);
> > > +			return ret;
> > > +		}
> > > +		if (n > 0) {
> > > +			rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t
> > *)src - 32 + n);
> > > +		}
> > >  		return ret;
> > >  	}
> > >
> > > -	/*
> > > -	 * For large copies > 128 bytes. This combination of 256, 64 and 16
> > byte
> > > -	 * copies was found to be faster than doing 128 and 32 byte copies as
> > > -	 * well.
> > > +	/**
> > > +	 * Make store aligned when copy size exceeds 512 bytes
> > >  	 */
> > > -	for ( ; n >= 256; n -= 256) {
> > > -		rte_mov256((uint8_t *)dst, (const uint8_t *)src);
> > > -		dst = (uint8_t *)dst + 256;
> > > -		src = (const uint8_t *)src + 256;
> > > +	dstofss = 32 - (int)((long long)(void *)dst & 0x1F);
> > > +	n -= dstofss;
> > > +	rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > > +	src = (const uint8_t *)src + dstofss;
> > > +	dst = (uint8_t *)dst + dstofss;
> > > +
> > > +	/**
> > > +	 * Copy 256-byte blocks.
> > > +	 * Use copy block function for better instruction order control,
> > > +	 * which is important when load is unaligned.
> > > +	 */
> > > +	rte_mov256blocks((uint8_t *)dst, (const uint8_t *)src, n);
> > > +	bits = n;
> > > +	n = n & 255;
> > > +	bits -= n;
> > > +	src = (const uint8_t *)src + bits;
> > > +	dst = (uint8_t *)dst + bits;
> > > +
> > > +	/**
> > > +	 * Copy 64-byte blocks.
> > > +	 * Use copy block function for better instruction order control,
> > > +	 * which is important when load is unaligned.
> > > +	 */
> > > +	if (n >= 64) {
> > > +		rte_mov64blocks((uint8_t *)dst, (const uint8_t *)src, n);
> > > +		bits = n;
> > > +		n = n & 63;
> > > +		bits -= n;
> > > +		src = (const uint8_t *)src + bits;
> > > +		dst = (uint8_t *)dst + bits;
> > >  	}
> > >
> > > -	/*
> > > -	 * We split the remaining bytes (which will be less than 256) into
> > > -	 * 64byte (2^6) chunks.
> > > -	 * Using incrementing integers in the case labels of a switch
> > statement
> > > -	 * enourages the compiler to use a jump table. To get incrementing
> > > -	 * integers, we shift the 2 relevant bits to the LSB position to first
> > > -	 * get decrementing integers, and then subtract.
> > > +	/**
> > > +	 * Copy whatever left
> > >  	 */
> > > -	switch (3 - (n >> 6)) {
> > > -	case 0x00:
> > > -		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > > -		n -= 64;
> > > -		dst = (uint8_t *)dst + 64;
> > > -		src = (const uint8_t *)src + 64;      /* fallthrough */
> > > -	case 0x01:
> > > -		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > > -		n -= 64;
> > > -		dst = (uint8_t *)dst + 64;
> > > -		src = (const uint8_t *)src + 64;      /* fallthrough */
> > > -	case 0x02:
> > > -		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > > -		n -= 64;
> > > -		dst = (uint8_t *)dst + 64;
> > > -		src = (const uint8_t *)src + 64;      /* fallthrough */
> > > -	default:
> > > -		;
> > > +	goto COPY_BLOCK_64_BACK31;
> > > +}
> > > +
> > > +#else /* RTE_MACHINE_CPUFLAG_AVX2 */
> > > +
> > > +/**
> > > + * SSE & AVX implementation below
> > > + */
> > > +
> > > +/**
> > > + * Copy 16 bytes from one location to another,
> > > + * locations should not overlap.
> > > + */
> > > +static inline void
> > > +rte_mov16(uint8_t *dst, const uint8_t *src) {
> > > +	__m128i xmm0;
> > > +
> > > +	xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);
> > > +	_mm_storeu_si128((__m128i *)dst, xmm0); }
> > > +
> > > +/**
> > > + * Copy 32 bytes from one location to another,
> > > + * locations should not overlap.
> > > + */
> > > +static inline void
> > > +rte_mov32(uint8_t *dst, const uint8_t *src) {
> > > +	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> > > +	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16); }
> > > +
> > > +/**
> > > + * Copy 64 bytes from one location to another,
> > > + * locations should not overlap.
> > > + */
> > > +static inline void
> > > +rte_mov64(uint8_t *dst, const uint8_t *src) {
> > > +	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> > > +	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> > > +	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> > > +	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16); }
> > > +
> > > +/**
> > > + * Copy 128 bytes from one location to another,
> > > + * locations should not overlap.
> > > + */
> > > +static inline void
> > > +rte_mov128(uint8_t *dst, const uint8_t *src) {
> > > +	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> > > +	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> > > +	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> > > +	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
> > > +	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
> > > +	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
> > > +	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
> > > +	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16); }
> > > +
> > > +/**
> > > + * Copy 256 bytes from one location to another,
> > > + * locations should not overlap.
> > > + */
> > > +static inline void
> > > +rte_mov256(uint8_t *dst, const uint8_t *src) {
> > > +	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> > > +	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> > > +	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> > > +	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
> > > +	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
> > > +	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
> > > +	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
> > > +	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
> > > +	rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);
> > > +	rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);
> > > +	rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);
> > > +	rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);
> > > +	rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);
> > > +	rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);
> > > +	rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);
> > > +	rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
> > > +}
> > > +
> > > +/**
> > > + * Macro for copying unaligned block from one location to another
> > > +with constant load offset,
> > > + * 47 bytes leftover maximum,
> > > + * locations should not overlap.
> > > + * Requirements:
> > > + * - Store is aligned
> > > + * - Load offset is <offset>, which must be immediate value within
> > > +[1, 15]
> > > + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit
> > > +forwards are available for loading
> > > + * - <dst>, <src>, <len> must be variables
> > > + * - __m128i <xmm0> ~ <xmm8> must be pre-defined  */
> > > +#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)
> > \
> > > +({                                                                                                          \
> > > +    int tmp;                                                                                                \
> > > +    while (len >= 128 + 16 - offset) {                                                                      \
> > > +        xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 0 * 16));                  \
> > > +        len -= 128;                                                                                         \
> > > +        xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 1 * 16));                  \
> > > +        xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 2 * 16));                  \
> > > +        xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 3 * 16));                  \
> > > +        xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 4 * 16));                  \
> > > +        xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 5 * 16));                  \
> > > +        xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 6 * 16));                  \
> > > +        xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 7 * 16));                  \
> > > +        xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 8 * 16));                  \
> > > +        src = (const uint8_t *)src + 128;                                                                   \
> > > +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16),
> > _mm_alignr_epi8(xmm1, xmm0, offset));        \
> > > +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16),
> > _mm_alignr_epi8(xmm2, xmm1, offset));        \
> > > +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16),
> > _mm_alignr_epi8(xmm3, xmm2, offset));        \
> > > +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16),
> > _mm_alignr_epi8(xmm4, xmm3, offset));        \
> > > +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16),
> > _mm_alignr_epi8(xmm5, xmm4, offset));        \
> > > +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16),
> > _mm_alignr_epi8(xmm6, xmm5, offset));        \
> > > +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16),
> > _mm_alignr_epi8(xmm7, xmm6, offset));        \
> > > +        _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16),
> > _mm_alignr_epi8(xmm8, xmm7, offset));        \
> > > +        dst = (uint8_t *)dst + 128;                                                                         \
> > > +    }                                                                                                       \
> > > +    tmp = len;                                                                                              \
> > > +    len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
> > > +    tmp -= len;                                                                                             \
> > > +    src = (const uint8_t *)src + tmp;                                                                       \
> > > +    dst = (uint8_t *)dst + tmp;                                                                             \
> > > +    if (len >= 32 + 16 - offset) {                                                                          \
> > > +        while (len >= 32 + 16 - offset) {                                                                   \
> > > +            xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 0 * 16));              \
> > > +            len -= 32;                                                                                      \
> > > +            xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 1 * 16));              \
> > > +            xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 2 * 16));              \
> > > +            src = (const uint8_t *)src + 32;                                                                \
> > > +            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16),
> > _mm_alignr_epi8(xmm1, xmm0, offset));    \
> > > +            _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16),
> > _mm_alignr_epi8(xmm2, xmm1, offset));    \
> > > +            dst = (uint8_t *)dst + 32;                                                                      \
> > > +        }                                                                                                   \
> > > +        tmp = len;                                                                                          \
> > > +        len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
> > > +        tmp -= len;                                                                                         \
> > > +        src = (const uint8_t *)src + tmp;                                                                   \
> > > +        dst = (uint8_t *)dst + tmp;                                                                         \
> > > +    }                                                                                                       \
> > > +})
> > > +
> > > +/**
> > > + * Macro for copying unaligned block from one location to another,
> > > + * 47 bytes leftover maximum,
> > > + * locations should not overlap.
> > > + * Use switch here because the aligning instruction requires immediate
> > value for shift count.
> > > + * Requirements:
> > > + * - Store is aligned
> > > + * - Load offset is <offset>, which must be within [1, 15]
> > > + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit
> > > +forwards are available for loading
> > > + * - <dst>, <src>, <len> must be variables
> > > + * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM
> > must be
> > > +pre-defined  */
> > > +#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                   \
> > > +({                                                                    \
> > > +    switch (offset) {                                                 \
> > > +    case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break;    \
> > > +    case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break;    \
> > > +    case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break;    \
> > > +    case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break;    \
> > > +    case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break;    \
> > > +    case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break;    \
> > > +    case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break;    \
> > > +    case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break;    \
> > > +    case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break;    \
> > > +    case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break;    \
> > > +    case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break;    \
> > > +    case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break;    \
> > > +    case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break;    \
> > > +    case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break;    \
> > > +    case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break;    \
> > > +    default:;                                                         \
> > > +    }                                                                 \
> > > +})
> >
> > We probably didn't understand each other properly.
> > My thought was to do something like:
> >
> > #define MOVEUNALIGNED_128(offset)	do { \
> >  _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16),
> > _mm_alignr_epi8(xmm1, xmm0, offset));        \
> >  _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16),
> > _mm_alignr_epi8(xmm2, xmm1, offset));        \
> >  _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16),
> > _mm_alignr_epi8(xmm3, xmm2, offset));        \
> >  _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16),
> > _mm_alignr_epi8(xmm4, xmm3, offset));        \
> >  _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16),
> > _mm_alignr_epi8(xmm5, xmm4, offset));        \
> >  _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16),
> > _mm_alignr_epi8(xmm6, xmm5, offset));        \
> >  _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16),
> > _mm_alignr_epi8(xmm7, xmm6, offset));        \
> >  _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16),
> > _mm_alignr_epi8(xmm8, xmm7, offset));       \
> > } while(0);
> >
> >
> > Then at MOVEUNALIGNED_LEFT47_IMM:
> > ....
> > xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7
> > * 16));                  \
> > xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8
> > * 16));                  \
> > src = (const uint8_t *)src + 128;
> >
> > switch(offset) {
> >     case 0x01: MOVEUNALIGNED_128(1); break;
> >     ...
> >     case 0x0f: MOVEUNALIGNED_128(f); break; } ...
> > dst = (uint8_t *)dst + 128
> >
> > An then in rte_memcpy() you don't need a switch for
> > MOVEUNALIGNED_LEFT47_IMM itself.
> > Thought it might help to generate smaller code for rte_ememcpy() while
> > keeping performance reasonably high.
> >
> > Konstantin
> >
> >
> > > +
> > > +static inline void *
> > > +rte_memcpy(void *dst, const void *src, size_t n) {
> > > +	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7,
> > xmm8;
> > > +	void *ret = dst;
> > > +	int dstofss;
> > > +	int srcofs;
> > > +
> > > +	/**
> > > +	 * Copy less than 16 bytes
> > > +	 */
> > > +	if (n < 16) {
> > > +		if (n & 0x01) {
> > > +			*(uint8_t *)dst = *(const uint8_t *)src;
> > > +			src = (const uint8_t *)src + 1;
> > > +			dst = (uint8_t *)dst + 1;
> > > +		}
> > > +		if (n & 0x02) {
> > > +			*(uint16_t *)dst = *(const uint16_t *)src;
> > > +			src = (const uint16_t *)src + 1;
> > > +			dst = (uint16_t *)dst + 1;
> > > +		}
> > > +		if (n & 0x04) {
> > > +			*(uint32_t *)dst = *(const uint32_t *)src;
> > > +			src = (const uint32_t *)src + 1;
> > > +			dst = (uint32_t *)dst + 1;
> > > +		}
> > > +		if (n & 0x08) {
> > > +			*(uint64_t *)dst = *(const uint64_t *)src;
> > > +		}
> > > +		return ret;
> > >  	}
> > >
> > > -	/*
> > > -	 * We split the remaining bytes (which will be less than 64) into
> > > -	 * 16byte (2^4) chunks, using the same switch structure as above.
> > > +	/**
> > > +	 * Fast way when copy size doesn't exceed 512 bytes
> > >  	 */
> > > -	switch (3 - (n >> 4)) {
> > > -	case 0x00:
> > > -		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> > > -		n -= 16;
> > > -		dst = (uint8_t *)dst + 16;
> > > -		src = (const uint8_t *)src + 16;      /* fallthrough */
> > > -	case 0x01:
> > > -		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> > > -		n -= 16;
> > > -		dst = (uint8_t *)dst + 16;
> > > -		src = (const uint8_t *)src + 16;      /* fallthrough */
> > > -	case 0x02:
> > > +	if (n <= 32) {
> > >  		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> > > -		n -= 16;
> > > -		dst = (uint8_t *)dst + 16;
> > > -		src = (const uint8_t *)src + 16;      /* fallthrough */
> > > -	default:
> > > -		;
> > > +		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 +
> > n);
> > > +		return ret;
> > >  	}
> > > -
> > > -	/* Copy any remaining bytes, without going beyond end of buffers
> > */
> > > -	if (n != 0) {
> > > +	if (n <= 48) {
> > > +		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > > +		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 +
> > n);
> > > +		return ret;
> > > +	}
> > > +	if (n <= 64) {
> > > +		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > > +		rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
> > >  		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 +
> > n);
> > > +		return ret;
> > > +	}
> > > +	if (n <= 128) {
> > > +		goto COPY_BLOCK_128_BACK15;
> > >  	}
> > > -	return ret;
> > > +	if (n <= 512) {
> > > +		if (n >= 256) {
> > > +			n -= 256;
> > > +			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> > > +			rte_mov128((uint8_t *)dst + 128, (const uint8_t *)src
> > + 128);
> > > +			src = (const uint8_t *)src + 256;
> > > +			dst = (uint8_t *)dst + 256;
> > > +		}
> > > +COPY_BLOCK_255_BACK15:
> > > +		if (n >= 128) {
> > > +			n -= 128;
> > > +			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> > > +			src = (const uint8_t *)src + 128;
> > > +			dst = (uint8_t *)dst + 128;
> > > +		}
> > > +COPY_BLOCK_128_BACK15:
> > > +		if (n >= 64) {
> > > +			n -= 64;
> > > +			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > > +			src = (const uint8_t *)src + 64;
> > > +			dst = (uint8_t *)dst + 64;
> > > +		}
> > > +COPY_BLOCK_64_BACK15:
> > > +		if (n >= 32) {
> > > +			n -= 32;
> > > +			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > > +			src = (const uint8_t *)src + 32;
> > > +			dst = (uint8_t *)dst + 32;
> > > +		}
> > > +		if (n > 16) {
> > > +			rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> > > +			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t
> > *)src - 16 + n);
> > > +			return ret;
> > > +		}
> > > +		if (n > 0) {
> > > +			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t
> > *)src - 16 + n);
> > > +		}
> > > +		return ret;
> > > +	}
> > > +
> > > +	/**
> > > +	 * Make store aligned when copy size exceeds 512 bytes,
> > > +	 * and make sure the first 15 bytes are copied, because
> > > +	 * unaligned copy functions require up to 15 bytes
> > > +	 * backwards access.
> > > +	 */
> > > +	dstofss = 16 - (int)((long long)(void *)dst & 0x0F) + 16;
> > > +	n -= dstofss;
> > > +	rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > > +	src = (const uint8_t *)src + dstofss;
> > > +	dst = (uint8_t *)dst + dstofss;
> > > +	srcofs = (int)((long long)(const void *)src & 0x0F);
> > > +
> > > +	/**
> > > +	 * For aligned copy
> > > +	 */
> > > +	if (srcofs == 0) {
> > > +		/**
> > > +		 * Copy 256-byte blocks
> > > +		 */
> > > +		for (; n >= 256; n -= 256) {
> > > +			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
> > > +			dst = (uint8_t *)dst + 256;
> > > +			src = (const uint8_t *)src + 256;
> > > +		}
> > > +
> > > +		/**
> > > +		 * Copy whatever left
> > > +		 */
> > > +		goto COPY_BLOCK_255_BACK15;
> > > +	}
> > > +
> > > +	/**
> > > +	 * For copy with unaligned load
> > > +	 */
> > > +	MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
> > > +
> > > +	/**
> > > +	 * Copy whatever left
> > > +	 */
> > > +	goto COPY_BLOCK_64_BACK15;
> > >  }
> > >
> > > +#endif /* RTE_MACHINE_CPUFLAG_AVX2 */
> > > +
> > >  #ifdef __cplusplus
> > >  }
> > >  #endif
> > > --
> > > 1.9.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization
  2015-01-29  2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang
                   ` (4 preceding siblings ...)
  2015-01-29  6:16 ` [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Fu, JingguoX
@ 2015-02-10  3:06 ` Liang, Cunming
  2015-02-16 15:57 ` De Lara Guarch, Pablo
  6 siblings, 0 replies; 12+ messages in thread
From: Liang, Cunming @ 2015-02-10  3:06 UTC (permalink / raw)
  To: Wang, Zhihong, dev



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhihong Wang
> Sent: Thursday, January 29, 2015 10:39 AM
> To: dev@dpdk.org
> Subject: [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization
> 
> This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
> It also extends memcpy test coverage with unaligned cases and more test points.
> 
> Optimization techniques are summarized below:
> 
> 1. Utilize full cache bandwidth
> 
> 2. Enforce aligned stores
> 
> 3. Apply load address alignment based on architecture features
> 
> 4. Make load/store address available as early as possible
> 
> 5. General optimization techniques like inlining, branch reducing, prefetch
> pattern access
> 
> --------------
> Changes in v2:
> 
> 1. Reduced constant test cases in app/test/test_memcpy_perf.c for fast build
> 
> 2. Modified macro definition for better code readability & safety
> 
> Zhihong Wang (4):
>   app/test: Disabled VTA for memcpy test in app/test/Makefile
>   app/test: Removed unnecessary test cases in app/test/test_memcpy.c
>   app/test: Extended test coverage in app/test/test_memcpy_perf.c
>   lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
>     and AVX platforms
> 
>  app/test/Makefile                                  |   6 +
>  app/test/test_memcpy.c                             |  52 +-
>  app/test/test_memcpy_perf.c                        | 220 ++++---
>  .../common/include/arch/x86/rte_memcpy.h           | 680 +++++++++++++++-----
> -
>  4 files changed, 654 insertions(+), 304 deletions(-)
> 
> --
> 1.9.3 

Acked-by:  Cunming Liang <cunming.liang@intel.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization
  2015-01-29  2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang
                   ` (5 preceding siblings ...)
  2015-02-10  3:06 ` Liang, Cunming
@ 2015-02-16 15:57 ` De Lara Guarch, Pablo
  2015-02-25 10:46   ` Thomas Monjalon
  6 siblings, 1 reply; 12+ messages in thread
From: De Lara Guarch, Pablo @ 2015-02-16 15:57 UTC (permalink / raw)
  To: Wang, Zhihong, dev



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhihong Wang
> Sent: Thursday, January 29, 2015 2:39 AM
> To: dev@dpdk.org
> Subject: [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization
> 
> This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
> It also extends memcpy test coverage with unaligned cases and more test
> points.
> 
> Optimization techniques are summarized below:
> 
> 1. Utilize full cache bandwidth
> 
> 2. Enforce aligned stores
> 
> 3. Apply load address alignment based on architecture features
> 
> 4. Make load/store address available as early as possible
> 
> 5. General optimization techniques like inlining, branch reducing, prefetch
> pattern access
> 
> --------------
> Changes in v2:
> 
> 1. Reduced constant test cases in app/test/test_memcpy_perf.c for fast
> build
> 
> 2. Modified macro definition for better code readability & safety
> 
> Zhihong Wang (4):
>   app/test: Disabled VTA for memcpy test in app/test/Makefile
>   app/test: Removed unnecessary test cases in app/test/test_memcpy.c
>   app/test: Extended test coverage in app/test/test_memcpy_perf.c
>   lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
>     and AVX platforms
> 
>  app/test/Makefile                                  |   6 +
>  app/test/test_memcpy.c                             |  52 +-
>  app/test/test_memcpy_perf.c                        | 220 ++++---
>  .../common/include/arch/x86/rte_memcpy.h           | 680
> +++++++++++++++------
>  4 files changed, 654 insertions(+), 304 deletions(-)
> 
> --
> 1.9.3

Acked-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization
  2015-02-16 15:57 ` De Lara Guarch, Pablo
@ 2015-02-25 10:46   ` Thomas Monjalon
  0 siblings, 0 replies; 12+ messages in thread
From: Thomas Monjalon @ 2015-02-25 10:46 UTC (permalink / raw)
  To: Wang, Zhihong, konstantin.ananyev; +Cc: dev

> > This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
> > It also extends memcpy test coverage with unaligned cases and more test
> > points.
> > 
> > Optimization techniques are summarized below:
> > 
> > 1. Utilize full cache bandwidth
> > 
> > 2. Enforce aligned stores
> > 
> > 3. Apply load address alignment based on architecture features
> > 
> > 4. Make load/store address available as early as possible
> > 
> > 5. General optimization techniques like inlining, branch reducing, prefetch
> > pattern access
> > 
> > --------------
> > Changes in v2:
> > 
> > 1. Reduced constant test cases in app/test/test_memcpy_perf.c for fast
> > build
> > 
> > 2. Modified macro definition for better code readability & safety
> > 
> > Zhihong Wang (4):
> >   app/test: Disabled VTA for memcpy test in app/test/Makefile
> >   app/test: Removed unnecessary test cases in app/test/test_memcpy.c
> >   app/test: Extended test coverage in app/test/test_memcpy_perf.c
> >   lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
> >     and AVX platforms
> 
> Acked-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>

Applied, thanks for the great work!

Note: we are still looking for a maintainer of x86 EAL.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2015-02-25 10:47 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-29  2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang
2015-01-29  2:38 ` [dpdk-dev] [PATCH v2 1/4] app/test: Disabled VTA for memcpy test in app/test/Makefile Zhihong Wang
2015-01-29  2:38 ` [dpdk-dev] [PATCH v2 2/4] app/test: Removed unnecessary test cases in app/test/test_memcpy.c Zhihong Wang
2015-01-29  2:38 ` [dpdk-dev] [PATCH v2 3/4] app/test: Extended test coverage in app/test/test_memcpy_perf.c Zhihong Wang
2015-01-29  2:38 ` [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms Zhihong Wang
2015-01-29 15:17   ` Ananyev, Konstantin
2015-01-30  5:57     ` Wang, Zhihong
2015-01-30 10:44       ` Ananyev, Konstantin
2015-01-29  6:16 ` [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Fu, JingguoX
2015-02-10  3:06 ` Liang, Cunming
2015-02-16 15:57 ` De Lara Guarch, Pablo
2015-02-25 10:46   ` Thomas Monjalon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).