DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
@ 2015-01-19  1:53 zhihong.wang
  2015-01-19  1:53 ` [dpdk-dev] [PATCH 1/4] app/test: Disabled VTA for memcpy test in app/test/Makefile zhihong.wang
                   ` (6 more replies)
  0 siblings, 7 replies; 48+ messages in thread
From: zhihong.wang @ 2015-01-19  1:53 UTC (permalink / raw)
  To: dev

This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
It also extends memcpy test coverage with unaligned cases and more test points.

Optimization techniques are summarized below:

1. Utilize full cache bandwidth

2. Enforce aligned stores

3. Apply load address alignment based on architecture features

4. Make load/store address available as early as possible

5. General optimization techniques like inlining, branch reducing, prefetch pattern access

Zhihong Wang (4):
  Disabled VTA for memcpy test in app/test/Makefile
  Removed unnecessary test cases in test_memcpy.c
  Extended test coverage in test_memcpy_perf.c
  Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
    platforms

 app/test/Makefile                                  |   6 +
 app/test/test_memcpy.c                             |  52 +-
 app/test/test_memcpy_perf.c                        | 238 +++++---
 .../common/include/arch/x86/rte_memcpy.h           | 664 +++++++++++++++------
 4 files changed, 656 insertions(+), 304 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH 1/4] app/test: Disabled VTA for memcpy test in app/test/Makefile
  2015-01-19  1:53 [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization zhihong.wang
@ 2015-01-19  1:53 ` zhihong.wang
  2015-01-19  1:53 ` [dpdk-dev] [PATCH 2/4] app/test: Removed unnecessary test cases in test_memcpy.c zhihong.wang
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 48+ messages in thread
From: zhihong.wang @ 2015-01-19  1:53 UTC (permalink / raw)
  To: dev

VTA is for debugging only, it increases compile time and binary size, especially when there're a lot of inlines.
So disable it since memcpy test contains a lot of inline calls.

Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
 app/test/Makefile | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/app/test/Makefile b/app/test/Makefile
index 4311f96..94dbadf 100644
--- a/app/test/Makefile
+++ b/app/test/Makefile
@@ -143,6 +143,12 @@ CFLAGS_test_kni.o += -Wno-deprecated-declarations
 endif
 CFLAGS += -D_GNU_SOURCE
 
+# Disable VTA for memcpy test
+ifeq ($(CC), gcc)
+CFLAGS_test_memcpy.o += -fno-var-tracking-assignments
+CFLAGS_test_memcpy_perf.o += -fno-var-tracking-assignments
+endif
+
 # this application needs libraries first
 DEPDIRS-y += lib
 
-- 
1.9.3

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH 2/4] app/test: Removed unnecessary test cases in test_memcpy.c
  2015-01-19  1:53 [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization zhihong.wang
  2015-01-19  1:53 ` [dpdk-dev] [PATCH 1/4] app/test: Disabled VTA for memcpy test in app/test/Makefile zhihong.wang
@ 2015-01-19  1:53 ` zhihong.wang
  2015-01-19  1:53 ` [dpdk-dev] [PATCH 3/4] app/test: Extended test coverage in test_memcpy_perf.c zhihong.wang
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 48+ messages in thread
From: zhihong.wang @ 2015-01-19  1:53 UTC (permalink / raw)
  To: dev

Removed unnecessary test cases for base move functions since the function "func_test" covers them all.

Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
 app/test/test_memcpy.c | 52 +-------------------------------------------------
 1 file changed, 1 insertion(+), 51 deletions(-)

diff --git a/app/test/test_memcpy.c b/app/test/test_memcpy.c
index 56b8e1e..b2bb4e0 100644
--- a/app/test/test_memcpy.c
+++ b/app/test/test_memcpy.c
@@ -78,56 +78,9 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
 #define TEST_BATCH_SIZE         100
 
 /* Data is aligned on this many bytes (power of 2) */
-#define ALIGNMENT_UNIT          16
+#define ALIGNMENT_UNIT          32
 
 
-
-/* Structure with base memcpy func pointer, and number of bytes it copies */
-struct base_memcpy_func {
-	void (*func)(uint8_t *dst, const uint8_t *src);
-	unsigned size;
-};
-
-/* To create base_memcpy_func structure entries */
-#define BASE_FUNC(n) {rte_mov##n, n}
-
-/* Max number of bytes that can be copies with a "base" memcpy functions */
-#define MAX_BASE_FUNC_SIZE 256
-
-/*
- * Test the "base" memcpy functions, that a copy fixed number of bytes.
- */
-static int
-base_func_test(void)
-{
-	const struct base_memcpy_func base_memcpy_funcs[6] = {
-		BASE_FUNC(16),
-		BASE_FUNC(32),
-		BASE_FUNC(48),
-		BASE_FUNC(64),
-		BASE_FUNC(128),
-		BASE_FUNC(256),
-	};
-	unsigned i, j;
-	unsigned num_funcs = sizeof(base_memcpy_funcs) / sizeof(base_memcpy_funcs[0]);
-	uint8_t dst[MAX_BASE_FUNC_SIZE];
-	uint8_t src[MAX_BASE_FUNC_SIZE];
-
-	for (i = 0; i < num_funcs; i++) {
-		unsigned size = base_memcpy_funcs[i].size;
-		for (j = 0; j < size; j++) {
-			dst[j] = 0;
-			src[j] = (uint8_t) rte_rand();
-		}
-		base_memcpy_funcs[i].func(dst, src);
-		for (j = 0; j < size; j++)
-			if (dst[j] != src[j])
-				return -1;
-	}
-
-	return 0;
-}
-
 /*
  * Create two buffers, and initialise one with random values. These are copied
  * to the second buffer and then compared to see if the copy was successful.
@@ -218,9 +171,6 @@ test_memcpy(void)
 	ret = func_test();
 	if (ret != 0)
 		return -1;
-	ret = base_func_test();
-	if (ret != 0)
-		return -1;
 	return 0;
 }
 
-- 
1.9.3

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH 3/4] app/test: Extended test coverage in test_memcpy_perf.c
  2015-01-19  1:53 [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization zhihong.wang
  2015-01-19  1:53 ` [dpdk-dev] [PATCH 1/4] app/test: Disabled VTA for memcpy test in app/test/Makefile zhihong.wang
  2015-01-19  1:53 ` [dpdk-dev] [PATCH 2/4] app/test: Removed unnecessary test cases in test_memcpy.c zhihong.wang
@ 2015-01-19  1:53 ` zhihong.wang
  2015-01-19  1:53 ` [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms zhihong.wang
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 48+ messages in thread
From: zhihong.wang @ 2015-01-19  1:53 UTC (permalink / raw)
  To: dev

Main code changes:

1. Added more typical data points for a thorough performance test

2. Added unaligned test cases since it's common in DPDK usage

Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
 app/test/test_memcpy_perf.c | 238 +++++++++++++++++++++++++++++---------------
 1 file changed, 156 insertions(+), 82 deletions(-)

diff --git a/app/test/test_memcpy_perf.c b/app/test/test_memcpy_perf.c
index 7809610..4875af1 100644
--- a/app/test/test_memcpy_perf.c
+++ b/app/test/test_memcpy_perf.c
@@ -54,9 +54,10 @@
 /* List of buffer sizes to test */
 #if TEST_VALUE_RANGE == 0
 static size_t buf_sizes[] = {
-	0, 1, 7, 8, 9, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128, 129, 255,
-	256, 257, 320, 384, 511, 512, 513, 1023, 1024, 1025, 1518, 1522, 1600,
-	2048, 3072, 4096, 5120, 6144, 7168, 8192
+	1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128,
+	129, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384, 385, 447, 448,
+	449, 511, 512, 513, 767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600,
+	2048, 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192
 };
 /* MUST be as large as largest packet size above */
 #define SMALL_BUFFER_SIZE       8192
@@ -78,7 +79,7 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
 #define TEST_BATCH_SIZE         100
 
 /* Data is aligned on this many bytes (power of 2) */
-#define ALIGNMENT_UNIT          16
+#define ALIGNMENT_UNIT          32
 
 /*
  * Pointers used in performance tests. The two large buffers are for uncached
@@ -94,19 +95,19 @@ init_buffers(void)
 {
 	unsigned i;
 
-	large_buf_read = rte_malloc("memcpy", LARGE_BUFFER_SIZE, ALIGNMENT_UNIT);
+	large_buf_read = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
 	if (large_buf_read == NULL)
 		goto error_large_buf_read;
 
-	large_buf_write = rte_malloc("memcpy", LARGE_BUFFER_SIZE, ALIGNMENT_UNIT);
+	large_buf_write = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
 	if (large_buf_write == NULL)
 		goto error_large_buf_write;
 
-	small_buf_read = rte_malloc("memcpy", SMALL_BUFFER_SIZE, ALIGNMENT_UNIT);
+	small_buf_read = rte_malloc("memcpy", SMALL_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
 	if (small_buf_read == NULL)
 		goto error_small_buf_read;
 
-	small_buf_write = rte_malloc("memcpy", SMALL_BUFFER_SIZE, ALIGNMENT_UNIT);
+	small_buf_write = rte_malloc("memcpy", SMALL_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
 	if (small_buf_write == NULL)
 		goto error_small_buf_write;
 
@@ -140,25 +141,25 @@ free_buffers(void)
 
 /*
  * Get a random offset into large array, with enough space needed to perform
- * max copy size. Offset is aligned.
+ * max copy size. Offset is aligned, uoffset is used for unalignment setting.
  */
 static inline size_t
-get_rand_offset(void)
+get_rand_offset(size_t uoffset)
 {
-	return ((rte_rand() % (LARGE_BUFFER_SIZE - SMALL_BUFFER_SIZE)) &
-	                ~(ALIGNMENT_UNIT - 1));
+	return (((rte_rand() % (LARGE_BUFFER_SIZE - SMALL_BUFFER_SIZE)) &
+			~(ALIGNMENT_UNIT - 1)) + uoffset);
 }
 
 /* Fill in source and destination addresses. */
 static inline void
-fill_addr_arrays(size_t *dst_addr, int is_dst_cached,
-		size_t *src_addr, int is_src_cached)
+fill_addr_arrays(size_t *dst_addr, int is_dst_cached, size_t dst_uoffset,
+				 size_t *src_addr, int is_src_cached, size_t src_uoffset)
 {
 	unsigned int i;
 
 	for (i = 0; i < TEST_BATCH_SIZE; i++) {
-		dst_addr[i] = (is_dst_cached) ? 0 : get_rand_offset();
-		src_addr[i] = (is_src_cached) ? 0 : get_rand_offset();
+		dst_addr[i] = (is_dst_cached) ? dst_uoffset : get_rand_offset(dst_uoffset);
+		src_addr[i] = (is_src_cached) ? src_uoffset : get_rand_offset(src_uoffset);
 	}
 }
 
@@ -169,16 +170,17 @@ fill_addr_arrays(size_t *dst_addr, int is_dst_cached,
  */
 static void
 do_uncached_write(uint8_t *dst, int is_dst_cached,
-		const uint8_t *src, int is_src_cached, size_t size)
+				  const uint8_t *src, int is_src_cached, size_t size)
 {
 	unsigned i, j;
 	size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];
 
 	for (i = 0; i < (TEST_ITERATIONS / TEST_BATCH_SIZE); i++) {
-		fill_addr_arrays(dst_addrs, is_dst_cached,
-			 src_addrs, is_src_cached);
-		for (j = 0; j < TEST_BATCH_SIZE; j++)
+		fill_addr_arrays(dst_addrs, is_dst_cached, 0,
+						 src_addrs, is_src_cached, 0);
+		for (j = 0; j < TEST_BATCH_SIZE; j++) {
 			rte_memcpy(dst+dst_addrs[j], src+src_addrs[j], size);
+		}
 	}
 }
 
@@ -186,52 +188,129 @@ do_uncached_write(uint8_t *dst, int is_dst_cached,
  * Run a single memcpy performance test. This is a macro to ensure that if
  * the "size" parameter is a constant it won't be converted to a variable.
  */
-#define SINGLE_PERF_TEST(dst, is_dst_cached, src, is_src_cached, size) do {   \
-	unsigned int iter, t;                                                 \
-	size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];        \
-	uint64_t start_time, total_time = 0;                                  \
-	uint64_t total_time2 = 0;                                             \
-	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {  \
-		fill_addr_arrays(dst_addrs, is_dst_cached,                    \
-		                 src_addrs, is_src_cached);                   \
-		start_time = rte_rdtsc();                                     \
-		for (t = 0; t < TEST_BATCH_SIZE; t++)                         \
-			rte_memcpy(dst+dst_addrs[t], src+src_addrs[t], size); \
-		total_time += rte_rdtsc() - start_time;                       \
-	}                                                                     \
-	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {  \
-		fill_addr_arrays(dst_addrs, is_dst_cached,                    \
-		                 src_addrs, is_src_cached);                   \
-		start_time = rte_rdtsc();                                     \
-		for (t = 0; t < TEST_BATCH_SIZE; t++)                         \
-			memcpy(dst+dst_addrs[t], src+src_addrs[t], size);     \
-		total_time2 += rte_rdtsc() - start_time;                      \
-	}                                                                     \
-	printf("%8.0f -",  (double)total_time /TEST_ITERATIONS);              \
-	printf("%5.0f",  (double)total_time2 / TEST_ITERATIONS);              \
+#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset,                   \
+						 src, is_src_cached, src_uoffset, size)             \
+do {                                                                        \
+	unsigned int iter, t;                                                   \
+	size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];          \
+	uint64_t start_time, total_time = 0;                                    \
+	uint64_t total_time2 = 0;                                               \
+	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
+		fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
+						 src_addrs, is_src_cached, src_uoffset);            \
+		start_time = rte_rdtsc();                                           \
+		for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
+			rte_memcpy(dst+dst_addrs[t], src+src_addrs[t], size);           \
+		total_time += rte_rdtsc() - start_time;                             \
+	}                                                                       \
+	for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) {    \
+		fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset,             \
+						 src_addrs, is_src_cached, src_uoffset);            \
+		start_time = rte_rdtsc();                                           \
+		for (t = 0; t < TEST_BATCH_SIZE; t++)                               \
+			memcpy(dst+dst_addrs[t], src+src_addrs[t], size);               \
+		total_time2 += rte_rdtsc() - start_time;                            \
+	}                                                                       \
+	printf("%8.0f -",  (double)total_time /TEST_ITERATIONS);                \
+	printf("%5.0f",  (double)total_time2 / TEST_ITERATIONS);                \
 } while (0)
 
-/* Run memcpy() tests for each cached/uncached permutation. */
-#define ALL_PERF_TESTS_FOR_SIZE(n) do {                             \
-	if (__builtin_constant_p(n))                                \
-		printf("\nC%6u", (unsigned)n);                      \
-	else                                                        \
-		printf("\n%7u", (unsigned)n);                       \
-	SINGLE_PERF_TEST(small_buf_write, 1, small_buf_read, 1, n); \
-	SINGLE_PERF_TEST(large_buf_write, 0, small_buf_read, 1, n); \
-	SINGLE_PERF_TEST(small_buf_write, 1, large_buf_read, 0, n); \
-	SINGLE_PERF_TEST(large_buf_write, 0, large_buf_read, 0, n); \
+/* Run aligned memcpy tests for each cached/uncached permutation */
+#define ALL_PERF_TESTS_FOR_SIZE(n)                                       \
+do {                                                                     \
+	if (__builtin_constant_p(n))                                         \
+		printf("\nC%6u", (unsigned)n);                                   \
+	else                                                                 \
+		printf("\n%7u", (unsigned)n);                                    \
+	SINGLE_PERF_TEST(small_buf_write, 1, 0, small_buf_read, 1, 0, n);    \
+	SINGLE_PERF_TEST(large_buf_write, 0, 0, small_buf_read, 1, 0, n);    \
+	SINGLE_PERF_TEST(small_buf_write, 1, 0, large_buf_read, 0, 0, n);    \
+	SINGLE_PERF_TEST(large_buf_write, 0, 0, large_buf_read, 0, 0, n);    \
 } while (0)
 
-/*
- * Run performance tests for a number of different sizes and cached/uncached
- * permutations.
- */
+/* Run unaligned memcpy tests for each cached/uncached permutation */
+#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n)                             \
+do {                                                                     \
+	if (__builtin_constant_p(n))                                         \
+		printf("\nC%6u", (unsigned)n);                                   \
+	else                                                                 \
+		printf("\n%7u", (unsigned)n);                                    \
+	SINGLE_PERF_TEST(small_buf_write, 1, 1, small_buf_read, 1, 5, n);    \
+	SINGLE_PERF_TEST(large_buf_write, 0, 1, small_buf_read, 1, 5, n);    \
+	SINGLE_PERF_TEST(small_buf_write, 1, 1, large_buf_read, 0, 5, n);    \
+	SINGLE_PERF_TEST(large_buf_write, 0, 1, large_buf_read, 0, 5, n);    \
+} while (0)
+
+/* Run memcpy tests for constant length */
+#define ALL_PERF_TEST_FOR_CONSTANT                                       \
+{                                                                        \
+	TEST_CONSTANT(1U); TEST_CONSTANT(2U); TEST_CONSTANT(3U);             \
+	TEST_CONSTANT(4U); TEST_CONSTANT(5U); TEST_CONSTANT(6U);             \
+	TEST_CONSTANT(7U); TEST_CONSTANT(8U); TEST_CONSTANT(9U);             \
+	TEST_CONSTANT(12U); TEST_CONSTANT(15U); TEST_CONSTANT(16U);          \
+	TEST_CONSTANT(17U); TEST_CONSTANT(31U); TEST_CONSTANT(32U);          \
+	TEST_CONSTANT(33U); TEST_CONSTANT(63U); TEST_CONSTANT(64U);          \
+	TEST_CONSTANT(65U); TEST_CONSTANT(127U); TEST_CONSTANT(128U);        \
+	TEST_CONSTANT(129U); TEST_CONSTANT(191U); TEST_CONSTANT(192U);       \
+	TEST_CONSTANT(193U); TEST_CONSTANT(255U); TEST_CONSTANT(256U);       \
+	TEST_CONSTANT(257U); TEST_CONSTANT(319U); TEST_CONSTANT(320U);       \
+	TEST_CONSTANT(321U); TEST_CONSTANT(383U); TEST_CONSTANT(384U);       \
+	TEST_CONSTANT(385U); TEST_CONSTANT(447U); TEST_CONSTANT(448U);       \
+	TEST_CONSTANT(449U); TEST_CONSTANT(511U); TEST_CONSTANT(512U);       \
+	TEST_CONSTANT(513U); TEST_CONSTANT(767U); TEST_CONSTANT(768U);       \
+	TEST_CONSTANT(769U); TEST_CONSTANT(1023U); TEST_CONSTANT(1024U);     \
+	TEST_CONSTANT(1025U); TEST_CONSTANT(1518U); TEST_CONSTANT(1522U);    \
+	TEST_CONSTANT(1536U); TEST_CONSTANT(1600U); TEST_CONSTANT(2048U);    \
+	TEST_CONSTANT(2560U); TEST_CONSTANT(3072U); TEST_CONSTANT(3584U);    \
+	TEST_CONSTANT(4096U); TEST_CONSTANT(4608U); TEST_CONSTANT(5120U);    \
+	TEST_CONSTANT(5632U); TEST_CONSTANT(6144U); TEST_CONSTANT(6656U);    \
+	TEST_CONSTANT(7168U); TEST_CONSTANT(7680U); TEST_CONSTANT(8192U);    \
+}
+
+/* Run all memcpy tests for aligned constant cases */
+static inline void
+perf_test_constant_aligned(void)
+{
+#define TEST_CONSTANT ALL_PERF_TESTS_FOR_SIZE
+	ALL_PERF_TEST_FOR_CONSTANT;
+#undef TEST_CONSTANT
+}
+
+/* Run all memcpy tests for unaligned constant cases */
+static inline void
+perf_test_constant_unaligned(void)
+{
+#define TEST_CONSTANT ALL_PERF_TESTS_FOR_SIZE_UNALIGNED
+	ALL_PERF_TEST_FOR_CONSTANT;
+#undef TEST_CONSTANT
+}
+
+/* Run all memcpy tests for aligned variable cases */
+static inline void
+perf_test_variable_aligned(void)
+{
+	unsigned n = sizeof(buf_sizes) / sizeof(buf_sizes[0]);
+	unsigned i;
+	for (i = 0; i < n; i++) {
+		ALL_PERF_TESTS_FOR_SIZE((size_t)buf_sizes[i]);
+	}
+}
+
+/* Run all memcpy tests for unaligned variable cases */
+static inline void
+perf_test_variable_unaligned(void)
+{
+	unsigned n = sizeof(buf_sizes) / sizeof(buf_sizes[0]);
+	unsigned i;
+	for (i = 0; i < n; i++) {
+		ALL_PERF_TESTS_FOR_SIZE_UNALIGNED((size_t)buf_sizes[i]);
+	}
+}
+
+/* Run all memcpy tests */
 static int
 perf_test(void)
 {
-	const unsigned num_buf_sizes = sizeof(buf_sizes) / sizeof(buf_sizes[0]);
-	unsigned i;
 	int ret;
 
 	ret = init_buffers();
@@ -239,7 +318,8 @@ perf_test(void)
 		return ret;
 
 #if TEST_VALUE_RANGE != 0
-	/* Setup buf_sizes array, if required */
+	/* Set up buf_sizes array, if required */
+	unsigned i;
 	for (i = 0; i < TEST_VALUE_RANGE; i++)
 		buf_sizes[i] = i;
 #endif
@@ -248,28 +328,23 @@ perf_test(void)
 	do_uncached_write(large_buf_write, 0, small_buf_read, 1, SMALL_BUFFER_SIZE);
 
 	printf("\n** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **\n"
-	       "======= ============== ============== ============== ==============\n"
-	       "   Size Cache to cache   Cache to mem   Mem to cache     Mem to mem\n"
-	       "(bytes)        (ticks)        (ticks)        (ticks)        (ticks)\n"
-	       "------- -------------- -------------- -------------- --------------");
-
-	/* Do tests where size is a variable */
-	for (i = 0; i < num_buf_sizes; i++) {
-		ALL_PERF_TESTS_FOR_SIZE((size_t)buf_sizes[i]);
-	}
+		   "======= ============== ============== ============== ==============\n"
+		   "   Size Cache to cache   Cache to mem   Mem to cache     Mem to mem\n"
+		   "(bytes)        (ticks)        (ticks)        (ticks)        (ticks)\n"
+		   "------- -------------- -------------- -------------- --------------");
+
+	printf("\n========================== %2dB aligned ============================", ALIGNMENT_UNIT);
+	/* Do aligned tests where size is a variable */
+	perf_test_variable_aligned();
 	printf("\n------- -------------- -------------- -------------- --------------");
-	/* Do tests where size is a compile-time constant */
-	ALL_PERF_TESTS_FOR_SIZE(63U);
-	ALL_PERF_TESTS_FOR_SIZE(64U);
-	ALL_PERF_TESTS_FOR_SIZE(65U);
-	ALL_PERF_TESTS_FOR_SIZE(255U);
-	ALL_PERF_TESTS_FOR_SIZE(256U);
-	ALL_PERF_TESTS_FOR_SIZE(257U);
-	ALL_PERF_TESTS_FOR_SIZE(1023U);
-	ALL_PERF_TESTS_FOR_SIZE(1024U);
-	ALL_PERF_TESTS_FOR_SIZE(1025U);
-	ALL_PERF_TESTS_FOR_SIZE(1518U);
-
+	/* Do aligned tests where size is a compile-time constant */
+	perf_test_constant_aligned();
+	printf("\n=========================== Unaligned =============================");
+	/* Do unaligned tests where size is a variable */
+	perf_test_variable_unaligned();
+	printf("\n------- -------------- -------------- -------------- --------------");
+	/* Do unaligned tests where size is a compile-time constant */
+	perf_test_constant_unaligned();
 	printf("\n======= ============== ============== ============== ==============\n\n");
 
 	free_buffers();
@@ -277,7 +352,6 @@ perf_test(void)
 	return 0;
 }
 
-
 static int
 test_memcpy_perf(void)
 {
-- 
1.9.3

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
  2015-01-19  1:53 [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization zhihong.wang
                   ` (2 preceding siblings ...)
  2015-01-19  1:53 ` [dpdk-dev] [PATCH 3/4] app/test: Extended test coverage in test_memcpy_perf.c zhihong.wang
@ 2015-01-19  1:53 ` zhihong.wang
  2015-01-20 17:15   ` Stephen Hemminger
  2015-01-26 14:43   ` Wodkowski, PawelX
  2015-01-19 13:02 ` [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization Neil Horman
                   ` (2 subsequent siblings)
  6 siblings, 2 replies; 48+ messages in thread
From: zhihong.wang @ 2015-01-19  1:53 UTC (permalink / raw)
  To: dev

Main code changes:

1. Differentiate architectural features based on CPU flags

    a. Implement separated move functions for SSE/AVX/AVX2 to make full utilization of cache bandwidth

    b. Implement separated copy flow specifically optimized for target architecture

2. Rewrite the memcpy function "rte_memcpy"

    a. Add store aligning

    b. Add load aligning based on architectural features

    c. Put block copy loop into inline move functions for better control of instruction order

    d. Eliminate unnecessary MOVs

3. Rewrite the inline move functions

    a. Add move functions for unaligned load cases

    b. Change instruction order in copy loops for better pipeline utilization

    c. Use intrinsics instead of assembly code

4. Remove slow glibc call for constant copies

Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
 .../common/include/arch/x86/rte_memcpy.h           | 664 +++++++++++++++------
 1 file changed, 493 insertions(+), 171 deletions(-)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index fb9eba8..69a5c6f 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -34,166 +34,189 @@
 #ifndef _RTE_MEMCPY_X86_64_H_
 #define _RTE_MEMCPY_X86_64_H_
 
+/**
+ * @file
+ *
+ * Functions for SSE/AVX/AVX2 implementation of memcpy().
+ */
+
+#include <stdio.h>
 #include <stdint.h>
 #include <string.h>
-#include <emmintrin.h>
+#include <x86intrin.h>
 
 #ifdef __cplusplus
 extern "C" {
 #endif
 
-#include "generic/rte_memcpy.h"
+/**
+ * Copy bytes from one location to another. The locations must not overlap.
+ *
+ * @note This is implemented as a macro, so it's address should not be taken
+ * and care is needed as parameter expressions may be evaluated multiple times.
+ *
+ * @param dst
+ *   Pointer to the destination of the data.
+ * @param src
+ *   Pointer to the source data.
+ * @param n
+ *   Number of bytes to copy.
+ * @return
+ *   Pointer to the destination data.
+ */
+static inline void *
+rte_memcpy(void *dst, const void *src, size_t n) __attribute__((always_inline));
 
-#ifdef __INTEL_COMPILER
-#pragma warning(disable:593) /* Stop unused variable warning (reg_a etc). */
-#endif
+#ifdef RTE_MACHINE_CPUFLAG_AVX2
 
+/**
+ * AVX2 implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
 static inline void
 rte_mov16(uint8_t *dst, const uint8_t *src)
 {
-	__m128i reg_a;
-	asm volatile (
-		"movdqu (%[src]), %[reg_a]\n\t"
-		"movdqu %[reg_a], (%[dst])\n\t"
-		: [reg_a] "=x" (reg_a)
-		: [src] "r" (src),
-		  [dst] "r"(dst)
-		: "memory"
-	);
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
 }
 
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
 static inline void
 rte_mov32(uint8_t *dst, const uint8_t *src)
 {
-	__m128i reg_a, reg_b;
-	asm volatile (
-		"movdqu (%[src]), %[reg_a]\n\t"
-		"movdqu 16(%[src]), %[reg_b]\n\t"
-		"movdqu %[reg_a], (%[dst])\n\t"
-		"movdqu %[reg_b], 16(%[dst])\n\t"
-		: [reg_a] "=x" (reg_a),
-		  [reg_b] "=x" (reg_b)
-		: [src] "r" (src),
-		  [dst] "r"(dst)
-		: "memory"
-	);
-}
+	__m256i ymm0;
 
-static inline void
-rte_mov48(uint8_t *dst, const uint8_t *src)
-{
-	__m128i reg_a, reg_b, reg_c;
-	asm volatile (
-		"movdqu (%[src]), %[reg_a]\n\t"
-		"movdqu 16(%[src]), %[reg_b]\n\t"
-		"movdqu 32(%[src]), %[reg_c]\n\t"
-		"movdqu %[reg_a], (%[dst])\n\t"
-		"movdqu %[reg_b], 16(%[dst])\n\t"
-		"movdqu %[reg_c], 32(%[dst])\n\t"
-		: [reg_a] "=x" (reg_a),
-		  [reg_b] "=x" (reg_b),
-		  [reg_c] "=x" (reg_c)
-		: [src] "r" (src),
-		  [dst] "r"(dst)
-		: "memory"
-	);
+	ymm0 = _mm256_loadu_si256((const __m256i *)src);
+	_mm256_storeu_si256((__m256i *)dst, ymm0);
 }
 
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
 static inline void
 rte_mov64(uint8_t *dst, const uint8_t *src)
 {
-	__m128i reg_a, reg_b, reg_c, reg_d;
-	asm volatile (
-		"movdqu (%[src]), %[reg_a]\n\t"
-		"movdqu 16(%[src]), %[reg_b]\n\t"
-		"movdqu 32(%[src]), %[reg_c]\n\t"
-		"movdqu 48(%[src]), %[reg_d]\n\t"
-		"movdqu %[reg_a], (%[dst])\n\t"
-		"movdqu %[reg_b], 16(%[dst])\n\t"
-		"movdqu %[reg_c], 32(%[dst])\n\t"
-		"movdqu %[reg_d], 48(%[dst])\n\t"
-		: [reg_a] "=x" (reg_a),
-		  [reg_b] "=x" (reg_b),
-		  [reg_c] "=x" (reg_c),
-		  [reg_d] "=x" (reg_d)
-		: [src] "r" (src),
-		  [dst] "r"(dst)
-		: "memory"
-	);
+	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
+	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
 }
 
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
 static inline void
 rte_mov128(uint8_t *dst, const uint8_t *src)
 {
-	__m128i reg_a, reg_b, reg_c, reg_d, reg_e, reg_f, reg_g, reg_h;
-	asm volatile (
-		"movdqu (%[src]), %[reg_a]\n\t"
-		"movdqu 16(%[src]), %[reg_b]\n\t"
-		"movdqu 32(%[src]), %[reg_c]\n\t"
-		"movdqu 48(%[src]), %[reg_d]\n\t"
-		"movdqu 64(%[src]), %[reg_e]\n\t"
-		"movdqu 80(%[src]), %[reg_f]\n\t"
-		"movdqu 96(%[src]), %[reg_g]\n\t"
-		"movdqu 112(%[src]), %[reg_h]\n\t"
-		"movdqu %[reg_a], (%[dst])\n\t"
-		"movdqu %[reg_b], 16(%[dst])\n\t"
-		"movdqu %[reg_c], 32(%[dst])\n\t"
-		"movdqu %[reg_d], 48(%[dst])\n\t"
-		"movdqu %[reg_e], 64(%[dst])\n\t"
-		"movdqu %[reg_f], 80(%[dst])\n\t"
-		"movdqu %[reg_g], 96(%[dst])\n\t"
-		"movdqu %[reg_h], 112(%[dst])\n\t"
-		: [reg_a] "=x" (reg_a),
-		  [reg_b] "=x" (reg_b),
-		  [reg_c] "=x" (reg_c),
-		  [reg_d] "=x" (reg_d),
-		  [reg_e] "=x" (reg_e),
-		  [reg_f] "=x" (reg_f),
-		  [reg_g] "=x" (reg_g),
-		  [reg_h] "=x" (reg_h)
-		: [src] "r" (src),
-		  [dst] "r"(dst)
-		: "memory"
-	);
+	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
+	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
+	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
+	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
 }
 
-#ifdef __INTEL_COMPILER
-#pragma warning(enable:593)
-#endif
-
+/**
+ * Copy 256 bytes from one location to another,
+ * locations should not overlap.
+ */
 static inline void
 rte_mov256(uint8_t *dst, const uint8_t *src)
 {
-	rte_mov128(dst, src);
-	rte_mov128(dst + 128, src + 128);
+	rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
+	rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
+	rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
+	rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
+	rte_mov32((uint8_t *)dst + 4 * 32, (const uint8_t *)src + 4 * 32);
+	rte_mov32((uint8_t *)dst + 5 * 32, (const uint8_t *)src + 5 * 32);
+	rte_mov32((uint8_t *)dst + 6 * 32, (const uint8_t *)src + 6 * 32);
+	rte_mov32((uint8_t *)dst + 7 * 32, (const uint8_t *)src + 7 * 32);
 }
 
-#define rte_memcpy(dst, src, n)              \
-	({ (__builtin_constant_p(n)) ?       \
-	memcpy((dst), (src), (n)) :          \
-	rte_memcpy_func((dst), (src), (n)); })
+/**
+ * Copy 64-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m256i ymm0, ymm1;
+
+	while (n >= 64) {
+		ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32));
+		n -= 64;
+		ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32));
+		src = (const uint8_t *)src + 64;
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
+		dst = (uint8_t *)dst + 64;
+	}
+}
+
+/**
+ * Copy 256-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov256blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+	__m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7;
+
+	while (n >= 256) {
+		ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32));
+		n -= 256;
+		ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32));
+		ymm2 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 2 * 32));
+		ymm3 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 3 * 32));
+		ymm4 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 4 * 32));
+		ymm5 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 5 * 32));
+		ymm6 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 6 * 32));
+		ymm7 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 7 * 32));
+		src = (const uint8_t *)src + 256;
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 4 * 32), ymm4);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 5 * 32), ymm5);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 6 * 32), ymm6);
+		_mm256_storeu_si256((__m256i *)((uint8_t *)dst + 7 * 32), ymm7);
+		dst = (uint8_t *)dst + 256;
+	}
+}
 
 static inline void *
-rte_memcpy_func(void *dst, const void *src, size_t n)
+rte_memcpy(void *dst, const void *src, size_t n)
 {
 	void *ret = dst;
+	int dstofss;
+	int bits;
 
-	/* We can't copy < 16 bytes using XMM registers so do it manually. */
+	/**
+	 * Copy less than 16 bytes
+	 */
 	if (n < 16) {
 		if (n & 0x01) {
 			*(uint8_t *)dst = *(const uint8_t *)src;
-			dst = (uint8_t *)dst + 1;
 			src = (const uint8_t *)src + 1;
+			dst = (uint8_t *)dst + 1;
 		}
 		if (n & 0x02) {
 			*(uint16_t *)dst = *(const uint16_t *)src;
-			dst = (uint16_t *)dst + 1;
 			src = (const uint16_t *)src + 1;
+			dst = (uint16_t *)dst + 1;
 		}
 		if (n & 0x04) {
 			*(uint32_t *)dst = *(const uint32_t *)src;
-			dst = (uint32_t *)dst + 1;
 			src = (const uint32_t *)src + 1;
+			dst = (uint32_t *)dst + 1;
 		}
 		if (n & 0x08) {
 			*(uint64_t *)dst = *(const uint64_t *)src;
@@ -201,95 +224,394 @@ rte_memcpy_func(void *dst, const void *src, size_t n)
 		return ret;
 	}
 
-	/* Special fast cases for <= 128 bytes */
+	/**
+	 * Fast way when copy size doesn't exceed 512 bytes
+	 */
 	if (n <= 32) {
 		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
 		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
 		return ret;
 	}
-
 	if (n <= 64) {
 		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
 		rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
 		return ret;
 	}
-
-	if (n <= 128) {
-		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-		rte_mov64((uint8_t *)dst - 64 + n, (const uint8_t *)src - 64 + n);
+	if (n <= 512) {
+		if (n >= 256) {
+			n -= 256;
+			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 256;
+			dst = (uint8_t *)dst + 256;
+		}
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+		if (n >= 64) {
+			n -= 64;
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 64;
+			dst = (uint8_t *)dst + 64;
+		}
+COPY_BLOCK_64_BACK31:
+		if (n > 32) {
+			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
+			return ret;
+		}
+		if (n > 0) {
+			rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
+		}
 		return ret;
 	}
 
-	/*
-	 * For large copies > 128 bytes. This combination of 256, 64 and 16 byte
-	 * copies was found to be faster than doing 128 and 32 byte copies as
-	 * well.
+	/**
+	 * Make store aligned when copy size exceeds 512 bytes
 	 */
-	for ( ; n >= 256; n -= 256) {
-		rte_mov256((uint8_t *)dst, (const uint8_t *)src);
-		dst = (uint8_t *)dst + 256;
-		src = (const uint8_t *)src + 256;
+	dstofss = 32 - (int)((long long)(void *)dst & 0x1F);
+	n -= dstofss;
+	rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+	src = (const uint8_t *)src + dstofss;
+	dst = (uint8_t *)dst + dstofss;
+
+	/**
+	 * Copy 256-byte blocks.
+	 * Use copy block function for better instruction order control,
+	 * which is important when load is unaligned.
+	 */
+	rte_mov256blocks((uint8_t *)dst, (const uint8_t *)src, n);
+	bits = n;
+	n = n & 255;
+	bits -= n;
+	src = (const uint8_t *)src + bits;
+	dst = (uint8_t *)dst + bits;
+
+	/**
+	 * Copy 64-byte blocks.
+	 * Use copy block function for better instruction order control,
+	 * which is important when load is unaligned.
+	 */
+	if (n >= 64) {
+		rte_mov64blocks((uint8_t *)dst, (const uint8_t *)src, n);
+		bits = n;
+		n = n & 63;
+		bits -= n;
+		src = (const uint8_t *)src + bits;
+		dst = (uint8_t *)dst + bits;
 	}
 
-	/*
-	 * We split the remaining bytes (which will be less than 256) into
-	 * 64byte (2^6) chunks.
-	 * Using incrementing integers in the case labels of a switch statement
-	 * enourages the compiler to use a jump table. To get incrementing
-	 * integers, we shift the 2 relevant bits to the LSB position to first
-	 * get decrementing integers, and then subtract.
+	/**
+	 * Copy whatever left
 	 */
-	switch (3 - (n >> 6)) {
-	case 0x00:
-		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-		n -= 64;
-		dst = (uint8_t *)dst + 64;
-		src = (const uint8_t *)src + 64;      /* fallthrough */
-	case 0x01:
-		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-		n -= 64;
-		dst = (uint8_t *)dst + 64;
-		src = (const uint8_t *)src + 64;      /* fallthrough */
-	case 0x02:
-		rte_mov64((uint8_t *)dst, (const uint8_t *)src);
-		n -= 64;
-		dst = (uint8_t *)dst + 64;
-		src = (const uint8_t *)src + 64;      /* fallthrough */
-	default:
-		;
+	goto COPY_BLOCK_64_BACK31;
+}
+
+#else /* RTE_MACHINE_CPUFLAG_AVX2 */
+
+/**
+ * SSE & AVX implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+	__m128i xmm0;
+
+	xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);
+	_mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
+	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
+	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
+	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
+}
+
+/**
+ * Copy 256 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov256(uint8_t *dst, const uint8_t *src)
+{
+	rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+	rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+	rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+	rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+	rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
+	rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
+	rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
+	rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
+	rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);
+	rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);
+	rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);
+	rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);
+	rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);
+	rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);
+	rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);
+	rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
+}
+
+/**
+ * Macro for copying unaligned block from one location to another,
+ * 47 bytes leftover maximum,
+ * locations should not overlap.
+ * Requirements:
+ * - Store is aligned
+ * - Load offset is <offset>, which must be immediate value within [1, 15]
+ * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
+ * - <dst>, <src>, <len> must be variables
+ * - __m128i <xmm0> ~ <xmm8> must be pre-defined
+ */
+#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)                                                         \
+{                                                                                                           \
+	int tmp;                                                                                                \
+	while (len >= 128 + 16 - offset) {                                                                      \
+		xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));                  \
+		len -= 128;                                                                                         \
+		xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));                  \
+		xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));                  \
+		xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16));                  \
+		xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16));                  \
+		xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16));                  \
+		xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16));                  \
+		xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16));                  \
+		xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16));                  \
+		src = (const uint8_t *)src + 128;                                                                   \
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));        \
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));        \
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset));        \
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset));        \
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset));        \
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset));        \
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset));        \
+		_mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset));        \
+		dst = (uint8_t *)dst + 128;                                                                         \
+	}                                                                                                       \
+	tmp = len;                                                                                              \
+	len = ((len - 16 + offset) & 127) + 16 - offset;                                                        \
+	tmp -= len;                                                                                             \
+	src = (const uint8_t *)src + tmp;                                                                       \
+	dst = (uint8_t *)dst + tmp;                                                                             \
+	if (len >= 32 + 16 - offset) {                                                                          \
+		while (len >= 32 + 16 - offset) {                                                                   \
+			xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16));              \
+			len -= 32;                                                                                      \
+			xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16));              \
+			xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16));              \
+			src = (const uint8_t *)src + 32;                                                                \
+			_mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset));    \
+			_mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset));    \
+			dst = (uint8_t *)dst + 32;                                                                      \
+		}                                                                                                   \
+		tmp = len;                                                                                          \
+		len = ((len - 16 + offset) & 31) + 16 - offset;                                                     \
+		tmp -= len;                                                                                         \
+		src = (const uint8_t *)src + tmp;                                                                   \
+		dst = (uint8_t *)dst + tmp;                                                                         \
+	}                                                                                                       \
+}
+
+static inline void *
+rte_memcpy(void *dst, const void *src, size_t n)
+{
+	__m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
+	void *ret = dst;
+	int dstofss;
+	int srcofs;
+
+	/**
+	 * Copy less than 16 bytes
+	 */
+	if (n < 16) {
+		if (n & 0x01) {
+			*(uint8_t *)dst = *(const uint8_t *)src;
+			src = (const uint8_t *)src + 1;
+			dst = (uint8_t *)dst + 1;
+		}
+		if (n & 0x02) {
+			*(uint16_t *)dst = *(const uint16_t *)src;
+			src = (const uint16_t *)src + 1;
+			dst = (uint16_t *)dst + 1;
+		}
+		if (n & 0x04) {
+			*(uint32_t *)dst = *(const uint32_t *)src;
+			src = (const uint32_t *)src + 1;
+			dst = (uint32_t *)dst + 1;
+		}
+		if (n & 0x08) {
+			*(uint64_t *)dst = *(const uint64_t *)src;
+		}
+		return ret;
 	}
 
-	/*
-	 * We split the remaining bytes (which will be less than 64) into
-	 * 16byte (2^4) chunks, using the same switch structure as above.
+	/**
+	 * Fast way when copy size doesn't exceed 512 bytes
 	 */
-	switch (3 - (n >> 4)) {
-	case 0x00:
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		n -= 16;
-		dst = (uint8_t *)dst + 16;
-		src = (const uint8_t *)src + 16;      /* fallthrough */
-	case 0x01:
-		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		n -= 16;
-		dst = (uint8_t *)dst + 16;
-		src = (const uint8_t *)src + 16;      /* fallthrough */
-	case 0x02:
+	if (n <= 32) {
 		rte_mov16((uint8_t *)dst, (const uint8_t *)src);
-		n -= 16;
-		dst = (uint8_t *)dst + 16;
-		src = (const uint8_t *)src + 16;      /* fallthrough */
-	default:
-		;
+		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
+		return ret;
 	}
-
-	/* Copy any remaining bytes, without going beyond end of buffers */
-	if (n != 0) {
+	if (n <= 48) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
+		return ret;
+	}
+	if (n <= 64) {
+		rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+		rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
 		rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
+		return ret;
 	}
-	return ret;
+	if (n <= 128) {
+		goto COPY_BLOCK_128_BACK15;
+	}
+	if (n <= 512) {
+		if (n >= 256) {
+			n -= 256;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov128((uint8_t *)dst + 128, (const uint8_t *)src + 128);
+			src = (const uint8_t *)src + 256;
+			dst = (uint8_t *)dst + 256;
+		}
+COPY_BLOCK_255_BACK15:
+		if (n >= 128) {
+			n -= 128;
+			rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 128;
+			dst = (uint8_t *)dst + 128;
+		}
+COPY_BLOCK_128_BACK15:
+		if (n >= 64) {
+			n -= 64;
+			rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 64;
+			dst = (uint8_t *)dst + 64;
+		}
+COPY_BLOCK_64_BACK15:
+		if (n >= 32) {
+			n -= 32;
+			rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+			src = (const uint8_t *)src + 32;
+			dst = (uint8_t *)dst + 32;
+		}
+		if (n > 16) {
+			rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
+			return ret;
+		}
+		if (n > 0) {
+			rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
+		}
+		return ret;
+	}
+
+	/**
+	 * Make store aligned when copy size exceeds 512 bytes,
+	 * and make sure the first 15 bytes are copied, because
+	 * unaligned copy functions require up to 15 bytes
+	 * backwards access.
+	 */
+	dstofss = 16 - (int)((long long)(void *)dst & 0x0F) + 16;
+	n -= dstofss;
+	rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+	src = (const uint8_t *)src + dstofss;
+	dst = (uint8_t *)dst + dstofss;
+	srcofs = (int)((long long)(const void *)src & 0x0F);
+
+	/**
+	 * For aligned copy
+	 */
+	if (srcofs == 0) {
+		/**
+		 * Copy 256-byte blocks
+		 */
+		for (; n >= 256; n -= 256) {
+			rte_mov256((uint8_t *)dst, (const uint8_t *)src);
+			dst = (uint8_t *)dst + 256;
+			src = (const uint8_t *)src + 256;
+		}
+
+		/**
+		 * Copy whatever left
+		 */
+		goto COPY_BLOCK_255_BACK15;
+	}
+
+	/**
+	 * For copy with unaligned load, use PALIGNR to force load alignment.
+	 * Use switch here because PALIGNR requires immediate value for shift count.
+	 */
+	switch (srcofs) {
+	case 0x01: MOVEUNALIGNED_LEFT47(dst, src, n, 0x01); break;
+	case 0x02: MOVEUNALIGNED_LEFT47(dst, src, n, 0x02); break;
+	case 0x03: MOVEUNALIGNED_LEFT47(dst, src, n, 0x03); break;
+	case 0x04: MOVEUNALIGNED_LEFT47(dst, src, n, 0x04); break;
+	case 0x05: MOVEUNALIGNED_LEFT47(dst, src, n, 0x05); break;
+	case 0x06: MOVEUNALIGNED_LEFT47(dst, src, n, 0x06); break;
+	case 0x07: MOVEUNALIGNED_LEFT47(dst, src, n, 0x07); break;
+	case 0x08: MOVEUNALIGNED_LEFT47(dst, src, n, 0x08); break;
+	case 0x09: MOVEUNALIGNED_LEFT47(dst, src, n, 0x09); break;
+	case 0x0A: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0A); break;
+	case 0x0B: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0B); break;
+	case 0x0C: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0C); break;
+	case 0x0D: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0D); break;
+	case 0x0E: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0E); break;
+	case 0x0F: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0F); break;
+	default:;
+	}
+
+	/**
+	 * Copy whatever left
+	 */
+	goto COPY_BLOCK_64_BACK15;
 }
 
+#endif /* RTE_MACHINE_CPUFLAG_AVX2 */
+
 #ifdef __cplusplus
 }
 #endif
-- 
1.9.3

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-19  1:53 [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization zhihong.wang
                   ` (3 preceding siblings ...)
  2015-01-19  1:53 ` [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms zhihong.wang
@ 2015-01-19 13:02 ` Neil Horman
  2015-01-20  3:01   ` Wang, Zhihong
  2015-01-25 14:50 ` Luke Gorrie
  2015-01-29  3:42 ` [dpdk-dev] " Fu, JingguoX
  6 siblings, 1 reply; 48+ messages in thread
From: Neil Horman @ 2015-01-19 13:02 UTC (permalink / raw)
  To: zhihong.wang; +Cc: dev

On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang@intel.com wrote:
> This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
> It also extends memcpy test coverage with unaligned cases and more test points.
> 
> Optimization techniques are summarized below:
> 
> 1. Utilize full cache bandwidth
> 
> 2. Enforce aligned stores
> 
> 3. Apply load address alignment based on architecture features
> 
> 4. Make load/store address available as early as possible
> 
> 5. General optimization techniques like inlining, branch reducing, prefetch pattern access
> 
> Zhihong Wang (4):
>   Disabled VTA for memcpy test in app/test/Makefile
>   Removed unnecessary test cases in test_memcpy.c
>   Extended test coverage in test_memcpy_perf.c
>   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
>     platforms
> 
>  app/test/Makefile                                  |   6 +
>  app/test/test_memcpy.c                             |  52 +-
>  app/test/test_memcpy_perf.c                        | 238 +++++---
>  .../common/include/arch/x86/rte_memcpy.h           | 664 +++++++++++++++------
>  4 files changed, 656 insertions(+), 304 deletions(-)
> 
> -- 
> 1.9.3
> 
> 
Are you able to compile this with gcc 4.9.2?  The compilation of
test_memcpy_perf is taking forever for me.  It appears hung.
Neil

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-19 13:02 ` [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization Neil Horman
@ 2015-01-20  3:01   ` Wang, Zhihong
  2015-01-20 15:11     ` Neil Horman
  0 siblings, 1 reply; 48+ messages in thread
From: Wang, Zhihong @ 2015-01-20  3:01 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev



> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Monday, January 19, 2015 9:02 PM
> To: Wang, Zhihong
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang@intel.com wrote:
> > This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
> > It also extends memcpy test coverage with unaligned cases and more test
> points.
> >
> > Optimization techniques are summarized below:
> >
> > 1. Utilize full cache bandwidth
> >
> > 2. Enforce aligned stores
> >
> > 3. Apply load address alignment based on architecture features
> >
> > 4. Make load/store address available as early as possible
> >
> > 5. General optimization techniques like inlining, branch reducing,
> > prefetch pattern access
> >
> > Zhihong Wang (4):
> >   Disabled VTA for memcpy test in app/test/Makefile
> >   Removed unnecessary test cases in test_memcpy.c
> >   Extended test coverage in test_memcpy_perf.c
> >   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
> >     platforms
> >
> >  app/test/Makefile                                  |   6 +
> >  app/test/test_memcpy.c                             |  52 +-
> >  app/test/test_memcpy_perf.c                        | 238 +++++---
> >  .../common/include/arch/x86/rte_memcpy.h           | 664
> +++++++++++++++------
> >  4 files changed, 656 insertions(+), 304 deletions(-)
> >
> > --
> > 1.9.3
> >
> >
> Are you able to compile this with gcc 4.9.2?  The compilation of
> test_memcpy_perf is taking forever for me.  It appears hung.
> Neil


Neil,

Thanks for reporting this!
It should compile but will take quite some time if the CPU doesn't support AVX2, the reason is that:
1. The SSE & AVX memcpy implementation is more complicated than AVX2 version thus the compiler takes more time to compile and optimize
2. The new test_memcpy_perf.c contains 126 constants memcpy calls for better test case coverage, that's quite a lot

I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
1. The whole compile process takes 9'41" with the original test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls)
2. It takes only 2'41" after I reduce the constant memcpy call number to 12 + 12 = 24

I'll reduce memcpy call in the next version of patch.

Zhihong (John)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-20  3:01   ` Wang, Zhihong
@ 2015-01-20 15:11     ` Neil Horman
  2015-01-20 16:14       ` Bruce Richardson
  0 siblings, 1 reply; 48+ messages in thread
From: Neil Horman @ 2015-01-20 15:11 UTC (permalink / raw)
  To: Wang, Zhihong; +Cc: dev

On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong wrote:
> 
> 
> > -----Original Message-----
> > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > Sent: Monday, January 19, 2015 9:02 PM
> > To: Wang, Zhihong
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > 
> > On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang@intel.com wrote:
> > > This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
> > > It also extends memcpy test coverage with unaligned cases and more test
> > points.
> > >
> > > Optimization techniques are summarized below:
> > >
> > > 1. Utilize full cache bandwidth
> > >
> > > 2. Enforce aligned stores
> > >
> > > 3. Apply load address alignment based on architecture features
> > >
> > > 4. Make load/store address available as early as possible
> > >
> > > 5. General optimization techniques like inlining, branch reducing,
> > > prefetch pattern access
> > >
> > > Zhihong Wang (4):
> > >   Disabled VTA for memcpy test in app/test/Makefile
> > >   Removed unnecessary test cases in test_memcpy.c
> > >   Extended test coverage in test_memcpy_perf.c
> > >   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
> > >     platforms
> > >
> > >  app/test/Makefile                                  |   6 +
> > >  app/test/test_memcpy.c                             |  52 +-
> > >  app/test/test_memcpy_perf.c                        | 238 +++++---
> > >  .../common/include/arch/x86/rte_memcpy.h           | 664
> > +++++++++++++++------
> > >  4 files changed, 656 insertions(+), 304 deletions(-)
> > >
> > > --
> > > 1.9.3
> > >
> > >
> > Are you able to compile this with gcc 4.9.2?  The compilation of
> > test_memcpy_perf is taking forever for me.  It appears hung.
> > Neil
> 
> 
> Neil,
> 
> Thanks for reporting this!
> It should compile but will take quite some time if the CPU doesn't support AVX2, the reason is that:
> 1. The SSE & AVX memcpy implementation is more complicated than AVX2 version thus the compiler takes more time to compile and optimize
> 2. The new test_memcpy_perf.c contains 126 constants memcpy calls for better test case coverage, that's quite a lot
> 
> I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
> 1. The whole compile process takes 9'41" with the original test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls)
> 2. It takes only 2'41" after I reduce the constant memcpy call number to 12 + 12 = 24
> 
> I'll reduce memcpy call in the next version of patch.
> 
ok, thank you.  I'm all for optimzation, but I think a compile that takes almost
10 minutes for a single file is going to generate some raised eyebrows when end
users start tinkering with it

Neil

> Zhihong (John)
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-20 15:11     ` Neil Horman
@ 2015-01-20 16:14       ` Bruce Richardson
  2015-01-21  3:44         ` Wang, Zhihong
  0 siblings, 1 reply; 48+ messages in thread
From: Bruce Richardson @ 2015-01-20 16:14 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev

On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong wrote:
> > 
> > 
> > > -----Original Message-----
> > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > Sent: Monday, January 19, 2015 9:02 PM
> > > To: Wang, Zhihong
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > 
> > > On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang@intel.com wrote:
> > > > This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
> > > > It also extends memcpy test coverage with unaligned cases and more test
> > > points.
> > > >
> > > > Optimization techniques are summarized below:
> > > >
> > > > 1. Utilize full cache bandwidth
> > > >
> > > > 2. Enforce aligned stores
> > > >
> > > > 3. Apply load address alignment based on architecture features
> > > >
> > > > 4. Make load/store address available as early as possible
> > > >
> > > > 5. General optimization techniques like inlining, branch reducing,
> > > > prefetch pattern access
> > > >
> > > > Zhihong Wang (4):
> > > >   Disabled VTA for memcpy test in app/test/Makefile
> > > >   Removed unnecessary test cases in test_memcpy.c
> > > >   Extended test coverage in test_memcpy_perf.c
> > > >   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
> > > >     platforms
> > > >
> > > >  app/test/Makefile                                  |   6 +
> > > >  app/test/test_memcpy.c                             |  52 +-
> > > >  app/test/test_memcpy_perf.c                        | 238 +++++---
> > > >  .../common/include/arch/x86/rte_memcpy.h           | 664
> > > +++++++++++++++------
> > > >  4 files changed, 656 insertions(+), 304 deletions(-)
> > > >
> > > > --
> > > > 1.9.3
> > > >
> > > >
> > > Are you able to compile this with gcc 4.9.2?  The compilation of
> > > test_memcpy_perf is taking forever for me.  It appears hung.
> > > Neil
> > 
> > 
> > Neil,
> > 
> > Thanks for reporting this!
> > It should compile but will take quite some time if the CPU doesn't support AVX2, the reason is that:
> > 1. The SSE & AVX memcpy implementation is more complicated than AVX2 version thus the compiler takes more time to compile and optimize
> > 2. The new test_memcpy_perf.c contains 126 constants memcpy calls for better test case coverage, that's quite a lot
> > 
> > I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
> > 1. The whole compile process takes 9'41" with the original test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls)
> > 2. It takes only 2'41" after I reduce the constant memcpy call number to 12 + 12 = 24
> > 
> > I'll reduce memcpy call in the next version of patch.
> > 
> ok, thank you.  I'm all for optimzation, but I think a compile that takes almost
> 10 minutes for a single file is going to generate some raised eyebrows when end
> users start tinkering with it
> 
> Neil
> 
> > Zhihong (John)
> > 
Even two minutes is a very long time to compile, IMHO. The whole of DPDK doesn't
take that long to compile right now, and that's with a couple of huge header files
with routing tables in it. Any chance you could cut compile time down to a few
seconds while still having reasonable tests?
Also, when there is AVX2 present on the system, what is the compile time like
for that code?

	/Bruce

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
  2015-01-19  1:53 ` [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms zhihong.wang
@ 2015-01-20 17:15   ` Stephen Hemminger
  2015-01-20 19:16     ` Neil Horman
  2015-01-25 20:02     ` Jim Thompson
  2015-01-26 14:43   ` Wodkowski, PawelX
  1 sibling, 2 replies; 48+ messages in thread
From: Stephen Hemminger @ 2015-01-20 17:15 UTC (permalink / raw)
  To: zhihong.wang; +Cc: dev

On Mon, 19 Jan 2015 09:53:34 +0800
zhihong.wang@intel.com wrote:

> Main code changes:
> 
> 1. Differentiate architectural features based on CPU flags
> 
>     a. Implement separated move functions for SSE/AVX/AVX2 to make full utilization of cache bandwidth
> 
>     b. Implement separated copy flow specifically optimized for target architecture
> 
> 2. Rewrite the memcpy function "rte_memcpy"
> 
>     a. Add store aligning
> 
>     b. Add load aligning based on architectural features
> 
>     c. Put block copy loop into inline move functions for better control of instruction order
> 
>     d. Eliminate unnecessary MOVs
> 
> 3. Rewrite the inline move functions
> 
>     a. Add move functions for unaligned load cases
> 
>     b. Change instruction order in copy loops for better pipeline utilization
> 
>     c. Use intrinsics instead of assembly code
> 
> 4. Remove slow glibc call for constant copies
> 
> Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>

Dumb question: why not fix glibc memcpy instead?
What is special about rte_memcpy?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
  2015-01-20 17:15   ` Stephen Hemminger
@ 2015-01-20 19:16     ` Neil Horman
  2015-01-21  3:18       ` Wang, Zhihong
  2015-01-25 20:02     ` Jim Thompson
  1 sibling, 1 reply; 48+ messages in thread
From: Neil Horman @ 2015-01-20 19:16 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev

On Tue, Jan 20, 2015 at 09:15:38AM -0800, Stephen Hemminger wrote:
> On Mon, 19 Jan 2015 09:53:34 +0800
> zhihong.wang@intel.com wrote:
> 
> > Main code changes:
> > 
> > 1. Differentiate architectural features based on CPU flags
> > 
> >     a. Implement separated move functions for SSE/AVX/AVX2 to make full utilization of cache bandwidth
> > 
> >     b. Implement separated copy flow specifically optimized for target architecture
> > 
> > 2. Rewrite the memcpy function "rte_memcpy"
> > 
> >     a. Add store aligning
> > 
> >     b. Add load aligning based on architectural features
> > 
> >     c. Put block copy loop into inline move functions for better control of instruction order
> > 
> >     d. Eliminate unnecessary MOVs
> > 
> > 3. Rewrite the inline move functions
> > 
> >     a. Add move functions for unaligned load cases
> > 
> >     b. Change instruction order in copy loops for better pipeline utilization
> > 
> >     c. Use intrinsics instead of assembly code
> > 
> > 4. Remove slow glibc call for constant copies
> > 
> > Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
> 
> Dumb question: why not fix glibc memcpy instead?
> What is special about rte_memcpy?
> 
> 
Fair point.  Though, does glibc implement optimized memcpys per arch?  Or do
they just rely on the __builtin's from gcc to get optimized variants?

Neil

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
  2015-01-20 19:16     ` Neil Horman
@ 2015-01-21  3:18       ` Wang, Zhihong
  0 siblings, 0 replies; 48+ messages in thread
From: Wang, Zhihong @ 2015-01-21  3:18 UTC (permalink / raw)
  To: Neil Horman, Stephen Hemminger; +Cc: dev



> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Wednesday, January 21, 2015 3:16 AM
> To: Stephen Hemminger
> Cc: Wang, Zhihong; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in
> arch/x86/rte_memcpy.h for both SSE and AVX platforms
> 
> On Tue, Jan 20, 2015 at 09:15:38AM -0800, Stephen Hemminger wrote:
> > On Mon, 19 Jan 2015 09:53:34 +0800
> > zhihong.wang@intel.com wrote:
> >
> > > Main code changes:
> > >
> > > 1. Differentiate architectural features based on CPU flags
> > >
> > >     a. Implement separated move functions for SSE/AVX/AVX2 to make
> > > full utilization of cache bandwidth
> > >
> > >     b. Implement separated copy flow specifically optimized for
> > > target architecture
> > >
> > > 2. Rewrite the memcpy function "rte_memcpy"
> > >
> > >     a. Add store aligning
> > >
> > >     b. Add load aligning based on architectural features
> > >
> > >     c. Put block copy loop into inline move functions for better
> > > control of instruction order
> > >
> > >     d. Eliminate unnecessary MOVs
> > >
> > > 3. Rewrite the inline move functions
> > >
> > >     a. Add move functions for unaligned load cases
> > >
> > >     b. Change instruction order in copy loops for better pipeline
> > > utilization
> > >
> > >     c. Use intrinsics instead of assembly code
> > >
> > > 4. Remove slow glibc call for constant copies
> > >
> > > Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
> >
> > Dumb question: why not fix glibc memcpy instead?
> > What is special about rte_memcpy?
> >
> >
> Fair point.  Though, does glibc implement optimized memcpys per arch?  Or
> do they just rely on the __builtin's from gcc to get optimized variants?
> 
> Neil

Neil, Stephen,

Glibc has per arch implementation but is for general purpose, while rte_memcpy is more for small size & in cache memcpy, which is the DPDK case. This lead to different trade-offs and optimization techniques.
Also, glibc's update from version to version is also based on general judgments. We can say that glibc 2.18 is for Ivy Bridge and 2.20 is for Haswell, though not full accurate. But we need an implementation for both Sandy Bridge and Haswell.

For instance, glibc 2.18 has load aligning optimization for unaligned memcpy but doesn't support 256-bit mov; while glibc 2.20 add support for 256-bit mov, but remove load aligning optimization. This hurts unaligned memcpy performance a lot on architectures like Ivy Bridge. Glibc's reason is that the load aligning optimization doesn't help when src/dst isn't in cache, which could be the general case, but not the DPDK case.

Zhihong (John)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-20 16:14       ` Bruce Richardson
@ 2015-01-21  3:44         ` Wang, Zhihong
  2015-01-21 11:40           ` Bruce Richardson
                             ` (2 more replies)
  0 siblings, 3 replies; 48+ messages in thread
From: Wang, Zhihong @ 2015-01-21  3:44 UTC (permalink / raw)
  To: Richardson, Bruce, Neil Horman; +Cc: dev



> -----Original Message-----
> From: Richardson, Bruce
> Sent: Wednesday, January 21, 2015 12:15 AM
> To: Neil Horman
> Cc: Wang, Zhihong; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > Sent: Monday, January 19, 2015 9:02 PM
> > > > To: Wang, Zhihong
> > > > Cc: dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > >
> > > > On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang@intel.com
> wrote:
> > > > > This patch set optimizes memcpy for DPDK for both SSE and AVX
> platforms.
> > > > > It also extends memcpy test coverage with unaligned cases and
> > > > > more test
> > > > points.
> > > > >
> > > > > Optimization techniques are summarized below:
> > > > >
> > > > > 1. Utilize full cache bandwidth
> > > > >
> > > > > 2. Enforce aligned stores
> > > > >
> > > > > 3. Apply load address alignment based on architecture features
> > > > >
> > > > > 4. Make load/store address available as early as possible
> > > > >
> > > > > 5. General optimization techniques like inlining, branch
> > > > > reducing, prefetch pattern access
> > > > >
> > > > > Zhihong Wang (4):
> > > > >   Disabled VTA for memcpy test in app/test/Makefile
> > > > >   Removed unnecessary test cases in test_memcpy.c
> > > > >   Extended test coverage in test_memcpy_perf.c
> > > > >   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
> > > > >     platforms
> > > > >
> > > > >  app/test/Makefile                                  |   6 +
> > > > >  app/test/test_memcpy.c                             |  52 +-
> > > > >  app/test/test_memcpy_perf.c                        | 238 +++++---
> > > > >  .../common/include/arch/x86/rte_memcpy.h           | 664
> > > > +++++++++++++++------
> > > > >  4 files changed, 656 insertions(+), 304 deletions(-)
> > > > >
> > > > > --
> > > > > 1.9.3
> > > > >
> > > > >
> > > > Are you able to compile this with gcc 4.9.2?  The compilation of
> > > > test_memcpy_perf is taking forever for me.  It appears hung.
> > > > Neil
> > >
> > >
> > > Neil,
> > >
> > > Thanks for reporting this!
> > > It should compile but will take quite some time if the CPU doesn't support
> AVX2, the reason is that:
> > > 1. The SSE & AVX memcpy implementation is more complicated than
> AVX2
> > > version thus the compiler takes more time to compile and optimize 2.
> > > The new test_memcpy_perf.c contains 126 constants memcpy calls for
> > > better test case coverage, that's quite a lot
> > >
> > > I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
> > > 1. The whole compile process takes 9'41" with the original
> > > test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2. It takes
> > > only 2'41" after I reduce the constant memcpy call number to 12 + 12
> > > = 24
> > >
> > > I'll reduce memcpy call in the next version of patch.
> > >
> > ok, thank you.  I'm all for optimzation, but I think a compile that
> > takes almost
> > 10 minutes for a single file is going to generate some raised eyebrows
> > when end users start tinkering with it
> >
> > Neil
> >
> > > Zhihong (John)
> > >
> Even two minutes is a very long time to compile, IMHO. The whole of DPDK
> doesn't take that long to compile right now, and that's with a couple of huge
> header files with routing tables in it. Any chance you could cut compile time
> down to a few seconds while still having reasonable tests?
> Also, when there is AVX2 present on the system, what is the compile time
> like for that code?
> 
> 	/Bruce

Neil, Bruce,

Some data first.

Sandy Bridge without AVX2:
1. original w/ 10 constant memcpy: 2'25" 
2. patch w/ 12 constant memcpy: 2'41" 
3. patch w/ 63 constant memcpy: 9'41" 

Haswell with AVX2:
1. original w/ 10 constant memcpy: 1'57" 
2. patch w/ 12 constant memcpy: 1'56" 
3. patch w/ 63 constant memcpy: 3'16" 

Also, to address Bruce's question, we have to reduce test case to cut down compile time. Because we use:
1. intrinsics instead of assembly for better flexibility and can utilize more compiler optimization 
2. complex function body for better performance 
3. inlining 
This increases compile time.
But I think it'd be okay to do that as long as we can select a fair set of test points.

It'd be great if you could give some suggestion, say, 12 points.

Zhihong (John)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-21  3:44         ` Wang, Zhihong
@ 2015-01-21 11:40           ` Bruce Richardson
  2015-01-21 12:02           ` Ananyev, Konstantin
  2015-01-21 12:36           ` Marc Sune
  2 siblings, 0 replies; 48+ messages in thread
From: Bruce Richardson @ 2015-01-21 11:40 UTC (permalink / raw)
  To: Wang, Zhihong; +Cc: dev

On Wed, Jan 21, 2015 at 03:44:23AM +0000, Wang, Zhihong wrote:
 
> Neil, Bruce,
> 
> Some data first.
> 
> Sandy Bridge without AVX2:
> 1. original w/ 10 constant memcpy: 2'25" 
> 2. patch w/ 12 constant memcpy: 2'41" 
> 3. patch w/ 63 constant memcpy: 9'41" 
> 
> Haswell with AVX2:
> 1. original w/ 10 constant memcpy: 1'57" 
> 2. patch w/ 12 constant memcpy: 1'56" 
> 3. patch w/ 63 constant memcpy: 3'16" 
> 
> Also, to address Bruce's question, we have to reduce test case to cut down compile time. Because we use:
> 1. intrinsics instead of assembly for better flexibility and can utilize more compiler optimization 
> 2. complex function body for better performance 
> 3. inlining 
> This increases compile time.
> But I think it'd be okay to do that as long as we can select a fair set of test points.
> 
> It'd be great if you could give some suggestion, say, 12 points.
> 
> Zhihong (John)
> 
Hi Zhihong,

Just for comparison I've done a clean dpdk compile on my SNB system this morning.
Using parallel make (which is pretty normal I suspect), I get the following
numbers:
 real    0m52.549s
 user    0m36.034s
 sys     0m10.014s

So total compile time is 52 seconds.

Running a make uninstall and then make install again with "-j 1", provides the 
following numbers:

 real    0m32.751s
 user    0m16.041s
 sys     0m7.946s

Obviously, caching effects are being completely ignored by the this unscientific
study (rerunning the first test again gives a 13-second time), but the upshot
is that the compile time for DPDK right now is well under a minute in the normal
case. Adding in a new file that, in the best case, takes two minutes to compile
is going to increase our compile time many times over. 

Regards,
/Bruce

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-21  3:44         ` Wang, Zhihong
  2015-01-21 11:40           ` Bruce Richardson
@ 2015-01-21 12:02           ` Ananyev, Konstantin
  2015-01-21 12:38             ` Neil Horman
  2015-01-21 12:36           ` Marc Sune
  2 siblings, 1 reply; 48+ messages in thread
From: Ananyev, Konstantin @ 2015-01-21 12:02 UTC (permalink / raw)
  To: Wang, Zhihong, Richardson, Bruce, Neil Horman; +Cc: dev



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Wang, Zhihong
> Sent: Wednesday, January 21, 2015 3:44 AM
> To: Richardson, Bruce; Neil Horman
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> 
> 
> > -----Original Message-----
> > From: Richardson, Bruce
> > Sent: Wednesday, January 21, 2015 12:15 AM
> > To: Neil Horman
> > Cc: Wang, Zhihong; dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >
> > On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > Sent: Monday, January 19, 2015 9:02 PM
> > > > > To: Wang, Zhihong
> > > > > Cc: dev@dpdk.org
> > > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > >
> > > > > On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang@intel.com
> > wrote:
> > > > > > This patch set optimizes memcpy for DPDK for both SSE and AVX
> > platforms.
> > > > > > It also extends memcpy test coverage with unaligned cases and
> > > > > > more test
> > > > > points.
> > > > > >
> > > > > > Optimization techniques are summarized below:
> > > > > >
> > > > > > 1. Utilize full cache bandwidth
> > > > > >
> > > > > > 2. Enforce aligned stores
> > > > > >
> > > > > > 3. Apply load address alignment based on architecture features
> > > > > >
> > > > > > 4. Make load/store address available as early as possible
> > > > > >
> > > > > > 5. General optimization techniques like inlining, branch
> > > > > > reducing, prefetch pattern access
> > > > > >
> > > > > > Zhihong Wang (4):
> > > > > >   Disabled VTA for memcpy test in app/test/Makefile
> > > > > >   Removed unnecessary test cases in test_memcpy.c
> > > > > >   Extended test coverage in test_memcpy_perf.c
> > > > > >   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
> > > > > >     platforms
> > > > > >
> > > > > >  app/test/Makefile                                  |   6 +
> > > > > >  app/test/test_memcpy.c                             |  52 +-
> > > > > >  app/test/test_memcpy_perf.c                        | 238 +++++---
> > > > > >  .../common/include/arch/x86/rte_memcpy.h           | 664
> > > > > +++++++++++++++------
> > > > > >  4 files changed, 656 insertions(+), 304 deletions(-)
> > > > > >
> > > > > > --
> > > > > > 1.9.3
> > > > > >
> > > > > >
> > > > > Are you able to compile this with gcc 4.9.2?  The compilation of
> > > > > test_memcpy_perf is taking forever for me.  It appears hung.
> > > > > Neil
> > > >
> > > >
> > > > Neil,
> > > >
> > > > Thanks for reporting this!
> > > > It should compile but will take quite some time if the CPU doesn't support
> > AVX2, the reason is that:
> > > > 1. The SSE & AVX memcpy implementation is more complicated than
> > AVX2
> > > > version thus the compiler takes more time to compile and optimize 2.
> > > > The new test_memcpy_perf.c contains 126 constants memcpy calls for
> > > > better test case coverage, that's quite a lot
> > > >
> > > > I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
> > > > 1. The whole compile process takes 9'41" with the original
> > > > test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2. It takes
> > > > only 2'41" after I reduce the constant memcpy call number to 12 + 12
> > > > = 24
> > > >
> > > > I'll reduce memcpy call in the next version of patch.
> > > >
> > > ok, thank you.  I'm all for optimzation, but I think a compile that
> > > takes almost
> > > 10 minutes for a single file is going to generate some raised eyebrows
> > > when end users start tinkering with it
> > >
> > > Neil
> > >
> > > > Zhihong (John)
> > > >
> > Even two minutes is a very long time to compile, IMHO. The whole of DPDK
> > doesn't take that long to compile right now, and that's with a couple of huge
> > header files with routing tables in it. Any chance you could cut compile time
> > down to a few seconds while still having reasonable tests?
> > Also, when there is AVX2 present on the system, what is the compile time
> > like for that code?
> >
> > 	/Bruce
> 
> Neil, Bruce,
> 
> Some data first.
> 
> Sandy Bridge without AVX2:
> 1. original w/ 10 constant memcpy: 2'25"
> 2. patch w/ 12 constant memcpy: 2'41"
> 3. patch w/ 63 constant memcpy: 9'41"
> 
> Haswell with AVX2:
> 1. original w/ 10 constant memcpy: 1'57"
> 2. patch w/ 12 constant memcpy: 1'56"
> 3. patch w/ 63 constant memcpy: 3'16"
> 
> Also, to address Bruce's question, we have to reduce test case to cut down compile time. Because we use:
> 1. intrinsics instead of assembly for better flexibility and can utilize more compiler optimization
> 2. complex function body for better performance
> 3. inlining
> This increases compile time.

We use instrincts and inlining in many other places too.
Why it suddenly became a problem here?
Konstantin

> But I think it'd be okay to do that as long as we can select a fair set of test points.
> 
> It'd be great if you could give some suggestion, say, 12 points.
> 
> Zhihong (John)
> 
> 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-21  3:44         ` Wang, Zhihong
  2015-01-21 11:40           ` Bruce Richardson
  2015-01-21 12:02           ` Ananyev, Konstantin
@ 2015-01-21 12:36           ` Marc Sune
  2015-01-21 13:02             ` Bruce Richardson
  2 siblings, 1 reply; 48+ messages in thread
From: Marc Sune @ 2015-01-21 12:36 UTC (permalink / raw)
  To: dev


On 21/01/15 04:44, Wang, Zhihong wrote:
>
>> -----Original Message-----
>> From: Richardson, Bruce
>> Sent: Wednesday, January 21, 2015 12:15 AM
>> To: Neil Horman
>> Cc: Wang, Zhihong; dev@dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
>>
>> On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
>>> On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong wrote:
>>>>
>>>>> -----Original Message-----
>>>>> From: Neil Horman [mailto:nhorman@tuxdriver.com]
>>>>> Sent: Monday, January 19, 2015 9:02 PM
>>>>> To: Wang, Zhihong
>>>>> Cc: dev@dpdk.org
>>>>> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
>>>>>
>>>>> On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang@intel.com
>> wrote:
>>>>>> This patch set optimizes memcpy for DPDK for both SSE and AVX
>> platforms.
>>>>>> It also extends memcpy test coverage with unaligned cases and
>>>>>> more test
>>>>> points.
>>>>>> Optimization techniques are summarized below:
>>>>>>
>>>>>> 1. Utilize full cache bandwidth
>>>>>>
>>>>>> 2. Enforce aligned stores
>>>>>>
>>>>>> 3. Apply load address alignment based on architecture features
>>>>>>
>>>>>> 4. Make load/store address available as early as possible
>>>>>>
>>>>>> 5. General optimization techniques like inlining, branch
>>>>>> reducing, prefetch pattern access
>>>>>>
>>>>>> Zhihong Wang (4):
>>>>>>    Disabled VTA for memcpy test in app/test/Makefile
>>>>>>    Removed unnecessary test cases in test_memcpy.c
>>>>>>    Extended test coverage in test_memcpy_perf.c
>>>>>>    Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
>>>>>>      platforms
>>>>>>
>>>>>>   app/test/Makefile                                  |   6 +
>>>>>>   app/test/test_memcpy.c                             |  52 +-
>>>>>>   app/test/test_memcpy_perf.c                        | 238 +++++---
>>>>>>   .../common/include/arch/x86/rte_memcpy.h           | 664
>>>>> +++++++++++++++------
>>>>>>   4 files changed, 656 insertions(+), 304 deletions(-)
>>>>>>
>>>>>> --
>>>>>> 1.9.3
>>>>>>
>>>>>>
>>>>> Are you able to compile this with gcc 4.9.2?  The compilation of
>>>>> test_memcpy_perf is taking forever for me.  It appears hung.
>>>>> Neil
>>>>
>>>> Neil,
>>>>
>>>> Thanks for reporting this!
>>>> It should compile but will take quite some time if the CPU doesn't support
>> AVX2, the reason is that:
>>>> 1. The SSE & AVX memcpy implementation is more complicated than
>> AVX2
>>>> version thus the compiler takes more time to compile and optimize 2.
>>>> The new test_memcpy_perf.c contains 126 constants memcpy calls for
>>>> better test case coverage, that's quite a lot
>>>>
>>>> I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
>>>> 1. The whole compile process takes 9'41" with the original
>>>> test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2. It takes
>>>> only 2'41" after I reduce the constant memcpy call number to 12 + 12
>>>> = 24
>>>>
>>>> I'll reduce memcpy call in the next version of patch.
>>>>
>>> ok, thank you.  I'm all for optimzation, but I think a compile that
>>> takes almost
>>> 10 minutes for a single file is going to generate some raised eyebrows
>>> when end users start tinkering with it
>>>
>>> Neil
>>>
>>>> Zhihong (John)
>>>>
>> Even two minutes is a very long time to compile, IMHO. The whole of DPDK
>> doesn't take that long to compile right now, and that's with a couple of huge
>> header files with routing tables in it. Any chance you could cut compile time
>> down to a few seconds while still having reasonable tests?
>> Also, when there is AVX2 present on the system, what is the compile time
>> like for that code?
>>
>> 	/Bruce
> Neil, Bruce,
>
> Some data first.
>
> Sandy Bridge without AVX2:
> 1. original w/ 10 constant memcpy: 2'25"
> 2. patch w/ 12 constant memcpy: 2'41"
> 3. patch w/ 63 constant memcpy: 9'41"
>
> Haswell with AVX2:
> 1. original w/ 10 constant memcpy: 1'57"
> 2. patch w/ 12 constant memcpy: 1'56"
> 3. patch w/ 63 constant memcpy: 3'16"
>
> Also, to address Bruce's question, we have to reduce test case to cut down compile time. Because we use:
> 1. intrinsics instead of assembly for better flexibility and can utilize more compiler optimization
> 2. complex function body for better performance
> 3. inlining
> This increases compile time.
> But I think it'd be okay to do that as long as we can select a fair set of test points.
>
> It'd be great if you could give some suggestion, say, 12 points.
>
> Zhihong (John)
>
>

While I agree in the general case these long compilation times is 
painful for the users, having a factor of 2-8x in memcpy operations is 
quite an improvement, specially in DPDK applications which need to deal 
(unfortunately) heavily on them -- e.g. IP fragmentation and reassembly.

Why not having a fast compilation by default, and a tunable config flag 
to enable a highly optimized version of rte_memcpy (e.g. 
RTE_EAL_OPT_MEMCPY)?

Marc

>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-21 12:02           ` Ananyev, Konstantin
@ 2015-01-21 12:38             ` Neil Horman
  2015-01-23  3:26               ` Wang, Zhihong
  0 siblings, 1 reply; 48+ messages in thread
From: Neil Horman @ 2015-01-21 12:38 UTC (permalink / raw)
  To: Ananyev, Konstantin; +Cc: dev

On Wed, Jan 21, 2015 at 12:02:57PM +0000, Ananyev, Konstantin wrote:
> 
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Wang, Zhihong
> > Sent: Wednesday, January 21, 2015 3:44 AM
> > To: Richardson, Bruce; Neil Horman
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > 
> > 
> > 
> > > -----Original Message-----
> > > From: Richardson, Bruce
> > > Sent: Wednesday, January 21, 2015 12:15 AM
> > > To: Neil Horman
> > > Cc: Wang, Zhihong; dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >
> > > On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > > On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong wrote:
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > > Sent: Monday, January 19, 2015 9:02 PM
> > > > > > To: Wang, Zhihong
> > > > > > Cc: dev@dpdk.org
> > > > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > > >
> > > > > > On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang@intel.com
> > > wrote:
> > > > > > > This patch set optimizes memcpy for DPDK for both SSE and AVX
> > > platforms.
> > > > > > > It also extends memcpy test coverage with unaligned cases and
> > > > > > > more test
> > > > > > points.
> > > > > > >
> > > > > > > Optimization techniques are summarized below:
> > > > > > >
> > > > > > > 1. Utilize full cache bandwidth
> > > > > > >
> > > > > > > 2. Enforce aligned stores
> > > > > > >
> > > > > > > 3. Apply load address alignment based on architecture features
> > > > > > >
> > > > > > > 4. Make load/store address available as early as possible
> > > > > > >
> > > > > > > 5. General optimization techniques like inlining, branch
> > > > > > > reducing, prefetch pattern access
> > > > > > >
> > > > > > > Zhihong Wang (4):
> > > > > > >   Disabled VTA for memcpy test in app/test/Makefile
> > > > > > >   Removed unnecessary test cases in test_memcpy.c
> > > > > > >   Extended test coverage in test_memcpy_perf.c
> > > > > > >   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
> > > > > > >     platforms
> > > > > > >
> > > > > > >  app/test/Makefile                                  |   6 +
> > > > > > >  app/test/test_memcpy.c                             |  52 +-
> > > > > > >  app/test/test_memcpy_perf.c                        | 238 +++++---
> > > > > > >  .../common/include/arch/x86/rte_memcpy.h           | 664
> > > > > > +++++++++++++++------
> > > > > > >  4 files changed, 656 insertions(+), 304 deletions(-)
> > > > > > >
> > > > > > > --
> > > > > > > 1.9.3
> > > > > > >
> > > > > > >
> > > > > > Are you able to compile this with gcc 4.9.2?  The compilation of
> > > > > > test_memcpy_perf is taking forever for me.  It appears hung.
> > > > > > Neil
> > > > >
> > > > >
> > > > > Neil,
> > > > >
> > > > > Thanks for reporting this!
> > > > > It should compile but will take quite some time if the CPU doesn't support
> > > AVX2, the reason is that:
> > > > > 1. The SSE & AVX memcpy implementation is more complicated than
> > > AVX2
> > > > > version thus the compiler takes more time to compile and optimize 2.
> > > > > The new test_memcpy_perf.c contains 126 constants memcpy calls for
> > > > > better test case coverage, that's quite a lot
> > > > >
> > > > > I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
> > > > > 1. The whole compile process takes 9'41" with the original
> > > > > test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2. It takes
> > > > > only 2'41" after I reduce the constant memcpy call number to 12 + 12
> > > > > = 24
> > > > >
> > > > > I'll reduce memcpy call in the next version of patch.
> > > > >
> > > > ok, thank you.  I'm all for optimzation, but I think a compile that
> > > > takes almost
> > > > 10 minutes for a single file is going to generate some raised eyebrows
> > > > when end users start tinkering with it
> > > >
> > > > Neil
> > > >
> > > > > Zhihong (John)
> > > > >
> > > Even two minutes is a very long time to compile, IMHO. The whole of DPDK
> > > doesn't take that long to compile right now, and that's with a couple of huge
> > > header files with routing tables in it. Any chance you could cut compile time
> > > down to a few seconds while still having reasonable tests?
> > > Also, when there is AVX2 present on the system, what is the compile time
> > > like for that code?
> > >
> > > 	/Bruce
> > 
> > Neil, Bruce,
> > 
> > Some data first.
> > 
> > Sandy Bridge without AVX2:
> > 1. original w/ 10 constant memcpy: 2'25"
> > 2. patch w/ 12 constant memcpy: 2'41"
> > 3. patch w/ 63 constant memcpy: 9'41"
> > 
> > Haswell with AVX2:
> > 1. original w/ 10 constant memcpy: 1'57"
> > 2. patch w/ 12 constant memcpy: 1'56"
> > 3. patch w/ 63 constant memcpy: 3'16"
> > 
> > Also, to address Bruce's question, we have to reduce test case to cut down compile time. Because we use:
> > 1. intrinsics instead of assembly for better flexibility and can utilize more compiler optimization
> > 2. complex function body for better performance
> > 3. inlining
> > This increases compile time.
> 
> We use instrincts and inlining in many other places too.
> Why it suddenly became a problem here?
I agree, something just doesnt feel right here.  not sure what it is yet, but I
don't see how a memcpy function can be so complex as to take almost 10 minutes
to compile.  Its almost like we're recursively including something here and its
driving gcc into a huge loop
Neil

> Konstantin
> 
> > But I think it'd be okay to do that as long as we can select a fair set of test points.
> > 
> > It'd be great if you could give some suggestion, say, 12 points.
> > 
> > Zhihong (John)
> > 
> > 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-21 12:36           ` Marc Sune
@ 2015-01-21 13:02             ` Bruce Richardson
  2015-01-21 13:21               ` Marc Sune
  0 siblings, 1 reply; 48+ messages in thread
From: Bruce Richardson @ 2015-01-21 13:02 UTC (permalink / raw)
  To: Marc Sune; +Cc: dev

On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> 
> On 21/01/15 04:44, Wang, Zhihong wrote:
> >
> >>-----Original Message-----
> >>From: Richardson, Bruce
> >>Sent: Wednesday, January 21, 2015 12:15 AM
> >>To: Neil Horman
> >>Cc: Wang, Zhihong; dev@dpdk.org
> >>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >>
> >>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> >>>On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong wrote:
> >>>>
> >>>>>-----Original Message-----
> >>>>>From: Neil Horman [mailto:nhorman@tuxdriver.com]
> >>>>>Sent: Monday, January 19, 2015 9:02 PM
> >>>>>To: Wang, Zhihong
> >>>>>Cc: dev@dpdk.org
> >>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >>>>>
> >>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang@intel.com
> >>wrote:
> >>>>>>This patch set optimizes memcpy for DPDK for both SSE and AVX
> >>platforms.
> >>>>>>It also extends memcpy test coverage with unaligned cases and
> >>>>>>more test
> >>>>>points.
> >>>>>>Optimization techniques are summarized below:
> >>>>>>
> >>>>>>1. Utilize full cache bandwidth
> >>>>>>
> >>>>>>2. Enforce aligned stores
> >>>>>>
> >>>>>>3. Apply load address alignment based on architecture features
> >>>>>>
> >>>>>>4. Make load/store address available as early as possible
> >>>>>>
> >>>>>>5. General optimization techniques like inlining, branch
> >>>>>>reducing, prefetch pattern access
> >>>>>>
> >>>>>>Zhihong Wang (4):
> >>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> >>>>>>   Removed unnecessary test cases in test_memcpy.c
> >>>>>>   Extended test coverage in test_memcpy_perf.c
> >>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
> >>>>>>     platforms
> >>>>>>
> >>>>>>  app/test/Makefile                                  |   6 +
> >>>>>>  app/test/test_memcpy.c                             |  52 +-
> >>>>>>  app/test/test_memcpy_perf.c                        | 238 +++++---
> >>>>>>  .../common/include/arch/x86/rte_memcpy.h           | 664
> >>>>>+++++++++++++++------
> >>>>>>  4 files changed, 656 insertions(+), 304 deletions(-)
> >>>>>>
> >>>>>>--
> >>>>>>1.9.3
> >>>>>>
> >>>>>>
> >>>>>Are you able to compile this with gcc 4.9.2?  The compilation of
> >>>>>test_memcpy_perf is taking forever for me.  It appears hung.
> >>>>>Neil
> >>>>
> >>>>Neil,
> >>>>
> >>>>Thanks for reporting this!
> >>>>It should compile but will take quite some time if the CPU doesn't support
> >>AVX2, the reason is that:
> >>>>1. The SSE & AVX memcpy implementation is more complicated than
> >>AVX2
> >>>>version thus the compiler takes more time to compile and optimize 2.
> >>>>The new test_memcpy_perf.c contains 126 constants memcpy calls for
> >>>>better test case coverage, that's quite a lot
> >>>>
> >>>>I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
> >>>>1. The whole compile process takes 9'41" with the original
> >>>>test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2. It takes
> >>>>only 2'41" after I reduce the constant memcpy call number to 12 + 12
> >>>>= 24
> >>>>
> >>>>I'll reduce memcpy call in the next version of patch.
> >>>>
> >>>ok, thank you.  I'm all for optimzation, but I think a compile that
> >>>takes almost
> >>>10 minutes for a single file is going to generate some raised eyebrows
> >>>when end users start tinkering with it
> >>>
> >>>Neil
> >>>
> >>>>Zhihong (John)
> >>>>
> >>Even two minutes is a very long time to compile, IMHO. The whole of DPDK
> >>doesn't take that long to compile right now, and that's with a couple of huge
> >>header files with routing tables in it. Any chance you could cut compile time
> >>down to a few seconds while still having reasonable tests?
> >>Also, when there is AVX2 present on the system, what is the compile time
> >>like for that code?
> >>
> >>	/Bruce
> >Neil, Bruce,
> >
> >Some data first.
> >
> >Sandy Bridge without AVX2:
> >1. original w/ 10 constant memcpy: 2'25"
> >2. patch w/ 12 constant memcpy: 2'41"
> >3. patch w/ 63 constant memcpy: 9'41"
> >
> >Haswell with AVX2:
> >1. original w/ 10 constant memcpy: 1'57"
> >2. patch w/ 12 constant memcpy: 1'56"
> >3. patch w/ 63 constant memcpy: 3'16"
> >
> >Also, to address Bruce's question, we have to reduce test case to cut down compile time. Because we use:
> >1. intrinsics instead of assembly for better flexibility and can utilize more compiler optimization
> >2. complex function body for better performance
> >3. inlining
> >This increases compile time.
> >But I think it'd be okay to do that as long as we can select a fair set of test points.
> >
> >It'd be great if you could give some suggestion, say, 12 points.
> >
> >Zhihong (John)
> >
> >
> 
> While I agree in the general case these long compilation times is painful
> for the users, having a factor of 2-8x in memcpy operations is quite an
> improvement, specially in DPDK applications which need to deal
> (unfortunately) heavily on them -- e.g. IP fragmentation and reassembly.
> 
> Why not having a fast compilation by default, and a tunable config flag to
> enable a highly optimized version of rte_memcpy (e.g. RTE_EAL_OPT_MEMCPY)?
> 
> Marc
>
Out of interest, are these 2-8x improvements something you have benchmarked
in these app scenarios? [i.e. not just in micro-benchmarks].

/Bruce

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-21 13:02             ` Bruce Richardson
@ 2015-01-21 13:21               ` Marc Sune
  2015-01-21 13:26                 ` Bruce Richardson
  0 siblings, 1 reply; 48+ messages in thread
From: Marc Sune @ 2015-01-21 13:21 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev


On 21/01/15 14:02, Bruce Richardson wrote:
> On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
>> On 21/01/15 04:44, Wang, Zhihong wrote:
>>>> -----Original Message-----
>>>> From: Richardson, Bruce
>>>> Sent: Wednesday, January 21, 2015 12:15 AM
>>>> To: Neil Horman
>>>> Cc: Wang, Zhihong; dev@dpdk.org
>>>> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
>>>>
>>>> On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
>>>>> On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong wrote:
>>>>>>> -----Original Message-----
>>>>>>> From: Neil Horman [mailto:nhorman@tuxdriver.com]
>>>>>>> Sent: Monday, January 19, 2015 9:02 PM
>>>>>>> To: Wang, Zhihong
>>>>>>> Cc: dev@dpdk.org
>>>>>>> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
>>>>>>>
>>>>>>> On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang@intel.com
>>>> wrote:
>>>>>>>> This patch set optimizes memcpy for DPDK for both SSE and AVX
>>>> platforms.
>>>>>>>> It also extends memcpy test coverage with unaligned cases and
>>>>>>>> more test
>>>>>>> points.
>>>>>>>> Optimization techniques are summarized below:
>>>>>>>>
>>>>>>>> 1. Utilize full cache bandwidth
>>>>>>>>
>>>>>>>> 2. Enforce aligned stores
>>>>>>>>
>>>>>>>> 3. Apply load address alignment based on architecture features
>>>>>>>>
>>>>>>>> 4. Make load/store address available as early as possible
>>>>>>>>
>>>>>>>> 5. General optimization techniques like inlining, branch
>>>>>>>> reducing, prefetch pattern access
>>>>>>>>
>>>>>>>> Zhihong Wang (4):
>>>>>>>>    Disabled VTA for memcpy test in app/test/Makefile
>>>>>>>>    Removed unnecessary test cases in test_memcpy.c
>>>>>>>>    Extended test coverage in test_memcpy_perf.c
>>>>>>>>    Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
>>>>>>>>      platforms
>>>>>>>>
>>>>>>>>   app/test/Makefile                                  |   6 +
>>>>>>>>   app/test/test_memcpy.c                             |  52 +-
>>>>>>>>   app/test/test_memcpy_perf.c                        | 238 +++++---
>>>>>>>>   .../common/include/arch/x86/rte_memcpy.h           | 664
>>>>>>> +++++++++++++++------
>>>>>>>>   4 files changed, 656 insertions(+), 304 deletions(-)
>>>>>>>>
>>>>>>>> --
>>>>>>>> 1.9.3
>>>>>>>>
>>>>>>>>
>>>>>>> Are you able to compile this with gcc 4.9.2?  The compilation of
>>>>>>> test_memcpy_perf is taking forever for me.  It appears hung.
>>>>>>> Neil
>>>>>> Neil,
>>>>>>
>>>>>> Thanks for reporting this!
>>>>>> It should compile but will take quite some time if the CPU doesn't support
>>>> AVX2, the reason is that:
>>>>>> 1. The SSE & AVX memcpy implementation is more complicated than
>>>> AVX2
>>>>>> version thus the compiler takes more time to compile and optimize 2.
>>>>>> The new test_memcpy_perf.c contains 126 constants memcpy calls for
>>>>>> better test case coverage, that's quite a lot
>>>>>>
>>>>>> I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
>>>>>> 1. The whole compile process takes 9'41" with the original
>>>>>> test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2. It takes
>>>>>> only 2'41" after I reduce the constant memcpy call number to 12 + 12
>>>>>> = 24
>>>>>>
>>>>>> I'll reduce memcpy call in the next version of patch.
>>>>>>
>>>>> ok, thank you.  I'm all for optimzation, but I think a compile that
>>>>> takes almost
>>>>> 10 minutes for a single file is going to generate some raised eyebrows
>>>>> when end users start tinkering with it
>>>>>
>>>>> Neil
>>>>>
>>>>>> Zhihong (John)
>>>>>>
>>>> Even two minutes is a very long time to compile, IMHO. The whole of DPDK
>>>> doesn't take that long to compile right now, and that's with a couple of huge
>>>> header files with routing tables in it. Any chance you could cut compile time
>>>> down to a few seconds while still having reasonable tests?
>>>> Also, when there is AVX2 present on the system, what is the compile time
>>>> like for that code?
>>>>
>>>> 	/Bruce
>>> Neil, Bruce,
>>>
>>> Some data first.
>>>
>>> Sandy Bridge without AVX2:
>>> 1. original w/ 10 constant memcpy: 2'25"
>>> 2. patch w/ 12 constant memcpy: 2'41"
>>> 3. patch w/ 63 constant memcpy: 9'41"
>>>
>>> Haswell with AVX2:
>>> 1. original w/ 10 constant memcpy: 1'57"
>>> 2. patch w/ 12 constant memcpy: 1'56"
>>> 3. patch w/ 63 constant memcpy: 3'16"
>>>
>>> Also, to address Bruce's question, we have to reduce test case to cut down compile time. Because we use:
>>> 1. intrinsics instead of assembly for better flexibility and can utilize more compiler optimization
>>> 2. complex function body for better performance
>>> 3. inlining
>>> This increases compile time.
>>> But I think it'd be okay to do that as long as we can select a fair set of test points.
>>>
>>> It'd be great if you could give some suggestion, say, 12 points.
>>>
>>> Zhihong (John)
>>>
>>>
>> While I agree in the general case these long compilation times is painful
>> for the users, having a factor of 2-8x in memcpy operations is quite an
>> improvement, specially in DPDK applications which need to deal
>> (unfortunately) heavily on them -- e.g. IP fragmentation and reassembly.
>>
>> Why not having a fast compilation by default, and a tunable config flag to
>> enable a highly optimized version of rte_memcpy (e.g. RTE_EAL_OPT_MEMCPY)?
>>
>> Marc
>>
> Out of interest, are these 2-8x improvements something you have benchmarked
> in these app scenarios? [i.e. not just in micro-benchmarks].

How much that micro-speedup will end up affecting the performance of the 
entire application is something I cannot say, so I agree that we should 
probably have some additional benchmarks before deciding that pays off 
maintaining 2 versions of rte_memcpy.

There are however a bunch of possible DPDK applications that could 
potentially benefit; IP fragmentation, tunneling and specialized DPI 
applications, among others, since they involve a reasonable amount of 
memcpys per pkt. My point was, *if* it proves that is enough beneficial, 
why not having it optionally?

Marc

>
> /Bruce

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-21 13:21               ` Marc Sune
@ 2015-01-21 13:26                 ` Bruce Richardson
  2015-01-21 19:49                   ` Stephen Hemminger
  2015-01-23  6:52                   ` Wang, Zhihong
  0 siblings, 2 replies; 48+ messages in thread
From: Bruce Richardson @ 2015-01-21 13:26 UTC (permalink / raw)
  To: Marc Sune; +Cc: dev

On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> 
> On 21/01/15 14:02, Bruce Richardson wrote:
> >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> >>On 21/01/15 04:44, Wang, Zhihong wrote:
> >>>>-----Original Message-----
> >>>>From: Richardson, Bruce
> >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> >>>>To: Neil Horman
> >>>>Cc: Wang, Zhihong; dev@dpdk.org
> >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >>>>
> >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong wrote:
> >>>>>>>-----Original Message-----
> >>>>>>>From: Neil Horman [mailto:nhorman@tuxdriver.com]
> >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> >>>>>>>To: Wang, Zhihong
> >>>>>>>Cc: dev@dpdk.org
> >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >>>>>>>
> >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang@intel.com
> >>>>wrote:
> >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and AVX
> >>>>platforms.
> >>>>>>>>It also extends memcpy test coverage with unaligned cases and
> >>>>>>>>more test
> >>>>>>>points.
> >>>>>>>>Optimization techniques are summarized below:
> >>>>>>>>
> >>>>>>>>1. Utilize full cache bandwidth
> >>>>>>>>
> >>>>>>>>2. Enforce aligned stores
> >>>>>>>>
> >>>>>>>>3. Apply load address alignment based on architecture features
> >>>>>>>>
> >>>>>>>>4. Make load/store address available as early as possible
> >>>>>>>>
> >>>>>>>>5. General optimization techniques like inlining, branch
> >>>>>>>>reducing, prefetch pattern access
> >>>>>>>>
> >>>>>>>>Zhihong Wang (4):
> >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
> >>>>>>>>   Extended test coverage in test_memcpy_perf.c
> >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
> >>>>>>>>     platforms
> >>>>>>>>
> >>>>>>>>  app/test/Makefile                                  |   6 +
> >>>>>>>>  app/test/test_memcpy.c                             |  52 +-
> >>>>>>>>  app/test/test_memcpy_perf.c                        | 238 +++++---
> >>>>>>>>  .../common/include/arch/x86/rte_memcpy.h           | 664
> >>>>>>>+++++++++++++++------
> >>>>>>>>  4 files changed, 656 insertions(+), 304 deletions(-)
> >>>>>>>>
> >>>>>>>>--
> >>>>>>>>1.9.3
> >>>>>>>>
> >>>>>>>>
> >>>>>>>Are you able to compile this with gcc 4.9.2?  The compilation of
> >>>>>>>test_memcpy_perf is taking forever for me.  It appears hung.
> >>>>>>>Neil
> >>>>>>Neil,
> >>>>>>
> >>>>>>Thanks for reporting this!
> >>>>>>It should compile but will take quite some time if the CPU doesn't support
> >>>>AVX2, the reason is that:
> >>>>>>1. The SSE & AVX memcpy implementation is more complicated than
> >>>>AVX2
> >>>>>>version thus the compiler takes more time to compile and optimize 2.
> >>>>>>The new test_memcpy_perf.c contains 126 constants memcpy calls for
> >>>>>>better test case coverage, that's quite a lot
> >>>>>>
> >>>>>>I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
> >>>>>>1. The whole compile process takes 9'41" with the original
> >>>>>>test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2. It takes
> >>>>>>only 2'41" after I reduce the constant memcpy call number to 12 + 12
> >>>>>>= 24
> >>>>>>
> >>>>>>I'll reduce memcpy call in the next version of patch.
> >>>>>>
> >>>>>ok, thank you.  I'm all for optimzation, but I think a compile that
> >>>>>takes almost
> >>>>>10 minutes for a single file is going to generate some raised eyebrows
> >>>>>when end users start tinkering with it
> >>>>>
> >>>>>Neil
> >>>>>
> >>>>>>Zhihong (John)
> >>>>>>
> >>>>Even two minutes is a very long time to compile, IMHO. The whole of DPDK
> >>>>doesn't take that long to compile right now, and that's with a couple of huge
> >>>>header files with routing tables in it. Any chance you could cut compile time
> >>>>down to a few seconds while still having reasonable tests?
> >>>>Also, when there is AVX2 present on the system, what is the compile time
> >>>>like for that code?
> >>>>
> >>>>	/Bruce
> >>>Neil, Bruce,
> >>>
> >>>Some data first.
> >>>
> >>>Sandy Bridge without AVX2:
> >>>1. original w/ 10 constant memcpy: 2'25"
> >>>2. patch w/ 12 constant memcpy: 2'41"
> >>>3. patch w/ 63 constant memcpy: 9'41"
> >>>
> >>>Haswell with AVX2:
> >>>1. original w/ 10 constant memcpy: 1'57"
> >>>2. patch w/ 12 constant memcpy: 1'56"
> >>>3. patch w/ 63 constant memcpy: 3'16"
> >>>
> >>>Also, to address Bruce's question, we have to reduce test case to cut down compile time. Because we use:
> >>>1. intrinsics instead of assembly for better flexibility and can utilize more compiler optimization
> >>>2. complex function body for better performance
> >>>3. inlining
> >>>This increases compile time.
> >>>But I think it'd be okay to do that as long as we can select a fair set of test points.
> >>>
> >>>It'd be great if you could give some suggestion, say, 12 points.
> >>>
> >>>Zhihong (John)
> >>>
> >>>
> >>While I agree in the general case these long compilation times is painful
> >>for the users, having a factor of 2-8x in memcpy operations is quite an
> >>improvement, specially in DPDK applications which need to deal
> >>(unfortunately) heavily on them -- e.g. IP fragmentation and reassembly.
> >>
> >>Why not having a fast compilation by default, and a tunable config flag to
> >>enable a highly optimized version of rte_memcpy (e.g. RTE_EAL_OPT_MEMCPY)?
> >>
> >>Marc
> >>
> >Out of interest, are these 2-8x improvements something you have benchmarked
> >in these app scenarios? [i.e. not just in micro-benchmarks].
> 
> How much that micro-speedup will end up affecting the performance of the
> entire application is something I cannot say, so I agree that we should
> probably have some additional benchmarks before deciding that pays off
> maintaining 2 versions of rte_memcpy.
> 
> There are however a bunch of possible DPDK applications that could
> potentially benefit; IP fragmentation, tunneling and specialized DPI
> applications, among others, since they involve a reasonable amount of
> memcpys per pkt. My point was, *if* it proves that is enough beneficial, why
> not having it optionally?
> 
> Marc

I agree, if it provides the speedups then we need to have it in - and quite possibly
on by default, even.

/Bruce

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-21 13:26                 ` Bruce Richardson
@ 2015-01-21 19:49                   ` Stephen Hemminger
  2015-01-21 20:54                     ` Neil Horman
  2015-01-23  6:52                   ` Wang, Zhihong
  1 sibling, 1 reply; 48+ messages in thread
From: Stephen Hemminger @ 2015-01-21 19:49 UTC (permalink / raw)
  To: Bruce Richardson; +Cc: dev

On Wed, 21 Jan 2015 13:26:20 +0000
Bruce Richardson <bruce.richardson@intel.com> wrote:

> On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> > 
> > On 21/01/15 14:02, Bruce Richardson wrote:
> > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> > >>>>-----Original Message-----
> > >>>>From: Richardson, Bruce
> > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> > >>>>To: Neil Horman
> > >>>>Cc: Wang, Zhihong; dev@dpdk.org
> > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >>>>
> > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong wrote:
> > >>>>>>>-----Original Message-----
> > >>>>>>>From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> > >>>>>>>To: Wang, Zhihong
> > >>>>>>>Cc: dev@dpdk.org
> > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >>>>>>>
> > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang@intel.com
> > >>>>wrote:
> > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and AVX
> > >>>>platforms.
> > >>>>>>>>It also extends memcpy test coverage with unaligned cases and
> > >>>>>>>>more test
> > >>>>>>>points.
> > >>>>>>>>Optimization techniques are summarized below:
> > >>>>>>>>
> > >>>>>>>>1. Utilize full cache bandwidth
> > >>>>>>>>
> > >>>>>>>>2. Enforce aligned stores
> > >>>>>>>>
> > >>>>>>>>3. Apply load address alignment based on architecture features
> > >>>>>>>>
> > >>>>>>>>4. Make load/store address available as early as possible
> > >>>>>>>>
> > >>>>>>>>5. General optimization techniques like inlining, branch
> > >>>>>>>>reducing, prefetch pattern access
> > >>>>>>>>
> > >>>>>>>>Zhihong Wang (4):
> > >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> > >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
> > >>>>>>>>   Extended test coverage in test_memcpy_perf.c
> > >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
> > >>>>>>>>     platforms
> > >>>>>>>>
> > >>>>>>>>  app/test/Makefile                                  |   6 +
> > >>>>>>>>  app/test/test_memcpy.c                             |  52 +-
> > >>>>>>>>  app/test/test_memcpy_perf.c                        | 238 +++++---
> > >>>>>>>>  .../common/include/arch/x86/rte_memcpy.h           | 664
> > >>>>>>>+++++++++++++++------
> > >>>>>>>>  4 files changed, 656 insertions(+), 304 deletions(-)
> > >>>>>>>>
> > >>>>>>>>--
> > >>>>>>>>1.9.3
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>Are you able to compile this with gcc 4.9.2?  The compilation of
> > >>>>>>>test_memcpy_perf is taking forever for me.  It appears hung.
> > >>>>>>>Neil
> > >>>>>>Neil,
> > >>>>>>
> > >>>>>>Thanks for reporting this!
> > >>>>>>It should compile but will take quite some time if the CPU doesn't support
> > >>>>AVX2, the reason is that:
> > >>>>>>1. The SSE & AVX memcpy implementation is more complicated than
> > >>>>AVX2
> > >>>>>>version thus the compiler takes more time to compile and optimize 2.
> > >>>>>>The new test_memcpy_perf.c contains 126 constants memcpy calls for
> > >>>>>>better test case coverage, that's quite a lot
> > >>>>>>
> > >>>>>>I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
> > >>>>>>1. The whole compile process takes 9'41" with the original
> > >>>>>>test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2. It takes
> > >>>>>>only 2'41" after I reduce the constant memcpy call number to 12 + 12
> > >>>>>>= 24
> > >>>>>>
> > >>>>>>I'll reduce memcpy call in the next version of patch.
> > >>>>>>
> > >>>>>ok, thank you.  I'm all for optimzation, but I think a compile that
> > >>>>>takes almost
> > >>>>>10 minutes for a single file is going to generate some raised eyebrows
> > >>>>>when end users start tinkering with it
> > >>>>>
> > >>>>>Neil
> > >>>>>
> > >>>>>>Zhihong (John)
> > >>>>>>
> > >>>>Even two minutes is a very long time to compile, IMHO. The whole of DPDK
> > >>>>doesn't take that long to compile right now, and that's with a couple of huge
> > >>>>header files with routing tables in it. Any chance you could cut compile time
> > >>>>down to a few seconds while still having reasonable tests?
> > >>>>Also, when there is AVX2 present on the system, what is the compile time
> > >>>>like for that code?
> > >>>>
> > >>>>	/Bruce
> > >>>Neil, Bruce,
> > >>>
> > >>>Some data first.
> > >>>
> > >>>Sandy Bridge without AVX2:
> > >>>1. original w/ 10 constant memcpy: 2'25"
> > >>>2. patch w/ 12 constant memcpy: 2'41"
> > >>>3. patch w/ 63 constant memcpy: 9'41"
> > >>>
> > >>>Haswell with AVX2:
> > >>>1. original w/ 10 constant memcpy: 1'57"
> > >>>2. patch w/ 12 constant memcpy: 1'56"
> > >>>3. patch w/ 63 constant memcpy: 3'16"
> > >>>
> > >>>Also, to address Bruce's question, we have to reduce test case to cut down compile time. Because we use:
> > >>>1. intrinsics instead of assembly for better flexibility and can utilize more compiler optimization
> > >>>2. complex function body for better performance
> > >>>3. inlining
> > >>>This increases compile time.
> > >>>But I think it'd be okay to do that as long as we can select a fair set of test points.
> > >>>
> > >>>It'd be great if you could give some suggestion, say, 12 points.
> > >>>
> > >>>Zhihong (John)
> > >>>
> > >>>
> > >>While I agree in the general case these long compilation times is painful
> > >>for the users, having a factor of 2-8x in memcpy operations is quite an
> > >>improvement, specially in DPDK applications which need to deal
> > >>(unfortunately) heavily on them -- e.g. IP fragmentation and reassembly.
> > >>
> > >>Why not having a fast compilation by default, and a tunable config flag to
> > >>enable a highly optimized version of rte_memcpy (e.g. RTE_EAL_OPT_MEMCPY)?
> > >>
> > >>Marc
> > >>
> > >Out of interest, are these 2-8x improvements something you have benchmarked
> > >in these app scenarios? [i.e. not just in micro-benchmarks].
> > 
> > How much that micro-speedup will end up affecting the performance of the
> > entire application is something I cannot say, so I agree that we should
> > probably have some additional benchmarks before deciding that pays off
> > maintaining 2 versions of rte_memcpy.
> > 
> > There are however a bunch of possible DPDK applications that could
> > potentially benefit; IP fragmentation, tunneling and specialized DPI
> > applications, among others, since they involve a reasonable amount of
> > memcpys per pkt. My point was, *if* it proves that is enough beneficial, why
> > not having it optionally?
> > 
> > Marc
> 
> I agree, if it provides the speedups then we need to have it in - and quite possibly
> on by default, even.
> 
> /Bruce

One issue I have is that as a vendor we need to ship on binary, not different distributions
for each Intel chip variant. There is some support for multi-chip version functions
but only in latest Gcc which isn't in Debian stable. And the multi-chip version
of functions is going to be more expensive than inlining. For some cases, I have
seen that the overhead of fancy instructions looks good but have nasty side effects
like CPU stall and/or increased power consumption which turns of turbo boost.


Distro's in general have the same problem with special case optimizations.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-21 19:49                   ` Stephen Hemminger
@ 2015-01-21 20:54                     ` Neil Horman
  2015-01-21 21:25                       ` Jim Thompson
  2015-01-22 18:21                       ` EDMISON, Kelvin (Kelvin)
  0 siblings, 2 replies; 48+ messages in thread
From: Neil Horman @ 2015-01-21 20:54 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev

On Wed, Jan 21, 2015 at 11:49:47AM -0800, Stephen Hemminger wrote:
> On Wed, 21 Jan 2015 13:26:20 +0000
> Bruce Richardson <bruce.richardson@intel.com> wrote:
> 
> > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> > > 
> > > On 21/01/15 14:02, Bruce Richardson wrote:
> > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> > > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> > > >>>>-----Original Message-----
> > > >>>>From: Richardson, Bruce
> > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> > > >>>>To: Neil Horman
> > > >>>>Cc: Wang, Zhihong; dev@dpdk.org
> > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > >>>>
> > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong wrote:
> > > >>>>>>>-----Original Message-----
> > > >>>>>>>From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> > > >>>>>>>To: Wang, Zhihong
> > > >>>>>>>Cc: dev@dpdk.org
> > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > >>>>>>>
> > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang@intel.com
> > > >>>>wrote:
> > > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and AVX
> > > >>>>platforms.
> > > >>>>>>>>It also extends memcpy test coverage with unaligned cases and
> > > >>>>>>>>more test
> > > >>>>>>>points.
> > > >>>>>>>>Optimization techniques are summarized below:
> > > >>>>>>>>
> > > >>>>>>>>1. Utilize full cache bandwidth
> > > >>>>>>>>
> > > >>>>>>>>2. Enforce aligned stores
> > > >>>>>>>>
> > > >>>>>>>>3. Apply load address alignment based on architecture features
> > > >>>>>>>>
> > > >>>>>>>>4. Make load/store address available as early as possible
> > > >>>>>>>>
> > > >>>>>>>>5. General optimization techniques like inlining, branch
> > > >>>>>>>>reducing, prefetch pattern access
> > > >>>>>>>>
> > > >>>>>>>>Zhihong Wang (4):
> > > >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> > > >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
> > > >>>>>>>>   Extended test coverage in test_memcpy_perf.c
> > > >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
> > > >>>>>>>>     platforms
> > > >>>>>>>>
> > > >>>>>>>>  app/test/Makefile                                  |   6 +
> > > >>>>>>>>  app/test/test_memcpy.c                             |  52 +-
> > > >>>>>>>>  app/test/test_memcpy_perf.c                        | 238 +++++---
> > > >>>>>>>>  .../common/include/arch/x86/rte_memcpy.h           | 664
> > > >>>>>>>+++++++++++++++------
> > > >>>>>>>>  4 files changed, 656 insertions(+), 304 deletions(-)
> > > >>>>>>>>
> > > >>>>>>>>--
> > > >>>>>>>>1.9.3
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>Are you able to compile this with gcc 4.9.2?  The compilation of
> > > >>>>>>>test_memcpy_perf is taking forever for me.  It appears hung.
> > > >>>>>>>Neil
> > > >>>>>>Neil,
> > > >>>>>>
> > > >>>>>>Thanks for reporting this!
> > > >>>>>>It should compile but will take quite some time if the CPU doesn't support
> > > >>>>AVX2, the reason is that:
> > > >>>>>>1. The SSE & AVX memcpy implementation is more complicated than
> > > >>>>AVX2
> > > >>>>>>version thus the compiler takes more time to compile and optimize 2.
> > > >>>>>>The new test_memcpy_perf.c contains 126 constants memcpy calls for
> > > >>>>>>better test case coverage, that's quite a lot
> > > >>>>>>
> > > >>>>>>I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
> > > >>>>>>1. The whole compile process takes 9'41" with the original
> > > >>>>>>test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2. It takes
> > > >>>>>>only 2'41" after I reduce the constant memcpy call number to 12 + 12
> > > >>>>>>= 24
> > > >>>>>>
> > > >>>>>>I'll reduce memcpy call in the next version of patch.
> > > >>>>>>
> > > >>>>>ok, thank you.  I'm all for optimzation, but I think a compile that
> > > >>>>>takes almost
> > > >>>>>10 minutes for a single file is going to generate some raised eyebrows
> > > >>>>>when end users start tinkering with it
> > > >>>>>
> > > >>>>>Neil
> > > >>>>>
> > > >>>>>>Zhihong (John)
> > > >>>>>>
> > > >>>>Even two minutes is a very long time to compile, IMHO. The whole of DPDK
> > > >>>>doesn't take that long to compile right now, and that's with a couple of huge
> > > >>>>header files with routing tables in it. Any chance you could cut compile time
> > > >>>>down to a few seconds while still having reasonable tests?
> > > >>>>Also, when there is AVX2 present on the system, what is the compile time
> > > >>>>like for that code?
> > > >>>>
> > > >>>>	/Bruce
> > > >>>Neil, Bruce,
> > > >>>
> > > >>>Some data first.
> > > >>>
> > > >>>Sandy Bridge without AVX2:
> > > >>>1. original w/ 10 constant memcpy: 2'25"
> > > >>>2. patch w/ 12 constant memcpy: 2'41"
> > > >>>3. patch w/ 63 constant memcpy: 9'41"
> > > >>>
> > > >>>Haswell with AVX2:
> > > >>>1. original w/ 10 constant memcpy: 1'57"
> > > >>>2. patch w/ 12 constant memcpy: 1'56"
> > > >>>3. patch w/ 63 constant memcpy: 3'16"
> > > >>>
> > > >>>Also, to address Bruce's question, we have to reduce test case to cut down compile time. Because we use:
> > > >>>1. intrinsics instead of assembly for better flexibility and can utilize more compiler optimization
> > > >>>2. complex function body for better performance
> > > >>>3. inlining
> > > >>>This increases compile time.
> > > >>>But I think it'd be okay to do that as long as we can select a fair set of test points.
> > > >>>
> > > >>>It'd be great if you could give some suggestion, say, 12 points.
> > > >>>
> > > >>>Zhihong (John)
> > > >>>
> > > >>>
> > > >>While I agree in the general case these long compilation times is painful
> > > >>for the users, having a factor of 2-8x in memcpy operations is quite an
> > > >>improvement, specially in DPDK applications which need to deal
> > > >>(unfortunately) heavily on them -- e.g. IP fragmentation and reassembly.
> > > >>
> > > >>Why not having a fast compilation by default, and a tunable config flag to
> > > >>enable a highly optimized version of rte_memcpy (e.g. RTE_EAL_OPT_MEMCPY)?
> > > >>
> > > >>Marc
> > > >>
> > > >Out of interest, are these 2-8x improvements something you have benchmarked
> > > >in these app scenarios? [i.e. not just in micro-benchmarks].
> > > 
> > > How much that micro-speedup will end up affecting the performance of the
> > > entire application is something I cannot say, so I agree that we should
> > > probably have some additional benchmarks before deciding that pays off
> > > maintaining 2 versions of rte_memcpy.
> > > 
> > > There are however a bunch of possible DPDK applications that could
> > > potentially benefit; IP fragmentation, tunneling and specialized DPI
> > > applications, among others, since they involve a reasonable amount of
> > > memcpys per pkt. My point was, *if* it proves that is enough beneficial, why
> > > not having it optionally?
> > > 
> > > Marc
> > 
> > I agree, if it provides the speedups then we need to have it in - and quite possibly
> > on by default, even.
> > 
> > /Bruce
> 
> One issue I have is that as a vendor we need to ship on binary, not different distributions
> for each Intel chip variant. There is some support for multi-chip version functions
> but only in latest Gcc which isn't in Debian stable. And the multi-chip version
> of functions is going to be more expensive than inlining. For some cases, I have
> seen that the overhead of fancy instructions looks good but have nasty side effects
> like CPU stall and/or increased power consumption which turns of turbo boost.
> 
> 
> Distro's in general have the same problem with special case optimizations.
> 
What we really need is to do something like borrow the alternatives mechanism
from the kernel so that we can dynamically replace instructions at run time
based on cpu flags.  That way we could make the choice at run time, and wouldn't
have to do alot of special case jumping about.  
Neil

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-21 20:54                     ` Neil Horman
@ 2015-01-21 21:25                       ` Jim Thompson
  2015-01-22  0:53                         ` Stephen Hemminger
  2015-01-22  9:06                         ` Luke Gorrie
  2015-01-22 18:21                       ` EDMISON, Kelvin (Kelvin)
  1 sibling, 2 replies; 48+ messages in thread
From: Jim Thompson @ 2015-01-21 21:25 UTC (permalink / raw)
  To: Neil Horman; +Cc: dev


I’m not as concerned with compile times given the potential performance boost.

A long time ago (mid-80s) I was at Convex, and wanted to do a vector bcopy(), because it would make the I/O system (mostly disk then (*)) go faster.
The architect explained to me that the vector registers were for applications, not the kernel (as well as re-explaining the expense of vector context
switches, should the kernel be using the vector unit(s) and some application also wanted to use them.  

The same is true today of AVX/AVX2, SSE, and even the AES-NI instructions.  Normally we don’t use these in kernel code (which is traditionally where
the networking stack has lived).   

The differences with DPDK are that a) entire cores (including the AVX/SSE units and even AES-NI (FPU) are dedicated to DPDK, and b) DPDK is a library,
and the resulting networking applications are exactly that, applications.  The "operating system” is now a control plane.

Jim

(* Back then it was commonly thought that TCP would never be able to fill a 10Gbps Ethernet.)

> On Jan 21, 2015, at 2:54 PM, Neil Horman <nhorman@tuxdriver.com> wrote:
> 
> On Wed, Jan 21, 2015 at 11:49:47AM -0800, Stephen Hemminger wrote:
>> On Wed, 21 Jan 2015 13:26:20 +0000
>> Bruce Richardson <bruce.richardson@intel.com> wrote:
>> 
>>> On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
>>>> 
>>>> On 21/01/15 14:02, Bruce Richardson wrote:
>>>>> On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
>>>>>> On 21/01/15 04:44, Wang, Zhihong wrote:
>>>>>>>> -----Original Message-----
>>>>>>>> From: Richardson, Bruce
>>>>>>>> Sent: Wednesday, January 21, 2015 12:15 AM
>>>>>>>> To: Neil Horman
>>>>>>>> Cc: Wang, Zhihong; dev@dpdk.org
>>>>>>>> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
>>>>>>>> 
>>>>>>>> On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
>>>>>>>>> On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong wrote:
>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>> From: Neil Horman [mailto:nhorman@tuxdriver.com]
>>>>>>>>>>> Sent: Monday, January 19, 2015 9:02 PM
>>>>>>>>>>> To: Wang, Zhihong
>>>>>>>>>>> Cc: dev@dpdk.org
>>>>>>>>>>> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
>>>>>>>>>>> 
>>>>>>>>>>> On Mon, Jan 19, 2015 at 09:53:30AM +0800, zhihong.wang@intel.com
>>>>>>>> wrote:
>>>>>>>>>>>> This patch set optimizes memcpy for DPDK for both SSE and AVX
>>>>>>>> platforms.
>>>>>>>>>>>> It also extends memcpy test coverage with unaligned cases and
>>>>>>>>>>>> more test
>>>>>>>>>>> points.
>>>>>>>>>>>> Optimization techniques are summarized below:
>>>>>>>>>>>> 
>>>>>>>>>>>> 1. Utilize full cache bandwidth
>>>>>>>>>>>> 
>>>>>>>>>>>> 2. Enforce aligned stores
>>>>>>>>>>>> 
>>>>>>>>>>>> 3. Apply load address alignment based on architecture features
>>>>>>>>>>>> 
>>>>>>>>>>>> 4. Make load/store address available as early as possible
>>>>>>>>>>>> 
>>>>>>>>>>>> 5. General optimization techniques like inlining, branch
>>>>>>>>>>>> reducing, prefetch pattern access
>>>>>>>>>>>> 
>>>>>>>>>>>> Zhihong Wang (4):
>>>>>>>>>>>>  Disabled VTA for memcpy test in app/test/Makefile
>>>>>>>>>>>>  Removed unnecessary test cases in test_memcpy.c
>>>>>>>>>>>>  Extended test coverage in test_memcpy_perf.c
>>>>>>>>>>>>  Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
>>>>>>>>>>>>    platforms
>>>>>>>>>>>> 
>>>>>>>>>>>> app/test/Makefile                                  |   6 +
>>>>>>>>>>>> app/test/test_memcpy.c                             |  52 +-
>>>>>>>>>>>> app/test/test_memcpy_perf.c                        | 238 +++++---
>>>>>>>>>>>> .../common/include/arch/x86/rte_memcpy.h           | 664
>>>>>>>>>>> +++++++++++++++------
>>>>>>>>>>>> 4 files changed, 656 insertions(+), 304 deletions(-)
>>>>>>>>>>>> 
>>>>>>>>>>>> --
>>>>>>>>>>>> 1.9.3
>>>>>>>>>>>> 
>>>>>>>>>>>> 
>>>>>>>>>>> Are you able to compile this with gcc 4.9.2?  The compilation of
>>>>>>>>>>> test_memcpy_perf is taking forever for me.  It appears hung.
>>>>>>>>>>> Neil
>>>>>>>>>> Neil,
>>>>>>>>>> 
>>>>>>>>>> Thanks for reporting this!
>>>>>>>>>> It should compile but will take quite some time if the CPU doesn't support
>>>>>>>> AVX2, the reason is that:
>>>>>>>>>> 1. The SSE & AVX memcpy implementation is more complicated than
>>>>>>>> AVX2
>>>>>>>>>> version thus the compiler takes more time to compile and optimize 2.
>>>>>>>>>> The new test_memcpy_perf.c contains 126 constants memcpy calls for
>>>>>>>>>> better test case coverage, that's quite a lot
>>>>>>>>>> 
>>>>>>>>>> I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
>>>>>>>>>> 1. The whole compile process takes 9'41" with the original
>>>>>>>>>> test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2. It takes
>>>>>>>>>> only 2'41" after I reduce the constant memcpy call number to 12 + 12
>>>>>>>>>> = 24
>>>>>>>>>> 
>>>>>>>>>> I'll reduce memcpy call in the next version of patch.
>>>>>>>>>> 
>>>>>>>>> ok, thank you.  I'm all for optimzation, but I think a compile that
>>>>>>>>> takes almost
>>>>>>>>> 10 minutes for a single file is going to generate some raised eyebrows
>>>>>>>>> when end users start tinkering with it
>>>>>>>>> 
>>>>>>>>> Neil
>>>>>>>>> 
>>>>>>>>>> Zhihong (John)
>>>>>>>>>> 
>>>>>>>> Even two minutes is a very long time to compile, IMHO. The whole of DPDK
>>>>>>>> doesn't take that long to compile right now, and that's with a couple of huge
>>>>>>>> header files with routing tables in it. Any chance you could cut compile time
>>>>>>>> down to a few seconds while still having reasonable tests?
>>>>>>>> Also, when there is AVX2 present on the system, what is the compile time
>>>>>>>> like for that code?
>>>>>>>> 
>>>>>>>> 	/Bruce
>>>>>>> Neil, Bruce,
>>>>>>> 
>>>>>>> Some data first.
>>>>>>> 
>>>>>>> Sandy Bridge without AVX2:
>>>>>>> 1. original w/ 10 constant memcpy: 2'25"
>>>>>>> 2. patch w/ 12 constant memcpy: 2'41"
>>>>>>> 3. patch w/ 63 constant memcpy: 9'41"
>>>>>>> 
>>>>>>> Haswell with AVX2:
>>>>>>> 1. original w/ 10 constant memcpy: 1'57"
>>>>>>> 2. patch w/ 12 constant memcpy: 1'56"
>>>>>>> 3. patch w/ 63 constant memcpy: 3'16"
>>>>>>> 
>>>>>>> Also, to address Bruce's question, we have to reduce test case to cut down compile time. Because we use:
>>>>>>> 1. intrinsics instead of assembly for better flexibility and can utilize more compiler optimization
>>>>>>> 2. complex function body for better performance
>>>>>>> 3. inlining
>>>>>>> This increases compile time.
>>>>>>> But I think it'd be okay to do that as long as we can select a fair set of test points.
>>>>>>> 
>>>>>>> It'd be great if you could give some suggestion, say, 12 points.
>>>>>>> 
>>>>>>> Zhihong (John)
>>>>>>> 
>>>>>>> 
>>>>>> While I agree in the general case these long compilation times is painful
>>>>>> for the users, having a factor of 2-8x in memcpy operations is quite an
>>>>>> improvement, specially in DPDK applications which need to deal
>>>>>> (unfortunately) heavily on them -- e.g. IP fragmentation and reassembly.
>>>>>> 
>>>>>> Why not having a fast compilation by default, and a tunable config flag to
>>>>>> enable a highly optimized version of rte_memcpy (e.g. RTE_EAL_OPT_MEMCPY)?
>>>>>> 
>>>>>> Marc
>>>>>> 
>>>>> Out of interest, are these 2-8x improvements something you have benchmarked
>>>>> in these app scenarios? [i.e. not just in micro-benchmarks].
>>>> 
>>>> How much that micro-speedup will end up affecting the performance of the
>>>> entire application is something I cannot say, so I agree that we should
>>>> probably have some additional benchmarks before deciding that pays off
>>>> maintaining 2 versions of rte_memcpy.
>>>> 
>>>> There are however a bunch of possible DPDK applications that could
>>>> potentially benefit; IP fragmentation, tunneling and specialized DPI
>>>> applications, among others, since they involve a reasonable amount of
>>>> memcpys per pkt. My point was, *if* it proves that is enough beneficial, why
>>>> not having it optionally?
>>>> 
>>>> Marc
>>> 
>>> I agree, if it provides the speedups then we need to have it in - and quite possibly
>>> on by default, even.
>>> 
>>> /Bruce
>> 
>> One issue I have is that as a vendor we need to ship on binary, not different distributions
>> for each Intel chip variant. There is some support for multi-chip version functions
>> but only in latest Gcc which isn't in Debian stable. And the multi-chip version
>> of functions is going to be more expensive than inlining. For some cases, I have
>> seen that the overhead of fancy instructions looks good but have nasty side effects
>> like CPU stall and/or increased power consumption which turns of turbo boost.
>> 
>> 
>> Distro's in general have the same problem with special case optimizations.
>> 
> What we really need is to do something like borrow the alternatives mechanism
> from the kernel so that we can dynamically replace instructions at run time
> based on cpu flags.  That way we could make the choice at run time, and wouldn't
> have to do alot of special case jumping about.  
> Neil
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-21 21:25                       ` Jim Thompson
@ 2015-01-22  0:53                         ` Stephen Hemminger
  2015-01-22  9:06                         ` Luke Gorrie
  1 sibling, 0 replies; 48+ messages in thread
From: Stephen Hemminger @ 2015-01-22  0:53 UTC (permalink / raw)
  To: Jim Thompson; +Cc: dev

On Wed, 21 Jan 2015 15:25:40 -0600
Jim Thompson <jim@netgate.com> wrote:

> I’m not as concerned with compile times given the potential performance boost.

Compile time matters. Right now full build of large project is fast.
Like 2 minutes or less.


Is this only the test applications (which can be disabled from the build),
or the library trying to do some tests. Since the build and target environment
will be different on a real product, the whole scheme seems flawed.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-21 21:25                       ` Jim Thompson
  2015-01-22  0:53                         ` Stephen Hemminger
@ 2015-01-22  9:06                         ` Luke Gorrie
  2015-01-22 13:29                           ` Jay Rolette
  1 sibling, 1 reply; 48+ messages in thread
From: Luke Gorrie @ 2015-01-22  9:06 UTC (permalink / raw)
  To: Jim Thompson; +Cc: dev

Howdy!

This memcpy discussion is absolutely fascinating. Glad to be a fly on the
wall!

On 21 January 2015 at 22:25, Jim Thompson <jim@netgate.com> wrote:

>
> The differences with DPDK are that a) entire cores (including the AVX/SSE
> units and even AES-NI (FPU) are dedicated to DPDK, and b) DPDK is a library,
> and the resulting networking applications are exactly that, applications.
> The "operating system” is now a control plane.
>
>
Here is another thought: when is it time to start thinking of packet copy
as a cheap unit-time operation?

Packets are shrinking exponentially when measured in:

- Cache lines
- Cache load/store operations needed to copy
- Number of vector move instructions needed to copy

because those units are all based on exponentially growing quantities,
while the byte size of packets stays the same for many applications.

So when is it time to stop caring?

(Are we already there, even, for certain conditions? How about Haswell CPU,
data already exclusively in our L1 cache, start and end both known to be
cache-line-aligned?)

Cheers,
-Luke (eagerly awaiting arrival of Haswell server...)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-22  9:06                         ` Luke Gorrie
@ 2015-01-22 13:29                           ` Jay Rolette
  2015-01-22 18:27                             ` Luke Gorrie
  0 siblings, 1 reply; 48+ messages in thread
From: Jay Rolette @ 2015-01-22 13:29 UTC (permalink / raw)
  To: Luke Gorrie; +Cc: dev

On Thu, Jan 22, 2015 at 3:06 AM, Luke Gorrie <luke@snabb.co> wrote:

Here is another thought: when is it time to start thinking of packet copy
> as a cheap unit-time operation?
>

Pretty much never short of changes to memory architecture, IMO. Frankly,
there are never enough cycles for deep packet inspection applications that
need to run at/near line-rate. Don't waste any doing something you can
avoid in the first place.

Microseconds matter. Scaling up to 100GbE, nanoseconds matter.

Jay

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-21 20:54                     ` Neil Horman
  2015-01-21 21:25                       ` Jim Thompson
@ 2015-01-22 18:21                       ` EDMISON, Kelvin (Kelvin)
  2015-01-27  8:22                         ` Wang, Zhihong
  1 sibling, 1 reply; 48+ messages in thread
From: EDMISON, Kelvin (Kelvin) @ 2015-01-22 18:21 UTC (permalink / raw)
  To: dev



On 2015-01-21, 3:54 PM, "Neil Horman" <nhorman@tuxdriver.com> wrote:

>On Wed, Jan 21, 2015 at 11:49:47AM -0800, Stephen Hemminger wrote:
>> On Wed, 21 Jan 2015 13:26:20 +0000
>> Bruce Richardson <bruce.richardson@intel.com> wrote:
>> 
>> > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
>> > > 
>> > > On 21/01/15 14:02, Bruce Richardson wrote:
>> > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
>> > > >>On 21/01/15 04:44, Wang, Zhihong wrote:
>> > > >>>>-----Original Message-----
>> > > >>>>From: Richardson, Bruce
>> > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
>> > > >>>>To: Neil Horman
>> > > >>>>Cc: Wang, Zhihong; dev@dpdk.org
>> > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
>> > > >>>>
>> > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
>> > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong wrote:
>> > > >>>>>>>-----Original Message-----
>> > > >>>>>>>From: Neil Horman [mailto:nhorman@tuxdriver.com]
>> > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
>> > > >>>>>>>To: Wang, Zhihong
>> > > >>>>>>>Cc: dev@dpdk.org
>> > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
>> > > >>>>>>>
>> > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
>>zhihong.wang@intel.com
>> > > >>>>wrote:
>> > > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and
>>AVX
>> > > >>>>platforms.
>> > > >>>>>>>>It also extends memcpy test coverage with unaligned cases
>>and
>> > > >>>>>>>>more test
>> > > >>>>>>>points.
>> > > >>>>>>>>Optimization techniques are summarized below:
>> > > >>>>>>>>
>> > > >>>>>>>>1. Utilize full cache bandwidth
>> > > >>>>>>>>
>> > > >>>>>>>>2. Enforce aligned stores
>> > > >>>>>>>>
>> > > >>>>>>>>3. Apply load address alignment based on architecture
>>features
>> > > >>>>>>>>
>> > > >>>>>>>>4. Make load/store address available as early as possible
>> > > >>>>>>>>
>> > > >>>>>>>>5. General optimization techniques like inlining, branch
>> > > >>>>>>>>reducing, prefetch pattern access
>> > > >>>>>>>>
>> > > >>>>>>>>Zhihong Wang (4):
>> > > >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
>> > > >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
>> > > >>>>>>>>   Extended test coverage in test_memcpy_perf.c
>> > > >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
>>and AVX
>> > > >>>>>>>>     platforms
>> > > >>>>>>>>
>> > > >>>>>>>>  app/test/Makefile                                  |   6 +
>> > > >>>>>>>>  app/test/test_memcpy.c                             |  52
>>+-
>> > > >>>>>>>>  app/test/test_memcpy_perf.c                        | 238
>>+++++---
>> > > >>>>>>>>  .../common/include/arch/x86/rte_memcpy.h           | 664
>> > > >>>>>>>+++++++++++++++------
>> > > >>>>>>>>  4 files changed, 656 insertions(+), 304 deletions(-)
>> > > >>>>>>>>
>> > > >>>>>>>>--
>> > > >>>>>>>>1.9.3
>> > > >>>>>>>>
>> > > >>>>>>>>
>> > > >>>>>>>Are you able to compile this with gcc 4.9.2?  The
>>compilation of
>> > > >>>>>>>test_memcpy_perf is taking forever for me.  It appears hung.
>> > > >>>>>>>Neil
>> > > >>>>>>Neil,
>> > > >>>>>>
>> > > >>>>>>Thanks for reporting this!
>> > > >>>>>>It should compile but will take quite some time if the CPU
>>doesn't support
>> > > >>>>AVX2, the reason is that:
>> > > >>>>>>1. The SSE & AVX memcpy implementation is more complicated
>>than
>> > > >>>>AVX2
>> > > >>>>>>version thus the compiler takes more time to compile and
>>optimize 2.
>> > > >>>>>>The new test_memcpy_perf.c contains 126 constants memcpy
>>calls for
>> > > >>>>>>better test case coverage, that's quite a lot
>> > > >>>>>>
>> > > >>>>>>I've just tested this patch on an Ivy Bridge machine with GCC
>>4.9.2:
>> > > >>>>>>1. The whole compile process takes 9'41" with the original
>> > > >>>>>>test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2.
>>It takes
>> > > >>>>>>only 2'41" after I reduce the constant memcpy call number to
>>12 + 12
>> > > >>>>>>= 24
>> > > >>>>>>
>> > > >>>>>>I'll reduce memcpy call in the next version of patch.
>> > > >>>>>>
>> > > >>>>>ok, thank you.  I'm all for optimzation, but I think a compile
>>that
>> > > >>>>>takes almost
>> > > >>>>>10 minutes for a single file is going to generate some raised
>>eyebrows
>> > > >>>>>when end users start tinkering with it
>> > > >>>>>
>> > > >>>>>Neil
>> > > >>>>>
>> > > >>>>>>Zhihong (John)
>> > > >>>>>>
>> > > >>>>Even two minutes is a very long time to compile, IMHO. The
>>whole of DPDK
>> > > >>>>doesn't take that long to compile right now, and that's with a
>>couple of huge
>> > > >>>>header files with routing tables in it. Any chance you could
>>cut compile time
>> > > >>>>down to a few seconds while still having reasonable tests?
>> > > >>>>Also, when there is AVX2 present on the system, what is the
>>compile time
>> > > >>>>like for that code?
>> > > >>>>
>> > > >>>>	/Bruce
>> > > >>>Neil, Bruce,
>> > > >>>
>> > > >>>Some data first.
>> > > >>>
>> > > >>>Sandy Bridge without AVX2:
>> > > >>>1. original w/ 10 constant memcpy: 2'25"
>> > > >>>2. patch w/ 12 constant memcpy: 2'41"
>> > > >>>3. patch w/ 63 constant memcpy: 9'41"
>> > > >>>
>> > > >>>Haswell with AVX2:
>> > > >>>1. original w/ 10 constant memcpy: 1'57"
>> > > >>>2. patch w/ 12 constant memcpy: 1'56"
>> > > >>>3. patch w/ 63 constant memcpy: 3'16"
>> > > >>>
>> > > >>>Also, to address Bruce's question, we have to reduce test case
>>to cut down compile time. Because we use:
>> > > >>>1. intrinsics instead of assembly for better flexibility and can
>>utilize more compiler optimization
>> > > >>>2. complex function body for better performance
>> > > >>>3. inlining
>> > > >>>This increases compile time.
>> > > >>>But I think it'd be okay to do that as long as we can select a
>>fair set of test points.
>> > > >>>
>> > > >>>It'd be great if you could give some suggestion, say, 12 points.
>> > > >>>
>> > > >>>Zhihong (John)
>> > > >>>
>> > > >>>
>> > > >>While I agree in the general case these long compilation times is
>>painful
>> > > >>for the users, having a factor of 2-8x in memcpy operations is
>>quite an
>> > > >>improvement, specially in DPDK applications which need to deal
>> > > >>(unfortunately) heavily on them -- e.g. IP fragmentation and
>>reassembly.
>> > > >>
>> > > >>Why not having a fast compilation by default, and a tunable
>>config flag to
>> > > >>enable a highly optimized version of rte_memcpy (e.g.
>>RTE_EAL_OPT_MEMCPY)?
>> > > >>
>> > > >>Marc
>> > > >>
>> > > >Out of interest, are these 2-8x improvements something you have
>>benchmarked
>> > > >in these app scenarios? [i.e. not just in micro-benchmarks].
>> > > 
>> > > How much that micro-speedup will end up affecting the performance
>>of the
>> > > entire application is something I cannot say, so I agree that we
>>should
>> > > probably have some additional benchmarks before deciding that pays
>>off
>> > > maintaining 2 versions of rte_memcpy.
>> > > 
>> > > There are however a bunch of possible DPDK applications that could
>> > > potentially benefit; IP fragmentation, tunneling and specialized DPI
>> > > applications, among others, since they involve a reasonable amount
>>of
>> > > memcpys per pkt. My point was, *if* it proves that is enough
>>beneficial, why
>> > > not having it optionally?
>> > > 
>> > > Marc
>> > 
>> > I agree, if it provides the speedups then we need to have it in - and
>>quite possibly
>> > on by default, even.
>> > 
>> > /Bruce
>> 
>> One issue I have is that as a vendor we need to ship on binary, not
>>different distributions
>> for each Intel chip variant. There is some support for multi-chip
>>version functions
>> but only in latest Gcc which isn't in Debian stable. And the multi-chip
>>version
>> of functions is going to be more expensive than inlining. For some
>>cases, I have
>> seen that the overhead of fancy instructions looks good but have nasty
>>side effects
>> like CPU stall and/or increased power consumption which turns of turbo
>>boost.
>> 
>> 
>> Distro's in general have the same problem with special case
>>optimizations.
>> 
>What we really need is to do something like borrow the alternatives
>mechanism
>from the kernel so that we can dynamically replace instructions at run
>time
>based on cpu flags.  That way we could make the choice at run time, and
>wouldn't
>have to do alot of special case jumping about.
>Neil

+1.  

I think it should be an anti-requirement that the build machine be the
exact same chip as the deployment platform.

I like the cpu flag inspection approach.  It would help in the case where
DPDK is in a VM and an odd set of CPU flags have been exposed.

If that approach doesn't work though, then perhaps DPDK memcpy could go
through a benchmarking at app startup time and select the most performant
option out of a set, like mdraid's raid6 implementation does.  To give an
example, this is what my systems print out at boot time re: raid6
algorithm selection.
raid6: sse2x1    3171 MB/s
raid6: sse2x2    3925 MB/s
raid6: sse2x4    4523 MB/s
raid6: using algorithm sse2x4 (4523 MB/s)

Regards,
   Kelvin

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-22 13:29                           ` Jay Rolette
@ 2015-01-22 18:27                             ` Luke Gorrie
  2015-01-22 19:36                               ` Jay Rolette
  0 siblings, 1 reply; 48+ messages in thread
From: Luke Gorrie @ 2015-01-22 18:27 UTC (permalink / raw)
  To: Jay Rolette; +Cc: dev

On 22 January 2015 at 14:29, Jay Rolette <rolette@infiniteio.com> wrote:

> Microseconds matter. Scaling up to 100GbE, nanoseconds matter.
>

True. Is there a cut-off point though? Does one nanosecond matter?

AVX512 will fit a 64-byte packet in one register and move that to or from
memory with one instruction. L1/L2 cache bandwidth per server is growing on
a double-exponential curve (both bandwidth per core and cores per CPU). I
wonder if moving data around in cache will soon be too cheap for us to
justify worrying about.

I suppose that 1500 byte wide registers are still a ways off though ;-)

Cheers!
-Luke (begging your indulgence for wandering off on a tangent)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-22 18:27                             ` Luke Gorrie
@ 2015-01-22 19:36                               ` Jay Rolette
  0 siblings, 0 replies; 48+ messages in thread
From: Jay Rolette @ 2015-01-22 19:36 UTC (permalink / raw)
  To: Luke Gorrie; +Cc: dev

On Thu, Jan 22, 2015 at 12:27 PM, Luke Gorrie <luke@snabb.co> wrote:

> On 22 January 2015 at 14:29, Jay Rolette <rolette@infiniteio.com> wrote:
>
>> Microseconds matter. Scaling up to 100GbE, nanoseconds matter.
>>
>
> True. Is there a cut-off point though?
>

There are always engineering trade-offs that have to be made. If I'm
optimizing something today, I'm certainly not starting at something that
takes 1ns for an app that is doing L4-7 processing. It's all about
profiling and figuring out where the bottlenecks are.

For past networking products I've built, there was a lot of traffic that
the software didn't have to do much to. Minimal L2/L3 checks, then forward
the packet. It didn't even have to parse the headers because that was
offloaded on an FPGA. The only way to make those packets faster was to turn
them around in the FPGA and not send them to the CPU at all. That change
improved small packet performance by ~30%. That was on high-end network
processors that are significantly faster than Intel processors for packet
handling.

It seems to be a strange thing when you realize that just getting the
packets into the CPU is expensive, nevermind what you do with them after
that.

Does one nanosecond matter?
>

You just have to be careful when talking about things like a nanosecond.
It's sounds really small, but IPG for a 10G link is only 9.6ns. It's all
relative.

AVX512 will fit a 64-byte packet in one register and move that to or from
> memory with one instruction. L1/L2 cache bandwidth per server is growing on
> a double-exponential curve (both bandwidth per core and cores per CPU). I
> wonder if moving data around in cache will soon be too cheap for us to
> justify worrying about.
>

Adding cores helps with aggregate performance, but doesn't really help with
latency on a single packet. That said, I'll take advantage of anything I
can from the hardware to either let me scale up how much traffic I can
handle or the amount of features I can add at the same performance level!

Jay

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-21 12:38             ` Neil Horman
@ 2015-01-23  3:26               ` Wang, Zhihong
  0 siblings, 0 replies; 48+ messages in thread
From: Wang, Zhihong @ 2015-01-23  3:26 UTC (permalink / raw)
  To: Neil Horman, Ananyev, Konstantin; +Cc: dev



> -----Original Message-----
> From: Neil Horman [mailto:nhorman@tuxdriver.com]
> Sent: Wednesday, January 21, 2015 8:38 PM
> To: Ananyev, Konstantin
> Cc: Wang, Zhihong; Richardson, Bruce; dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> On Wed, Jan 21, 2015 at 12:02:57PM +0000, Ananyev, Konstantin wrote:
> >
> >
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Wang, Zhihong
> > > Sent: Wednesday, January 21, 2015 3:44 AM
> > > To: Richardson, Bruce; Neil Horman
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Richardson, Bruce
> > > > Sent: Wednesday, January 21, 2015 12:15 AM
> > > > To: Neil Horman
> > > > Cc: Wang, Zhihong; dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > >
> > > > On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > > > On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong wrote:
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > > > Sent: Monday, January 19, 2015 9:02 PM
> > > > > > > To: Wang, Zhihong
> > > > > > > Cc: dev@dpdk.org
> > > > > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > > > >
> > > > > > > On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> > > > > > > zhihong.wang@intel.com
> > > > wrote:
> > > > > > > > This patch set optimizes memcpy for DPDK for both SSE and
> > > > > > > > AVX
> > > > platforms.
> > > > > > > > It also extends memcpy test coverage with unaligned cases
> > > > > > > > and more test
> > > > > > > points.
> > > > > > > >
> > > > > > > > Optimization techniques are summarized below:
> > > > > > > >
> > > > > > > > 1. Utilize full cache bandwidth
> > > > > > > >
> > > > > > > > 2. Enforce aligned stores
> > > > > > > >
> > > > > > > > 3. Apply load address alignment based on architecture
> > > > > > > > features
> > > > > > > >
> > > > > > > > 4. Make load/store address available as early as possible
> > > > > > > >
> > > > > > > > 5. General optimization techniques like inlining, branch
> > > > > > > > reducing, prefetch pattern access
> > > > > > > >
> > > > > > > > Zhihong Wang (4):
> > > > > > > >   Disabled VTA for memcpy test in app/test/Makefile
> > > > > > > >   Removed unnecessary test cases in test_memcpy.c
> > > > > > > >   Extended test coverage in test_memcpy_perf.c
> > > > > > > >   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
> and AVX
> > > > > > > >     platforms
> > > > > > > >
> > > > > > > >  app/test/Makefile                                  |   6 +
> > > > > > > >  app/test/test_memcpy.c                             |  52 +-
> > > > > > > >  app/test/test_memcpy_perf.c                        | 238 +++++---
> > > > > > > >  .../common/include/arch/x86/rte_memcpy.h           | 664
> > > > > > > +++++++++++++++------
> > > > > > > >  4 files changed, 656 insertions(+), 304 deletions(-)
> > > > > > > >
> > > > > > > > --
> > > > > > > > 1.9.3
> > > > > > > >
> > > > > > > >
> > > > > > > Are you able to compile this with gcc 4.9.2?  The
> > > > > > > compilation of test_memcpy_perf is taking forever for me.  It
> appears hung.
> > > > > > > Neil
> > > > > >
> > > > > >
> > > > > > Neil,
> > > > > >
> > > > > > Thanks for reporting this!
> > > > > > It should compile but will take quite some time if the CPU
> > > > > > doesn't support
> > > > AVX2, the reason is that:
> > > > > > 1. The SSE & AVX memcpy implementation is more complicated
> > > > > > than
> > > > AVX2
> > > > > > version thus the compiler takes more time to compile and optimize
> 2.
> > > > > > The new test_memcpy_perf.c contains 126 constants memcpy calls
> > > > > > for better test case coverage, that's quite a lot
> > > > > >
> > > > > > I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
> > > > > > 1. The whole compile process takes 9'41" with the original
> > > > > > test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2. It
> > > > > > takes only 2'41" after I reduce the constant memcpy call
> > > > > > number to 12 + 12 = 24
> > > > > >
> > > > > > I'll reduce memcpy call in the next version of patch.
> > > > > >
> > > > > ok, thank you.  I'm all for optimzation, but I think a compile
> > > > > that takes almost
> > > > > 10 minutes for a single file is going to generate some raised
> > > > > eyebrows when end users start tinkering with it
> > > > >
> > > > > Neil
> > > > >
> > > > > > Zhihong (John)
> > > > > >
> > > > Even two minutes is a very long time to compile, IMHO. The whole
> > > > of DPDK doesn't take that long to compile right now, and that's
> > > > with a couple of huge header files with routing tables in it. Any
> > > > chance you could cut compile time down to a few seconds while still
> having reasonable tests?
> > > > Also, when there is AVX2 present on the system, what is the
> > > > compile time like for that code?
> > > >
> > > > 	/Bruce
> > >
> > > Neil, Bruce,
> > >
> > > Some data first.
> > >
> > > Sandy Bridge without AVX2:
> > > 1. original w/ 10 constant memcpy: 2'25"
> > > 2. patch w/ 12 constant memcpy: 2'41"
> > > 3. patch w/ 63 constant memcpy: 9'41"
> > >
> > > Haswell with AVX2:
> > > 1. original w/ 10 constant memcpy: 1'57"
> > > 2. patch w/ 12 constant memcpy: 1'56"
> > > 3. patch w/ 63 constant memcpy: 3'16"
> > >
> > > Also, to address Bruce's question, we have to reduce test case to cut
> down compile time. Because we use:
> > > 1. intrinsics instead of assembly for better flexibility and can
> > > utilize more compiler optimization 2. complex function body for
> > > better performance 3. inlining This increases compile time.
> >
> > We use instrincts and inlining in many other places too.
> > Why it suddenly became a problem here?
> I agree, something just doesnt feel right here.  not sure what it is yet, but I
> don't see how a memcpy function can be so complex as to take almost 10
> minutes to compile.  Its almost like we're recursively including something
> here and its driving gcc into a huge loop Neil
> 
> > Konstantin
> >
> > > But I think it'd be okay to do that as long as we can select a fair set of test
> points.
> > >
> > > It'd be great if you could give some suggestion, say, 12 points.
> > >
> > > Zhihong (John)
> > >
> > >
> > >
> >
> >

Konstantin, Bruce, Neil,

The reason why it took so long is simply because there're too many function calls.
Just keep in mind that there are (63 + 63) * 4 = 504 memcpy calls (inline) with constant length, gcc will unroll the memcpy function body and generate instructions directly for all of them based on the immediate value.

I wrote a small program separately which contains the rte_memcpy.h and a "main" function that calls rte_memcpy 120 * 4 = 480 times with constant length, it took 11 minutes to compile.
Also, the compile time doesn't grow linearly because 1 group (120) memcpy calls took less than 1 minute.

So I think to reduce the compile time, we need to reduce the constant test case, like the original file in dpdk 1.8.0 has only (10 + 10)* 4 calls.

To Konstantin,
Intrinsics is not a problem, what I meant is that, if we write assembly, gcc may not have to optimize it, but if we use intrinsics, gcc will treat it like a piece of C code and do optimization, that may increase compile time.

To Bruce,
My previous compile time in this thread is measured like this: make clean ; rm -rf x86_64-native-linuxapp-gcc ; then make with j 1.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-21 13:26                 ` Bruce Richardson
  2015-01-21 19:49                   ` Stephen Hemminger
@ 2015-01-23  6:52                   ` Wang, Zhihong
  2015-01-26 18:29                     ` Ananyev, Konstantin
  1 sibling, 1 reply; 48+ messages in thread
From: Wang, Zhihong @ 2015-01-23  6:52 UTC (permalink / raw)
  To: Richardson, Bruce, Marc Sune; +Cc: dev



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> Sent: Wednesday, January 21, 2015 9:26 PM
> To: Marc Sune
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> >
> > On 21/01/15 14:02, Bruce Richardson wrote:
> > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> > >>>>-----Original Message-----
> > >>>>From: Richardson, Bruce
> > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> > >>>>To: Neil Horman
> > >>>>Cc: Wang, Zhihong; dev@dpdk.org
> > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >>>>
> > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong wrote:
> > >>>>>>>-----Original Message-----
> > >>>>>>>From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> > >>>>>>>To: Wang, Zhihong
> > >>>>>>>Cc: dev@dpdk.org
> > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >>>>>>>
> > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> > >>>>>>>zhihong.wang@intel.com
> > >>>>wrote:
> > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and AVX
> > >>>>platforms.
> > >>>>>>>>It also extends memcpy test coverage with unaligned cases and
> > >>>>>>>>more test
> > >>>>>>>points.
> > >>>>>>>>Optimization techniques are summarized below:
> > >>>>>>>>
> > >>>>>>>>1. Utilize full cache bandwidth
> > >>>>>>>>
> > >>>>>>>>2. Enforce aligned stores
> > >>>>>>>>
> > >>>>>>>>3. Apply load address alignment based on architecture features
> > >>>>>>>>
> > >>>>>>>>4. Make load/store address available as early as possible
> > >>>>>>>>
> > >>>>>>>>5. General optimization techniques like inlining, branch
> > >>>>>>>>reducing, prefetch pattern access
> > >>>>>>>>
> > >>>>>>>>Zhihong Wang (4):
> > >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> > >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
> > >>>>>>>>   Extended test coverage in test_memcpy_perf.c
> > >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
> and AVX
> > >>>>>>>>     platforms
> > >>>>>>>>
> > >>>>>>>>  app/test/Makefile                                  |   6 +
> > >>>>>>>>  app/test/test_memcpy.c                             |  52 +-
> > >>>>>>>>  app/test/test_memcpy_perf.c                        | 238 +++++---
> > >>>>>>>>  .../common/include/arch/x86/rte_memcpy.h           | 664
> > >>>>>>>+++++++++++++++------
> > >>>>>>>>  4 files changed, 656 insertions(+), 304 deletions(-)
> > >>>>>>>>
> > >>>>>>>>--
> > >>>>>>>>1.9.3
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>Are you able to compile this with gcc 4.9.2?  The compilation
> > >>>>>>>of test_memcpy_perf is taking forever for me.  It appears hung.
> > >>>>>>>Neil
> > >>>>>>Neil,
> > >>>>>>
> > >>>>>>Thanks for reporting this!
> > >>>>>>It should compile but will take quite some time if the CPU
> > >>>>>>doesn't support
> > >>>>AVX2, the reason is that:
> > >>>>>>1. The SSE & AVX memcpy implementation is more complicated
> than
> > >>>>AVX2
> > >>>>>>version thus the compiler takes more time to compile and optimize
> 2.
> > >>>>>>The new test_memcpy_perf.c contains 126 constants memcpy calls
> > >>>>>>for better test case coverage, that's quite a lot
> > >>>>>>
> > >>>>>>I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
> > >>>>>>1. The whole compile process takes 9'41" with the original
> > >>>>>>test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2. It
> > >>>>>>takes only 2'41" after I reduce the constant memcpy call number
> > >>>>>>to 12 + 12 = 24
> > >>>>>>
> > >>>>>>I'll reduce memcpy call in the next version of patch.
> > >>>>>>
> > >>>>>ok, thank you.  I'm all for optimzation, but I think a compile
> > >>>>>that takes almost
> > >>>>>10 minutes for a single file is going to generate some raised
> > >>>>>eyebrows when end users start tinkering with it
> > >>>>>
> > >>>>>Neil
> > >>>>>
> > >>>>>>Zhihong (John)
> > >>>>>>
> > >>>>Even two minutes is a very long time to compile, IMHO. The whole
> > >>>>of DPDK doesn't take that long to compile right now, and that's
> > >>>>with a couple of huge header files with routing tables in it. Any
> > >>>>chance you could cut compile time down to a few seconds while still
> having reasonable tests?
> > >>>>Also, when there is AVX2 present on the system, what is the
> > >>>>compile time like for that code?
> > >>>>
> > >>>>	/Bruce
> > >>>Neil, Bruce,
> > >>>
> > >>>Some data first.
> > >>>
> > >>>Sandy Bridge without AVX2:
> > >>>1. original w/ 10 constant memcpy: 2'25"
> > >>>2. patch w/ 12 constant memcpy: 2'41"
> > >>>3. patch w/ 63 constant memcpy: 9'41"
> > >>>
> > >>>Haswell with AVX2:
> > >>>1. original w/ 10 constant memcpy: 1'57"
> > >>>2. patch w/ 12 constant memcpy: 1'56"
> > >>>3. patch w/ 63 constant memcpy: 3'16"
> > >>>
> > >>>Also, to address Bruce's question, we have to reduce test case to cut
> down compile time. Because we use:
> > >>>1. intrinsics instead of assembly for better flexibility and can
> > >>>utilize more compiler optimization 2. complex function body for
> > >>>better performance 3. inlining This increases compile time.
> > >>>But I think it'd be okay to do that as long as we can select a fair set of
> test points.
> > >>>
> > >>>It'd be great if you could give some suggestion, say, 12 points.
> > >>>
> > >>>Zhihong (John)
> > >>>
> > >>>
> > >>While I agree in the general case these long compilation times is
> > >>painful for the users, having a factor of 2-8x in memcpy operations
> > >>is quite an improvement, specially in DPDK applications which need
> > >>to deal
> > >>(unfortunately) heavily on them -- e.g. IP fragmentation and reassembly.
> > >>
> > >>Why not having a fast compilation by default, and a tunable config
> > >>flag to enable a highly optimized version of rte_memcpy (e.g.
> RTE_EAL_OPT_MEMCPY)?
> > >>
> > >>Marc
> > >>
> > >Out of interest, are these 2-8x improvements something you have
> > >benchmarked in these app scenarios? [i.e. not just in micro-benchmarks].
> >
> > How much that micro-speedup will end up affecting the performance of
> > the entire application is something I cannot say, so I agree that we
> > should probably have some additional benchmarks before deciding that
> > pays off maintaining 2 versions of rte_memcpy.
> >
> > There are however a bunch of possible DPDK applications that could
> > potentially benefit; IP fragmentation, tunneling and specialized DPI
> > applications, among others, since they involve a reasonable amount of
> > memcpys per pkt. My point was, *if* it proves that is enough
> > beneficial, why not having it optionally?
> >
> > Marc
> 
> I agree, if it provides the speedups then we need to have it in - and quite
> possibly on by default, even.
> 
> /Bruce

Since we're clear now that the long compile time is mainly caused by too many inline function calls, I think it's okay not to do this.
Would you agree?

Zhihong (John)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-19  1:53 [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization zhihong.wang
                   ` (4 preceding siblings ...)
  2015-01-19 13:02 ` [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization Neil Horman
@ 2015-01-25 14:50 ` Luke Gorrie
  2015-01-26  1:30   ` Wang, Zhihong
  2015-01-29  3:42 ` [dpdk-dev] " Fu, JingguoX
  6 siblings, 1 reply; 48+ messages in thread
From: Luke Gorrie @ 2015-01-25 14:50 UTC (permalink / raw)
  To: zhihong.wang; +Cc: dev, snabb-devel

Hi John,

On 19 January 2015 at 02:53, <zhihong.wang@intel.com> wrote:

> This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
> It also extends memcpy test coverage with unaligned cases and more test
> points.
>

I am really interested in this work you are doing on memory copies
optimized for packet data. I would like to understand it in more depth. I
have a lot of questions and ideas but let me try to keep it simple for now
:-)

How do you benchmark? where does the "factor of 2-8" cited elsewhere in the
thread come from? how can I reproduce? what results are you seeing compared
with libc?

I did a quick benchmark this weekend based on cachebench
<http://icl.cs.utk.edu/projects/llcbench/cachebench.html>. This seems like
a fairly weak benchmark (always L1 cache, always same alignment, always
predictable branches). Do you think this is relevant? How does this compare
with your results?

I compared:
  rte_memcpy (the new optimized one compiled with gcc-4.9 and -march=native
and -O3)
  memcpy from glibc 2.19 (ubuntu 14.04)
  memcpy from glibc 2.20 (arch linux)

on hardware:
  E5-2620v3 (Haswell)
  E5-2650 (Sandy Bridge)

running cachebench like this:

./cachebench -p -e1 -x1 -m14


rte_memcpy.h on Haswell:

                Memory Copy Library Cache Test

C Size          Nanosec         MB/sec          % Chnge
-------         -------         -------         -------
256             0.01            89191.88        1.00
384             0.01            96505.43        0.92
512             0.01            96509.19        1.00
768             0.01            91475.72        1.06
1024            0.01            96293.82        0.95
1536            0.01            96521.66        1.00
2048            0.01            96522.87        1.00
3072            0.01            96525.53        1.00
4096            0.01            96522.79        1.00
6144            0.01            96507.71        1.00
8192            0.01            94584.41        1.02
12288           0.01            95062.80        0.99
16384           0.01            80493.46        1.18


libc 2.20 on Haswell:

                Memory Copy Library Cache Test

C Size          Nanosec         MB/sec          % Chnge
-------         -------         -------         -------
256             0.01            65978.64        1.00
384             0.01            100249.01       0.66
512             0.01            123476.55       0.81
768             0.01            144699.86       0.85
1024            0.01            159459.88       0.91
1536            0.01            168001.92       0.95
2048            0.01            80738.31        2.08
3072            0.01            80270.02        1.01
4096            0.01            84239.84        0.95
6144            0.01            90600.13        0.93
8192            0.01            89767.94        1.01
12288           0.01            92085.98        0.97
16384           0.01            92719.95        0.99


libc 2.19 on Haswell:

                Memory Copy Library Cache Test

C Size          Nanosec         MB/sec          % Chnge
-------         -------         -------         -------
256             0.02            59871.69        1.00
384             0.01            68545.94        0.87
512             0.01            72674.23        0.94
768             0.01            79257.47        0.92
1024            0.01            79740.43        0.99
1536            0.01            85483.67        0.93
2048            0.01            87703.68        0.97
3072            0.01            86685.71        1.01
4096            0.01            87147.84        0.99
6144            0.01            68622.96        1.27
8192            0.01            70591.25        0.97
12288           0.01            72621.28        0.97
16384           0.01            67713.63        1.07


rte_memcpy on Sandy Bridge:

Memory Copy Library Cache Test

C Size Nanosec MB/sec % Chnge
------- ------- ------- -------
256             0.02            62158.19        1.00
384             0.01            73256.41        0.85
512             0.01            82032.16        0.89
768             0.01            73919.92        1.11
1024            0.01            75937.51        0.97
1536            0.01            78280.20        0.97
2048            0.01            79562.54        0.98
3072            0.01            80800.93        0.98
4096            0.01            81453.71        0.99
6144            0.01            81915.84        0.99
8192            0.01            82427.98        0.99
12288           0.01            82789.82        1.00
16384           0.01            67519.66        1.23



libc 2.20 on Sandy Bridge:

Memory Copy Library Cache Test

C Size Nanosec MB/sec % Chnge
------- ------- ------- -------
256             0.02            48651.20        1.00
384             0.02            57653.91        0.84
512             0.01            67909.77        0.85
768             0.01            71177.75        0.95
1024            0.01            72519.48        0.98
1536            0.01            76686.24        0.95
2048            0.19            4975.55         15.41
3072            0.19            5091.97         0.98
4096            0.19            5152.38         0.99
6144            0.18            5211.26         0.99
8192            0.18            5245.27         0.99
12288           0.18            5276.50         0.99
16384           0.18            5209.80         1.01



libc 2.19 on Sandy Bridge:

Memory Copy Library Cache Test

C Size Nanosec MB/sec % Chnge
------- ------- ------- -------
256             0.02            44970.51        1.00
384             0.02            51922.46        0.87
512             0.02            57230.56        0.91
768             0.02            63438.96        0.90
1024            0.01            67506.58        0.94
1536            0.01            72579.25        0.93
2048            0.01            75722.25        0.96
3072            0.01            71039.19        1.07
4096            0.01            73946.17        0.96
6144            0.02            40969.79        1.80
8192            0.02            41396.05        0.99
12288           0.02            41830.01        0.99
16384           0.02            42032.40        1.00


Last question: Why is rte_memcpy inline? (Would making it a library
function give you smaller code, comparable performance, and fast compiles?)

Cheers!
-Luke

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
  2015-01-20 17:15   ` Stephen Hemminger
  2015-01-20 19:16     ` Neil Horman
@ 2015-01-25 20:02     ` Jim Thompson
  1 sibling, 0 replies; 48+ messages in thread
From: Jim Thompson @ 2015-01-25 20:02 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: dev





> On Jan 20, 2015, at 11:15 AM, Stephen Hemminger <stephen@networkplumber.org> wrote:
> 
> On Mon, 19 Jan 2015 09:53:34 +0800
> zhihong.wang@intel.com wrote:
> 
>> Main code changes:
>> 
>> 1. Differentiate architectural features based on CPU flags
>> 
>>    a. Implement separated move functions for SSE/AVX/AVX2 to make full utilization of cache bandwidth
>> 
>>    b. Implement separated copy flow specifically optimized for target architecture
>> 
>> 2. Rewrite the memcpy function "rte_memcpy"
>> 
>>    a. Add store aligning
>> 
>>    b. Add load aligning based on architectural features
>> 
>>    c. Put block copy loop into inline move functions for better control of instruction order
>> 
>>    d. Eliminate unnecessary MOVs
>> 
>> 3. Rewrite the inline move functions
>> 
>>    a. Add move functions for unaligned load cases
>> 
>>    b. Change instruction order in copy loops for better pipeline utilization
>> 
>>    c. Use intrinsics instead of assembly code
>> 
>> 4. Remove slow glibc call for constant copies
>> 
>> Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
> 
> Dumb question: why not fix glibc memcpy instead?
> What is special about rte_memcpy?

In addition to the other points, a FreeBSD doesn't use glibc on the target platform, (but it is used on, say MIPS), and FreeBSD is a supported DPDK platform. 

So glibc isn't a solution. 

Jim

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-25 14:50 ` Luke Gorrie
@ 2015-01-26  1:30   ` Wang, Zhihong
  2015-01-26  8:03     ` Luke Gorrie
  0 siblings, 1 reply; 48+ messages in thread
From: Wang, Zhihong @ 2015-01-26  1:30 UTC (permalink / raw)
  To: Luke Gorrie; +Cc: dev, snabb-devel

Hi Luke,

I’m very glad that you’re interested in this work. ☺

I never published any performance data, and haven’t run cachebench.
We use test_memcpy_perf.c in DPDK to do the test mainly, because it’s the environment that DPDK runs. You can also find the performance comparison there with glibc.
It can be launched in <target>/app/test: memcpy_perf_autotest.

Finally, inline can bring benefits based on practice, constant value unrolling for example, and for DPDK we need all possible optimization.


Thanks
John


From: lukego@gmail.com [mailto:lukego@gmail.com] On Behalf Of Luke Gorrie
Sent: Sunday, January 25, 2015 10:50 PM
To: Wang, Zhihong
Cc: dev@dpdk.org; snabb-devel@googlegroups.com
Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

Hi John,

On 19 January 2015 at 02:53, <zhihong.wang@intel.com<mailto:zhihong.wang@intel.com>> wrote:
This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
It also extends memcpy test coverage with unaligned cases and more test points.

I am really interested in this work you are doing on memory copies optimized for packet data. I would like to understand it in more depth. I have a lot of questions and ideas but let me try to keep it simple for now :-)

How do you benchmark? where does the "factor of 2-8" cited elsewhere in the thread come from? how can I reproduce? what results are you seeing compared with libc?

I did a quick benchmark this weekend based on cachebench<http://icl.cs.utk.edu/projects/llcbench/cachebench.html>. This seems like a fairly weak benchmark (always L1 cache, always same alignment, always predictable branches). Do you think this is relevant? How does this compare with your results?

I compared:
  rte_memcpy (the new optimized one compiled with gcc-4.9 and -march=native and -O3)
  memcpy from glibc 2.19 (ubuntu 14.04)
  memcpy from glibc 2.20 (arch linux)

on hardware:
  E5-2620v3 (Haswell)
  E5-2650 (Sandy Bridge)

running cachebench like this:

./cachebench -p -e1 -x1 -m14

rte_memcpy.h on Haswell:

                Memory Copy Library Cache Test

C Size          Nanosec         MB/sec          % Chnge
-------         -------         -------         -------
256             0.01            89191.88        1.00
384             0.01            96505.43        0.92
512             0.01            96509.19        1.00
768             0.01            91475.72        1.06
1024            0.01            96293.82        0.95
1536            0.01            96521.66        1.00
2048            0.01            96522.87        1.00
3072            0.01            96525.53        1.00
4096            0.01            96522.79        1.00
6144            0.01            96507.71        1.00
8192            0.01            94584.41        1.02
12288           0.01            95062.80        0.99
16384           0.01            80493.46        1.18

libc 2.20 on Haswell:

                Memory Copy Library Cache Test

C Size          Nanosec         MB/sec          % Chnge
-------         -------         -------         -------
256             0.01            65978.64        1.00
384             0.01            100249.01       0.66
512             0.01            123476.55       0.81
768             0.01            144699.86       0.85
1024            0.01            159459.88       0.91
1536            0.01            168001.92       0.95
2048            0.01            80738.31        2.08
3072            0.01            80270.02        1.01
4096            0.01            84239.84        0.95
6144            0.01            90600.13        0.93
8192            0.01            89767.94        1.01
12288           0.01            92085.98        0.97
16384           0.01            92719.95        0.99

libc 2.19 on Haswell:

                Memory Copy Library Cache Test

C Size          Nanosec         MB/sec          % Chnge
-------         -------         -------         -------
256             0.02            59871.69        1.00
384             0.01            68545.94        0.87
512             0.01            72674.23        0.94
768             0.01            79257.47        0.92
1024            0.01            79740.43        0.99
1536            0.01            85483.67        0.93
2048            0.01            87703.68        0.97
3072            0.01            86685.71        1.01
4096            0.01            87147.84        0.99
6144            0.01            68622.96        1.27
8192            0.01            70591.25        0.97
12288           0.01            72621.28        0.97
16384           0.01            67713.63        1.07

rte_memcpy on Sandy Bridge:

     Memory Copy Library Cache Test

C Size         Nanosec       MB/sec        % Chnge
-------        -------       -------       -------
256             0.02            62158.19        1.00
384             0.01            73256.41        0.85
512             0.01            82032.16        0.89
768             0.01            73919.92        1.11
1024            0.01            75937.51        0.97
1536            0.01            78280.20        0.97
2048            0.01            79562.54        0.98
3072            0.01            80800.93        0.98
4096            0.01            81453.71        0.99
6144            0.01            81915.84        0.99
8192            0.01            82427.98        0.99
12288           0.01            82789.82        1.00
16384           0.01            67519.66        1.23


libc 2.20 on Sandy Bridge:

     Memory Copy Library Cache Test

C Size         Nanosec       MB/sec        % Chnge
-------        -------       -------       -------
256             0.02            48651.20        1.00
384             0.02            57653.91        0.84
512             0.01            67909.77        0.85
768             0.01            71177.75        0.95
1024            0.01            72519.48        0.98
1536            0.01            76686.24        0.95
2048            0.19            4975.55         15.41
3072            0.19            5091.97         0.98
4096            0.19            5152.38         0.99
6144            0.18            5211.26         0.99
8192            0.18            5245.27         0.99
12288           0.18            5276.50         0.99
16384           0.18            5209.80         1.01


libc 2.19 on Sandy Bridge:

     Memory Copy Library Cache Test

C Size         Nanosec       MB/sec        % Chnge
-------        -------       -------       -------
256             0.02            44970.51        1.00
384             0.02            51922.46        0.87
512             0.02            57230.56        0.91
768             0.02            63438.96        0.90
1024            0.01            67506.58        0.94
1536            0.01            72579.25        0.93
2048            0.01            75722.25        0.96
3072            0.01            71039.19        1.07
4096            0.01            73946.17        0.96
6144            0.02            40969.79        1.80
8192            0.02            41396.05        0.99
12288           0.02            41830.01        0.99
16384           0.02            42032.40        1.00

Last question: Why is rte_memcpy inline? (Would making it a library function give you smaller code, comparable performance, and fast compiles?)

Cheers!
-Luke



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-26  1:30   ` Wang, Zhihong
@ 2015-01-26  8:03     ` Luke Gorrie
  2015-01-27  7:19       ` Wang, Zhihong
  0 siblings, 1 reply; 48+ messages in thread
From: Luke Gorrie @ 2015-01-26  8:03 UTC (permalink / raw)
  To: Wang, Zhihong; +Cc: dev, snabb-devel

On 26 January 2015 at 02:30, Wang, Zhihong <zhihong.wang@intel.com> wrote:

>  Hi Luke,
>
>
>
> I’m very glad that you’re interested in this work. J
>

Great :).

  I never published any performance data, and haven’t run cachebench.
>
> We use test_memcpy_perf.c in DPDK to do the test mainly, because it’s the
> environment that DPDK runs. You can also find the performance comparison
> there with glibc.
>
> It can be launched in <target>/app/test: memcpy_perf_autotest.
>

Could you give me a command-line example to run this please? (Sorry if this
should be obvious.)


>   Finally, inline can bring benefits based on practice, constant value
> unrolling for example, and for DPDK we need all possible optimization.
>

Do we need to think about code size and potential instruction cache
thrashing?

For me one call to rte_memcpy compiles to 3520 instructions
<https://gist.github.com/lukego/8b17a07246d999331b04> in 20KB of object
code. That's more than half the size of the Haswell instruction cache
(32KB) per call.

glibc 2.20's memcpy_avx_unaligned
<https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S;h=9f033f54568c3e5b6d9de9b3ba75f5be41070b92;hb=HEAD>
is only 909 bytes shared/total and also seems to have basically excellent
performance on Haswell.

So I am concerned about the code size of rte_memcpy, especially when
inlined, and meta-concerned about the nonlinear impact of nested inlined
functions on both compile time and object code size.


There is another issue that I am concerned about:

The Intel Optimization Guide suggests that rep movs is very efficient
starting in Ivy Bridge. In practice though it seems to be much slower than
using vector instructions, even though it is faster than it used to be in
Sandy Bridge. Is that true?

This could have a substantial impact on off-the-shelf memcpy. glibc 2.20's
memcpy uses movs for sizes >= 2048 and that is where performance takes a
dive for me (in microbenchmarks). GCC will also emit inline string move
instructions for certain constant-size memcpy calls at certain optimization
levels.


So I feel like I haven't yet found the right memcpy for me. and we haven't
even started to look at the interesting parts like cache-coherence
behaviour when sharing data between cores (vhost) and whether streaming
load/store can be used to defend the state of cache lines between cores.


Do I make any sense? What do I miss?


Cheers,
-Luke

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
  2015-01-19  1:53 ` [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms zhihong.wang
  2015-01-20 17:15   ` Stephen Hemminger
@ 2015-01-26 14:43   ` Wodkowski, PawelX
  2015-01-27  5:12     ` Wang, Zhihong
  1 sibling, 1 reply; 48+ messages in thread
From: Wodkowski, PawelX @ 2015-01-26 14:43 UTC (permalink / raw)
  To: Wang, Zhihong, dev

Hi,

I must say: greate work.

I have some small comments:

> +/**
> + * Macro for copying unaligned block from one location to another,
> + * 47 bytes leftover maximum,
> + * locations should not overlap.
> + * Requirements:
> + * - Store is aligned
> + * - Load offset is <offset>, which must be immediate value within [1, 15]
> + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards
> are available for loading
> + * - <dst>, <src>, <len> must be variables
> + * - __m128i <xmm0> ~ <xmm8> must be pre-defined
> + */
> +#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)
> \
> +{                                                                                                           \
...
> +}

Why not do { ... } while(0) or ({ ... }) ? This could have unpredictable side
effects.

Second:
Why you completely substitute
#define rte_memcpy(dst, src, n)              \
	({ (__builtin_constant_p(n)) ?       \
	memcpy((dst), (src), (n)) :          \
	rte_memcpy_func((dst), (src), (n)); })

with inline rte_memcpy()? This construction  can help compiler to deduce
which version to use (static?) inline implementation or call external
function.

Did you try 'extern inline' type? It could help reducing compilation time.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-23  6:52                   ` Wang, Zhihong
@ 2015-01-26 18:29                     ` Ananyev, Konstantin
  2015-01-27  1:42                       ` Wang, Zhihong
  0 siblings, 1 reply; 48+ messages in thread
From: Ananyev, Konstantin @ 2015-01-26 18:29 UTC (permalink / raw)
  To: Wang, Zhihong, Richardson, Bruce, Marc Sune; +Cc: dev

Hi Zhihong,

> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Wang, Zhihong
> Sent: Friday, January 23, 2015 6:52 AM
> To: Richardson, Bruce; Marc Sune
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> 
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce Richardson
> > Sent: Wednesday, January 21, 2015 9:26 PM
> > To: Marc Sune
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >
> > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> > >
> > > On 21/01/15 14:02, Bruce Richardson wrote:
> > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> > > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> > > >>>>-----Original Message-----
> > > >>>>From: Richardson, Bruce
> > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> > > >>>>To: Neil Horman
> > > >>>>Cc: Wang, Zhihong; dev@dpdk.org
> > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > >>>>
> > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong wrote:
> > > >>>>>>>-----Original Message-----
> > > >>>>>>>From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> > > >>>>>>>To: Wang, Zhihong
> > > >>>>>>>Cc: dev@dpdk.org
> > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > >>>>>>>
> > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> > > >>>>>>>zhihong.wang@intel.com
> > > >>>>wrote:
> > > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and AVX
> > > >>>>platforms.
> > > >>>>>>>>It also extends memcpy test coverage with unaligned cases and
> > > >>>>>>>>more test
> > > >>>>>>>points.
> > > >>>>>>>>Optimization techniques are summarized below:
> > > >>>>>>>>
> > > >>>>>>>>1. Utilize full cache bandwidth
> > > >>>>>>>>
> > > >>>>>>>>2. Enforce aligned stores
> > > >>>>>>>>
> > > >>>>>>>>3. Apply load address alignment based on architecture features
> > > >>>>>>>>
> > > >>>>>>>>4. Make load/store address available as early as possible
> > > >>>>>>>>
> > > >>>>>>>>5. General optimization techniques like inlining, branch
> > > >>>>>>>>reducing, prefetch pattern access
> > > >>>>>>>>
> > > >>>>>>>>Zhihong Wang (4):
> > > >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> > > >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
> > > >>>>>>>>   Extended test coverage in test_memcpy_perf.c
> > > >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
> > and AVX
> > > >>>>>>>>     platforms
> > > >>>>>>>>
> > > >>>>>>>>  app/test/Makefile                                  |   6 +
> > > >>>>>>>>  app/test/test_memcpy.c                             |  52 +-
> > > >>>>>>>>  app/test/test_memcpy_perf.c                        | 238 +++++---
> > > >>>>>>>>  .../common/include/arch/x86/rte_memcpy.h           | 664
> > > >>>>>>>+++++++++++++++------
> > > >>>>>>>>  4 files changed, 656 insertions(+), 304 deletions(-)
> > > >>>>>>>>
> > > >>>>>>>>--
> > > >>>>>>>>1.9.3
> > > >>>>>>>>
> > > >>>>>>>>
> > > >>>>>>>Are you able to compile this with gcc 4.9.2?  The compilation
> > > >>>>>>>of test_memcpy_perf is taking forever for me.  It appears hung.
> > > >>>>>>>Neil
> > > >>>>>>Neil,
> > > >>>>>>
> > > >>>>>>Thanks for reporting this!
> > > >>>>>>It should compile but will take quite some time if the CPU
> > > >>>>>>doesn't support
> > > >>>>AVX2, the reason is that:
> > > >>>>>>1. The SSE & AVX memcpy implementation is more complicated
> > than
> > > >>>>AVX2
> > > >>>>>>version thus the compiler takes more time to compile and optimize
> > 2.
> > > >>>>>>The new test_memcpy_perf.c contains 126 constants memcpy calls
> > > >>>>>>for better test case coverage, that's quite a lot
> > > >>>>>>
> > > >>>>>>I've just tested this patch on an Ivy Bridge machine with GCC 4.9.2:
> > > >>>>>>1. The whole compile process takes 9'41" with the original
> > > >>>>>>test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2. It
> > > >>>>>>takes only 2'41" after I reduce the constant memcpy call number
> > > >>>>>>to 12 + 12 = 24
> > > >>>>>>
> > > >>>>>>I'll reduce memcpy call in the next version of patch.
> > > >>>>>>
> > > >>>>>ok, thank you.  I'm all for optimzation, but I think a compile
> > > >>>>>that takes almost
> > > >>>>>10 minutes for a single file is going to generate some raised
> > > >>>>>eyebrows when end users start tinkering with it
> > > >>>>>
> > > >>>>>Neil
> > > >>>>>
> > > >>>>>>Zhihong (John)
> > > >>>>>>
> > > >>>>Even two minutes is a very long time to compile, IMHO. The whole
> > > >>>>of DPDK doesn't take that long to compile right now, and that's
> > > >>>>with a couple of huge header files with routing tables in it. Any
> > > >>>>chance you could cut compile time down to a few seconds while still
> > having reasonable tests?
> > > >>>>Also, when there is AVX2 present on the system, what is the
> > > >>>>compile time like for that code?
> > > >>>>
> > > >>>>	/Bruce
> > > >>>Neil, Bruce,
> > > >>>
> > > >>>Some data first.
> > > >>>
> > > >>>Sandy Bridge without AVX2:
> > > >>>1. original w/ 10 constant memcpy: 2'25"
> > > >>>2. patch w/ 12 constant memcpy: 2'41"
> > > >>>3. patch w/ 63 constant memcpy: 9'41"
> > > >>>
> > > >>>Haswell with AVX2:
> > > >>>1. original w/ 10 constant memcpy: 1'57"
> > > >>>2. patch w/ 12 constant memcpy: 1'56"
> > > >>>3. patch w/ 63 constant memcpy: 3'16"
> > > >>>
> > > >>>Also, to address Bruce's question, we have to reduce test case to cut
> > down compile time. Because we use:
> > > >>>1. intrinsics instead of assembly for better flexibility and can
> > > >>>utilize more compiler optimization 2. complex function body for
> > > >>>better performance 3. inlining This increases compile time.
> > > >>>But I think it'd be okay to do that as long as we can select a fair set of
> > test points.
> > > >>>
> > > >>>It'd be great if you could give some suggestion, say, 12 points.
> > > >>>
> > > >>>Zhihong (John)
> > > >>>
> > > >>>
> > > >>While I agree in the general case these long compilation times is
> > > >>painful for the users, having a factor of 2-8x in memcpy operations
> > > >>is quite an improvement, specially in DPDK applications which need
> > > >>to deal
> > > >>(unfortunately) heavily on them -- e.g. IP fragmentation and reassembly.
> > > >>
> > > >>Why not having a fast compilation by default, and a tunable config
> > > >>flag to enable a highly optimized version of rte_memcpy (e.g.
> > RTE_EAL_OPT_MEMCPY)?
> > > >>
> > > >>Marc
> > > >>
> > > >Out of interest, are these 2-8x improvements something you have
> > > >benchmarked in these app scenarios? [i.e. not just in micro-benchmarks].
> > >
> > > How much that micro-speedup will end up affecting the performance of
> > > the entire application is something I cannot say, so I agree that we
> > > should probably have some additional benchmarks before deciding that
> > > pays off maintaining 2 versions of rte_memcpy.
> > >
> > > There are however a bunch of possible DPDK applications that could
> > > potentially benefit; IP fragmentation, tunneling and specialized DPI
> > > applications, among others, since they involve a reasonable amount of
> > > memcpys per pkt. My point was, *if* it proves that is enough
> > > beneficial, why not having it optionally?
> > >
> > > Marc
> >
> > I agree, if it provides the speedups then we need to have it in - and quite
> > possibly on by default, even.
> >
> > /Bruce
> 
> Since we're clear now that the long compile time is mainly caused by too many inline function calls, I think it's okay not to do this.
> Would you agree?

Actually I wonder, if instead of:

+	switch (srcofs) {
+	case 0x01: MOVEUNALIGNED_LEFT47(dst, src, n, 0x01); break;
+	case 0x02: MOVEUNALIGNED_LEFT47(dst, src, n, 0x02); break;
+	case 0x03: MOVEUNALIGNED_LEFT47(dst, src, n, 0x03); break;
+	case 0x04: MOVEUNALIGNED_LEFT47(dst, src, n, 0x04); break;
+	case 0x05: MOVEUNALIGNED_LEFT47(dst, src, n, 0x05); break;
+	case 0x06: MOVEUNALIGNED_LEFT47(dst, src, n, 0x06); break;
+	case 0x07: MOVEUNALIGNED_LEFT47(dst, src, n, 0x07); break;
+	case 0x08: MOVEUNALIGNED_LEFT47(dst, src, n, 0x08); break;
+	case 0x09: MOVEUNALIGNED_LEFT47(dst, src, n, 0x09); break;
+	case 0x0A: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0A); break;
+	case 0x0B: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0B); break;
+	case 0x0C: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0C); break;
+	case 0x0D: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0D); break;
+	case 0x0E: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0E); break;
+	case 0x0F: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0F); break;
+	default:;
+	}

We'll just do:
MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);

That should reduce size of the generated code quite a bit, wouldn't it?
>From other side MOVEUNALIGNED_LEFT47() is pretty big chunk,
so performance difference having offset value in a register vs immediate value
shouldn't be significant.  

Konstantin

> 
> Zhihong (John)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-26 18:29                     ` Ananyev, Konstantin
@ 2015-01-27  1:42                       ` Wang, Zhihong
  2015-01-27 11:30                         ` Ananyev, Konstantin
  0 siblings, 1 reply; 48+ messages in thread
From: Wang, Zhihong @ 2015-01-27  1:42 UTC (permalink / raw)
  To: Ananyev, Konstantin, Richardson, Bruce, Marc Sune; +Cc: dev



> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Tuesday, January 27, 2015 2:29 AM
> To: Wang, Zhihong; Richardson, Bruce; Marc Sune
> Cc: dev@dpdk.org
> Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> Hi Zhihong,
> 
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Wang, Zhihong
> > Sent: Friday, January 23, 2015 6:52 AM
> > To: Richardson, Bruce; Marc Sune
> > Cc: dev@dpdk.org
> > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >
> >
> >
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> > > Richardson
> > > Sent: Wednesday, January 21, 2015 9:26 PM
> > > To: Marc Sune
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >
> > > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> > > >
> > > > On 21/01/15 14:02, Bruce Richardson wrote:
> > > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> > > > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> > > > >>>>-----Original Message-----
> > > > >>>>From: Richardson, Bruce
> > > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> > > > >>>>To: Neil Horman
> > > > >>>>Cc: Wang, Zhihong; dev@dpdk.org
> > > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > >>>>
> > > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong wrote:
> > > > >>>>>>>-----Original Message-----
> > > > >>>>>>>From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> > > > >>>>>>>To: Wang, Zhihong
> > > > >>>>>>>Cc: dev@dpdk.org
> > > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy
> > > > >>>>>>>optimization
> > > > >>>>>>>
> > > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> > > > >>>>>>>zhihong.wang@intel.com
> > > > >>>>wrote:
> > > > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and
> > > > >>>>>>>>AVX
> > > > >>>>platforms.
> > > > >>>>>>>>It also extends memcpy test coverage with unaligned cases
> > > > >>>>>>>>and more test
> > > > >>>>>>>points.
> > > > >>>>>>>>Optimization techniques are summarized below:
> > > > >>>>>>>>
> > > > >>>>>>>>1. Utilize full cache bandwidth
> > > > >>>>>>>>
> > > > >>>>>>>>2. Enforce aligned stores
> > > > >>>>>>>>
> > > > >>>>>>>>3. Apply load address alignment based on architecture
> > > > >>>>>>>>features
> > > > >>>>>>>>
> > > > >>>>>>>>4. Make load/store address available as early as possible
> > > > >>>>>>>>
> > > > >>>>>>>>5. General optimization techniques like inlining, branch
> > > > >>>>>>>>reducing, prefetch pattern access
> > > > >>>>>>>>
> > > > >>>>>>>>Zhihong Wang (4):
> > > > >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> > > > >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
> > > > >>>>>>>>   Extended test coverage in test_memcpy_perf.c
> > > > >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
> > > and AVX
> > > > >>>>>>>>     platforms
> > > > >>>>>>>>
> > > > >>>>>>>>  app/test/Makefile                                  |   6 +
> > > > >>>>>>>>  app/test/test_memcpy.c                             |  52 +-
> > > > >>>>>>>>  app/test/test_memcpy_perf.c                        | 238 +++++---
> > > > >>>>>>>>  .../common/include/arch/x86/rte_memcpy.h           | 664
> > > > >>>>>>>+++++++++++++++------
> > > > >>>>>>>>  4 files changed, 656 insertions(+), 304 deletions(-)
> > > > >>>>>>>>
> > > > >>>>>>>>--
> > > > >>>>>>>>1.9.3
> > > > >>>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>Are you able to compile this with gcc 4.9.2?  The
> > > > >>>>>>>compilation of test_memcpy_perf is taking forever for me.  It
> appears hung.
> > > > >>>>>>>Neil
> > > > >>>>>>Neil,
> > > > >>>>>>
> > > > >>>>>>Thanks for reporting this!
> > > > >>>>>>It should compile but will take quite some time if the CPU
> > > > >>>>>>doesn't support
> > > > >>>>AVX2, the reason is that:
> > > > >>>>>>1. The SSE & AVX memcpy implementation is more complicated
> > > than
> > > > >>>>AVX2
> > > > >>>>>>version thus the compiler takes more time to compile and
> > > > >>>>>>optimize
> > > 2.
> > > > >>>>>>The new test_memcpy_perf.c contains 126 constants memcpy
> > > > >>>>>>calls for better test case coverage, that's quite a lot
> > > > >>>>>>
> > > > >>>>>>I've just tested this patch on an Ivy Bridge machine with GCC
> 4.9.2:
> > > > >>>>>>1. The whole compile process takes 9'41" with the original
> > > > >>>>>>test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2.
> > > > >>>>>>It takes only 2'41" after I reduce the constant memcpy call
> > > > >>>>>>number to 12 + 12 = 24
> > > > >>>>>>
> > > > >>>>>>I'll reduce memcpy call in the next version of patch.
> > > > >>>>>>
> > > > >>>>>ok, thank you.  I'm all for optimzation, but I think a
> > > > >>>>>compile that takes almost
> > > > >>>>>10 minutes for a single file is going to generate some raised
> > > > >>>>>eyebrows when end users start tinkering with it
> > > > >>>>>
> > > > >>>>>Neil
> > > > >>>>>
> > > > >>>>>>Zhihong (John)
> > > > >>>>>>
> > > > >>>>Even two minutes is a very long time to compile, IMHO. The
> > > > >>>>whole of DPDK doesn't take that long to compile right now, and
> > > > >>>>that's with a couple of huge header files with routing tables
> > > > >>>>in it. Any chance you could cut compile time down to a few
> > > > >>>>seconds while still
> > > having reasonable tests?
> > > > >>>>Also, when there is AVX2 present on the system, what is the
> > > > >>>>compile time like for that code?
> > > > >>>>
> > > > >>>>	/Bruce
> > > > >>>Neil, Bruce,
> > > > >>>
> > > > >>>Some data first.
> > > > >>>
> > > > >>>Sandy Bridge without AVX2:
> > > > >>>1. original w/ 10 constant memcpy: 2'25"
> > > > >>>2. patch w/ 12 constant memcpy: 2'41"
> > > > >>>3. patch w/ 63 constant memcpy: 9'41"
> > > > >>>
> > > > >>>Haswell with AVX2:
> > > > >>>1. original w/ 10 constant memcpy: 1'57"
> > > > >>>2. patch w/ 12 constant memcpy: 1'56"
> > > > >>>3. patch w/ 63 constant memcpy: 3'16"
> > > > >>>
> > > > >>>Also, to address Bruce's question, we have to reduce test case
> > > > >>>to cut
> > > down compile time. Because we use:
> > > > >>>1. intrinsics instead of assembly for better flexibility and
> > > > >>>can utilize more compiler optimization 2. complex function body
> > > > >>>for better performance 3. inlining This increases compile time.
> > > > >>>But I think it'd be okay to do that as long as we can select a
> > > > >>>fair set of
> > > test points.
> > > > >>>
> > > > >>>It'd be great if you could give some suggestion, say, 12 points.
> > > > >>>
> > > > >>>Zhihong (John)
> > > > >>>
> > > > >>>
> > > > >>While I agree in the general case these long compilation times
> > > > >>is painful for the users, having a factor of 2-8x in memcpy
> > > > >>operations is quite an improvement, specially in DPDK
> > > > >>applications which need to deal
> > > > >>(unfortunately) heavily on them -- e.g. IP fragmentation and
> reassembly.
> > > > >>
> > > > >>Why not having a fast compilation by default, and a tunable
> > > > >>config flag to enable a highly optimized version of rte_memcpy (e.g.
> > > RTE_EAL_OPT_MEMCPY)?
> > > > >>
> > > > >>Marc
> > > > >>
> > > > >Out of interest, are these 2-8x improvements something you have
> > > > >benchmarked in these app scenarios? [i.e. not just in micro-
> benchmarks].
> > > >
> > > > How much that micro-speedup will end up affecting the performance
> > > > of the entire application is something I cannot say, so I agree
> > > > that we should probably have some additional benchmarks before
> > > > deciding that pays off maintaining 2 versions of rte_memcpy.
> > > >
> > > > There are however a bunch of possible DPDK applications that could
> > > > potentially benefit; IP fragmentation, tunneling and specialized
> > > > DPI applications, among others, since they involve a reasonable
> > > > amount of memcpys per pkt. My point was, *if* it proves that is
> > > > enough beneficial, why not having it optionally?
> > > >
> > > > Marc
> > >
> > > I agree, if it provides the speedups then we need to have it in -
> > > and quite possibly on by default, even.
> > >
> > > /Bruce
> >
> > Since we're clear now that the long compile time is mainly caused by too
> many inline function calls, I think it's okay not to do this.
> > Would you agree?
> 
> Actually I wonder, if instead of:
> 
> +	switch (srcofs) {
> +	case 0x01: MOVEUNALIGNED_LEFT47(dst, src, n, 0x01); break;
> +	case 0x02: MOVEUNALIGNED_LEFT47(dst, src, n, 0x02); break;
> +	case 0x03: MOVEUNALIGNED_LEFT47(dst, src, n, 0x03); break;
> +	case 0x04: MOVEUNALIGNED_LEFT47(dst, src, n, 0x04); break;
> +	case 0x05: MOVEUNALIGNED_LEFT47(dst, src, n, 0x05); break;
> +	case 0x06: MOVEUNALIGNED_LEFT47(dst, src, n, 0x06); break;
> +	case 0x07: MOVEUNALIGNED_LEFT47(dst, src, n, 0x07); break;
> +	case 0x08: MOVEUNALIGNED_LEFT47(dst, src, n, 0x08); break;
> +	case 0x09: MOVEUNALIGNED_LEFT47(dst, src, n, 0x09); break;
> +	case 0x0A: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0A); break;
> +	case 0x0B: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0B); break;
> +	case 0x0C: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0C); break;
> +	case 0x0D: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0D); break;
> +	case 0x0E: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0E); break;
> +	case 0x0F: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0F); break;
> +	default:;
> +	}
> 
> We'll just do:
> MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
> 
> That should reduce size of the generated code quite a bit, wouldn't it?
> From other side MOVEUNALIGNED_LEFT47() is pretty big chunk, so
> performance difference having offset value in a register vs immediate value
> shouldn't be significant.
> 
> Konstantin
> 
> >
> > Zhihong (John)

Hey Konstantin,

We have to use switch here because PALIGNR requires the shift count to be an 8-bit immediate.

Zhihong (John)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
  2015-01-26 14:43   ` Wodkowski, PawelX
@ 2015-01-27  5:12     ` Wang, Zhihong
  0 siblings, 0 replies; 48+ messages in thread
From: Wang, Zhihong @ 2015-01-27  5:12 UTC (permalink / raw)
  To: Wodkowski, PawelX, dev



> -----Original Message-----
> From: Wodkowski, PawelX
> Sent: Monday, January 26, 2015 10:43 PM
> To: Wang, Zhihong; dev@dpdk.org
> Subject: RE: [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in
> arch/x86/rte_memcpy.h for both SSE and AVX platforms
> 
> Hi,
> 
> I must say: greate work.
> 
> I have some small comments:
> 
> > +/**
> > + * Macro for copying unaligned block from one location to another,
> > + * 47 bytes leftover maximum,
> > + * locations should not overlap.
> > + * Requirements:
> > + * - Store is aligned
> > + * - Load offset is <offset>, which must be immediate value within [1, 15]
> > + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit
> forwards
> > are available for loading
> > + * - <dst>, <src>, <len> must be variables
> > + * - __m128i <xmm0> ~ <xmm8> must be pre-defined
> > + */
> > +#define MOVEUNALIGNED_LEFT47(dst, src, len, offset)
> > \
> > +{                                                                                                           \
> ...
> > +}
> 
> Why not do { ... } while(0) or ({ ... }) ? This could have unpredictable side
> effects.
> 
> Second:
> Why you completely substitute
> #define rte_memcpy(dst, src, n)              \
> 	({ (__builtin_constant_p(n)) ?       \
> 	memcpy((dst), (src), (n)) :          \
> 	rte_memcpy_func((dst), (src), (n)); })
> 
> with inline rte_memcpy()? This construction  can help compiler to deduce
> which version to use (static?) inline implementation or call external
> function.
> 
> Did you try 'extern inline' type? It could help reducing compilation time.

Hi Pawel,

Good call on "MOVEUNALIGNED_LEFT47". Thanks!

I removed the conditional __builtin_constant_p(n) because it calls glibc memcpy when the parameter is constant, while rte_memcpy has better performance there.
Current long compile time is caused by too many function calls, I'll fix that in the next version.

Zhihong (John)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-26  8:03     ` Luke Gorrie
@ 2015-01-27  7:19       ` Wang, Zhihong
  2015-01-27 13:57         ` [dpdk-dev] [snabb-devel] " Luke Gorrie
  0 siblings, 1 reply; 48+ messages in thread
From: Wang, Zhihong @ 2015-01-27  7:19 UTC (permalink / raw)
  To: Luke Gorrie; +Cc: dev, snabb-devel

Hey Luke,

Thanks for the excellent questions!

The following script will launch the memcpy test in DPDK:
echo -e 'memcpy_autotest\nmemcpy_perf_autotest\nquit\n' | ./x86_64-native-linuxapp-gcc/app/test -c 4 -n 4 -- -i

Thanks for sharing the object code, I think it’s the Sandy Bridge version though.
The rte_memcpy for Haswell is quite simple too, this is a decision based on arch difference: Haswell has significant improvements in memory hierarchy.
The Sandy Bridge unaligned memcpy is large in size but it has better performance because converting unaligned loads into aligned ones is crucial for in cache memcpy on Sandy Bridge.

The rep instruction is still not fast enough yet, but I can’t say much about it since I haven’t investigated it thoroughly.

To my understanding memcpy optimization is all about trade-offs according to use cases and this one is for DPDK scenario (Small size, in cache: you may find quite a few with only 6 bytes or so), you can refer to the rfc for this patch.
It’s not likely that one could make one that’re optimal for all scenarios.

But I agree with the author of glibc memcpy on this: A program with too many memcpys is a program with design flaw.


Thanks
Zhihong (John)

From: lukego@gmail.com [mailto:lukego@gmail.com] On Behalf Of Luke Gorrie
Sent: Monday, January 26, 2015 4:03 PM
To: Wang, Zhihong
Cc: dev@dpdk.org; snabb-devel@googlegroups.com
Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

On 26 January 2015 at 02:30, Wang, Zhihong <zhihong.wang@intel.com<mailto:zhihong.wang@intel.com>> wrote:
Hi Luke,

I’m very glad that you’re interested in this work. ☺

Great :).

 I never published any performance data, and haven’t run cachebench.
We use test_memcpy_perf.c in DPDK to do the test mainly, because it’s the environment that DPDK runs. You can also find the performance comparison there with glibc.
It can be launched in <target>/app/test: memcpy_perf_autotest.

Could you give me a command-line example to run this please? (Sorry if this should be obvious.)

 Finally, inline can bring benefits based on practice, constant value unrolling for example, and for DPDK we need all possible optimization.

Do we need to think about code size and potential instruction cache thrashing?

For me one call to rte_memcpy compiles to 3520 instructions<https://gist.github.com/lukego/8b17a07246d999331b04> in 20KB of object code. That's more than half the size of the Haswell instruction cache (32KB) per call.

glibc 2.20's memcpy_avx_unaligned<https://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/x86_64/multiarch/memcpy-avx-unaligned.S;h=9f033f54568c3e5b6d9de9b3ba75f5be41070b92;hb=HEAD> is only 909 bytes shared/total and also seems to have basically excellent performance on Haswell.

So I am concerned about the code size of rte_memcpy, especially when inlined, and meta-concerned about the nonlinear impact of nested inlined functions on both compile time and object code size.


There is another issue that I am concerned about:

The Intel Optimization Guide suggests that rep movs is very efficient starting in Ivy Bridge. In practice though it seems to be much slower than using vector instructions, even though it is faster than it used to be in Sandy Bridge. Is that true?

This could have a substantial impact on off-the-shelf memcpy. glibc 2.20's memcpy uses movs for sizes >= 2048 and that is where performance takes a dive for me (in microbenchmarks). GCC will also emit inline string move instructions for certain constant-size memcpy calls at certain optimization levels.


So I feel like I haven't yet found the right memcpy for me. and we haven't even started to look at the interesting parts like cache-coherence behaviour when sharing data between cores (vhost) and whether streaming load/store can be used to defend the state of cache lines between cores.


Do I make any sense? What do I miss?


Cheers,
-Luke



^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-22 18:21                       ` EDMISON, Kelvin (Kelvin)
@ 2015-01-27  8:22                         ` Wang, Zhihong
  2015-01-28 21:48                           ` EDMISON, Kelvin (Kelvin)
  0 siblings, 1 reply; 48+ messages in thread
From: Wang, Zhihong @ 2015-01-27  8:22 UTC (permalink / raw)
  To: EDMISON, Kelvin (Kelvin), Stephen Hemminger, Neil Horman; +Cc: dev



> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of EDMISON, Kelvin
> (Kelvin)
> Sent: Friday, January 23, 2015 2:22 AM
> To: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> 
> 
> On 2015-01-21, 3:54 PM, "Neil Horman" <nhorman@tuxdriver.com> wrote:
> 
> >On Wed, Jan 21, 2015 at 11:49:47AM -0800, Stephen Hemminger wrote:
> >> On Wed, 21 Jan 2015 13:26:20 +0000
> >> Bruce Richardson <bruce.richardson@intel.com> wrote:
> >>
> >> > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> >> > >
> >> > > On 21/01/15 14:02, Bruce Richardson wrote:
> >> > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> >> > > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> >> > > >>>>-----Original Message-----
> >> > > >>>>From: Richardson, Bruce
> >> > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> >> > > >>>>To: Neil Horman
> >> > > >>>>Cc: Wang, Zhihong; dev@dpdk.org
> >> > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >> > > >>>>
> >> > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> >> > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong
> wrote:
> >> > > >>>>>>>-----Original Message-----
> >> > > >>>>>>>From: Neil Horman [mailto:nhorman@tuxdriver.com]
> >> > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> >> > > >>>>>>>To: Wang, Zhihong
> >> > > >>>>>>>Cc: dev@dpdk.org
> >> > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy
> optimization
> >> > > >>>>>>>
> >> > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> >>zhihong.wang@intel.com
> >> > > >>>>wrote:
> >> > > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and
> >>AVX
> >> > > >>>>platforms.
> >> > > >>>>>>>>It also extends memcpy test coverage with unaligned cases
> >>and
> >> > > >>>>>>>>more test
> >> > > >>>>>>>points.
> >> > > >>>>>>>>Optimization techniques are summarized below:
> >> > > >>>>>>>>
> >> > > >>>>>>>>1. Utilize full cache bandwidth
> >> > > >>>>>>>>
> >> > > >>>>>>>>2. Enforce aligned stores
> >> > > >>>>>>>>
> >> > > >>>>>>>>3. Apply load address alignment based on architecture
> >>features
> >> > > >>>>>>>>
> >> > > >>>>>>>>4. Make load/store address available as early as possible
> >> > > >>>>>>>>
> >> > > >>>>>>>>5. General optimization techniques like inlining, branch
> >> > > >>>>>>>>reducing, prefetch pattern access
> >> > > >>>>>>>>
> >> > > >>>>>>>>Zhihong Wang (4):
> >> > > >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> >> > > >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
> >> > > >>>>>>>>   Extended test coverage in test_memcpy_perf.c
> >> > > >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both
> SSE
> >>and AVX
> >> > > >>>>>>>>     platforms
> >> > > >>>>>>>>
> >> > > >>>>>>>>  app/test/Makefile                                  |   6 +
> >> > > >>>>>>>>  app/test/test_memcpy.c                             |  52
> >>+-
> >> > > >>>>>>>>  app/test/test_memcpy_perf.c                        | 238
> >>+++++---
> >> > > >>>>>>>>  .../common/include/arch/x86/rte_memcpy.h           | 664
> >> > > >>>>>>>+++++++++++++++------
> >> > > >>>>>>>>  4 files changed, 656 insertions(+), 304 deletions(-)
> >> > > >>>>>>>>
> >> > > >>>>>>>>--
> >> > > >>>>>>>>1.9.3
> >> > > >>>>>>>>
> >> > > >>>>>>>>
> >> > > >>>>>>>Are you able to compile this with gcc 4.9.2?  The
> >>compilation of
> >> > > >>>>>>>test_memcpy_perf is taking forever for me.  It appears hung.
> >> > > >>>>>>>Neil
> >> > > >>>>>>Neil,
> >> > > >>>>>>
> >> > > >>>>>>Thanks for reporting this!
> >> > > >>>>>>It should compile but will take quite some time if the CPU
> >>doesn't support
> >> > > >>>>AVX2, the reason is that:
> >> > > >>>>>>1. The SSE & AVX memcpy implementation is more
> complicated
> >>than
> >> > > >>>>AVX2
> >> > > >>>>>>version thus the compiler takes more time to compile and
> >>optimize 2.
> >> > > >>>>>>The new test_memcpy_perf.c contains 126 constants memcpy
> >>calls for
> >> > > >>>>>>better test case coverage, that's quite a lot
> >> > > >>>>>>
> >> > > >>>>>>I've just tested this patch on an Ivy Bridge machine with GCC
> >>4.9.2:
> >> > > >>>>>>1. The whole compile process takes 9'41" with the original
> >> > > >>>>>>test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2.
> >>It takes
> >> > > >>>>>>only 2'41" after I reduce the constant memcpy call number to
> >>12 + 12
> >> > > >>>>>>= 24
> >> > > >>>>>>
> >> > > >>>>>>I'll reduce memcpy call in the next version of patch.
> >> > > >>>>>>
> >> > > >>>>>ok, thank you.  I'm all for optimzation, but I think a compile
> >>that
> >> > > >>>>>takes almost
> >> > > >>>>>10 minutes for a single file is going to generate some raised
> >>eyebrows
> >> > > >>>>>when end users start tinkering with it
> >> > > >>>>>
> >> > > >>>>>Neil
> >> > > >>>>>
> >> > > >>>>>>Zhihong (John)
> >> > > >>>>>>
> >> > > >>>>Even two minutes is a very long time to compile, IMHO. The
> >>whole of DPDK
> >> > > >>>>doesn't take that long to compile right now, and that's with a
> >>couple of huge
> >> > > >>>>header files with routing tables in it. Any chance you could
> >>cut compile time
> >> > > >>>>down to a few seconds while still having reasonable tests?
> >> > > >>>>Also, when there is AVX2 present on the system, what is the
> >>compile time
> >> > > >>>>like for that code?
> >> > > >>>>
> >> > > >>>>	/Bruce
> >> > > >>>Neil, Bruce,
> >> > > >>>
> >> > > >>>Some data first.
> >> > > >>>
> >> > > >>>Sandy Bridge without AVX2:
> >> > > >>>1. original w/ 10 constant memcpy: 2'25"
> >> > > >>>2. patch w/ 12 constant memcpy: 2'41"
> >> > > >>>3. patch w/ 63 constant memcpy: 9'41"
> >> > > >>>
> >> > > >>>Haswell with AVX2:
> >> > > >>>1. original w/ 10 constant memcpy: 1'57"
> >> > > >>>2. patch w/ 12 constant memcpy: 1'56"
> >> > > >>>3. patch w/ 63 constant memcpy: 3'16"
> >> > > >>>
> >> > > >>>Also, to address Bruce's question, we have to reduce test case
> >>to cut down compile time. Because we use:
> >> > > >>>1. intrinsics instead of assembly for better flexibility and can
> >>utilize more compiler optimization
> >> > > >>>2. complex function body for better performance
> >> > > >>>3. inlining
> >> > > >>>This increases compile time.
> >> > > >>>But I think it'd be okay to do that as long as we can select a
> >>fair set of test points.
> >> > > >>>
> >> > > >>>It'd be great if you could give some suggestion, say, 12 points.
> >> > > >>>
> >> > > >>>Zhihong (John)
> >> > > >>>
> >> > > >>>
> >> > > >>While I agree in the general case these long compilation times is
> >>painful
> >> > > >>for the users, having a factor of 2-8x in memcpy operations is
> >>quite an
> >> > > >>improvement, specially in DPDK applications which need to deal
> >> > > >>(unfortunately) heavily on them -- e.g. IP fragmentation and
> >>reassembly.
> >> > > >>
> >> > > >>Why not having a fast compilation by default, and a tunable
> >>config flag to
> >> > > >>enable a highly optimized version of rte_memcpy (e.g.
> >>RTE_EAL_OPT_MEMCPY)?
> >> > > >>
> >> > > >>Marc
> >> > > >>
> >> > > >Out of interest, are these 2-8x improvements something you have
> >>benchmarked
> >> > > >in these app scenarios? [i.e. not just in micro-benchmarks].
> >> > >
> >> > > How much that micro-speedup will end up affecting the performance
> >>of the
> >> > > entire application is something I cannot say, so I agree that we
> >>should
> >> > > probably have some additional benchmarks before deciding that pays
> >>off
> >> > > maintaining 2 versions of rte_memcpy.
> >> > >
> >> > > There are however a bunch of possible DPDK applications that could
> >> > > potentially benefit; IP fragmentation, tunneling and specialized DPI
> >> > > applications, among others, since they involve a reasonable amount
> >>of
> >> > > memcpys per pkt. My point was, *if* it proves that is enough
> >>beneficial, why
> >> > > not having it optionally?
> >> > >
> >> > > Marc
> >> >
> >> > I agree, if it provides the speedups then we need to have it in - and
> >>quite possibly
> >> > on by default, even.
> >> >
> >> > /Bruce
> >>
> >> One issue I have is that as a vendor we need to ship on binary, not
> >>different distributions
> >> for each Intel chip variant. There is some support for multi-chip
> >>version functions
> >> but only in latest Gcc which isn't in Debian stable. And the multi-chip
> >>version
> >> of functions is going to be more expensive than inlining. For some
> >>cases, I have
> >> seen that the overhead of fancy instructions looks good but have nasty
> >>side effects
> >> like CPU stall and/or increased power consumption which turns of turbo
> >>boost.
> >>
> >>
> >> Distro's in general have the same problem with special case
> >>optimizations.
> >>
> >What we really need is to do something like borrow the alternatives
> >mechanism
> >from the kernel so that we can dynamically replace instructions at run
> >time
> >based on cpu flags.  That way we could make the choice at run time, and
> >wouldn't
> >have to do alot of special case jumping about.
> >Neil
> 
> +1.
> 
> I think it should be an anti-requirement that the build machine be the
> exact same chip as the deployment platform.
> 
> I like the cpu flag inspection approach.  It would help in the case where
> DPDK is in a VM and an odd set of CPU flags have been exposed.
> 
> If that approach doesn't work though, then perhaps DPDK memcpy could go
> through a benchmarking at app startup time and select the most performant
> option out of a set, like mdraid's raid6 implementation does.  To give an
> example, this is what my systems print out at boot time re: raid6
> algorithm selection.
> raid6: sse2x1    3171 MB/s
> raid6: sse2x2    3925 MB/s
> raid6: sse2x4    4523 MB/s
> raid6: using algorithm sse2x4 (4523 MB/s)
> 
> Regards,
>    Kelvin
> 

Thanks for the proposal!

For DPDK, performance is always the most important concern. We need to utilize new architecture features to achieve that, so solution per arch is necessary.
Even a few extra cycles can lead to bad performance if they're in a hot loop.
For instance, let's assume DPDK takes 60 cycles to process a packet on average, then 3 more cycles here means 5% performance drop.

The dynamic solution is doable but with performance penalties, even if it could be small. Also it may bring extra complexity, which can lead to unpredictable behaviors and side effects.
For example, the dynamic solution won't have inline unrolling, which can bring significant performance benefit for small copies with constant length, like eth_addr.

We can investigate the VM scenario more.

Zhihong (John)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-27  1:42                       ` Wang, Zhihong
@ 2015-01-27 11:30                         ` Ananyev, Konstantin
  2015-01-27 12:19                           ` Ananyev, Konstantin
  0 siblings, 1 reply; 48+ messages in thread
From: Ananyev, Konstantin @ 2015-01-27 11:30 UTC (permalink / raw)
  To: Wang, Zhihong, Richardson, Bruce, Marc Sune; +Cc: dev



> -----Original Message-----
> From: Wang, Zhihong
> Sent: Tuesday, January 27, 2015 1:42 AM
> To: Ananyev, Konstantin; Richardson, Bruce; Marc Sune
> Cc: dev@dpdk.org
> Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> 
> 
> > -----Original Message-----
> > From: Ananyev, Konstantin
> > Sent: Tuesday, January 27, 2015 2:29 AM
> > To: Wang, Zhihong; Richardson, Bruce; Marc Sune
> > Cc: dev@dpdk.org
> > Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >
> > Hi Zhihong,
> >
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Wang, Zhihong
> > > Sent: Friday, January 23, 2015 6:52 AM
> > > To: Richardson, Bruce; Marc Sune
> > > Cc: dev@dpdk.org
> > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> > > > Richardson
> > > > Sent: Wednesday, January 21, 2015 9:26 PM
> > > > To: Marc Sune
> > > > Cc: dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > >
> > > > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> > > > >
> > > > > On 21/01/15 14:02, Bruce Richardson wrote:
> > > > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> > > > > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> > > > > >>>>-----Original Message-----
> > > > > >>>>From: Richardson, Bruce
> > > > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> > > > > >>>>To: Neil Horman
> > > > > >>>>Cc: Wang, Zhihong; dev@dpdk.org
> > > > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > > >>>>
> > > > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong wrote:
> > > > > >>>>>>>-----Original Message-----
> > > > > >>>>>>>From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> > > > > >>>>>>>To: Wang, Zhihong
> > > > > >>>>>>>Cc: dev@dpdk.org
> > > > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy
> > > > > >>>>>>>optimization
> > > > > >>>>>>>
> > > > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> > > > > >>>>>>>zhihong.wang@intel.com
> > > > > >>>>wrote:
> > > > > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and
> > > > > >>>>>>>>AVX
> > > > > >>>>platforms.
> > > > > >>>>>>>>It also extends memcpy test coverage with unaligned cases
> > > > > >>>>>>>>and more test
> > > > > >>>>>>>points.
> > > > > >>>>>>>>Optimization techniques are summarized below:
> > > > > >>>>>>>>
> > > > > >>>>>>>>1. Utilize full cache bandwidth
> > > > > >>>>>>>>
> > > > > >>>>>>>>2. Enforce aligned stores
> > > > > >>>>>>>>
> > > > > >>>>>>>>3. Apply load address alignment based on architecture
> > > > > >>>>>>>>features
> > > > > >>>>>>>>
> > > > > >>>>>>>>4. Make load/store address available as early as possible
> > > > > >>>>>>>>
> > > > > >>>>>>>>5. General optimization techniques like inlining, branch
> > > > > >>>>>>>>reducing, prefetch pattern access
> > > > > >>>>>>>>
> > > > > >>>>>>>>Zhihong Wang (4):
> > > > > >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> > > > > >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
> > > > > >>>>>>>>   Extended test coverage in test_memcpy_perf.c
> > > > > >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
> > > > and AVX
> > > > > >>>>>>>>     platforms
> > > > > >>>>>>>>
> > > > > >>>>>>>>  app/test/Makefile                                  |   6 +
> > > > > >>>>>>>>  app/test/test_memcpy.c                             |  52 +-
> > > > > >>>>>>>>  app/test/test_memcpy_perf.c                        | 238 +++++---
> > > > > >>>>>>>>  .../common/include/arch/x86/rte_memcpy.h           | 664
> > > > > >>>>>>>+++++++++++++++------
> > > > > >>>>>>>>  4 files changed, 656 insertions(+), 304 deletions(-)
> > > > > >>>>>>>>
> > > > > >>>>>>>>--
> > > > > >>>>>>>>1.9.3
> > > > > >>>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>Are you able to compile this with gcc 4.9.2?  The
> > > > > >>>>>>>compilation of test_memcpy_perf is taking forever for me.  It
> > appears hung.
> > > > > >>>>>>>Neil
> > > > > >>>>>>Neil,
> > > > > >>>>>>
> > > > > >>>>>>Thanks for reporting this!
> > > > > >>>>>>It should compile but will take quite some time if the CPU
> > > > > >>>>>>doesn't support
> > > > > >>>>AVX2, the reason is that:
> > > > > >>>>>>1. The SSE & AVX memcpy implementation is more complicated
> > > > than
> > > > > >>>>AVX2
> > > > > >>>>>>version thus the compiler takes more time to compile and
> > > > > >>>>>>optimize
> > > > 2.
> > > > > >>>>>>The new test_memcpy_perf.c contains 126 constants memcpy
> > > > > >>>>>>calls for better test case coverage, that's quite a lot
> > > > > >>>>>>
> > > > > >>>>>>I've just tested this patch on an Ivy Bridge machine with GCC
> > 4.9.2:
> > > > > >>>>>>1. The whole compile process takes 9'41" with the original
> > > > > >>>>>>test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2.
> > > > > >>>>>>It takes only 2'41" after I reduce the constant memcpy call
> > > > > >>>>>>number to 12 + 12 = 24
> > > > > >>>>>>
> > > > > >>>>>>I'll reduce memcpy call in the next version of patch.
> > > > > >>>>>>
> > > > > >>>>>ok, thank you.  I'm all for optimzation, but I think a
> > > > > >>>>>compile that takes almost
> > > > > >>>>>10 minutes for a single file is going to generate some raised
> > > > > >>>>>eyebrows when end users start tinkering with it
> > > > > >>>>>
> > > > > >>>>>Neil
> > > > > >>>>>
> > > > > >>>>>>Zhihong (John)
> > > > > >>>>>>
> > > > > >>>>Even two minutes is a very long time to compile, IMHO. The
> > > > > >>>>whole of DPDK doesn't take that long to compile right now, and
> > > > > >>>>that's with a couple of huge header files with routing tables
> > > > > >>>>in it. Any chance you could cut compile time down to a few
> > > > > >>>>seconds while still
> > > > having reasonable tests?
> > > > > >>>>Also, when there is AVX2 present on the system, what is the
> > > > > >>>>compile time like for that code?
> > > > > >>>>
> > > > > >>>>	/Bruce
> > > > > >>>Neil, Bruce,
> > > > > >>>
> > > > > >>>Some data first.
> > > > > >>>
> > > > > >>>Sandy Bridge without AVX2:
> > > > > >>>1. original w/ 10 constant memcpy: 2'25"
> > > > > >>>2. patch w/ 12 constant memcpy: 2'41"
> > > > > >>>3. patch w/ 63 constant memcpy: 9'41"
> > > > > >>>
> > > > > >>>Haswell with AVX2:
> > > > > >>>1. original w/ 10 constant memcpy: 1'57"
> > > > > >>>2. patch w/ 12 constant memcpy: 1'56"
> > > > > >>>3. patch w/ 63 constant memcpy: 3'16"
> > > > > >>>
> > > > > >>>Also, to address Bruce's question, we have to reduce test case
> > > > > >>>to cut
> > > > down compile time. Because we use:
> > > > > >>>1. intrinsics instead of assembly for better flexibility and
> > > > > >>>can utilize more compiler optimization 2. complex function body
> > > > > >>>for better performance 3. inlining This increases compile time.
> > > > > >>>But I think it'd be okay to do that as long as we can select a
> > > > > >>>fair set of
> > > > test points.
> > > > > >>>
> > > > > >>>It'd be great if you could give some suggestion, say, 12 points.
> > > > > >>>
> > > > > >>>Zhihong (John)
> > > > > >>>
> > > > > >>>
> > > > > >>While I agree in the general case these long compilation times
> > > > > >>is painful for the users, having a factor of 2-8x in memcpy
> > > > > >>operations is quite an improvement, specially in DPDK
> > > > > >>applications which need to deal
> > > > > >>(unfortunately) heavily on them -- e.g. IP fragmentation and
> > reassembly.
> > > > > >>
> > > > > >>Why not having a fast compilation by default, and a tunable
> > > > > >>config flag to enable a highly optimized version of rte_memcpy (e.g.
> > > > RTE_EAL_OPT_MEMCPY)?
> > > > > >>
> > > > > >>Marc
> > > > > >>
> > > > > >Out of interest, are these 2-8x improvements something you have
> > > > > >benchmarked in these app scenarios? [i.e. not just in micro-
> > benchmarks].
> > > > >
> > > > > How much that micro-speedup will end up affecting the performance
> > > > > of the entire application is something I cannot say, so I agree
> > > > > that we should probably have some additional benchmarks before
> > > > > deciding that pays off maintaining 2 versions of rte_memcpy.
> > > > >
> > > > > There are however a bunch of possible DPDK applications that could
> > > > > potentially benefit; IP fragmentation, tunneling and specialized
> > > > > DPI applications, among others, since they involve a reasonable
> > > > > amount of memcpys per pkt. My point was, *if* it proves that is
> > > > > enough beneficial, why not having it optionally?
> > > > >
> > > > > Marc
> > > >
> > > > I agree, if it provides the speedups then we need to have it in -
> > > > and quite possibly on by default, even.
> > > >
> > > > /Bruce
> > >
> > > Since we're clear now that the long compile time is mainly caused by too
> > many inline function calls, I think it's okay not to do this.
> > > Would you agree?
> >
> > Actually I wonder, if instead of:
> >
> > +	switch (srcofs) {
> > +	case 0x01: MOVEUNALIGNED_LEFT47(dst, src, n, 0x01); break;
> > +	case 0x02: MOVEUNALIGNED_LEFT47(dst, src, n, 0x02); break;
> > +	case 0x03: MOVEUNALIGNED_LEFT47(dst, src, n, 0x03); break;
> > +	case 0x04: MOVEUNALIGNED_LEFT47(dst, src, n, 0x04); break;
> > +	case 0x05: MOVEUNALIGNED_LEFT47(dst, src, n, 0x05); break;
> > +	case 0x06: MOVEUNALIGNED_LEFT47(dst, src, n, 0x06); break;
> > +	case 0x07: MOVEUNALIGNED_LEFT47(dst, src, n, 0x07); break;
> > +	case 0x08: MOVEUNALIGNED_LEFT47(dst, src, n, 0x08); break;
> > +	case 0x09: MOVEUNALIGNED_LEFT47(dst, src, n, 0x09); break;
> > +	case 0x0A: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0A); break;
> > +	case 0x0B: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0B); break;
> > +	case 0x0C: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0C); break;
> > +	case 0x0D: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0D); break;
> > +	case 0x0E: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0E); break;
> > +	case 0x0F: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0F); break;
> > +	default:;
> > +	}
> >
> > We'll just do:
> > MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
> >
> > That should reduce size of the generated code quite a bit, wouldn't it?
> > From other side MOVEUNALIGNED_LEFT47() is pretty big chunk, so
> > performance difference having offset value in a register vs immediate value
> > shouldn't be significant.
> >
> > Konstantin
> >
> > >
> > > Zhihong (John)
> 
> Hey Konstantin,
> 
> We have to use switch here because PALIGNR requires the shift count to be an 8-bit immediate.

Ah ok, then can we move switch inside the for the block of code that using PALIGNR?
Or would it be too big performance drop? 
Konstantin

> 
> Zhihong (John)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-27 11:30                         ` Ananyev, Konstantin
@ 2015-01-27 12:19                           ` Ananyev, Konstantin
  2015-01-28  2:06                             ` Wang, Zhihong
  0 siblings, 1 reply; 48+ messages in thread
From: Ananyev, Konstantin @ 2015-01-27 12:19 UTC (permalink / raw)
  To: Wang, Zhihong, Richardson, Bruce, 'Marc Sune'
  Cc: 'dev@dpdk.org'



> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Tuesday, January 27, 2015 11:30 AM
> To: Wang, Zhihong; Richardson, Bruce; Marc Sune
> Cc: dev@dpdk.org
> Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> 
> 
> > -----Original Message-----
> > From: Wang, Zhihong
> > Sent: Tuesday, January 27, 2015 1:42 AM
> > To: Ananyev, Konstantin; Richardson, Bruce; Marc Sune
> > Cc: dev@dpdk.org
> > Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >
> >
> >
> > > -----Original Message-----
> > > From: Ananyev, Konstantin
> > > Sent: Tuesday, January 27, 2015 2:29 AM
> > > To: Wang, Zhihong; Richardson, Bruce; Marc Sune
> > > Cc: dev@dpdk.org
> > > Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >
> > > Hi Zhihong,
> > >
> > > > -----Original Message-----
> > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Wang, Zhihong
> > > > Sent: Friday, January 23, 2015 6:52 AM
> > > > To: Richardson, Bruce; Marc Sune
> > > > Cc: dev@dpdk.org
> > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> > > > > Richardson
> > > > > Sent: Wednesday, January 21, 2015 9:26 PM
> > > > > To: Marc Sune
> > > > > Cc: dev@dpdk.org
> > > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > >
> > > > > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> > > > > >
> > > > > > On 21/01/15 14:02, Bruce Richardson wrote:
> > > > > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> > > > > > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> > > > > > >>>>-----Original Message-----
> > > > > > >>>>From: Richardson, Bruce
> > > > > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> > > > > > >>>>To: Neil Horman
> > > > > > >>>>Cc: Wang, Zhihong; dev@dpdk.org
> > > > > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > > > >>>>
> > > > > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > > > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong wrote:
> > > > > > >>>>>>>-----Original Message-----
> > > > > > >>>>>>>From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> > > > > > >>>>>>>To: Wang, Zhihong
> > > > > > >>>>>>>Cc: dev@dpdk.org
> > > > > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy
> > > > > > >>>>>>>optimization
> > > > > > >>>>>>>
> > > > > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> > > > > > >>>>>>>zhihong.wang@intel.com
> > > > > > >>>>wrote:
> > > > > > >>>>>>>>This patch set optimizes memcpy for DPDK for both SSE and
> > > > > > >>>>>>>>AVX
> > > > > > >>>>platforms.
> > > > > > >>>>>>>>It also extends memcpy test coverage with unaligned cases
> > > > > > >>>>>>>>and more test
> > > > > > >>>>>>>points.
> > > > > > >>>>>>>>Optimization techniques are summarized below:
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>1. Utilize full cache bandwidth
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>2. Enforce aligned stores
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>3. Apply load address alignment based on architecture
> > > > > > >>>>>>>>features
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>4. Make load/store address available as early as possible
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>5. General optimization techniques like inlining, branch
> > > > > > >>>>>>>>reducing, prefetch pattern access
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>Zhihong Wang (4):
> > > > > > >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> > > > > > >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
> > > > > > >>>>>>>>   Extended test coverage in test_memcpy_perf.c
> > > > > > >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
> > > > > and AVX
> > > > > > >>>>>>>>     platforms
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>  app/test/Makefile                                  |   6 +
> > > > > > >>>>>>>>  app/test/test_memcpy.c                             |  52 +-
> > > > > > >>>>>>>>  app/test/test_memcpy_perf.c                        | 238 +++++---
> > > > > > >>>>>>>>  .../common/include/arch/x86/rte_memcpy.h           | 664
> > > > > > >>>>>>>+++++++++++++++------
> > > > > > >>>>>>>>  4 files changed, 656 insertions(+), 304 deletions(-)
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>--
> > > > > > >>>>>>>>1.9.3
> > > > > > >>>>>>>>
> > > > > > >>>>>>>>
> > > > > > >>>>>>>Are you able to compile this with gcc 4.9.2?  The
> > > > > > >>>>>>>compilation of test_memcpy_perf is taking forever for me.  It
> > > appears hung.
> > > > > > >>>>>>>Neil
> > > > > > >>>>>>Neil,
> > > > > > >>>>>>
> > > > > > >>>>>>Thanks for reporting this!
> > > > > > >>>>>>It should compile but will take quite some time if the CPU
> > > > > > >>>>>>doesn't support
> > > > > > >>>>AVX2, the reason is that:
> > > > > > >>>>>>1. The SSE & AVX memcpy implementation is more complicated
> > > > > than
> > > > > > >>>>AVX2
> > > > > > >>>>>>version thus the compiler takes more time to compile and
> > > > > > >>>>>>optimize
> > > > > 2.
> > > > > > >>>>>>The new test_memcpy_perf.c contains 126 constants memcpy
> > > > > > >>>>>>calls for better test case coverage, that's quite a lot
> > > > > > >>>>>>
> > > > > > >>>>>>I've just tested this patch on an Ivy Bridge machine with GCC
> > > 4.9.2:
> > > > > > >>>>>>1. The whole compile process takes 9'41" with the original
> > > > > > >>>>>>test_memcpy_perf.c (63 + 63 = 126 constant memcpy calls) 2.
> > > > > > >>>>>>It takes only 2'41" after I reduce the constant memcpy call
> > > > > > >>>>>>number to 12 + 12 = 24
> > > > > > >>>>>>
> > > > > > >>>>>>I'll reduce memcpy call in the next version of patch.
> > > > > > >>>>>>
> > > > > > >>>>>ok, thank you.  I'm all for optimzation, but I think a
> > > > > > >>>>>compile that takes almost
> > > > > > >>>>>10 minutes for a single file is going to generate some raised
> > > > > > >>>>>eyebrows when end users start tinkering with it
> > > > > > >>>>>
> > > > > > >>>>>Neil
> > > > > > >>>>>
> > > > > > >>>>>>Zhihong (John)
> > > > > > >>>>>>
> > > > > > >>>>Even two minutes is a very long time to compile, IMHO. The
> > > > > > >>>>whole of DPDK doesn't take that long to compile right now, and
> > > > > > >>>>that's with a couple of huge header files with routing tables
> > > > > > >>>>in it. Any chance you could cut compile time down to a few
> > > > > > >>>>seconds while still
> > > > > having reasonable tests?
> > > > > > >>>>Also, when there is AVX2 present on the system, what is the
> > > > > > >>>>compile time like for that code?
> > > > > > >>>>
> > > > > > >>>>	/Bruce
> > > > > > >>>Neil, Bruce,
> > > > > > >>>
> > > > > > >>>Some data first.
> > > > > > >>>
> > > > > > >>>Sandy Bridge without AVX2:
> > > > > > >>>1. original w/ 10 constant memcpy: 2'25"
> > > > > > >>>2. patch w/ 12 constant memcpy: 2'41"
> > > > > > >>>3. patch w/ 63 constant memcpy: 9'41"
> > > > > > >>>
> > > > > > >>>Haswell with AVX2:
> > > > > > >>>1. original w/ 10 constant memcpy: 1'57"
> > > > > > >>>2. patch w/ 12 constant memcpy: 1'56"
> > > > > > >>>3. patch w/ 63 constant memcpy: 3'16"
> > > > > > >>>
> > > > > > >>>Also, to address Bruce's question, we have to reduce test case
> > > > > > >>>to cut
> > > > > down compile time. Because we use:
> > > > > > >>>1. intrinsics instead of assembly for better flexibility and
> > > > > > >>>can utilize more compiler optimization 2. complex function body
> > > > > > >>>for better performance 3. inlining This increases compile time.
> > > > > > >>>But I think it'd be okay to do that as long as we can select a
> > > > > > >>>fair set of
> > > > > test points.
> > > > > > >>>
> > > > > > >>>It'd be great if you could give some suggestion, say, 12 points.
> > > > > > >>>
> > > > > > >>>Zhihong (John)
> > > > > > >>>
> > > > > > >>>
> > > > > > >>While I agree in the general case these long compilation times
> > > > > > >>is painful for the users, having a factor of 2-8x in memcpy
> > > > > > >>operations is quite an improvement, specially in DPDK
> > > > > > >>applications which need to deal
> > > > > > >>(unfortunately) heavily on them -- e.g. IP fragmentation and
> > > reassembly.
> > > > > > >>
> > > > > > >>Why not having a fast compilation by default, and a tunable
> > > > > > >>config flag to enable a highly optimized version of rte_memcpy (e.g.
> > > > > RTE_EAL_OPT_MEMCPY)?
> > > > > > >>
> > > > > > >>Marc
> > > > > > >>
> > > > > > >Out of interest, are these 2-8x improvements something you have
> > > > > > >benchmarked in these app scenarios? [i.e. not just in micro-
> > > benchmarks].
> > > > > >
> > > > > > How much that micro-speedup will end up affecting the performance
> > > > > > of the entire application is something I cannot say, so I agree
> > > > > > that we should probably have some additional benchmarks before
> > > > > > deciding that pays off maintaining 2 versions of rte_memcpy.
> > > > > >
> > > > > > There are however a bunch of possible DPDK applications that could
> > > > > > potentially benefit; IP fragmentation, tunneling and specialized
> > > > > > DPI applications, among others, since they involve a reasonable
> > > > > > amount of memcpys per pkt. My point was, *if* it proves that is
> > > > > > enough beneficial, why not having it optionally?
> > > > > >
> > > > > > Marc
> > > > >
> > > > > I agree, if it provides the speedups then we need to have it in -
> > > > > and quite possibly on by default, even.
> > > > >
> > > > > /Bruce
> > > >
> > > > Since we're clear now that the long compile time is mainly caused by too
> > > many inline function calls, I think it's okay not to do this.
> > > > Would you agree?
> > >
> > > Actually I wonder, if instead of:
> > >
> > > +	switch (srcofs) {
> > > +	case 0x01: MOVEUNALIGNED_LEFT47(dst, src, n, 0x01); break;
> > > +	case 0x02: MOVEUNALIGNED_LEFT47(dst, src, n, 0x02); break;
> > > +	case 0x03: MOVEUNALIGNED_LEFT47(dst, src, n, 0x03); break;
> > > +	case 0x04: MOVEUNALIGNED_LEFT47(dst, src, n, 0x04); break;
> > > +	case 0x05: MOVEUNALIGNED_LEFT47(dst, src, n, 0x05); break;
> > > +	case 0x06: MOVEUNALIGNED_LEFT47(dst, src, n, 0x06); break;
> > > +	case 0x07: MOVEUNALIGNED_LEFT47(dst, src, n, 0x07); break;
> > > +	case 0x08: MOVEUNALIGNED_LEFT47(dst, src, n, 0x08); break;
> > > +	case 0x09: MOVEUNALIGNED_LEFT47(dst, src, n, 0x09); break;
> > > +	case 0x0A: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0A); break;
> > > +	case 0x0B: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0B); break;
> > > +	case 0x0C: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0C); break;
> > > +	case 0x0D: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0D); break;
> > > +	case 0x0E: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0E); break;
> > > +	case 0x0F: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0F); break;
> > > +	default:;
> > > +	}
> > >
> > > We'll just do:
> > > MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
> > >
> > > That should reduce size of the generated code quite a bit, wouldn't it?
> > > From other side MOVEUNALIGNED_LEFT47() is pretty big chunk, so
> > > performance difference having offset value in a register vs immediate value
> > > shouldn't be significant.
> > >
> > > Konstantin
> > >
> > > >
> > > > Zhihong (John)
> >
> > Hey Konstantin,
> >
> > We have to use switch here because PALIGNR requires the shift count to be an 8-bit immediate.
> 
> Ah ok, then can we move switch inside the for the block of code that using PALIGNR?
> Or would it be too big performance drop?

I meant 'inside the MOVEUNALIGNED_LEFT47()' macro. :)

> Konstantin
> 
> >
> > Zhihong (John)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [snabb-devel] RE: [PATCH 0/4] DPDK memcpy optimization
  2015-01-27  7:19       ` Wang, Zhihong
@ 2015-01-27 13:57         ` Luke Gorrie
  0 siblings, 0 replies; 48+ messages in thread
From: Luke Gorrie @ 2015-01-27 13:57 UTC (permalink / raw)
  To: snabb-devel; +Cc: dev

Hi again John,

Thank you for the patient answers :-)

Thank you for pointing this out: I was mistakenly testing your Sandy Bridge
code on Haswell (lacking -DRTE_MACHINE_CPUFLAG_AVX2).

Correcting that, your code is both the fastest and the smallest in my
humble micro benchmarking tests.

Looks like you have done great work! You probably knew that already :-) but
thank you for walking me through it.

The code compiles to 745 bytes of object code (smaller than glibc 2.20
memcpy) and cachebenches like this:

                Memory Copy Library Cache Test

C Size          Nanosec         MB/sec          % Chnge
-------         -------         -------         -------
256             0.01            97587.60        1.00
384             0.01            97628.83        1.00
512             0.01            97613.95        1.00
768             0.01            147811.44       0.66
1024            0.01            158938.68       0.93
1536            0.01            168487.49       0.94
2048            0.01            174278.83       0.97
3072            0.01            156922.58       1.11
4096            0.01            145811.59       1.08
6144            0.01            157388.27       0.93
8192            0.01            149616.95       1.05
12288           0.01            149064.26       1.00
16384           0.01            107895.06       1.38

the key difference from my perspective is that glibc 2.20 memcpy
performance goes way down for >= 2048 bytes when they switch from vector
moves to string moves, while your code stays consistent.

I will take it for a spin in a real application.

Cheers,
-Luke

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-27 12:19                           ` Ananyev, Konstantin
@ 2015-01-28  2:06                             ` Wang, Zhihong
  0 siblings, 0 replies; 48+ messages in thread
From: Wang, Zhihong @ 2015-01-28  2:06 UTC (permalink / raw)
  To: Ananyev, Konstantin, Richardson, Bruce, 'Marc Sune'
  Cc: 'dev@dpdk.org'



> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Tuesday, January 27, 2015 8:20 PM
> To: Wang, Zhihong; Richardson, Bruce; 'Marc Sune'
> Cc: 'dev@dpdk.org'
> Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> 
> 
> > -----Original Message-----
> > From: Ananyev, Konstantin
> > Sent: Tuesday, January 27, 2015 11:30 AM
> > To: Wang, Zhihong; Richardson, Bruce; Marc Sune
> > Cc: dev@dpdk.org
> > Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >
> >
> >
> > > -----Original Message-----
> > > From: Wang, Zhihong
> > > Sent: Tuesday, January 27, 2015 1:42 AM
> > > To: Ananyev, Konstantin; Richardson, Bruce; Marc Sune
> > > Cc: dev@dpdk.org
> > > Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Ananyev, Konstantin
> > > > Sent: Tuesday, January 27, 2015 2:29 AM
> > > > To: Wang, Zhihong; Richardson, Bruce; Marc Sune
> > > > Cc: dev@dpdk.org
> > > > Subject: RE: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > >
> > > > Hi Zhihong,
> > > >
> > > > > -----Original Message-----
> > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Wang,
> > > > > Zhihong
> > > > > Sent: Friday, January 23, 2015 6:52 AM
> > > > > To: Richardson, Bruce; Marc Sune
> > > > > Cc: dev@dpdk.org
> > > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Bruce
> > > > > > Richardson
> > > > > > Sent: Wednesday, January 21, 2015 9:26 PM
> > > > > > To: Marc Sune
> > > > > > Cc: dev@dpdk.org
> > > > > > Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> > > > > >
> > > > > > On Wed, Jan 21, 2015 at 02:21:25PM +0100, Marc Sune wrote:
> > > > > > >
> > > > > > > On 21/01/15 14:02, Bruce Richardson wrote:
> > > > > > > >On Wed, Jan 21, 2015 at 01:36:41PM +0100, Marc Sune wrote:
> > > > > > > >>On 21/01/15 04:44, Wang, Zhihong wrote:
> > > > > > > >>>>-----Original Message-----
> > > > > > > >>>>From: Richardson, Bruce
> > > > > > > >>>>Sent: Wednesday, January 21, 2015 12:15 AM
> > > > > > > >>>>To: Neil Horman
> > > > > > > >>>>Cc: Wang, Zhihong; dev@dpdk.org
> > > > > > > >>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy
> > > > > > > >>>>optimization
> > > > > > > >>>>
> > > > > > > >>>>On Tue, Jan 20, 2015 at 10:11:18AM -0500, Neil Horman wrote:
> > > > > > > >>>>>On Tue, Jan 20, 2015 at 03:01:44AM +0000, Wang, Zhihong
> wrote:
> > > > > > > >>>>>>>-----Original Message-----
> > > > > > > >>>>>>>From: Neil Horman [mailto:nhorman@tuxdriver.com]
> > > > > > > >>>>>>>Sent: Monday, January 19, 2015 9:02 PM
> > > > > > > >>>>>>>To: Wang, Zhihong
> > > > > > > >>>>>>>Cc: dev@dpdk.org
> > > > > > > >>>>>>>Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy
> > > > > > > >>>>>>>optimization
> > > > > > > >>>>>>>
> > > > > > > >>>>>>>On Mon, Jan 19, 2015 at 09:53:30AM +0800,
> > > > > > > >>>>>>>zhihong.wang@intel.com
> > > > > > > >>>>wrote:
> > > > > > > >>>>>>>>This patch set optimizes memcpy for DPDK for both
> > > > > > > >>>>>>>>SSE and AVX
> > > > > > > >>>>platforms.
> > > > > > > >>>>>>>>It also extends memcpy test coverage with unaligned
> > > > > > > >>>>>>>>cases and more test
> > > > > > > >>>>>>>points.
> > > > > > > >>>>>>>>Optimization techniques are summarized below:
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>>1. Utilize full cache bandwidth
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>>2. Enforce aligned stores
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>>3. Apply load address alignment based on
> > > > > > > >>>>>>>>architecture features
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>>4. Make load/store address available as early as
> > > > > > > >>>>>>>>possible
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>>5. General optimization techniques like inlining,
> > > > > > > >>>>>>>>branch reducing, prefetch pattern access
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>>Zhihong Wang (4):
> > > > > > > >>>>>>>>   Disabled VTA for memcpy test in app/test/Makefile
> > > > > > > >>>>>>>>   Removed unnecessary test cases in test_memcpy.c
> > > > > > > >>>>>>>>   Extended test coverage in test_memcpy_perf.c
> > > > > > > >>>>>>>>   Optimized memcpy in arch/x86/rte_memcpy.h for
> > > > > > > >>>>>>>>both SSE
> > > > > > and AVX
> > > > > > > >>>>>>>>     platforms
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>>  app/test/Makefile                                  |   6 +
> > > > > > > >>>>>>>>  app/test/test_memcpy.c                             |  52 +-
> > > > > > > >>>>>>>>  app/test/test_memcpy_perf.c                        | 238
> +++++---
> > > > > > > >>>>>>>>  .../common/include/arch/x86/rte_memcpy.h           |
> 664
> > > > > > > >>>>>>>+++++++++++++++------
> > > > > > > >>>>>>>>  4 files changed, 656 insertions(+), 304
> > > > > > > >>>>>>>> deletions(-)
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>>--
> > > > > > > >>>>>>>>1.9.3
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>>
> > > > > > > >>>>>>>Are you able to compile this with gcc 4.9.2?  The
> > > > > > > >>>>>>>compilation of test_memcpy_perf is taking forever for
> > > > > > > >>>>>>>me.  It
> > > > appears hung.
> > > > > > > >>>>>>>Neil
> > > > > > > >>>>>>Neil,
> > > > > > > >>>>>>
> > > > > > > >>>>>>Thanks for reporting this!
> > > > > > > >>>>>>It should compile but will take quite some time if the
> > > > > > > >>>>>>CPU doesn't support
> > > > > > > >>>>AVX2, the reason is that:
> > > > > > > >>>>>>1. The SSE & AVX memcpy implementation is more
> > > > > > > >>>>>>complicated
> > > > > > than
> > > > > > > >>>>AVX2
> > > > > > > >>>>>>version thus the compiler takes more time to compile
> > > > > > > >>>>>>and optimize
> > > > > > 2.
> > > > > > > >>>>>>The new test_memcpy_perf.c contains 126 constants
> > > > > > > >>>>>>memcpy calls for better test case coverage, that's
> > > > > > > >>>>>>quite a lot
> > > > > > > >>>>>>
> > > > > > > >>>>>>I've just tested this patch on an Ivy Bridge machine
> > > > > > > >>>>>>with GCC
> > > > 4.9.2:
> > > > > > > >>>>>>1. The whole compile process takes 9'41" with the
> > > > > > > >>>>>>original test_memcpy_perf.c (63 + 63 = 126 constant
> memcpy calls) 2.
> > > > > > > >>>>>>It takes only 2'41" after I reduce the constant memcpy
> > > > > > > >>>>>>call number to 12 + 12 = 24
> > > > > > > >>>>>>
> > > > > > > >>>>>>I'll reduce memcpy call in the next version of patch.
> > > > > > > >>>>>>
> > > > > > > >>>>>ok, thank you.  I'm all for optimzation, but I think a
> > > > > > > >>>>>compile that takes almost
> > > > > > > >>>>>10 minutes for a single file is going to generate some
> > > > > > > >>>>>raised eyebrows when end users start tinkering with it
> > > > > > > >>>>>
> > > > > > > >>>>>Neil
> > > > > > > >>>>>
> > > > > > > >>>>>>Zhihong (John)
> > > > > > > >>>>>>
> > > > > > > >>>>Even two minutes is a very long time to compile, IMHO.
> > > > > > > >>>>The whole of DPDK doesn't take that long to compile
> > > > > > > >>>>right now, and that's with a couple of huge header files
> > > > > > > >>>>with routing tables in it. Any chance you could cut
> > > > > > > >>>>compile time down to a few seconds while still
> > > > > > having reasonable tests?
> > > > > > > >>>>Also, when there is AVX2 present on the system, what is
> > > > > > > >>>>the compile time like for that code?
> > > > > > > >>>>
> > > > > > > >>>>	/Bruce
> > > > > > > >>>Neil, Bruce,
> > > > > > > >>>
> > > > > > > >>>Some data first.
> > > > > > > >>>
> > > > > > > >>>Sandy Bridge without AVX2:
> > > > > > > >>>1. original w/ 10 constant memcpy: 2'25"
> > > > > > > >>>2. patch w/ 12 constant memcpy: 2'41"
> > > > > > > >>>3. patch w/ 63 constant memcpy: 9'41"
> > > > > > > >>>
> > > > > > > >>>Haswell with AVX2:
> > > > > > > >>>1. original w/ 10 constant memcpy: 1'57"
> > > > > > > >>>2. patch w/ 12 constant memcpy: 1'56"
> > > > > > > >>>3. patch w/ 63 constant memcpy: 3'16"
> > > > > > > >>>
> > > > > > > >>>Also, to address Bruce's question, we have to reduce test
> > > > > > > >>>case to cut
> > > > > > down compile time. Because we use:
> > > > > > > >>>1. intrinsics instead of assembly for better flexibility
> > > > > > > >>>and can utilize more compiler optimization 2. complex
> > > > > > > >>>function body for better performance 3. inlining This increases
> compile time.
> > > > > > > >>>But I think it'd be okay to do that as long as we can
> > > > > > > >>>select a fair set of
> > > > > > test points.
> > > > > > > >>>
> > > > > > > >>>It'd be great if you could give some suggestion, say, 12 points.
> > > > > > > >>>
> > > > > > > >>>Zhihong (John)
> > > > > > > >>>
> > > > > > > >>>
> > > > > > > >>While I agree in the general case these long compilation
> > > > > > > >>times is painful for the users, having a factor of 2-8x in
> > > > > > > >>memcpy operations is quite an improvement, specially in
> > > > > > > >>DPDK applications which need to deal
> > > > > > > >>(unfortunately) heavily on them -- e.g. IP fragmentation
> > > > > > > >>and
> > > > reassembly.
> > > > > > > >>
> > > > > > > >>Why not having a fast compilation by default, and a
> > > > > > > >>tunable config flag to enable a highly optimized version of
> rte_memcpy (e.g.
> > > > > > RTE_EAL_OPT_MEMCPY)?
> > > > > > > >>
> > > > > > > >>Marc
> > > > > > > >>
> > > > > > > >Out of interest, are these 2-8x improvements something you
> > > > > > > >have benchmarked in these app scenarios? [i.e. not just in
> > > > > > > >micro-
> > > > benchmarks].
> > > > > > >
> > > > > > > How much that micro-speedup will end up affecting the
> > > > > > > performance of the entire application is something I cannot
> > > > > > > say, so I agree that we should probably have some additional
> > > > > > > benchmarks before deciding that pays off maintaining 2 versions
> of rte_memcpy.
> > > > > > >
> > > > > > > There are however a bunch of possible DPDK applications that
> > > > > > > could potentially benefit; IP fragmentation, tunneling and
> > > > > > > specialized DPI applications, among others, since they
> > > > > > > involve a reasonable amount of memcpys per pkt. My point
> > > > > > > was, *if* it proves that is enough beneficial, why not having it
> optionally?
> > > > > > >
> > > > > > > Marc
> > > > > >
> > > > > > I agree, if it provides the speedups then we need to have it
> > > > > > in - and quite possibly on by default, even.
> > > > > >
> > > > > > /Bruce
> > > > >
> > > > > Since we're clear now that the long compile time is mainly
> > > > > caused by too
> > > > many inline function calls, I think it's okay not to do this.
> > > > > Would you agree?
> > > >
> > > > Actually I wonder, if instead of:
> > > >
> > > > +	switch (srcofs) {
> > > > +	case 0x01: MOVEUNALIGNED_LEFT47(dst, src, n, 0x01); break;
> > > > +	case 0x02: MOVEUNALIGNED_LEFT47(dst, src, n, 0x02); break;
> > > > +	case 0x03: MOVEUNALIGNED_LEFT47(dst, src, n, 0x03); break;
> > > > +	case 0x04: MOVEUNALIGNED_LEFT47(dst, src, n, 0x04); break;
> > > > +	case 0x05: MOVEUNALIGNED_LEFT47(dst, src, n, 0x05); break;
> > > > +	case 0x06: MOVEUNALIGNED_LEFT47(dst, src, n, 0x06); break;
> > > > +	case 0x07: MOVEUNALIGNED_LEFT47(dst, src, n, 0x07); break;
> > > > +	case 0x08: MOVEUNALIGNED_LEFT47(dst, src, n, 0x08); break;
> > > > +	case 0x09: MOVEUNALIGNED_LEFT47(dst, src, n, 0x09); break;
> > > > +	case 0x0A: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0A); break;
> > > > +	case 0x0B: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0B); break;
> > > > +	case 0x0C: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0C); break;
> > > > +	case 0x0D: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0D); break;
> > > > +	case 0x0E: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0E); break;
> > > > +	case 0x0F: MOVEUNALIGNED_LEFT47(dst, src, n, 0x0F); break;
> > > > +	default:;
> > > > +	}
> > > >
> > > > We'll just do:
> > > > MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
> > > >
> > > > That should reduce size of the generated code quite a bit, wouldn't it?
> > > > From other side MOVEUNALIGNED_LEFT47() is pretty big chunk, so
> > > > performance difference having offset value in a register vs
> > > > immediate value shouldn't be significant.
> > > >
> > > > Konstantin
> > > >
> > > > >
> > > > > Zhihong (John)
> > >
> > > Hey Konstantin,
> > >
> > > We have to use switch here because PALIGNR requires the shift count to
> be an 8-bit immediate.
> >
> > Ah ok, then can we move switch inside the for the block of code that using
> PALIGNR?
> > Or would it be too big performance drop?
> 
> I meant 'inside the MOVEUNALIGNED_LEFT47()' macro. :)

I think it's more like a matter of programming taste :) and I agree that it looks clearer inside the macro.
Will add this in the next version. Thanks!

Zhihong (John)

> 
> > Konstantin
> >
> > >
> > > Zhihong (John)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-27  8:22                         ` Wang, Zhihong
@ 2015-01-28 21:48                           ` EDMISON, Kelvin (Kelvin)
  2015-01-29  1:53                             ` Wang, Zhihong
  0 siblings, 1 reply; 48+ messages in thread
From: EDMISON, Kelvin (Kelvin) @ 2015-01-28 21:48 UTC (permalink / raw)
  To: Wang, Zhihong, Stephen Hemminger, Neil Horman; +Cc: dev


On 2015-01-27, 3:22 AM, "Wang, Zhihong" <zhihong.wang@intel.com> wrote:

>
>
>> -----Original Message-----
>> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of EDMISON, Kelvin
>> (Kelvin)
>> Sent: Friday, January 23, 2015 2:22 AM
>> To: dev@dpdk.org
>> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
>> 
>> 
>> 
>> On 2015-01-21, 3:54 PM, "Neil Horman" <nhorman@tuxdriver.com> wrote:
>> 
>> >On Wed, Jan 21, 2015 at 11:49:47AM -0800, Stephen Hemminger wrote:
>> >> On Wed, 21 Jan 2015 13:26:20 +0000
>> >> Bruce Richardson <bruce.richardson@intel.com> wrote:
>> >>
[..trim...]
>> >> One issue I have is that as a vendor we need to ship on binary, not
>> >>different distributions
>> >> for each Intel chip variant. There is some support for multi-chip
>> >>version functions
>> >> but only in latest Gcc which isn't in Debian stable. And the
>>multi-chip
>> >>version
>> >> of functions is going to be more expensive than inlining. For some
>> >>cases, I have
>> >> seen that the overhead of fancy instructions looks good but have
>>nasty
>> >>side effects
>> >> like CPU stall and/or increased power consumption which turns of
>>turbo
>> >>boost.
>> >>
>> >>
>> >> Distro's in general have the same problem with special case
>> >>optimizations.
>> >>
>> >What we really need is to do something like borrow the alternatives
>> >mechanism
>> >from the kernel so that we can dynamically replace instructions at run
>> >time
>> >based on cpu flags.  That way we could make the choice at run time, and
>> >wouldn't
>> >have to do alot of special case jumping about.
>> >Neil
>> 
>> +1.
>> 
>> I think it should be an anti-requirement that the build machine be the
>> exact same chip as the deployment platform.
>> 
>> I like the cpu flag inspection approach.  It would help in the case
>>where
>> DPDK is in a VM and an odd set of CPU flags have been exposed.
>> 
>> If that approach doesn't work though, then perhaps DPDK memcpy could go
>> through a benchmarking at app startup time and select the most
>>performant
>> option out of a set, like mdraid's raid6 implementation does.  To give
>>an
>> example, this is what my systems print out at boot time re: raid6
>> algorithm selection.
>> raid6: sse2x1    3171 MB/s
>> raid6: sse2x2    3925 MB/s
>> raid6: sse2x4    4523 MB/s
>> raid6: using algorithm sse2x4 (4523 MB/s)
>> 
>> Regards,
>>    Kelvin
>> 
>
>Thanks for the proposal!
>
>For DPDK, performance is always the most important concern. We need to
>utilize new architecture features to achieve that, so solution per arch
>is necessary.
>Even a few extra cycles can lead to bad performance if they're in a hot
>loop.
>For instance, let's assume DPDK takes 60 cycles to process a packet on
>average, then 3 more cycles here means 5% performance drop.
>
>The dynamic solution is doable but with performance penalties, even if it
>could be small. Also it may bring extra complexity, which can lead to
>unpredictable behaviors and side effects.
>For example, the dynamic solution won't have inline unrolling, which can
>bring significant performance benefit for small copies with constant
>length, like eth_addr.
>
>We can investigate the VM scenario more.
>
>Zhihong (John)

John,

  Thanks for taking the time to answer my newbie question. I deeply
appreciate the attention paid to performance in DPDK. I have a follow-up
though.

I'm trying to figure out what requirements this approach creates for the
software build environment.  If we want to build optimized versions for
Haswell, Ivy Bridge, Sandy Bridge, etc, does this mean that we must have
one of each micro-architecture available for running the builds, or is
there a way of cross-compiling for all micro-architectures from just one
build environment?

Thanks,
  Kelvin 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-28 21:48                           ` EDMISON, Kelvin (Kelvin)
@ 2015-01-29  1:53                             ` Wang, Zhihong
  0 siblings, 0 replies; 48+ messages in thread
From: Wang, Zhihong @ 2015-01-29  1:53 UTC (permalink / raw)
  To: EDMISON, Kelvin (Kelvin), Stephen Hemminger, Neil Horman; +Cc: dev



> -----Original Message-----
> From: EDMISON, Kelvin (Kelvin) [mailto:kelvin.edmison@alcatel-lucent.com]
> Sent: Thursday, January 29, 2015 5:48 AM
> To: Wang, Zhihong; Stephen Hemminger; Neil Horman
> Cc: dev@dpdk.org
> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> 
> 
> On 2015-01-27, 3:22 AM, "Wang, Zhihong" <zhihong.wang@intel.com> wrote:
> 
> >
> >
> >> -----Original Message-----
> >> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of EDMISON,
> Kelvin
> >> (Kelvin)
> >> Sent: Friday, January 23, 2015 2:22 AM
> >> To: dev@dpdk.org
> >> Subject: Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
> >>
> >>
> >>
> >> On 2015-01-21, 3:54 PM, "Neil Horman" <nhorman@tuxdriver.com>
> wrote:
> >>
> >> >On Wed, Jan 21, 2015 at 11:49:47AM -0800, Stephen Hemminger wrote:
> >> >> On Wed, 21 Jan 2015 13:26:20 +0000 Bruce Richardson
> >> >> <bruce.richardson@intel.com> wrote:
> >> >>
> [..trim...]
> >> >> One issue I have is that as a vendor we need to ship on binary,
> >> >>not different distributions  for each Intel chip variant. There is
> >> >>some support for multi-chip version functions  but only in latest
> >> >>Gcc which isn't in Debian stable. And the
> >>multi-chip
> >> >>version
> >> >> of functions is going to be more expensive than inlining. For some
> >> >>cases, I have  seen that the overhead of fancy instructions looks
> >> >>good but have
> >>nasty
> >> >>side effects
> >> >> like CPU stall and/or increased power consumption which turns of
> >>turbo
> >> >>boost.
> >> >>
> >> >>
> >> >> Distro's in general have the same problem with special case
> >> >>optimizations.
> >> >>
> >> >What we really need is to do something like borrow the alternatives
> >> >mechanism from the kernel so that we can dynamically replace
> >> >instructions at run time based on cpu flags.  That way we could make
> >> >the choice at run time, and wouldn't have to do alot of special case
> >> >jumping about.
> >> >Neil
> >>
> >> +1.
> >>
> >> I think it should be an anti-requirement that the build machine be
> >> the exact same chip as the deployment platform.
> >>
> >> I like the cpu flag inspection approach.  It would help in the case
> >>where  DPDK is in a VM and an odd set of CPU flags have been exposed.
> >>
> >> If that approach doesn't work though, then perhaps DPDK memcpy could
> >>go  through a benchmarking at app startup time and select the most
> >>performant  option out of a set, like mdraid's raid6 implementation
> >>does.  To give an  example, this is what my systems print out at boot
> >>time re: raid6  algorithm selection.
> >> raid6: sse2x1    3171 MB/s
> >> raid6: sse2x2    3925 MB/s
> >> raid6: sse2x4    4523 MB/s
> >> raid6: using algorithm sse2x4 (4523 MB/s)
> >>
> >> Regards,
> >>    Kelvin
> >>
> >
> >Thanks for the proposal!
> >
> >For DPDK, performance is always the most important concern. We need to
> >utilize new architecture features to achieve that, so solution per arch
> >is necessary.
> >Even a few extra cycles can lead to bad performance if they're in a hot
> >loop.
> >For instance, let's assume DPDK takes 60 cycles to process a packet on
> >average, then 3 more cycles here means 5% performance drop.
> >
> >The dynamic solution is doable but with performance penalties, even if
> >it could be small. Also it may bring extra complexity, which can lead
> >to unpredictable behaviors and side effects.
> >For example, the dynamic solution won't have inline unrolling, which
> >can bring significant performance benefit for small copies with
> >constant length, like eth_addr.
> >
> >We can investigate the VM scenario more.
> >
> >Zhihong (John)
> 
> John,
> 
>   Thanks for taking the time to answer my newbie question. I deeply
> appreciate the attention paid to performance in DPDK. I have a follow-up
> though.
> 
> I'm trying to figure out what requirements this approach creates for the
> software build environment.  If we want to build optimized versions for
> Haswell, Ivy Bridge, Sandy Bridge, etc, does this mean that we must have one
> of each micro-architecture available for running the builds, or is there a way
> of cross-compiling for all micro-architectures from just one build
> environment?
> 
> Thanks,
>   Kelvin
> 

I'm not an expert in this, just some facts based on my test: The compile process depends on the compiler and the lib version.
So even on a machine that doesn't support the necessary ISA, it still should compile as long as gcc & glibc & etc have the support, only you'll get "Illegal instruction" trying launching the compiled binary.

Therefore if there's a way (worst case scenario: change flags manually) to make DPDK build process think that it's on a Haswell machine, it will produce Haswell binaries.

Zhihong (John)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization
  2015-01-19  1:53 [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization zhihong.wang
                   ` (5 preceding siblings ...)
  2015-01-25 14:50 ` Luke Gorrie
@ 2015-01-29  3:42 ` Fu, JingguoX
  6 siblings, 0 replies; 48+ messages in thread
From: Fu, JingguoX @ 2015-01-29  3:42 UTC (permalink / raw)
  To: Wang, Zhihong, dev

Basic Information

        Patch name        DPDK memcpy optimization
        Brief description about test purpose    Verify memory copy and memory copy performance cases on variety OS
        Test Flag         Tested-by
        Tester name       jingguox.fu at intel.com

        Test Tool Chain information     N/A
	  Commit ID	88fa98a60b34812bfed92e5b2706fcf7e1cbcbc8
        Test Result Summary     Total 6 cases, 6 passed, 0 failed
        
Test environment

        -   Environment 1:
            OS: Ubuntu12.04 3.2.0-23-generic X86_64
            GCC: gcc version 4.6.3
            CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
            NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ [8086:10fb] (rev 01)

        -   Environment 2: 
            OS: Ubuntu14.04 3.13.0-24-generic
            GCC: gcc version 4.8.2
            CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
            NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ [8086:10fb] (rev 01)

            Environment 3:
            OS: Fedora18 3.6.10-4.fc18.x86_64
            GCC: gcc version 4.7.2 20121109
            CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
            NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ [8086:10fb] (rev 01)


Detailed Testing information    

	  Test Case - name    test_memcpy
        Test Case - Description 
                  Create two buffers, and initialise one with random values. These are copied 
                  to the second buffer and then compared to see if the copy was successful. The 
                  bytes outside the copied area are also checked to make sure they were not changed.
        Test Case -test sample/application
                  test application in app/test
        Test Case -command / instruction
                  # ./app/test/test -n 1 -c ffff
                  #RTE>> memcpy_autotest
        Test Case - expected
                  #RTE>> Test	OK
        Test Result- PASSED

        Test Case - name    test_memcpy_perf
        Test Case - Description
                  a number of different sizes and cached/uncached permutations
        Test Case -test sample/application
                  test application in app/test
        Test Case -command / instruction
                  # ./app/test/test -n 1 -c ffff
                  #RTE>> memcpy_perf_autotest
        Test Case - expected
                  #RTE>> Test	OK
        Test Result- PASSED


-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of zhihong.wang@intel.com
Sent: Monday, January 19, 2015 09:54
To: dev@dpdk.org
Subject: [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization

This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
It also extends memcpy test coverage with unaligned cases and more test points.

Optimization techniques are summarized below:

1. Utilize full cache bandwidth

2. Enforce aligned stores

3. Apply load address alignment based on architecture features

4. Make load/store address available as early as possible

5. General optimization techniques like inlining, branch reducing, prefetch pattern access

Zhihong Wang (4):
  Disabled VTA for memcpy test in app/test/Makefile
  Removed unnecessary test cases in test_memcpy.c
  Extended test coverage in test_memcpy_perf.c
  Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX
    platforms

 app/test/Makefile                                  |   6 +
 app/test/test_memcpy.c                             |  52 +-
 app/test/test_memcpy_perf.c                        | 238 +++++---
 .../common/include/arch/x86/rte_memcpy.h           | 664 +++++++++++++++------
 4 files changed, 656 insertions(+), 304 deletions(-)

-- 
1.9.3

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2015-01-29  3:42 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-19  1:53 [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization zhihong.wang
2015-01-19  1:53 ` [dpdk-dev] [PATCH 1/4] app/test: Disabled VTA for memcpy test in app/test/Makefile zhihong.wang
2015-01-19  1:53 ` [dpdk-dev] [PATCH 2/4] app/test: Removed unnecessary test cases in test_memcpy.c zhihong.wang
2015-01-19  1:53 ` [dpdk-dev] [PATCH 3/4] app/test: Extended test coverage in test_memcpy_perf.c zhihong.wang
2015-01-19  1:53 ` [dpdk-dev] [PATCH 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms zhihong.wang
2015-01-20 17:15   ` Stephen Hemminger
2015-01-20 19:16     ` Neil Horman
2015-01-21  3:18       ` Wang, Zhihong
2015-01-25 20:02     ` Jim Thompson
2015-01-26 14:43   ` Wodkowski, PawelX
2015-01-27  5:12     ` Wang, Zhihong
2015-01-19 13:02 ` [dpdk-dev] [PATCH 0/4] DPDK memcpy optimization Neil Horman
2015-01-20  3:01   ` Wang, Zhihong
2015-01-20 15:11     ` Neil Horman
2015-01-20 16:14       ` Bruce Richardson
2015-01-21  3:44         ` Wang, Zhihong
2015-01-21 11:40           ` Bruce Richardson
2015-01-21 12:02           ` Ananyev, Konstantin
2015-01-21 12:38             ` Neil Horman
2015-01-23  3:26               ` Wang, Zhihong
2015-01-21 12:36           ` Marc Sune
2015-01-21 13:02             ` Bruce Richardson
2015-01-21 13:21               ` Marc Sune
2015-01-21 13:26                 ` Bruce Richardson
2015-01-21 19:49                   ` Stephen Hemminger
2015-01-21 20:54                     ` Neil Horman
2015-01-21 21:25                       ` Jim Thompson
2015-01-22  0:53                         ` Stephen Hemminger
2015-01-22  9:06                         ` Luke Gorrie
2015-01-22 13:29                           ` Jay Rolette
2015-01-22 18:27                             ` Luke Gorrie
2015-01-22 19:36                               ` Jay Rolette
2015-01-22 18:21                       ` EDMISON, Kelvin (Kelvin)
2015-01-27  8:22                         ` Wang, Zhihong
2015-01-28 21:48                           ` EDMISON, Kelvin (Kelvin)
2015-01-29  1:53                             ` Wang, Zhihong
2015-01-23  6:52                   ` Wang, Zhihong
2015-01-26 18:29                     ` Ananyev, Konstantin
2015-01-27  1:42                       ` Wang, Zhihong
2015-01-27 11:30                         ` Ananyev, Konstantin
2015-01-27 12:19                           ` Ananyev, Konstantin
2015-01-28  2:06                             ` Wang, Zhihong
2015-01-25 14:50 ` Luke Gorrie
2015-01-26  1:30   ` Wang, Zhihong
2015-01-26  8:03     ` Luke Gorrie
2015-01-27  7:19       ` Wang, Zhihong
2015-01-27 13:57         ` [dpdk-dev] [snabb-devel] " Luke Gorrie
2015-01-29  3:42 ` [dpdk-dev] " Fu, JingguoX

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).