* [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization
@ 2015-01-29 2:38 Zhihong Wang
2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 1/4] app/test: Disabled VTA for memcpy test in app/test/Makefile Zhihong Wang
` (6 more replies)
0 siblings, 7 replies; 12+ messages in thread
From: Zhihong Wang @ 2015-01-29 2:38 UTC (permalink / raw)
To: dev
This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
It also extends memcpy test coverage with unaligned cases and more test points.
Optimization techniques are summarized below:
1. Utilize full cache bandwidth
2. Enforce aligned stores
3. Apply load address alignment based on architecture features
4. Make load/store address available as early as possible
5. General optimization techniques like inlining, branch reducing, prefetch pattern access
--------------
Changes in v2:
1. Reduced constant test cases in app/test/test_memcpy_perf.c for fast build
2. Modified macro definition for better code readability & safety
Zhihong Wang (4):
app/test: Disabled VTA for memcpy test in app/test/Makefile
app/test: Removed unnecessary test cases in app/test/test_memcpy.c
app/test: Extended test coverage in app/test/test_memcpy_perf.c
lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
and AVX platforms
app/test/Makefile | 6 +
app/test/test_memcpy.c | 52 +-
app/test/test_memcpy_perf.c | 220 ++++---
.../common/include/arch/x86/rte_memcpy.h | 680 +++++++++++++++------
4 files changed, 654 insertions(+), 304 deletions(-)
--
1.9.3
^ permalink raw reply [flat|nested] 12+ messages in thread
* [dpdk-dev] [PATCH v2 1/4] app/test: Disabled VTA for memcpy test in app/test/Makefile
2015-01-29 2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang
@ 2015-01-29 2:38 ` Zhihong Wang
2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 2/4] app/test: Removed unnecessary test cases in app/test/test_memcpy.c Zhihong Wang
` (5 subsequent siblings)
6 siblings, 0 replies; 12+ messages in thread
From: Zhihong Wang @ 2015-01-29 2:38 UTC (permalink / raw)
To: dev
VTA is for debugging only, it increases compile time and binary size, especially when there're a lot of inlines.
So disable it since memcpy test contains a lot of inline calls.
Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
app/test/Makefile | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/app/test/Makefile b/app/test/Makefile
index 4311f96..94dbadf 100644
--- a/app/test/Makefile
+++ b/app/test/Makefile
@@ -143,6 +143,12 @@ CFLAGS_test_kni.o += -Wno-deprecated-declarations
endif
CFLAGS += -D_GNU_SOURCE
+# Disable VTA for memcpy test
+ifeq ($(CC), gcc)
+CFLAGS_test_memcpy.o += -fno-var-tracking-assignments
+CFLAGS_test_memcpy_perf.o += -fno-var-tracking-assignments
+endif
+
# this application needs libraries first
DEPDIRS-y += lib
--
1.9.3
^ permalink raw reply [flat|nested] 12+ messages in thread
* [dpdk-dev] [PATCH v2 2/4] app/test: Removed unnecessary test cases in app/test/test_memcpy.c
2015-01-29 2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang
2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 1/4] app/test: Disabled VTA for memcpy test in app/test/Makefile Zhihong Wang
@ 2015-01-29 2:38 ` Zhihong Wang
2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 3/4] app/test: Extended test coverage in app/test/test_memcpy_perf.c Zhihong Wang
` (4 subsequent siblings)
6 siblings, 0 replies; 12+ messages in thread
From: Zhihong Wang @ 2015-01-29 2:38 UTC (permalink / raw)
To: dev
Removed unnecessary test cases for base move functions since the function "func_test" covers them all.
Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
app/test/test_memcpy.c | 52 +-------------------------------------------------
1 file changed, 1 insertion(+), 51 deletions(-)
diff --git a/app/test/test_memcpy.c b/app/test/test_memcpy.c
index 56b8e1e..b2bb4e0 100644
--- a/app/test/test_memcpy.c
+++ b/app/test/test_memcpy.c
@@ -78,56 +78,9 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
#define TEST_BATCH_SIZE 100
/* Data is aligned on this many bytes (power of 2) */
-#define ALIGNMENT_UNIT 16
+#define ALIGNMENT_UNIT 32
-
-/* Structure with base memcpy func pointer, and number of bytes it copies */
-struct base_memcpy_func {
- void (*func)(uint8_t *dst, const uint8_t *src);
- unsigned size;
-};
-
-/* To create base_memcpy_func structure entries */
-#define BASE_FUNC(n) {rte_mov##n, n}
-
-/* Max number of bytes that can be copies with a "base" memcpy functions */
-#define MAX_BASE_FUNC_SIZE 256
-
-/*
- * Test the "base" memcpy functions, that a copy fixed number of bytes.
- */
-static int
-base_func_test(void)
-{
- const struct base_memcpy_func base_memcpy_funcs[6] = {
- BASE_FUNC(16),
- BASE_FUNC(32),
- BASE_FUNC(48),
- BASE_FUNC(64),
- BASE_FUNC(128),
- BASE_FUNC(256),
- };
- unsigned i, j;
- unsigned num_funcs = sizeof(base_memcpy_funcs) / sizeof(base_memcpy_funcs[0]);
- uint8_t dst[MAX_BASE_FUNC_SIZE];
- uint8_t src[MAX_BASE_FUNC_SIZE];
-
- for (i = 0; i < num_funcs; i++) {
- unsigned size = base_memcpy_funcs[i].size;
- for (j = 0; j < size; j++) {
- dst[j] = 0;
- src[j] = (uint8_t) rte_rand();
- }
- base_memcpy_funcs[i].func(dst, src);
- for (j = 0; j < size; j++)
- if (dst[j] != src[j])
- return -1;
- }
-
- return 0;
-}
-
/*
* Create two buffers, and initialise one with random values. These are copied
* to the second buffer and then compared to see if the copy was successful.
@@ -218,9 +171,6 @@ test_memcpy(void)
ret = func_test();
if (ret != 0)
return -1;
- ret = base_func_test();
- if (ret != 0)
- return -1;
return 0;
}
--
1.9.3
^ permalink raw reply [flat|nested] 12+ messages in thread
* [dpdk-dev] [PATCH v2 3/4] app/test: Extended test coverage in app/test/test_memcpy_perf.c
2015-01-29 2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang
2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 1/4] app/test: Disabled VTA for memcpy test in app/test/Makefile Zhihong Wang
2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 2/4] app/test: Removed unnecessary test cases in app/test/test_memcpy.c Zhihong Wang
@ 2015-01-29 2:38 ` Zhihong Wang
2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms Zhihong Wang
` (3 subsequent siblings)
6 siblings, 0 replies; 12+ messages in thread
From: Zhihong Wang @ 2015-01-29 2:38 UTC (permalink / raw)
To: dev
Main code changes:
1. Added more typical data points for a thorough performance test
2. Added unaligned test cases since it's common in DPDK usage
Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
app/test/test_memcpy_perf.c | 220 +++++++++++++++++++++++++++-----------------
1 file changed, 138 insertions(+), 82 deletions(-)
diff --git a/app/test/test_memcpy_perf.c b/app/test/test_memcpy_perf.c
index 7809610..754828e 100644
--- a/app/test/test_memcpy_perf.c
+++ b/app/test/test_memcpy_perf.c
@@ -54,9 +54,10 @@
/* List of buffer sizes to test */
#if TEST_VALUE_RANGE == 0
static size_t buf_sizes[] = {
- 0, 1, 7, 8, 9, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128, 129, 255,
- 256, 257, 320, 384, 511, 512, 513, 1023, 1024, 1025, 1518, 1522, 1600,
- 2048, 3072, 4096, 5120, 6144, 7168, 8192
+ 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128,
+ 129, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384, 385, 447, 448,
+ 449, 511, 512, 513, 767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600,
+ 2048, 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192
};
/* MUST be as large as largest packet size above */
#define SMALL_BUFFER_SIZE 8192
@@ -78,7 +79,7 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
#define TEST_BATCH_SIZE 100
/* Data is aligned on this many bytes (power of 2) */
-#define ALIGNMENT_UNIT 16
+#define ALIGNMENT_UNIT 32
/*
* Pointers used in performance tests. The two large buffers are for uncached
@@ -94,19 +95,19 @@ init_buffers(void)
{
unsigned i;
- large_buf_read = rte_malloc("memcpy", LARGE_BUFFER_SIZE, ALIGNMENT_UNIT);
+ large_buf_read = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
if (large_buf_read == NULL)
goto error_large_buf_read;
- large_buf_write = rte_malloc("memcpy", LARGE_BUFFER_SIZE, ALIGNMENT_UNIT);
+ large_buf_write = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
if (large_buf_write == NULL)
goto error_large_buf_write;
- small_buf_read = rte_malloc("memcpy", SMALL_BUFFER_SIZE, ALIGNMENT_UNIT);
+ small_buf_read = rte_malloc("memcpy", SMALL_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
if (small_buf_read == NULL)
goto error_small_buf_read;
- small_buf_write = rte_malloc("memcpy", SMALL_BUFFER_SIZE, ALIGNMENT_UNIT);
+ small_buf_write = rte_malloc("memcpy", SMALL_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT);
if (small_buf_write == NULL)
goto error_small_buf_write;
@@ -140,25 +141,25 @@ free_buffers(void)
/*
* Get a random offset into large array, with enough space needed to perform
- * max copy size. Offset is aligned.
+ * max copy size. Offset is aligned, uoffset is used for unalignment setting.
*/
static inline size_t
-get_rand_offset(void)
+get_rand_offset(size_t uoffset)
{
- return ((rte_rand() % (LARGE_BUFFER_SIZE - SMALL_BUFFER_SIZE)) &
- ~(ALIGNMENT_UNIT - 1));
+ return (((rte_rand() % (LARGE_BUFFER_SIZE - SMALL_BUFFER_SIZE)) &
+ ~(ALIGNMENT_UNIT - 1)) + uoffset);
}
/* Fill in source and destination addresses. */
static inline void
-fill_addr_arrays(size_t *dst_addr, int is_dst_cached,
- size_t *src_addr, int is_src_cached)
+fill_addr_arrays(size_t *dst_addr, int is_dst_cached, size_t dst_uoffset,
+ size_t *src_addr, int is_src_cached, size_t src_uoffset)
{
unsigned int i;
for (i = 0; i < TEST_BATCH_SIZE; i++) {
- dst_addr[i] = (is_dst_cached) ? 0 : get_rand_offset();
- src_addr[i] = (is_src_cached) ? 0 : get_rand_offset();
+ dst_addr[i] = (is_dst_cached) ? dst_uoffset : get_rand_offset(dst_uoffset);
+ src_addr[i] = (is_src_cached) ? src_uoffset : get_rand_offset(src_uoffset);
}
}
@@ -169,16 +170,17 @@ fill_addr_arrays(size_t *dst_addr, int is_dst_cached,
*/
static void
do_uncached_write(uint8_t *dst, int is_dst_cached,
- const uint8_t *src, int is_src_cached, size_t size)
+ const uint8_t *src, int is_src_cached, size_t size)
{
unsigned i, j;
size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE];
for (i = 0; i < (TEST_ITERATIONS / TEST_BATCH_SIZE); i++) {
- fill_addr_arrays(dst_addrs, is_dst_cached,
- src_addrs, is_src_cached);
- for (j = 0; j < TEST_BATCH_SIZE; j++)
+ fill_addr_arrays(dst_addrs, is_dst_cached, 0,
+ src_addrs, is_src_cached, 0);
+ for (j = 0; j < TEST_BATCH_SIZE; j++) {
rte_memcpy(dst+dst_addrs[j], src+src_addrs[j], size);
+ }
}
}
@@ -186,52 +188,111 @@ do_uncached_write(uint8_t *dst, int is_dst_cached,
* Run a single memcpy performance test. This is a macro to ensure that if
* the "size" parameter is a constant it won't be converted to a variable.
*/
-#define SINGLE_PERF_TEST(dst, is_dst_cached, src, is_src_cached, size) do { \
- unsigned int iter, t; \
- size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE]; \
- uint64_t start_time, total_time = 0; \
- uint64_t total_time2 = 0; \
- for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) { \
- fill_addr_arrays(dst_addrs, is_dst_cached, \
- src_addrs, is_src_cached); \
- start_time = rte_rdtsc(); \
- for (t = 0; t < TEST_BATCH_SIZE; t++) \
- rte_memcpy(dst+dst_addrs[t], src+src_addrs[t], size); \
- total_time += rte_rdtsc() - start_time; \
- } \
- for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) { \
- fill_addr_arrays(dst_addrs, is_dst_cached, \
- src_addrs, is_src_cached); \
- start_time = rte_rdtsc(); \
- for (t = 0; t < TEST_BATCH_SIZE; t++) \
- memcpy(dst+dst_addrs[t], src+src_addrs[t], size); \
- total_time2 += rte_rdtsc() - start_time; \
- } \
- printf("%8.0f -", (double)total_time /TEST_ITERATIONS); \
- printf("%5.0f", (double)total_time2 / TEST_ITERATIONS); \
+#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset, \
+ src, is_src_cached, src_uoffset, size) \
+do { \
+ unsigned int iter, t; \
+ size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE]; \
+ uint64_t start_time, total_time = 0; \
+ uint64_t total_time2 = 0; \
+ for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) { \
+ fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset, \
+ src_addrs, is_src_cached, src_uoffset); \
+ start_time = rte_rdtsc(); \
+ for (t = 0; t < TEST_BATCH_SIZE; t++) \
+ rte_memcpy(dst+dst_addrs[t], src+src_addrs[t], size); \
+ total_time += rte_rdtsc() - start_time; \
+ } \
+ for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) { \
+ fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset, \
+ src_addrs, is_src_cached, src_uoffset); \
+ start_time = rte_rdtsc(); \
+ for (t = 0; t < TEST_BATCH_SIZE; t++) \
+ memcpy(dst+dst_addrs[t], src+src_addrs[t], size); \
+ total_time2 += rte_rdtsc() - start_time; \
+ } \
+ printf("%8.0f -", (double)total_time /TEST_ITERATIONS); \
+ printf("%5.0f", (double)total_time2 / TEST_ITERATIONS); \
} while (0)
-/* Run memcpy() tests for each cached/uncached permutation. */
-#define ALL_PERF_TESTS_FOR_SIZE(n) do { \
- if (__builtin_constant_p(n)) \
- printf("\nC%6u", (unsigned)n); \
- else \
- printf("\n%7u", (unsigned)n); \
- SINGLE_PERF_TEST(small_buf_write, 1, small_buf_read, 1, n); \
- SINGLE_PERF_TEST(large_buf_write, 0, small_buf_read, 1, n); \
- SINGLE_PERF_TEST(small_buf_write, 1, large_buf_read, 0, n); \
- SINGLE_PERF_TEST(large_buf_write, 0, large_buf_read, 0, n); \
+/* Run aligned memcpy tests for each cached/uncached permutation */
+#define ALL_PERF_TESTS_FOR_SIZE(n) \
+do { \
+ if (__builtin_constant_p(n)) \
+ printf("\nC%6u", (unsigned)n); \
+ else \
+ printf("\n%7u", (unsigned)n); \
+ SINGLE_PERF_TEST(small_buf_write, 1, 0, small_buf_read, 1, 0, n); \
+ SINGLE_PERF_TEST(large_buf_write, 0, 0, small_buf_read, 1, 0, n); \
+ SINGLE_PERF_TEST(small_buf_write, 1, 0, large_buf_read, 0, 0, n); \
+ SINGLE_PERF_TEST(large_buf_write, 0, 0, large_buf_read, 0, 0, n); \
} while (0)
-/*
- * Run performance tests for a number of different sizes and cached/uncached
- * permutations.
- */
+/* Run unaligned memcpy tests for each cached/uncached permutation */
+#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n) \
+do { \
+ if (__builtin_constant_p(n)) \
+ printf("\nC%6u", (unsigned)n); \
+ else \
+ printf("\n%7u", (unsigned)n); \
+ SINGLE_PERF_TEST(small_buf_write, 1, 1, small_buf_read, 1, 5, n); \
+ SINGLE_PERF_TEST(large_buf_write, 0, 1, small_buf_read, 1, 5, n); \
+ SINGLE_PERF_TEST(small_buf_write, 1, 1, large_buf_read, 0, 5, n); \
+ SINGLE_PERF_TEST(large_buf_write, 0, 1, large_buf_read, 0, 5, n); \
+} while (0)
+
+/* Run memcpy tests for constant length */
+#define ALL_PERF_TEST_FOR_CONSTANT \
+do { \
+ TEST_CONSTANT(6U); TEST_CONSTANT(64U); TEST_CONSTANT(128U); \
+ TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U); \
+ TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U); \
+} while (0)
+
+/* Run all memcpy tests for aligned constant cases */
+static inline void
+perf_test_constant_aligned(void)
+{
+#define TEST_CONSTANT ALL_PERF_TESTS_FOR_SIZE
+ ALL_PERF_TEST_FOR_CONSTANT;
+#undef TEST_CONSTANT
+}
+
+/* Run all memcpy tests for unaligned constant cases */
+static inline void
+perf_test_constant_unaligned(void)
+{
+#define TEST_CONSTANT ALL_PERF_TESTS_FOR_SIZE_UNALIGNED
+ ALL_PERF_TEST_FOR_CONSTANT;
+#undef TEST_CONSTANT
+}
+
+/* Run all memcpy tests for aligned variable cases */
+static inline void
+perf_test_variable_aligned(void)
+{
+ unsigned n = sizeof(buf_sizes) / sizeof(buf_sizes[0]);
+ unsigned i;
+ for (i = 0; i < n; i++) {
+ ALL_PERF_TESTS_FOR_SIZE((size_t)buf_sizes[i]);
+ }
+}
+
+/* Run all memcpy tests for unaligned variable cases */
+static inline void
+perf_test_variable_unaligned(void)
+{
+ unsigned n = sizeof(buf_sizes) / sizeof(buf_sizes[0]);
+ unsigned i;
+ for (i = 0; i < n; i++) {
+ ALL_PERF_TESTS_FOR_SIZE_UNALIGNED((size_t)buf_sizes[i]);
+ }
+}
+
+/* Run all memcpy tests */
static int
perf_test(void)
{
- const unsigned num_buf_sizes = sizeof(buf_sizes) / sizeof(buf_sizes[0]);
- unsigned i;
int ret;
ret = init_buffers();
@@ -239,7 +300,8 @@ perf_test(void)
return ret;
#if TEST_VALUE_RANGE != 0
- /* Setup buf_sizes array, if required */
+ /* Set up buf_sizes array, if required */
+ unsigned i;
for (i = 0; i < TEST_VALUE_RANGE; i++)
buf_sizes[i] = i;
#endif
@@ -248,28 +310,23 @@ perf_test(void)
do_uncached_write(large_buf_write, 0, small_buf_read, 1, SMALL_BUFFER_SIZE);
printf("\n** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **\n"
- "======= ============== ============== ============== ==============\n"
- " Size Cache to cache Cache to mem Mem to cache Mem to mem\n"
- "(bytes) (ticks) (ticks) (ticks) (ticks)\n"
- "------- -------------- -------------- -------------- --------------");
-
- /* Do tests where size is a variable */
- for (i = 0; i < num_buf_sizes; i++) {
- ALL_PERF_TESTS_FOR_SIZE((size_t)buf_sizes[i]);
- }
+ "======= ============== ============== ============== ==============\n"
+ " Size Cache to cache Cache to mem Mem to cache Mem to mem\n"
+ "(bytes) (ticks) (ticks) (ticks) (ticks)\n"
+ "------- -------------- -------------- -------------- --------------");
+
+ printf("\n========================== %2dB aligned ============================", ALIGNMENT_UNIT);
+ /* Do aligned tests where size is a variable */
+ perf_test_variable_aligned();
printf("\n------- -------------- -------------- -------------- --------------");
- /* Do tests where size is a compile-time constant */
- ALL_PERF_TESTS_FOR_SIZE(63U);
- ALL_PERF_TESTS_FOR_SIZE(64U);
- ALL_PERF_TESTS_FOR_SIZE(65U);
- ALL_PERF_TESTS_FOR_SIZE(255U);
- ALL_PERF_TESTS_FOR_SIZE(256U);
- ALL_PERF_TESTS_FOR_SIZE(257U);
- ALL_PERF_TESTS_FOR_SIZE(1023U);
- ALL_PERF_TESTS_FOR_SIZE(1024U);
- ALL_PERF_TESTS_FOR_SIZE(1025U);
- ALL_PERF_TESTS_FOR_SIZE(1518U);
-
+ /* Do aligned tests where size is a compile-time constant */
+ perf_test_constant_aligned();
+ printf("\n=========================== Unaligned =============================");
+ /* Do unaligned tests where size is a variable */
+ perf_test_variable_unaligned();
+ printf("\n------- -------------- -------------- -------------- --------------");
+ /* Do unaligned tests where size is a compile-time constant */
+ perf_test_constant_unaligned();
printf("\n======= ============== ============== ============== ==============\n\n");
free_buffers();
@@ -277,7 +334,6 @@ perf_test(void)
return 0;
}
-
static int
test_memcpy_perf(void)
{
--
1.9.3
^ permalink raw reply [flat|nested] 12+ messages in thread
* [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
2015-01-29 2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang
` (2 preceding siblings ...)
2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 3/4] app/test: Extended test coverage in app/test/test_memcpy_perf.c Zhihong Wang
@ 2015-01-29 2:38 ` Zhihong Wang
2015-01-29 15:17 ` Ananyev, Konstantin
2015-01-29 6:16 ` [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Fu, JingguoX
` (2 subsequent siblings)
6 siblings, 1 reply; 12+ messages in thread
From: Zhihong Wang @ 2015-01-29 2:38 UTC (permalink / raw)
To: dev
Main code changes:
1. Differentiate architectural features based on CPU flags
a. Implement separated move functions for SSE/AVX/AVX2 to make full utilization of cache bandwidth
b. Implement separated copy flow specifically optimized for target architecture
2. Rewrite the memcpy function "rte_memcpy"
a. Add store aligning
b. Add load aligning based on architectural features
c. Put block copy loop into inline move functions for better control of instruction order
d. Eliminate unnecessary MOVs
3. Rewrite the inline move functions
a. Add move functions for unaligned load cases
b. Change instruction order in copy loops for better pipeline utilization
c. Use intrinsics instead of assembly code
4. Remove slow glibc call for constant copies
Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
.../common/include/arch/x86/rte_memcpy.h | 680 +++++++++++++++------
1 file changed, 509 insertions(+), 171 deletions(-)
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index fb9eba8..7b2d382 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -34,166 +34,189 @@
#ifndef _RTE_MEMCPY_X86_64_H_
#define _RTE_MEMCPY_X86_64_H_
+/**
+ * @file
+ *
+ * Functions for SSE/AVX/AVX2 implementation of memcpy().
+ */
+
+#include <stdio.h>
#include <stdint.h>
#include <string.h>
-#include <emmintrin.h>
+#include <x86intrin.h>
#ifdef __cplusplus
extern "C" {
#endif
-#include "generic/rte_memcpy.h"
+/**
+ * Copy bytes from one location to another. The locations must not overlap.
+ *
+ * @note This is implemented as a macro, so it's address should not be taken
+ * and care is needed as parameter expressions may be evaluated multiple times.
+ *
+ * @param dst
+ * Pointer to the destination of the data.
+ * @param src
+ * Pointer to the source data.
+ * @param n
+ * Number of bytes to copy.
+ * @return
+ * Pointer to the destination data.
+ */
+static inline void *
+rte_memcpy(void *dst, const void *src, size_t n) __attribute__((always_inline));
-#ifdef __INTEL_COMPILER
-#pragma warning(disable:593) /* Stop unused variable warning (reg_a etc). */
-#endif
+#ifdef RTE_MACHINE_CPUFLAG_AVX2
+/**
+ * AVX2 implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
static inline void
rte_mov16(uint8_t *dst, const uint8_t *src)
{
- __m128i reg_a;
- asm volatile (
- "movdqu (%[src]), %[reg_a]\n\t"
- "movdqu %[reg_a], (%[dst])\n\t"
- : [reg_a] "=x" (reg_a)
- : [src] "r" (src),
- [dst] "r"(dst)
- : "memory"
- );
+ __m128i xmm0;
+
+ xmm0 = _mm_loadu_si128((const __m128i *)src);
+ _mm_storeu_si128((__m128i *)dst, xmm0);
}
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
static inline void
rte_mov32(uint8_t *dst, const uint8_t *src)
{
- __m128i reg_a, reg_b;
- asm volatile (
- "movdqu (%[src]), %[reg_a]\n\t"
- "movdqu 16(%[src]), %[reg_b]\n\t"
- "movdqu %[reg_a], (%[dst])\n\t"
- "movdqu %[reg_b], 16(%[dst])\n\t"
- : [reg_a] "=x" (reg_a),
- [reg_b] "=x" (reg_b)
- : [src] "r" (src),
- [dst] "r"(dst)
- : "memory"
- );
-}
+ __m256i ymm0;
-static inline void
-rte_mov48(uint8_t *dst, const uint8_t *src)
-{
- __m128i reg_a, reg_b, reg_c;
- asm volatile (
- "movdqu (%[src]), %[reg_a]\n\t"
- "movdqu 16(%[src]), %[reg_b]\n\t"
- "movdqu 32(%[src]), %[reg_c]\n\t"
- "movdqu %[reg_a], (%[dst])\n\t"
- "movdqu %[reg_b], 16(%[dst])\n\t"
- "movdqu %[reg_c], 32(%[dst])\n\t"
- : [reg_a] "=x" (reg_a),
- [reg_b] "=x" (reg_b),
- [reg_c] "=x" (reg_c)
- : [src] "r" (src),
- [dst] "r"(dst)
- : "memory"
- );
+ ymm0 = _mm256_loadu_si256((const __m256i *)src);
+ _mm256_storeu_si256((__m256i *)dst, ymm0);
}
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
static inline void
rte_mov64(uint8_t *dst, const uint8_t *src)
{
- __m128i reg_a, reg_b, reg_c, reg_d;
- asm volatile (
- "movdqu (%[src]), %[reg_a]\n\t"
- "movdqu 16(%[src]), %[reg_b]\n\t"
- "movdqu 32(%[src]), %[reg_c]\n\t"
- "movdqu 48(%[src]), %[reg_d]\n\t"
- "movdqu %[reg_a], (%[dst])\n\t"
- "movdqu %[reg_b], 16(%[dst])\n\t"
- "movdqu %[reg_c], 32(%[dst])\n\t"
- "movdqu %[reg_d], 48(%[dst])\n\t"
- : [reg_a] "=x" (reg_a),
- [reg_b] "=x" (reg_b),
- [reg_c] "=x" (reg_c),
- [reg_d] "=x" (reg_d)
- : [src] "r" (src),
- [dst] "r"(dst)
- : "memory"
- );
+ rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
+ rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
}
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
static inline void
rte_mov128(uint8_t *dst, const uint8_t *src)
{
- __m128i reg_a, reg_b, reg_c, reg_d, reg_e, reg_f, reg_g, reg_h;
- asm volatile (
- "movdqu (%[src]), %[reg_a]\n\t"
- "movdqu 16(%[src]), %[reg_b]\n\t"
- "movdqu 32(%[src]), %[reg_c]\n\t"
- "movdqu 48(%[src]), %[reg_d]\n\t"
- "movdqu 64(%[src]), %[reg_e]\n\t"
- "movdqu 80(%[src]), %[reg_f]\n\t"
- "movdqu 96(%[src]), %[reg_g]\n\t"
- "movdqu 112(%[src]), %[reg_h]\n\t"
- "movdqu %[reg_a], (%[dst])\n\t"
- "movdqu %[reg_b], 16(%[dst])\n\t"
- "movdqu %[reg_c], 32(%[dst])\n\t"
- "movdqu %[reg_d], 48(%[dst])\n\t"
- "movdqu %[reg_e], 64(%[dst])\n\t"
- "movdqu %[reg_f], 80(%[dst])\n\t"
- "movdqu %[reg_g], 96(%[dst])\n\t"
- "movdqu %[reg_h], 112(%[dst])\n\t"
- : [reg_a] "=x" (reg_a),
- [reg_b] "=x" (reg_b),
- [reg_c] "=x" (reg_c),
- [reg_d] "=x" (reg_d),
- [reg_e] "=x" (reg_e),
- [reg_f] "=x" (reg_f),
- [reg_g] "=x" (reg_g),
- [reg_h] "=x" (reg_h)
- : [src] "r" (src),
- [dst] "r"(dst)
- : "memory"
- );
+ rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
+ rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
+ rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
+ rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
}
-#ifdef __INTEL_COMPILER
-#pragma warning(enable:593)
-#endif
-
+/**
+ * Copy 256 bytes from one location to another,
+ * locations should not overlap.
+ */
static inline void
rte_mov256(uint8_t *dst, const uint8_t *src)
{
- rte_mov128(dst, src);
- rte_mov128(dst + 128, src + 128);
+ rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
+ rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
+ rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
+ rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
+ rte_mov32((uint8_t *)dst + 4 * 32, (const uint8_t *)src + 4 * 32);
+ rte_mov32((uint8_t *)dst + 5 * 32, (const uint8_t *)src + 5 * 32);
+ rte_mov32((uint8_t *)dst + 6 * 32, (const uint8_t *)src + 6 * 32);
+ rte_mov32((uint8_t *)dst + 7 * 32, (const uint8_t *)src + 7 * 32);
}
-#define rte_memcpy(dst, src, n) \
- ({ (__builtin_constant_p(n)) ? \
- memcpy((dst), (src), (n)) : \
- rte_memcpy_func((dst), (src), (n)); })
+/**
+ * Copy 64-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+ __m256i ymm0, ymm1;
+
+ while (n >= 64) {
+ ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32));
+ n -= 64;
+ ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32));
+ src = (const uint8_t *)src + 64;
+ _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
+ _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
+ dst = (uint8_t *)dst + 64;
+ }
+}
+
+/**
+ * Copy 256-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov256blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+ __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7;
+
+ while (n >= 256) {
+ ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32));
+ n -= 256;
+ ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32));
+ ymm2 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 2 * 32));
+ ymm3 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 3 * 32));
+ ymm4 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 4 * 32));
+ ymm5 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 5 * 32));
+ ymm6 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 6 * 32));
+ ymm7 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 7 * 32));
+ src = (const uint8_t *)src + 256;
+ _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
+ _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
+ _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2);
+ _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3);
+ _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 4 * 32), ymm4);
+ _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 5 * 32), ymm5);
+ _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 6 * 32), ymm6);
+ _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 7 * 32), ymm7);
+ dst = (uint8_t *)dst + 256;
+ }
+}
static inline void *
-rte_memcpy_func(void *dst, const void *src, size_t n)
+rte_memcpy(void *dst, const void *src, size_t n)
{
void *ret = dst;
+ int dstofss;
+ int bits;
- /* We can't copy < 16 bytes using XMM registers so do it manually. */
+ /**
+ * Copy less than 16 bytes
+ */
if (n < 16) {
if (n & 0x01) {
*(uint8_t *)dst = *(const uint8_t *)src;
- dst = (uint8_t *)dst + 1;
src = (const uint8_t *)src + 1;
+ dst = (uint8_t *)dst + 1;
}
if (n & 0x02) {
*(uint16_t *)dst = *(const uint16_t *)src;
- dst = (uint16_t *)dst + 1;
src = (const uint16_t *)src + 1;
+ dst = (uint16_t *)dst + 1;
}
if (n & 0x04) {
*(uint32_t *)dst = *(const uint32_t *)src;
- dst = (uint32_t *)dst + 1;
src = (const uint32_t *)src + 1;
+ dst = (uint32_t *)dst + 1;
}
if (n & 0x08) {
*(uint64_t *)dst = *(const uint64_t *)src;
@@ -201,95 +224,410 @@ rte_memcpy_func(void *dst, const void *src, size_t n)
return ret;
}
- /* Special fast cases for <= 128 bytes */
+ /**
+ * Fast way when copy size doesn't exceed 512 bytes
+ */
if (n <= 32) {
rte_mov16((uint8_t *)dst, (const uint8_t *)src);
rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
return ret;
}
-
if (n <= 64) {
rte_mov32((uint8_t *)dst, (const uint8_t *)src);
rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
return ret;
}
-
- if (n <= 128) {
- rte_mov64((uint8_t *)dst, (const uint8_t *)src);
- rte_mov64((uint8_t *)dst - 64 + n, (const uint8_t *)src - 64 + n);
+ if (n <= 512) {
+ if (n >= 256) {
+ n -= 256;
+ rte_mov256((uint8_t *)dst, (const uint8_t *)src);
+ src = (const uint8_t *)src + 256;
+ dst = (uint8_t *)dst + 256;
+ }
+ if (n >= 128) {
+ n -= 128;
+ rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+ src = (const uint8_t *)src + 128;
+ dst = (uint8_t *)dst + 128;
+ }
+ if (n >= 64) {
+ n -= 64;
+ rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+ src = (const uint8_t *)src + 64;
+ dst = (uint8_t *)dst + 64;
+ }
+COPY_BLOCK_64_BACK31:
+ if (n > 32) {
+ rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+ rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
+ return ret;
+ }
+ if (n > 0) {
+ rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
+ }
return ret;
}
- /*
- * For large copies > 128 bytes. This combination of 256, 64 and 16 byte
- * copies was found to be faster than doing 128 and 32 byte copies as
- * well.
+ /**
+ * Make store aligned when copy size exceeds 512 bytes
*/
- for ( ; n >= 256; n -= 256) {
- rte_mov256((uint8_t *)dst, (const uint8_t *)src);
- dst = (uint8_t *)dst + 256;
- src = (const uint8_t *)src + 256;
+ dstofss = 32 - (int)((long long)(void *)dst & 0x1F);
+ n -= dstofss;
+ rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+ src = (const uint8_t *)src + dstofss;
+ dst = (uint8_t *)dst + dstofss;
+
+ /**
+ * Copy 256-byte blocks.
+ * Use copy block function for better instruction order control,
+ * which is important when load is unaligned.
+ */
+ rte_mov256blocks((uint8_t *)dst, (const uint8_t *)src, n);
+ bits = n;
+ n = n & 255;
+ bits -= n;
+ src = (const uint8_t *)src + bits;
+ dst = (uint8_t *)dst + bits;
+
+ /**
+ * Copy 64-byte blocks.
+ * Use copy block function for better instruction order control,
+ * which is important when load is unaligned.
+ */
+ if (n >= 64) {
+ rte_mov64blocks((uint8_t *)dst, (const uint8_t *)src, n);
+ bits = n;
+ n = n & 63;
+ bits -= n;
+ src = (const uint8_t *)src + bits;
+ dst = (uint8_t *)dst + bits;
}
- /*
- * We split the remaining bytes (which will be less than 256) into
- * 64byte (2^6) chunks.
- * Using incrementing integers in the case labels of a switch statement
- * enourages the compiler to use a jump table. To get incrementing
- * integers, we shift the 2 relevant bits to the LSB position to first
- * get decrementing integers, and then subtract.
+ /**
+ * Copy whatever left
*/
- switch (3 - (n >> 6)) {
- case 0x00:
- rte_mov64((uint8_t *)dst, (const uint8_t *)src);
- n -= 64;
- dst = (uint8_t *)dst + 64;
- src = (const uint8_t *)src + 64; /* fallthrough */
- case 0x01:
- rte_mov64((uint8_t *)dst, (const uint8_t *)src);
- n -= 64;
- dst = (uint8_t *)dst + 64;
- src = (const uint8_t *)src + 64; /* fallthrough */
- case 0x02:
- rte_mov64((uint8_t *)dst, (const uint8_t *)src);
- n -= 64;
- dst = (uint8_t *)dst + 64;
- src = (const uint8_t *)src + 64; /* fallthrough */
- default:
- ;
+ goto COPY_BLOCK_64_BACK31;
+}
+
+#else /* RTE_MACHINE_CPUFLAG_AVX2 */
+
+/**
+ * SSE & AVX implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+ __m128i xmm0;
+
+ xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);
+ _mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+ rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+ rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+ rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+ rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+ rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+ rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+ rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+ rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+ rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+ rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+ rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
+ rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
+ rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
+ rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
+}
+
+/**
+ * Copy 256 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov256(uint8_t *dst, const uint8_t *src)
+{
+ rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
+ rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
+ rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
+ rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
+ rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
+ rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
+ rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
+ rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
+ rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);
+ rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);
+ rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);
+ rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);
+ rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);
+ rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);
+ rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);
+ rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
+}
+
+/**
+ * Macro for copying unaligned block from one location to another with constant load offset,
+ * 47 bytes leftover maximum,
+ * locations should not overlap.
+ * Requirements:
+ * - Store is aligned
+ * - Load offset is <offset>, which must be immediate value within [1, 15]
+ * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
+ * - <dst>, <src>, <len> must be variables
+ * - __m128i <xmm0> ~ <xmm8> must be pre-defined
+ */
+#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset) \
+({ \
+ int tmp; \
+ while (len >= 128 + 16 - offset) { \
+ xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16)); \
+ len -= 128; \
+ xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16)); \
+ xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16)); \
+ xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16)); \
+ xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16)); \
+ xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16)); \
+ xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16)); \
+ xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16)); \
+ xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16)); \
+ src = (const uint8_t *)src + 128; \
+ _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset)); \
+ _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset)); \
+ _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset)); \
+ _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset)); \
+ _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset)); \
+ _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset)); \
+ _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset)); \
+ _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset)); \
+ dst = (uint8_t *)dst + 128; \
+ } \
+ tmp = len; \
+ len = ((len - 16 + offset) & 127) + 16 - offset; \
+ tmp -= len; \
+ src = (const uint8_t *)src + tmp; \
+ dst = (uint8_t *)dst + tmp; \
+ if (len >= 32 + 16 - offset) { \
+ while (len >= 32 + 16 - offset) { \
+ xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16)); \
+ len -= 32; \
+ xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16)); \
+ xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16)); \
+ src = (const uint8_t *)src + 32; \
+ _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset)); \
+ _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset)); \
+ dst = (uint8_t *)dst + 32; \
+ } \
+ tmp = len; \
+ len = ((len - 16 + offset) & 31) + 16 - offset; \
+ tmp -= len; \
+ src = (const uint8_t *)src + tmp; \
+ dst = (uint8_t *)dst + tmp; \
+ } \
+})
+
+/**
+ * Macro for copying unaligned block from one location to another,
+ * 47 bytes leftover maximum,
+ * locations should not overlap.
+ * Use switch here because the aligning instruction requires immediate value for shift count.
+ * Requirements:
+ * - Store is aligned
+ * - Load offset is <offset>, which must be within [1, 15]
+ * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
+ * - <dst>, <src>, <len> must be variables
+ * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined
+ */
+#define MOVEUNALIGNED_LEFT47(dst, src, len, offset) \
+({ \
+ switch (offset) { \
+ case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break; \
+ case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break; \
+ case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break; \
+ case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break; \
+ case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break; \
+ case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break; \
+ case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break; \
+ case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break; \
+ case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break; \
+ case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break; \
+ case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break; \
+ case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break; \
+ case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break; \
+ case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break; \
+ case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break; \
+ default:; \
+ } \
+})
+
+static inline void *
+rte_memcpy(void *dst, const void *src, size_t n)
+{
+ __m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
+ void *ret = dst;
+ int dstofss;
+ int srcofs;
+
+ /**
+ * Copy less than 16 bytes
+ */
+ if (n < 16) {
+ if (n & 0x01) {
+ *(uint8_t *)dst = *(const uint8_t *)src;
+ src = (const uint8_t *)src + 1;
+ dst = (uint8_t *)dst + 1;
+ }
+ if (n & 0x02) {
+ *(uint16_t *)dst = *(const uint16_t *)src;
+ src = (const uint16_t *)src + 1;
+ dst = (uint16_t *)dst + 1;
+ }
+ if (n & 0x04) {
+ *(uint32_t *)dst = *(const uint32_t *)src;
+ src = (const uint32_t *)src + 1;
+ dst = (uint32_t *)dst + 1;
+ }
+ if (n & 0x08) {
+ *(uint64_t *)dst = *(const uint64_t *)src;
+ }
+ return ret;
}
- /*
- * We split the remaining bytes (which will be less than 64) into
- * 16byte (2^4) chunks, using the same switch structure as above.
+ /**
+ * Fast way when copy size doesn't exceed 512 bytes
*/
- switch (3 - (n >> 4)) {
- case 0x00:
- rte_mov16((uint8_t *)dst, (const uint8_t *)src);
- n -= 16;
- dst = (uint8_t *)dst + 16;
- src = (const uint8_t *)src + 16; /* fallthrough */
- case 0x01:
- rte_mov16((uint8_t *)dst, (const uint8_t *)src);
- n -= 16;
- dst = (uint8_t *)dst + 16;
- src = (const uint8_t *)src + 16; /* fallthrough */
- case 0x02:
+ if (n <= 32) {
rte_mov16((uint8_t *)dst, (const uint8_t *)src);
- n -= 16;
- dst = (uint8_t *)dst + 16;
- src = (const uint8_t *)src + 16; /* fallthrough */
- default:
- ;
+ rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
+ return ret;
}
-
- /* Copy any remaining bytes, without going beyond end of buffers */
- if (n != 0) {
+ if (n <= 48) {
+ rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+ rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
+ return ret;
+ }
+ if (n <= 64) {
+ rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+ rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
+ return ret;
+ }
+ if (n <= 128) {
+ goto COPY_BLOCK_128_BACK15;
}
- return ret;
+ if (n <= 512) {
+ if (n >= 256) {
+ n -= 256;
+ rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+ rte_mov128((uint8_t *)dst + 128, (const uint8_t *)src + 128);
+ src = (const uint8_t *)src + 256;
+ dst = (uint8_t *)dst + 256;
+ }
+COPY_BLOCK_255_BACK15:
+ if (n >= 128) {
+ n -= 128;
+ rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+ src = (const uint8_t *)src + 128;
+ dst = (uint8_t *)dst + 128;
+ }
+COPY_BLOCK_128_BACK15:
+ if (n >= 64) {
+ n -= 64;
+ rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+ src = (const uint8_t *)src + 64;
+ dst = (uint8_t *)dst + 64;
+ }
+COPY_BLOCK_64_BACK15:
+ if (n >= 32) {
+ n -= 32;
+ rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+ src = (const uint8_t *)src + 32;
+ dst = (uint8_t *)dst + 32;
+ }
+ if (n > 16) {
+ rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+ rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
+ return ret;
+ }
+ if (n > 0) {
+ rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
+ }
+ return ret;
+ }
+
+ /**
+ * Make store aligned when copy size exceeds 512 bytes,
+ * and make sure the first 15 bytes are copied, because
+ * unaligned copy functions require up to 15 bytes
+ * backwards access.
+ */
+ dstofss = 16 - (int)((long long)(void *)dst & 0x0F) + 16;
+ n -= dstofss;
+ rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+ src = (const uint8_t *)src + dstofss;
+ dst = (uint8_t *)dst + dstofss;
+ srcofs = (int)((long long)(const void *)src & 0x0F);
+
+ /**
+ * For aligned copy
+ */
+ if (srcofs == 0) {
+ /**
+ * Copy 256-byte blocks
+ */
+ for (; n >= 256; n -= 256) {
+ rte_mov256((uint8_t *)dst, (const uint8_t *)src);
+ dst = (uint8_t *)dst + 256;
+ src = (const uint8_t *)src + 256;
+ }
+
+ /**
+ * Copy whatever left
+ */
+ goto COPY_BLOCK_255_BACK15;
+ }
+
+ /**
+ * For copy with unaligned load
+ */
+ MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
+
+ /**
+ * Copy whatever left
+ */
+ goto COPY_BLOCK_64_BACK15;
}
+#endif /* RTE_MACHINE_CPUFLAG_AVX2 */
+
#ifdef __cplusplus
}
#endif
--
1.9.3
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization
2015-01-29 2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang
` (3 preceding siblings ...)
2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms Zhihong Wang
@ 2015-01-29 6:16 ` Fu, JingguoX
2015-02-10 3:06 ` Liang, Cunming
2015-02-16 15:57 ` De Lara Guarch, Pablo
6 siblings, 0 replies; 12+ messages in thread
From: Fu, JingguoX @ 2015-01-29 6:16 UTC (permalink / raw)
To: Wang, Zhihong, dev
Basic Information
Patch name DPDK memcpy optimization v2
Brief description about test purpose Verify memory copy and memory copy performance cases on variety OS
Test Flag Tested-by
Tester name jingguox.fu at intel.com
Test Tool Chain information N/A
Commit ID 88fa98a60b34812bfed92e5b2706fcf7e1cbcbc8
Test Result Summary Total 6 cases, 6 passed, 0 failed
Test environment
- Environment 1:
OS: Ubuntu12.04 3.2.0-23-generic X86_64
GCC: gcc version 4.6.3
CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ [8086:10fb] (rev 01)
- Environment 2:
OS: Ubuntu14.04 3.13.0-24-generic
GCC: gcc version 4.8.2
CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ [8086:10fb] (rev 01)
Environment 3:
OS: Fedora18 3.6.10-4.fc18.x86_64
GCC: gcc version 4.7.2 20121109
CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz
NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ [8086:10fb] (rev 01)
Detailed Testing information
Test Case - name test_memcpy
Test Case - Description
Create two buffers, and initialise one with random values. These are copied
to the second buffer and then compared to see if the copy was successful. The
bytes outside the copied area are also checked to make sure they were not changed.
Test Case -test sample/application
test application in app/test
Test Case -command / instruction
# ./app/test/test -n 1 -c ffff
#RTE>> memcpy_autotest
Test Case - expected
#RTE>> Test OK
Test Result- PASSED
Test Case - name test_memcpy_perf
Test Case - Description
a number of different sizes and cached/uncached permutations
Test Case -test sample/application
test application in app/test
Test Case -command / instruction
# ./app/test/test -n 1 -c ffff
#RTE>> memcpy_perf_autotest
Test Case - expected
#RTE>> Test OK
Test Result- PASSED
-----Original Message-----
From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhihong Wang
Sent: Thursday, January 29, 2015 10:39
To: dev@dpdk.org
Subject: [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization
This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
It also extends memcpy test coverage with unaligned cases and more test points.
Optimization techniques are summarized below:
1. Utilize full cache bandwidth
2. Enforce aligned stores
3. Apply load address alignment based on architecture features
4. Make load/store address available as early as possible
5. General optimization techniques like inlining, branch reducing, prefetch pattern access
--------------
Changes in v2:
1. Reduced constant test cases in app/test/test_memcpy_perf.c for fast build
2. Modified macro definition for better code readability & safety
Zhihong Wang (4):
app/test: Disabled VTA for memcpy test in app/test/Makefile
app/test: Removed unnecessary test cases in app/test/test_memcpy.c
app/test: Extended test coverage in app/test/test_memcpy_perf.c
lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
and AVX platforms
app/test/Makefile | 6 +
app/test/test_memcpy.c | 52 +-
app/test/test_memcpy_perf.c | 220 ++++---
.../common/include/arch/x86/rte_memcpy.h | 680 +++++++++++++++------
4 files changed, 654 insertions(+), 304 deletions(-)
--
1.9.3
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms Zhihong Wang
@ 2015-01-29 15:17 ` Ananyev, Konstantin
2015-01-30 5:57 ` Wang, Zhihong
0 siblings, 1 reply; 12+ messages in thread
From: Ananyev, Konstantin @ 2015-01-29 15:17 UTC (permalink / raw)
To: Wang, Zhihong, dev
Hi Zhihong,
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhihong Wang
> Sent: Thursday, January 29, 2015 2:39 AM
> To: dev@dpdk.org
> Subject: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
>
> Main code changes:
>
> 1. Differentiate architectural features based on CPU flags
>
> a. Implement separated move functions for SSE/AVX/AVX2 to make full utilization of cache bandwidth
>
> b. Implement separated copy flow specifically optimized for target architecture
>
> 2. Rewrite the memcpy function "rte_memcpy"
>
> a. Add store aligning
>
> b. Add load aligning based on architectural features
>
> c. Put block copy loop into inline move functions for better control of instruction order
>
> d. Eliminate unnecessary MOVs
>
> 3. Rewrite the inline move functions
>
> a. Add move functions for unaligned load cases
>
> b. Change instruction order in copy loops for better pipeline utilization
>
> c. Use intrinsics instead of assembly code
>
> 4. Remove slow glibc call for constant copies
>
> Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
> ---
> .../common/include/arch/x86/rte_memcpy.h | 680 +++++++++++++++------
> 1 file changed, 509 insertions(+), 171 deletions(-)
>
> diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> index fb9eba8..7b2d382 100644
> --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> @@ -34,166 +34,189 @@
> #ifndef _RTE_MEMCPY_X86_64_H_
> #define _RTE_MEMCPY_X86_64_H_
>
> +/**
> + * @file
> + *
> + * Functions for SSE/AVX/AVX2 implementation of memcpy().
> + */
> +
> +#include <stdio.h>
> #include <stdint.h>
> #include <string.h>
> -#include <emmintrin.h>
> +#include <x86intrin.h>
>
> #ifdef __cplusplus
> extern "C" {
> #endif
>
> -#include "generic/rte_memcpy.h"
> +/**
> + * Copy bytes from one location to another. The locations must not overlap.
> + *
> + * @note This is implemented as a macro, so it's address should not be taken
> + * and care is needed as parameter expressions may be evaluated multiple times.
> + *
> + * @param dst
> + * Pointer to the destination of the data.
> + * @param src
> + * Pointer to the source data.
> + * @param n
> + * Number of bytes to copy.
> + * @return
> + * Pointer to the destination data.
> + */
> +static inline void *
> +rte_memcpy(void *dst, const void *src, size_t n) __attribute__((always_inline));
>
> -#ifdef __INTEL_COMPILER
> -#pragma warning(disable:593) /* Stop unused variable warning (reg_a etc). */
> -#endif
> +#ifdef RTE_MACHINE_CPUFLAG_AVX2
>
> +/**
> + * AVX2 implementation below
> + */
> +
> +/**
> + * Copy 16 bytes from one location to another,
> + * locations should not overlap.
> + */
> static inline void
> rte_mov16(uint8_t *dst, const uint8_t *src)
> {
> - __m128i reg_a;
> - asm volatile (
> - "movdqu (%[src]), %[reg_a]\n\t"
> - "movdqu %[reg_a], (%[dst])\n\t"
> - : [reg_a] "=x" (reg_a)
> - : [src] "r" (src),
> - [dst] "r"(dst)
> - : "memory"
> - );
> + __m128i xmm0;
> +
> + xmm0 = _mm_loadu_si128((const __m128i *)src);
> + _mm_storeu_si128((__m128i *)dst, xmm0);
> }
>
> +/**
> + * Copy 32 bytes from one location to another,
> + * locations should not overlap.
> + */
> static inline void
> rte_mov32(uint8_t *dst, const uint8_t *src)
> {
> - __m128i reg_a, reg_b;
> - asm volatile (
> - "movdqu (%[src]), %[reg_a]\n\t"
> - "movdqu 16(%[src]), %[reg_b]\n\t"
> - "movdqu %[reg_a], (%[dst])\n\t"
> - "movdqu %[reg_b], 16(%[dst])\n\t"
> - : [reg_a] "=x" (reg_a),
> - [reg_b] "=x" (reg_b)
> - : [src] "r" (src),
> - [dst] "r"(dst)
> - : "memory"
> - );
> -}
> + __m256i ymm0;
>
> -static inline void
> -rte_mov48(uint8_t *dst, const uint8_t *src)
> -{
> - __m128i reg_a, reg_b, reg_c;
> - asm volatile (
> - "movdqu (%[src]), %[reg_a]\n\t"
> - "movdqu 16(%[src]), %[reg_b]\n\t"
> - "movdqu 32(%[src]), %[reg_c]\n\t"
> - "movdqu %[reg_a], (%[dst])\n\t"
> - "movdqu %[reg_b], 16(%[dst])\n\t"
> - "movdqu %[reg_c], 32(%[dst])\n\t"
> - : [reg_a] "=x" (reg_a),
> - [reg_b] "=x" (reg_b),
> - [reg_c] "=x" (reg_c)
> - : [src] "r" (src),
> - [dst] "r"(dst)
> - : "memory"
> - );
> + ymm0 = _mm256_loadu_si256((const __m256i *)src);
> + _mm256_storeu_si256((__m256i *)dst, ymm0);
> }
>
> +/**
> + * Copy 64 bytes from one location to another,
> + * locations should not overlap.
> + */
> static inline void
> rte_mov64(uint8_t *dst, const uint8_t *src)
> {
> - __m128i reg_a, reg_b, reg_c, reg_d;
> - asm volatile (
> - "movdqu (%[src]), %[reg_a]\n\t"
> - "movdqu 16(%[src]), %[reg_b]\n\t"
> - "movdqu 32(%[src]), %[reg_c]\n\t"
> - "movdqu 48(%[src]), %[reg_d]\n\t"
> - "movdqu %[reg_a], (%[dst])\n\t"
> - "movdqu %[reg_b], 16(%[dst])\n\t"
> - "movdqu %[reg_c], 32(%[dst])\n\t"
> - "movdqu %[reg_d], 48(%[dst])\n\t"
> - : [reg_a] "=x" (reg_a),
> - [reg_b] "=x" (reg_b),
> - [reg_c] "=x" (reg_c),
> - [reg_d] "=x" (reg_d)
> - : [src] "r" (src),
> - [dst] "r"(dst)
> - : "memory"
> - );
> + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
> + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
> }
>
> +/**
> + * Copy 128 bytes from one location to another,
> + * locations should not overlap.
> + */
> static inline void
> rte_mov128(uint8_t *dst, const uint8_t *src)
> {
> - __m128i reg_a, reg_b, reg_c, reg_d, reg_e, reg_f, reg_g, reg_h;
> - asm volatile (
> - "movdqu (%[src]), %[reg_a]\n\t"
> - "movdqu 16(%[src]), %[reg_b]\n\t"
> - "movdqu 32(%[src]), %[reg_c]\n\t"
> - "movdqu 48(%[src]), %[reg_d]\n\t"
> - "movdqu 64(%[src]), %[reg_e]\n\t"
> - "movdqu 80(%[src]), %[reg_f]\n\t"
> - "movdqu 96(%[src]), %[reg_g]\n\t"
> - "movdqu 112(%[src]), %[reg_h]\n\t"
> - "movdqu %[reg_a], (%[dst])\n\t"
> - "movdqu %[reg_b], 16(%[dst])\n\t"
> - "movdqu %[reg_c], 32(%[dst])\n\t"
> - "movdqu %[reg_d], 48(%[dst])\n\t"
> - "movdqu %[reg_e], 64(%[dst])\n\t"
> - "movdqu %[reg_f], 80(%[dst])\n\t"
> - "movdqu %[reg_g], 96(%[dst])\n\t"
> - "movdqu %[reg_h], 112(%[dst])\n\t"
> - : [reg_a] "=x" (reg_a),
> - [reg_b] "=x" (reg_b),
> - [reg_c] "=x" (reg_c),
> - [reg_d] "=x" (reg_d),
> - [reg_e] "=x" (reg_e),
> - [reg_f] "=x" (reg_f),
> - [reg_g] "=x" (reg_g),
> - [reg_h] "=x" (reg_h)
> - : [src] "r" (src),
> - [dst] "r"(dst)
> - : "memory"
> - );
> + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
> + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
> + rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
> + rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
> }
>
> -#ifdef __INTEL_COMPILER
> -#pragma warning(enable:593)
> -#endif
> -
> +/**
> + * Copy 256 bytes from one location to another,
> + * locations should not overlap.
> + */
> static inline void
> rte_mov256(uint8_t *dst, const uint8_t *src)
> {
> - rte_mov128(dst, src);
> - rte_mov128(dst + 128, src + 128);
> + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
> + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
> + rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
> + rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
> + rte_mov32((uint8_t *)dst + 4 * 32, (const uint8_t *)src + 4 * 32);
> + rte_mov32((uint8_t *)dst + 5 * 32, (const uint8_t *)src + 5 * 32);
> + rte_mov32((uint8_t *)dst + 6 * 32, (const uint8_t *)src + 6 * 32);
> + rte_mov32((uint8_t *)dst + 7 * 32, (const uint8_t *)src + 7 * 32);
> }
>
> -#define rte_memcpy(dst, src, n) \
> - ({ (__builtin_constant_p(n)) ? \
> - memcpy((dst), (src), (n)) : \
> - rte_memcpy_func((dst), (src), (n)); })
> +/**
> + * Copy 64-byte blocks from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov64blocks(uint8_t *dst, const uint8_t *src, size_t n)
> +{
> + __m256i ymm0, ymm1;
> +
> + while (n >= 64) {
> + ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32));
> + n -= 64;
> + ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32));
> + src = (const uint8_t *)src + 64;
> + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
> + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
> + dst = (uint8_t *)dst + 64;
> + }
> +}
> +
> +/**
> + * Copy 256-byte blocks from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov256blocks(uint8_t *dst, const uint8_t *src, size_t n)
> +{
> + __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7;
> +
> + while (n >= 256) {
> + ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32));
> + n -= 256;
> + ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32));
> + ymm2 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 2 * 32));
> + ymm3 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 3 * 32));
> + ymm4 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 4 * 32));
> + ymm5 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 5 * 32));
> + ymm6 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 6 * 32));
> + ymm7 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 7 * 32));
> + src = (const uint8_t *)src + 256;
> + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0);
> + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1);
> + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2);
> + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3);
> + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 4 * 32), ymm4);
> + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 5 * 32), ymm5);
> + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 6 * 32), ymm6);
> + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 7 * 32), ymm7);
> + dst = (uint8_t *)dst + 256;
> + }
> +}
>
> static inline void *
> -rte_memcpy_func(void *dst, const void *src, size_t n)
> +rte_memcpy(void *dst, const void *src, size_t n)
> {
> void *ret = dst;
> + int dstofss;
> + int bits;
>
> - /* We can't copy < 16 bytes using XMM registers so do it manually. */
> + /**
> + * Copy less than 16 bytes
> + */
> if (n < 16) {
> if (n & 0x01) {
> *(uint8_t *)dst = *(const uint8_t *)src;
> - dst = (uint8_t *)dst + 1;
> src = (const uint8_t *)src + 1;
> + dst = (uint8_t *)dst + 1;
> }
> if (n & 0x02) {
> *(uint16_t *)dst = *(const uint16_t *)src;
> - dst = (uint16_t *)dst + 1;
> src = (const uint16_t *)src + 1;
> + dst = (uint16_t *)dst + 1;
> }
> if (n & 0x04) {
> *(uint32_t *)dst = *(const uint32_t *)src;
> - dst = (uint32_t *)dst + 1;
> src = (const uint32_t *)src + 1;
> + dst = (uint32_t *)dst + 1;
> }
> if (n & 0x08) {
> *(uint64_t *)dst = *(const uint64_t *)src;
> @@ -201,95 +224,410 @@ rte_memcpy_func(void *dst, const void *src, size_t n)
> return ret;
> }
>
> - /* Special fast cases for <= 128 bytes */
> + /**
> + * Fast way when copy size doesn't exceed 512 bytes
> + */
> if (n <= 32) {
> rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
> return ret;
> }
> -
> if (n <= 64) {
> rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
> return ret;
> }
> -
> - if (n <= 128) {
> - rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> - rte_mov64((uint8_t *)dst - 64 + n, (const uint8_t *)src - 64 + n);
> + if (n <= 512) {
> + if (n >= 256) {
> + n -= 256;
> + rte_mov256((uint8_t *)dst, (const uint8_t *)src);
> + src = (const uint8_t *)src + 256;
> + dst = (uint8_t *)dst + 256;
> + }
> + if (n >= 128) {
> + n -= 128;
> + rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> + src = (const uint8_t *)src + 128;
> + dst = (uint8_t *)dst + 128;
> + }
> + if (n >= 64) {
> + n -= 64;
> + rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> + src = (const uint8_t *)src + 64;
> + dst = (uint8_t *)dst + 64;
> + }
> +COPY_BLOCK_64_BACK31:
> + if (n > 32) {
> + rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> + rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
> + return ret;
> + }
> + if (n > 0) {
> + rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n);
> + }
> return ret;
> }
>
> - /*
> - * For large copies > 128 bytes. This combination of 256, 64 and 16 byte
> - * copies was found to be faster than doing 128 and 32 byte copies as
> - * well.
> + /**
> + * Make store aligned when copy size exceeds 512 bytes
> */
> - for ( ; n >= 256; n -= 256) {
> - rte_mov256((uint8_t *)dst, (const uint8_t *)src);
> - dst = (uint8_t *)dst + 256;
> - src = (const uint8_t *)src + 256;
> + dstofss = 32 - (int)((long long)(void *)dst & 0x1F);
> + n -= dstofss;
> + rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> + src = (const uint8_t *)src + dstofss;
> + dst = (uint8_t *)dst + dstofss;
> +
> + /**
> + * Copy 256-byte blocks.
> + * Use copy block function for better instruction order control,
> + * which is important when load is unaligned.
> + */
> + rte_mov256blocks((uint8_t *)dst, (const uint8_t *)src, n);
> + bits = n;
> + n = n & 255;
> + bits -= n;
> + src = (const uint8_t *)src + bits;
> + dst = (uint8_t *)dst + bits;
> +
> + /**
> + * Copy 64-byte blocks.
> + * Use copy block function for better instruction order control,
> + * which is important when load is unaligned.
> + */
> + if (n >= 64) {
> + rte_mov64blocks((uint8_t *)dst, (const uint8_t *)src, n);
> + bits = n;
> + n = n & 63;
> + bits -= n;
> + src = (const uint8_t *)src + bits;
> + dst = (uint8_t *)dst + bits;
> }
>
> - /*
> - * We split the remaining bytes (which will be less than 256) into
> - * 64byte (2^6) chunks.
> - * Using incrementing integers in the case labels of a switch statement
> - * enourages the compiler to use a jump table. To get incrementing
> - * integers, we shift the 2 relevant bits to the LSB position to first
> - * get decrementing integers, and then subtract.
> + /**
> + * Copy whatever left
> */
> - switch (3 - (n >> 6)) {
> - case 0x00:
> - rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> - n -= 64;
> - dst = (uint8_t *)dst + 64;
> - src = (const uint8_t *)src + 64; /* fallthrough */
> - case 0x01:
> - rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> - n -= 64;
> - dst = (uint8_t *)dst + 64;
> - src = (const uint8_t *)src + 64; /* fallthrough */
> - case 0x02:
> - rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> - n -= 64;
> - dst = (uint8_t *)dst + 64;
> - src = (const uint8_t *)src + 64; /* fallthrough */
> - default:
> - ;
> + goto COPY_BLOCK_64_BACK31;
> +}
> +
> +#else /* RTE_MACHINE_CPUFLAG_AVX2 */
> +
> +/**
> + * SSE & AVX implementation below
> + */
> +
> +/**
> + * Copy 16 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov16(uint8_t *dst, const uint8_t *src)
> +{
> + __m128i xmm0;
> +
> + xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);
> + _mm_storeu_si128((__m128i *)dst, xmm0);
> +}
> +
> +/**
> + * Copy 32 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov32(uint8_t *dst, const uint8_t *src)
> +{
> + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> +}
> +
> +/**
> + * Copy 64 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov64(uint8_t *dst, const uint8_t *src)
> +{
> + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> + rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> + rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
> +}
> +
> +/**
> + * Copy 128 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov128(uint8_t *dst, const uint8_t *src)
> +{
> + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> + rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> + rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
> + rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
> + rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
> + rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
> + rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
> +}
> +
> +/**
> + * Copy 256 bytes from one location to another,
> + * locations should not overlap.
> + */
> +static inline void
> +rte_mov256(uint8_t *dst, const uint8_t *src)
> +{
> + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> + rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> + rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
> + rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
> + rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
> + rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
> + rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
> + rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);
> + rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);
> + rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);
> + rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);
> + rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);
> + rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);
> + rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);
> + rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
> +}
> +
> +/**
> + * Macro for copying unaligned block from one location to another with constant load offset,
> + * 47 bytes leftover maximum,
> + * locations should not overlap.
> + * Requirements:
> + * - Store is aligned
> + * - Load offset is <offset>, which must be immediate value within [1, 15]
> + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
> + * - <dst>, <src>, <len> must be variables
> + * - __m128i <xmm0> ~ <xmm8> must be pre-defined
> + */
> +#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset) \
> +({ \
> + int tmp; \
> + while (len >= 128 + 16 - offset) { \
> + xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16)); \
> + len -= 128; \
> + xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16)); \
> + xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16)); \
> + xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16)); \
> + xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16)); \
> + xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16)); \
> + xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16)); \
> + xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16)); \
> + xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16)); \
> + src = (const uint8_t *)src + 128; \
> + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset)); \
> + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset)); \
> + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset)); \
> + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset)); \
> + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset)); \
> + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset)); \
> + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset)); \
> + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset)); \
> + dst = (uint8_t *)dst + 128; \
> + } \
> + tmp = len; \
> + len = ((len - 16 + offset) & 127) + 16 - offset; \
> + tmp -= len; \
> + src = (const uint8_t *)src + tmp; \
> + dst = (uint8_t *)dst + tmp; \
> + if (len >= 32 + 16 - offset) { \
> + while (len >= 32 + 16 - offset) { \
> + xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16)); \
> + len -= 32; \
> + xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16)); \
> + xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16)); \
> + src = (const uint8_t *)src + 32; \
> + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset)); \
> + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset)); \
> + dst = (uint8_t *)dst + 32; \
> + } \
> + tmp = len; \
> + len = ((len - 16 + offset) & 31) + 16 - offset; \
> + tmp -= len; \
> + src = (const uint8_t *)src + tmp; \
> + dst = (uint8_t *)dst + tmp; \
> + } \
> +})
> +
> +/**
> + * Macro for copying unaligned block from one location to another,
> + * 47 bytes leftover maximum,
> + * locations should not overlap.
> + * Use switch here because the aligning instruction requires immediate value for shift count.
> + * Requirements:
> + * - Store is aligned
> + * - Load offset is <offset>, which must be within [1, 15]
> + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading
> + * - <dst>, <src>, <len> must be variables
> + * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined
> + */
> +#define MOVEUNALIGNED_LEFT47(dst, src, len, offset) \
> +({ \
> + switch (offset) { \
> + case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break; \
> + case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break; \
> + case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break; \
> + case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break; \
> + case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break; \
> + case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break; \
> + case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break; \
> + case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break; \
> + case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break; \
> + case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break; \
> + case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break; \
> + case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break; \
> + case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break; \
> + case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break; \
> + case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break; \
> + default:; \
> + } \
> +})
We probably didn't understand each other properly.
My thought was to do something like:
#define MOVEUNALIGNED_128(offset) do { \
_mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset)); \
_mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset)); \
_mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset)); \
_mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset)); \
_mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset)); \
_mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset)); \
_mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset)); \
_mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset)); \
} while(0);
Then at MOVEUNALIGNED_LEFT47_IMM:
....
xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16)); \
xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16)); \
src = (const uint8_t *)src + 128;
switch(offset) {
case 0x01: MOVEUNALIGNED_128(1); break;
...
case 0x0f: MOVEUNALIGNED_128(f); break;
}
...
dst = (uint8_t *)dst + 128
An then in rte_memcpy() you don't need a switch for MOVEUNALIGNED_LEFT47_IMM itself.
Thought it might help to generate smaller code for rte_ememcpy() while keeping performance reasonably high.
Konstantin
> +
> +static inline void *
> +rte_memcpy(void *dst, const void *src, size_t n)
> +{
> + __m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8;
> + void *ret = dst;
> + int dstofss;
> + int srcofs;
> +
> + /**
> + * Copy less than 16 bytes
> + */
> + if (n < 16) {
> + if (n & 0x01) {
> + *(uint8_t *)dst = *(const uint8_t *)src;
> + src = (const uint8_t *)src + 1;
> + dst = (uint8_t *)dst + 1;
> + }
> + if (n & 0x02) {
> + *(uint16_t *)dst = *(const uint16_t *)src;
> + src = (const uint16_t *)src + 1;
> + dst = (uint16_t *)dst + 1;
> + }
> + if (n & 0x04) {
> + *(uint32_t *)dst = *(const uint32_t *)src;
> + src = (const uint32_t *)src + 1;
> + dst = (uint32_t *)dst + 1;
> + }
> + if (n & 0x08) {
> + *(uint64_t *)dst = *(const uint64_t *)src;
> + }
> + return ret;
> }
>
> - /*
> - * We split the remaining bytes (which will be less than 64) into
> - * 16byte (2^4) chunks, using the same switch structure as above.
> + /**
> + * Fast way when copy size doesn't exceed 512 bytes
> */
> - switch (3 - (n >> 4)) {
> - case 0x00:
> - rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> - n -= 16;
> - dst = (uint8_t *)dst + 16;
> - src = (const uint8_t *)src + 16; /* fallthrough */
> - case 0x01:
> - rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> - n -= 16;
> - dst = (uint8_t *)dst + 16;
> - src = (const uint8_t *)src + 16; /* fallthrough */
> - case 0x02:
> + if (n <= 32) {
> rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> - n -= 16;
> - dst = (uint8_t *)dst + 16;
> - src = (const uint8_t *)src + 16; /* fallthrough */
> - default:
> - ;
> + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
> + return ret;
> }
> -
> - /* Copy any remaining bytes, without going beyond end of buffers */
> - if (n != 0) {
> + if (n <= 48) {
> + rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
> + return ret;
> + }
> + if (n <= 64) {
> + rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> + rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
> rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
> + return ret;
> + }
> + if (n <= 128) {
> + goto COPY_BLOCK_128_BACK15;
> }
> - return ret;
> + if (n <= 512) {
> + if (n >= 256) {
> + n -= 256;
> + rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> + rte_mov128((uint8_t *)dst + 128, (const uint8_t *)src + 128);
> + src = (const uint8_t *)src + 256;
> + dst = (uint8_t *)dst + 256;
> + }
> +COPY_BLOCK_255_BACK15:
> + if (n >= 128) {
> + n -= 128;
> + rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> + src = (const uint8_t *)src + 128;
> + dst = (uint8_t *)dst + 128;
> + }
> +COPY_BLOCK_128_BACK15:
> + if (n >= 64) {
> + n -= 64;
> + rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> + src = (const uint8_t *)src + 64;
> + dst = (uint8_t *)dst + 64;
> + }
> +COPY_BLOCK_64_BACK15:
> + if (n >= 32) {
> + n -= 32;
> + rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> + src = (const uint8_t *)src + 32;
> + dst = (uint8_t *)dst + 32;
> + }
> + if (n > 16) {
> + rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
> + return ret;
> + }
> + if (n > 0) {
> + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n);
> + }
> + return ret;
> + }
> +
> + /**
> + * Make store aligned when copy size exceeds 512 bytes,
> + * and make sure the first 15 bytes are copied, because
> + * unaligned copy functions require up to 15 bytes
> + * backwards access.
> + */
> + dstofss = 16 - (int)((long long)(void *)dst & 0x0F) + 16;
> + n -= dstofss;
> + rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> + src = (const uint8_t *)src + dstofss;
> + dst = (uint8_t *)dst + dstofss;
> + srcofs = (int)((long long)(const void *)src & 0x0F);
> +
> + /**
> + * For aligned copy
> + */
> + if (srcofs == 0) {
> + /**
> + * Copy 256-byte blocks
> + */
> + for (; n >= 256; n -= 256) {
> + rte_mov256((uint8_t *)dst, (const uint8_t *)src);
> + dst = (uint8_t *)dst + 256;
> + src = (const uint8_t *)src + 256;
> + }
> +
> + /**
> + * Copy whatever left
> + */
> + goto COPY_BLOCK_255_BACK15;
> + }
> +
> + /**
> + * For copy with unaligned load
> + */
> + MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
> +
> + /**
> + * Copy whatever left
> + */
> + goto COPY_BLOCK_64_BACK15;
> }
>
> +#endif /* RTE_MACHINE_CPUFLAG_AVX2 */
> +
> #ifdef __cplusplus
> }
> #endif
> --
> 1.9.3
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
2015-01-29 15:17 ` Ananyev, Konstantin
@ 2015-01-30 5:57 ` Wang, Zhihong
2015-01-30 10:44 ` Ananyev, Konstantin
0 siblings, 1 reply; 12+ messages in thread
From: Wang, Zhihong @ 2015-01-30 5:57 UTC (permalink / raw)
To: Ananyev, Konstantin, dev
Hey Konstantin,
This method does reduce code size but lead to significant performance drop.
I think we need to keep the original code.
Thanks
Zhihong (John)
> -----Original Message-----
> From: Ananyev, Konstantin
> Sent: Thursday, January 29, 2015 11:18 PM
> To: Wang, Zhihong; dev@dpdk.org
> Subject: RE: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in
> arch/x86/rte_memcpy.h for both SSE and AVX platforms
>
> Hi Zhihong,
>
> > -----Original Message-----
> > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhihong Wang
> > Sent: Thursday, January 29, 2015 2:39 AM
> > To: dev@dpdk.org
> > Subject: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in
> > arch/x86/rte_memcpy.h for both SSE and AVX platforms
> >
> > Main code changes:
> >
> > 1. Differentiate architectural features based on CPU flags
> >
> > a. Implement separated move functions for SSE/AVX/AVX2 to make
> > full utilization of cache bandwidth
> >
> > b. Implement separated copy flow specifically optimized for target
> > architecture
> >
> > 2. Rewrite the memcpy function "rte_memcpy"
> >
> > a. Add store aligning
> >
> > b. Add load aligning based on architectural features
> >
> > c. Put block copy loop into inline move functions for better
> > control of instruction order
> >
> > d. Eliminate unnecessary MOVs
> >
> > 3. Rewrite the inline move functions
> >
> > a. Add move functions for unaligned load cases
> >
> > b. Change instruction order in copy loops for better pipeline
> > utilization
> >
> > c. Use intrinsics instead of assembly code
> >
> > 4. Remove slow glibc call for constant copies
> >
> > Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
> > ---
> > .../common/include/arch/x86/rte_memcpy.h | 680
> +++++++++++++++------
> > 1 file changed, 509 insertions(+), 171 deletions(-)
> >
> > diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > index fb9eba8..7b2d382 100644
> > --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > @@ -34,166 +34,189 @@
> > #ifndef _RTE_MEMCPY_X86_64_H_
> > #define _RTE_MEMCPY_X86_64_H_
> >
> > +/**
> > + * @file
> > + *
> > + * Functions for SSE/AVX/AVX2 implementation of memcpy().
> > + */
> > +
> > +#include <stdio.h>
> > #include <stdint.h>
> > #include <string.h>
> > -#include <emmintrin.h>
> > +#include <x86intrin.h>
> >
> > #ifdef __cplusplus
> > extern "C" {
> > #endif
> >
> > -#include "generic/rte_memcpy.h"
> > +/**
> > + * Copy bytes from one location to another. The locations must not
> overlap.
> > + *
> > + * @note This is implemented as a macro, so it's address should not
> > +be taken
> > + * and care is needed as parameter expressions may be evaluated
> multiple times.
> > + *
> > + * @param dst
> > + * Pointer to the destination of the data.
> > + * @param src
> > + * Pointer to the source data.
> > + * @param n
> > + * Number of bytes to copy.
> > + * @return
> > + * Pointer to the destination data.
> > + */
> > +static inline void *
> > +rte_memcpy(void *dst, const void *src, size_t n)
> > +__attribute__((always_inline));
> >
> > -#ifdef __INTEL_COMPILER
> > -#pragma warning(disable:593) /* Stop unused variable warning (reg_a
> > etc). */ -#endif
> > +#ifdef RTE_MACHINE_CPUFLAG_AVX2
> >
> > +/**
> > + * AVX2 implementation below
> > + */
> > +
> > +/**
> > + * Copy 16 bytes from one location to another,
> > + * locations should not overlap.
> > + */
> > static inline void
> > rte_mov16(uint8_t *dst, const uint8_t *src) {
> > - __m128i reg_a;
> > - asm volatile (
> > - "movdqu (%[src]), %[reg_a]\n\t"
> > - "movdqu %[reg_a], (%[dst])\n\t"
> > - : [reg_a] "=x" (reg_a)
> > - : [src] "r" (src),
> > - [dst] "r"(dst)
> > - : "memory"
> > - );
> > + __m128i xmm0;
> > +
> > + xmm0 = _mm_loadu_si128((const __m128i *)src);
> > + _mm_storeu_si128((__m128i *)dst, xmm0);
> > }
> >
> > +/**
> > + * Copy 32 bytes from one location to another,
> > + * locations should not overlap.
> > + */
> > static inline void
> > rte_mov32(uint8_t *dst, const uint8_t *src) {
> > - __m128i reg_a, reg_b;
> > - asm volatile (
> > - "movdqu (%[src]), %[reg_a]\n\t"
> > - "movdqu 16(%[src]), %[reg_b]\n\t"
> > - "movdqu %[reg_a], (%[dst])\n\t"
> > - "movdqu %[reg_b], 16(%[dst])\n\t"
> > - : [reg_a] "=x" (reg_a),
> > - [reg_b] "=x" (reg_b)
> > - : [src] "r" (src),
> > - [dst] "r"(dst)
> > - : "memory"
> > - );
> > -}
> > + __m256i ymm0;
> >
> > -static inline void
> > -rte_mov48(uint8_t *dst, const uint8_t *src) -{
> > - __m128i reg_a, reg_b, reg_c;
> > - asm volatile (
> > - "movdqu (%[src]), %[reg_a]\n\t"
> > - "movdqu 16(%[src]), %[reg_b]\n\t"
> > - "movdqu 32(%[src]), %[reg_c]\n\t"
> > - "movdqu %[reg_a], (%[dst])\n\t"
> > - "movdqu %[reg_b], 16(%[dst])\n\t"
> > - "movdqu %[reg_c], 32(%[dst])\n\t"
> > - : [reg_a] "=x" (reg_a),
> > - [reg_b] "=x" (reg_b),
> > - [reg_c] "=x" (reg_c)
> > - : [src] "r" (src),
> > - [dst] "r"(dst)
> > - : "memory"
> > - );
> > + ymm0 = _mm256_loadu_si256((const __m256i *)src);
> > + _mm256_storeu_si256((__m256i *)dst, ymm0);
> > }
> >
> > +/**
> > + * Copy 64 bytes from one location to another,
> > + * locations should not overlap.
> > + */
> > static inline void
> > rte_mov64(uint8_t *dst, const uint8_t *src) {
> > - __m128i reg_a, reg_b, reg_c, reg_d;
> > - asm volatile (
> > - "movdqu (%[src]), %[reg_a]\n\t"
> > - "movdqu 16(%[src]), %[reg_b]\n\t"
> > - "movdqu 32(%[src]), %[reg_c]\n\t"
> > - "movdqu 48(%[src]), %[reg_d]\n\t"
> > - "movdqu %[reg_a], (%[dst])\n\t"
> > - "movdqu %[reg_b], 16(%[dst])\n\t"
> > - "movdqu %[reg_c], 32(%[dst])\n\t"
> > - "movdqu %[reg_d], 48(%[dst])\n\t"
> > - : [reg_a] "=x" (reg_a),
> > - [reg_b] "=x" (reg_b),
> > - [reg_c] "=x" (reg_c),
> > - [reg_d] "=x" (reg_d)
> > - : [src] "r" (src),
> > - [dst] "r"(dst)
> > - : "memory"
> > - );
> > + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
> > + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
> > }
> >
> > +/**
> > + * Copy 128 bytes from one location to another,
> > + * locations should not overlap.
> > + */
> > static inline void
> > rte_mov128(uint8_t *dst, const uint8_t *src) {
> > - __m128i reg_a, reg_b, reg_c, reg_d, reg_e, reg_f, reg_g, reg_h;
> > - asm volatile (
> > - "movdqu (%[src]), %[reg_a]\n\t"
> > - "movdqu 16(%[src]), %[reg_b]\n\t"
> > - "movdqu 32(%[src]), %[reg_c]\n\t"
> > - "movdqu 48(%[src]), %[reg_d]\n\t"
> > - "movdqu 64(%[src]), %[reg_e]\n\t"
> > - "movdqu 80(%[src]), %[reg_f]\n\t"
> > - "movdqu 96(%[src]), %[reg_g]\n\t"
> > - "movdqu 112(%[src]), %[reg_h]\n\t"
> > - "movdqu %[reg_a], (%[dst])\n\t"
> > - "movdqu %[reg_b], 16(%[dst])\n\t"
> > - "movdqu %[reg_c], 32(%[dst])\n\t"
> > - "movdqu %[reg_d], 48(%[dst])\n\t"
> > - "movdqu %[reg_e], 64(%[dst])\n\t"
> > - "movdqu %[reg_f], 80(%[dst])\n\t"
> > - "movdqu %[reg_g], 96(%[dst])\n\t"
> > - "movdqu %[reg_h], 112(%[dst])\n\t"
> > - : [reg_a] "=x" (reg_a),
> > - [reg_b] "=x" (reg_b),
> > - [reg_c] "=x" (reg_c),
> > - [reg_d] "=x" (reg_d),
> > - [reg_e] "=x" (reg_e),
> > - [reg_f] "=x" (reg_f),
> > - [reg_g] "=x" (reg_g),
> > - [reg_h] "=x" (reg_h)
> > - : [src] "r" (src),
> > - [dst] "r"(dst)
> > - : "memory"
> > - );
> > + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
> > + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
> > + rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
> > + rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
> > }
> >
> > -#ifdef __INTEL_COMPILER
> > -#pragma warning(enable:593)
> > -#endif
> > -
> > +/**
> > + * Copy 256 bytes from one location to another,
> > + * locations should not overlap.
> > + */
> > static inline void
> > rte_mov256(uint8_t *dst, const uint8_t *src) {
> > - rte_mov128(dst, src);
> > - rte_mov128(dst + 128, src + 128);
> > + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
> > + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
> > + rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
> > + rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
> > + rte_mov32((uint8_t *)dst + 4 * 32, (const uint8_t *)src + 4 * 32);
> > + rte_mov32((uint8_t *)dst + 5 * 32, (const uint8_t *)src + 5 * 32);
> > + rte_mov32((uint8_t *)dst + 6 * 32, (const uint8_t *)src + 6 * 32);
> > + rte_mov32((uint8_t *)dst + 7 * 32, (const uint8_t *)src + 7 * 32);
> > }
> >
> > -#define rte_memcpy(dst, src, n) \
> > - ({ (__builtin_constant_p(n)) ? \
> > - memcpy((dst), (src), (n)) : \
> > - rte_memcpy_func((dst), (src), (n)); })
> > +/**
> > + * Copy 64-byte blocks from one location to another,
> > + * locations should not overlap.
> > + */
> > +static inline void
> > +rte_mov64blocks(uint8_t *dst, const uint8_t *src, size_t n) {
> > + __m256i ymm0, ymm1;
> > +
> > + while (n >= 64) {
> > + ymm0 = _mm256_loadu_si256((const __m256i *)((const
> uint8_t *)src + 0 * 32));
> > + n -= 64;
> > + ymm1 = _mm256_loadu_si256((const __m256i *)((const
> uint8_t *)src + 1 * 32));
> > + src = (const uint8_t *)src + 64;
> > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32),
> ymm0);
> > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32),
> ymm1);
> > + dst = (uint8_t *)dst + 64;
> > + }
> > +}
> > +
> > +/**
> > + * Copy 256-byte blocks from one location to another,
> > + * locations should not overlap.
> > + */
> > +static inline void
> > +rte_mov256blocks(uint8_t *dst, const uint8_t *src, size_t n) {
> > + __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7;
> > +
> > + while (n >= 256) {
> > + ymm0 = _mm256_loadu_si256((const __m256i *)((const
> uint8_t *)src + 0 * 32));
> > + n -= 256;
> > + ymm1 = _mm256_loadu_si256((const __m256i *)((const
> uint8_t *)src + 1 * 32));
> > + ymm2 = _mm256_loadu_si256((const __m256i *)((const
> uint8_t *)src + 2 * 32));
> > + ymm3 = _mm256_loadu_si256((const __m256i *)((const
> uint8_t *)src + 3 * 32));
> > + ymm4 = _mm256_loadu_si256((const __m256i *)((const
> uint8_t *)src + 4 * 32));
> > + ymm5 = _mm256_loadu_si256((const __m256i *)((const
> uint8_t *)src + 5 * 32));
> > + ymm6 = _mm256_loadu_si256((const __m256i *)((const
> uint8_t *)src + 6 * 32));
> > + ymm7 = _mm256_loadu_si256((const __m256i *)((const
> uint8_t *)src + 7 * 32));
> > + src = (const uint8_t *)src + 256;
> > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32),
> ymm0);
> > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32),
> ymm1);
> > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32),
> ymm2);
> > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32),
> ymm3);
> > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 4 * 32),
> ymm4);
> > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 5 * 32),
> ymm5);
> > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 6 * 32),
> ymm6);
> > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 7 * 32),
> ymm7);
> > + dst = (uint8_t *)dst + 256;
> > + }
> > +}
> >
> > static inline void *
> > -rte_memcpy_func(void *dst, const void *src, size_t n)
> > +rte_memcpy(void *dst, const void *src, size_t n)
> > {
> > void *ret = dst;
> > + int dstofss;
> > + int bits;
> >
> > - /* We can't copy < 16 bytes using XMM registers so do it manually. */
> > + /**
> > + * Copy less than 16 bytes
> > + */
> > if (n < 16) {
> > if (n & 0x01) {
> > *(uint8_t *)dst = *(const uint8_t *)src;
> > - dst = (uint8_t *)dst + 1;
> > src = (const uint8_t *)src + 1;
> > + dst = (uint8_t *)dst + 1;
> > }
> > if (n & 0x02) {
> > *(uint16_t *)dst = *(const uint16_t *)src;
> > - dst = (uint16_t *)dst + 1;
> > src = (const uint16_t *)src + 1;
> > + dst = (uint16_t *)dst + 1;
> > }
> > if (n & 0x04) {
> > *(uint32_t *)dst = *(const uint32_t *)src;
> > - dst = (uint32_t *)dst + 1;
> > src = (const uint32_t *)src + 1;
> > + dst = (uint32_t *)dst + 1;
> > }
> > if (n & 0x08) {
> > *(uint64_t *)dst = *(const uint64_t *)src; @@ -201,95
> +224,410 @@
> > rte_memcpy_func(void *dst, const void *src, size_t n)
> > return ret;
> > }
> >
> > - /* Special fast cases for <= 128 bytes */
> > + /**
> > + * Fast way when copy size doesn't exceed 512 bytes
> > + */
> > if (n <= 32) {
> > rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> > rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 +
> n);
> > return ret;
> > }
> > -
> > if (n <= 64) {
> > rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 +
> n);
> > return ret;
> > }
> > -
> > - if (n <= 128) {
> > - rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > - rte_mov64((uint8_t *)dst - 64 + n, (const uint8_t *)src - 64 +
> n);
> > + if (n <= 512) {
> > + if (n >= 256) {
> > + n -= 256;
> > + rte_mov256((uint8_t *)dst, (const uint8_t *)src);
> > + src = (const uint8_t *)src + 256;
> > + dst = (uint8_t *)dst + 256;
> > + }
> > + if (n >= 128) {
> > + n -= 128;
> > + rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> > + src = (const uint8_t *)src + 128;
> > + dst = (uint8_t *)dst + 128;
> > + }
> > + if (n >= 64) {
> > + n -= 64;
> > + rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > + src = (const uint8_t *)src + 64;
> > + dst = (uint8_t *)dst + 64;
> > + }
> > +COPY_BLOCK_64_BACK31:
> > + if (n > 32) {
> > + rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > + rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t
> *)src - 32 + n);
> > + return ret;
> > + }
> > + if (n > 0) {
> > + rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t
> *)src - 32 + n);
> > + }
> > return ret;
> > }
> >
> > - /*
> > - * For large copies > 128 bytes. This combination of 256, 64 and 16
> byte
> > - * copies was found to be faster than doing 128 and 32 byte copies as
> > - * well.
> > + /**
> > + * Make store aligned when copy size exceeds 512 bytes
> > */
> > - for ( ; n >= 256; n -= 256) {
> > - rte_mov256((uint8_t *)dst, (const uint8_t *)src);
> > - dst = (uint8_t *)dst + 256;
> > - src = (const uint8_t *)src + 256;
> > + dstofss = 32 - (int)((long long)(void *)dst & 0x1F);
> > + n -= dstofss;
> > + rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > + src = (const uint8_t *)src + dstofss;
> > + dst = (uint8_t *)dst + dstofss;
> > +
> > + /**
> > + * Copy 256-byte blocks.
> > + * Use copy block function for better instruction order control,
> > + * which is important when load is unaligned.
> > + */
> > + rte_mov256blocks((uint8_t *)dst, (const uint8_t *)src, n);
> > + bits = n;
> > + n = n & 255;
> > + bits -= n;
> > + src = (const uint8_t *)src + bits;
> > + dst = (uint8_t *)dst + bits;
> > +
> > + /**
> > + * Copy 64-byte blocks.
> > + * Use copy block function for better instruction order control,
> > + * which is important when load is unaligned.
> > + */
> > + if (n >= 64) {
> > + rte_mov64blocks((uint8_t *)dst, (const uint8_t *)src, n);
> > + bits = n;
> > + n = n & 63;
> > + bits -= n;
> > + src = (const uint8_t *)src + bits;
> > + dst = (uint8_t *)dst + bits;
> > }
> >
> > - /*
> > - * We split the remaining bytes (which will be less than 256) into
> > - * 64byte (2^6) chunks.
> > - * Using incrementing integers in the case labels of a switch
> statement
> > - * enourages the compiler to use a jump table. To get incrementing
> > - * integers, we shift the 2 relevant bits to the LSB position to first
> > - * get decrementing integers, and then subtract.
> > + /**
> > + * Copy whatever left
> > */
> > - switch (3 - (n >> 6)) {
> > - case 0x00:
> > - rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > - n -= 64;
> > - dst = (uint8_t *)dst + 64;
> > - src = (const uint8_t *)src + 64; /* fallthrough */
> > - case 0x01:
> > - rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > - n -= 64;
> > - dst = (uint8_t *)dst + 64;
> > - src = (const uint8_t *)src + 64; /* fallthrough */
> > - case 0x02:
> > - rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > - n -= 64;
> > - dst = (uint8_t *)dst + 64;
> > - src = (const uint8_t *)src + 64; /* fallthrough */
> > - default:
> > - ;
> > + goto COPY_BLOCK_64_BACK31;
> > +}
> > +
> > +#else /* RTE_MACHINE_CPUFLAG_AVX2 */
> > +
> > +/**
> > + * SSE & AVX implementation below
> > + */
> > +
> > +/**
> > + * Copy 16 bytes from one location to another,
> > + * locations should not overlap.
> > + */
> > +static inline void
> > +rte_mov16(uint8_t *dst, const uint8_t *src) {
> > + __m128i xmm0;
> > +
> > + xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);
> > + _mm_storeu_si128((__m128i *)dst, xmm0); }
> > +
> > +/**
> > + * Copy 32 bytes from one location to another,
> > + * locations should not overlap.
> > + */
> > +static inline void
> > +rte_mov32(uint8_t *dst, const uint8_t *src) {
> > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16); }
> > +
> > +/**
> > + * Copy 64 bytes from one location to another,
> > + * locations should not overlap.
> > + */
> > +static inline void
> > +rte_mov64(uint8_t *dst, const uint8_t *src) {
> > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> > + rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> > + rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16); }
> > +
> > +/**
> > + * Copy 128 bytes from one location to another,
> > + * locations should not overlap.
> > + */
> > +static inline void
> > +rte_mov128(uint8_t *dst, const uint8_t *src) {
> > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> > + rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> > + rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
> > + rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
> > + rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
> > + rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
> > + rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16); }
> > +
> > +/**
> > + * Copy 256 bytes from one location to another,
> > + * locations should not overlap.
> > + */
> > +static inline void
> > +rte_mov256(uint8_t *dst, const uint8_t *src) {
> > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> > + rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> > + rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
> > + rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
> > + rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
> > + rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
> > + rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
> > + rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);
> > + rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);
> > + rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);
> > + rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);
> > + rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);
> > + rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);
> > + rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);
> > + rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
> > +}
> > +
> > +/**
> > + * Macro for copying unaligned block from one location to another
> > +with constant load offset,
> > + * 47 bytes leftover maximum,
> > + * locations should not overlap.
> > + * Requirements:
> > + * - Store is aligned
> > + * - Load offset is <offset>, which must be immediate value within
> > +[1, 15]
> > + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit
> > +forwards are available for loading
> > + * - <dst>, <src>, <len> must be variables
> > + * - __m128i <xmm0> ~ <xmm8> must be pre-defined */
> > +#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)
> \
> > +({ \
> > + int tmp; \
> > + while (len >= 128 + 16 - offset) { \
> > + xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 0 * 16)); \
> > + len -= 128; \
> > + xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 1 * 16)); \
> > + xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 2 * 16)); \
> > + xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 3 * 16)); \
> > + xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 4 * 16)); \
> > + xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 5 * 16)); \
> > + xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 6 * 16)); \
> > + xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 7 * 16)); \
> > + xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 8 * 16)); \
> > + src = (const uint8_t *)src + 128; \
> > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16),
> _mm_alignr_epi8(xmm1, xmm0, offset)); \
> > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16),
> _mm_alignr_epi8(xmm2, xmm1, offset)); \
> > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16),
> _mm_alignr_epi8(xmm3, xmm2, offset)); \
> > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16),
> _mm_alignr_epi8(xmm4, xmm3, offset)); \
> > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16),
> _mm_alignr_epi8(xmm5, xmm4, offset)); \
> > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16),
> _mm_alignr_epi8(xmm6, xmm5, offset)); \
> > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16),
> _mm_alignr_epi8(xmm7, xmm6, offset)); \
> > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16),
> _mm_alignr_epi8(xmm8, xmm7, offset)); \
> > + dst = (uint8_t *)dst + 128; \
> > + } \
> > + tmp = len; \
> > + len = ((len - 16 + offset) & 127) + 16 - offset; \
> > + tmp -= len; \
> > + src = (const uint8_t *)src + tmp; \
> > + dst = (uint8_t *)dst + tmp; \
> > + if (len >= 32 + 16 - offset) { \
> > + while (len >= 32 + 16 - offset) { \
> > + xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 0 * 16)); \
> > + len -= 32; \
> > + xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 1 * 16)); \
> > + xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> offset + 2 * 16)); \
> > + src = (const uint8_t *)src + 32; \
> > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16),
> _mm_alignr_epi8(xmm1, xmm0, offset)); \
> > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16),
> _mm_alignr_epi8(xmm2, xmm1, offset)); \
> > + dst = (uint8_t *)dst + 32; \
> > + } \
> > + tmp = len; \
> > + len = ((len - 16 + offset) & 31) + 16 - offset; \
> > + tmp -= len; \
> > + src = (const uint8_t *)src + tmp; \
> > + dst = (uint8_t *)dst + tmp; \
> > + } \
> > +})
> > +
> > +/**
> > + * Macro for copying unaligned block from one location to another,
> > + * 47 bytes leftover maximum,
> > + * locations should not overlap.
> > + * Use switch here because the aligning instruction requires immediate
> value for shift count.
> > + * Requirements:
> > + * - Store is aligned
> > + * - Load offset is <offset>, which must be within [1, 15]
> > + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit
> > +forwards are available for loading
> > + * - <dst>, <src>, <len> must be variables
> > + * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM
> must be
> > +pre-defined */
> > +#define MOVEUNALIGNED_LEFT47(dst, src, len, offset) \
> > +({ \
> > + switch (offset) { \
> > + case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break; \
> > + case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break; \
> > + case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break; \
> > + case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break; \
> > + case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break; \
> > + case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break; \
> > + case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break; \
> > + case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break; \
> > + case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break; \
> > + case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break; \
> > + case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break; \
> > + case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break; \
> > + case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break; \
> > + case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break; \
> > + case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break; \
> > + default:; \
> > + } \
> > +})
>
> We probably didn't understand each other properly.
> My thought was to do something like:
>
> #define MOVEUNALIGNED_128(offset) do { \
> _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16),
> _mm_alignr_epi8(xmm1, xmm0, offset)); \
> _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16),
> _mm_alignr_epi8(xmm2, xmm1, offset)); \
> _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16),
> _mm_alignr_epi8(xmm3, xmm2, offset)); \
> _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16),
> _mm_alignr_epi8(xmm4, xmm3, offset)); \
> _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16),
> _mm_alignr_epi8(xmm5, xmm4, offset)); \
> _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16),
> _mm_alignr_epi8(xmm6, xmm5, offset)); \
> _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16),
> _mm_alignr_epi8(xmm7, xmm6, offset)); \
> _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16),
> _mm_alignr_epi8(xmm8, xmm7, offset)); \
> } while(0);
>
>
> Then at MOVEUNALIGNED_LEFT47_IMM:
> ....
> xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7
> * 16)); \
> xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8
> * 16)); \
> src = (const uint8_t *)src + 128;
>
> switch(offset) {
> case 0x01: MOVEUNALIGNED_128(1); break;
> ...
> case 0x0f: MOVEUNALIGNED_128(f); break; } ...
> dst = (uint8_t *)dst + 128
>
> An then in rte_memcpy() you don't need a switch for
> MOVEUNALIGNED_LEFT47_IMM itself.
> Thought it might help to generate smaller code for rte_ememcpy() while
> keeping performance reasonably high.
>
> Konstantin
>
>
> > +
> > +static inline void *
> > +rte_memcpy(void *dst, const void *src, size_t n) {
> > + __m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7,
> xmm8;
> > + void *ret = dst;
> > + int dstofss;
> > + int srcofs;
> > +
> > + /**
> > + * Copy less than 16 bytes
> > + */
> > + if (n < 16) {
> > + if (n & 0x01) {
> > + *(uint8_t *)dst = *(const uint8_t *)src;
> > + src = (const uint8_t *)src + 1;
> > + dst = (uint8_t *)dst + 1;
> > + }
> > + if (n & 0x02) {
> > + *(uint16_t *)dst = *(const uint16_t *)src;
> > + src = (const uint16_t *)src + 1;
> > + dst = (uint16_t *)dst + 1;
> > + }
> > + if (n & 0x04) {
> > + *(uint32_t *)dst = *(const uint32_t *)src;
> > + src = (const uint32_t *)src + 1;
> > + dst = (uint32_t *)dst + 1;
> > + }
> > + if (n & 0x08) {
> > + *(uint64_t *)dst = *(const uint64_t *)src;
> > + }
> > + return ret;
> > }
> >
> > - /*
> > - * We split the remaining bytes (which will be less than 64) into
> > - * 16byte (2^4) chunks, using the same switch structure as above.
> > + /**
> > + * Fast way when copy size doesn't exceed 512 bytes
> > */
> > - switch (3 - (n >> 4)) {
> > - case 0x00:
> > - rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> > - n -= 16;
> > - dst = (uint8_t *)dst + 16;
> > - src = (const uint8_t *)src + 16; /* fallthrough */
> > - case 0x01:
> > - rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> > - n -= 16;
> > - dst = (uint8_t *)dst + 16;
> > - src = (const uint8_t *)src + 16; /* fallthrough */
> > - case 0x02:
> > + if (n <= 32) {
> > rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> > - n -= 16;
> > - dst = (uint8_t *)dst + 16;
> > - src = (const uint8_t *)src + 16; /* fallthrough */
> > - default:
> > - ;
> > + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 +
> n);
> > + return ret;
> > }
> > -
> > - /* Copy any remaining bytes, without going beyond end of buffers
> */
> > - if (n != 0) {
> > + if (n <= 48) {
> > + rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 +
> n);
> > + return ret;
> > + }
> > + if (n <= 64) {
> > + rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > + rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
> > rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 +
> n);
> > + return ret;
> > + }
> > + if (n <= 128) {
> > + goto COPY_BLOCK_128_BACK15;
> > }
> > - return ret;
> > + if (n <= 512) {
> > + if (n >= 256) {
> > + n -= 256;
> > + rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> > + rte_mov128((uint8_t *)dst + 128, (const uint8_t *)src
> + 128);
> > + src = (const uint8_t *)src + 256;
> > + dst = (uint8_t *)dst + 256;
> > + }
> > +COPY_BLOCK_255_BACK15:
> > + if (n >= 128) {
> > + n -= 128;
> > + rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> > + src = (const uint8_t *)src + 128;
> > + dst = (uint8_t *)dst + 128;
> > + }
> > +COPY_BLOCK_128_BACK15:
> > + if (n >= 64) {
> > + n -= 64;
> > + rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > + src = (const uint8_t *)src + 64;
> > + dst = (uint8_t *)dst + 64;
> > + }
> > +COPY_BLOCK_64_BACK15:
> > + if (n >= 32) {
> > + n -= 32;
> > + rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > + src = (const uint8_t *)src + 32;
> > + dst = (uint8_t *)dst + 32;
> > + }
> > + if (n > 16) {
> > + rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> > + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t
> *)src - 16 + n);
> > + return ret;
> > + }
> > + if (n > 0) {
> > + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t
> *)src - 16 + n);
> > + }
> > + return ret;
> > + }
> > +
> > + /**
> > + * Make store aligned when copy size exceeds 512 bytes,
> > + * and make sure the first 15 bytes are copied, because
> > + * unaligned copy functions require up to 15 bytes
> > + * backwards access.
> > + */
> > + dstofss = 16 - (int)((long long)(void *)dst & 0x0F) + 16;
> > + n -= dstofss;
> > + rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > + src = (const uint8_t *)src + dstofss;
> > + dst = (uint8_t *)dst + dstofss;
> > + srcofs = (int)((long long)(const void *)src & 0x0F);
> > +
> > + /**
> > + * For aligned copy
> > + */
> > + if (srcofs == 0) {
> > + /**
> > + * Copy 256-byte blocks
> > + */
> > + for (; n >= 256; n -= 256) {
> > + rte_mov256((uint8_t *)dst, (const uint8_t *)src);
> > + dst = (uint8_t *)dst + 256;
> > + src = (const uint8_t *)src + 256;
> > + }
> > +
> > + /**
> > + * Copy whatever left
> > + */
> > + goto COPY_BLOCK_255_BACK15;
> > + }
> > +
> > + /**
> > + * For copy with unaligned load
> > + */
> > + MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
> > +
> > + /**
> > + * Copy whatever left
> > + */
> > + goto COPY_BLOCK_64_BACK15;
> > }
> >
> > +#endif /* RTE_MACHINE_CPUFLAG_AVX2 */
> > +
> > #ifdef __cplusplus
> > }
> > #endif
> > --
> > 1.9.3
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
2015-01-30 5:57 ` Wang, Zhihong
@ 2015-01-30 10:44 ` Ananyev, Konstantin
0 siblings, 0 replies; 12+ messages in thread
From: Ananyev, Konstantin @ 2015-01-30 10:44 UTC (permalink / raw)
To: Wang, Zhihong, dev
Hey Zhihong,
> -----Original Message-----
> From: Wang, Zhihong
> Sent: Friday, January 30, 2015 5:57 AM
> To: Ananyev, Konstantin; dev@dpdk.org
> Subject: RE: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms
>
> Hey Konstantin,
>
> This method does reduce code size but lead to significant performance drop.
> I think we need to keep the original code.
Sure, no point to make it slower.
Thanks for trying it anyway.
Konstantin
>
>
> Thanks
> Zhihong (John)
>
>
> > -----Original Message-----
> > From: Ananyev, Konstantin
> > Sent: Thursday, January 29, 2015 11:18 PM
> > To: Wang, Zhihong; dev@dpdk.org
> > Subject: RE: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in
> > arch/x86/rte_memcpy.h for both SSE and AVX platforms
> >
> > Hi Zhihong,
> >
> > > -----Original Message-----
> > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhihong Wang
> > > Sent: Thursday, January 29, 2015 2:39 AM
> > > To: dev@dpdk.org
> > > Subject: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in
> > > arch/x86/rte_memcpy.h for both SSE and AVX platforms
> > >
> > > Main code changes:
> > >
> > > 1. Differentiate architectural features based on CPU flags
> > >
> > > a. Implement separated move functions for SSE/AVX/AVX2 to make
> > > full utilization of cache bandwidth
> > >
> > > b. Implement separated copy flow specifically optimized for target
> > > architecture
> > >
> > > 2. Rewrite the memcpy function "rte_memcpy"
> > >
> > > a. Add store aligning
> > >
> > > b. Add load aligning based on architectural features
> > >
> > > c. Put block copy loop into inline move functions for better
> > > control of instruction order
> > >
> > > d. Eliminate unnecessary MOVs
> > >
> > > 3. Rewrite the inline move functions
> > >
> > > a. Add move functions for unaligned load cases
> > >
> > > b. Change instruction order in copy loops for better pipeline
> > > utilization
> > >
> > > c. Use intrinsics instead of assembly code
> > >
> > > 4. Remove slow glibc call for constant copies
> > >
> > > Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
> > > ---
> > > .../common/include/arch/x86/rte_memcpy.h | 680
> > +++++++++++++++------
> > > 1 file changed, 509 insertions(+), 171 deletions(-)
> > >
> > > diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > > b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > > index fb9eba8..7b2d382 100644
> > > --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > > +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
> > > @@ -34,166 +34,189 @@
> > > #ifndef _RTE_MEMCPY_X86_64_H_
> > > #define _RTE_MEMCPY_X86_64_H_
> > >
> > > +/**
> > > + * @file
> > > + *
> > > + * Functions for SSE/AVX/AVX2 implementation of memcpy().
> > > + */
> > > +
> > > +#include <stdio.h>
> > > #include <stdint.h>
> > > #include <string.h>
> > > -#include <emmintrin.h>
> > > +#include <x86intrin.h>
> > >
> > > #ifdef __cplusplus
> > > extern "C" {
> > > #endif
> > >
> > > -#include "generic/rte_memcpy.h"
> > > +/**
> > > + * Copy bytes from one location to another. The locations must not
> > overlap.
> > > + *
> > > + * @note This is implemented as a macro, so it's address should not
> > > +be taken
> > > + * and care is needed as parameter expressions may be evaluated
> > multiple times.
> > > + *
> > > + * @param dst
> > > + * Pointer to the destination of the data.
> > > + * @param src
> > > + * Pointer to the source data.
> > > + * @param n
> > > + * Number of bytes to copy.
> > > + * @return
> > > + * Pointer to the destination data.
> > > + */
> > > +static inline void *
> > > +rte_memcpy(void *dst, const void *src, size_t n)
> > > +__attribute__((always_inline));
> > >
> > > -#ifdef __INTEL_COMPILER
> > > -#pragma warning(disable:593) /* Stop unused variable warning (reg_a
> > > etc). */ -#endif
> > > +#ifdef RTE_MACHINE_CPUFLAG_AVX2
> > >
> > > +/**
> > > + * AVX2 implementation below
> > > + */
> > > +
> > > +/**
> > > + * Copy 16 bytes from one location to another,
> > > + * locations should not overlap.
> > > + */
> > > static inline void
> > > rte_mov16(uint8_t *dst, const uint8_t *src) {
> > > - __m128i reg_a;
> > > - asm volatile (
> > > - "movdqu (%[src]), %[reg_a]\n\t"
> > > - "movdqu %[reg_a], (%[dst])\n\t"
> > > - : [reg_a] "=x" (reg_a)
> > > - : [src] "r" (src),
> > > - [dst] "r"(dst)
> > > - : "memory"
> > > - );
> > > + __m128i xmm0;
> > > +
> > > + xmm0 = _mm_loadu_si128((const __m128i *)src);
> > > + _mm_storeu_si128((__m128i *)dst, xmm0);
> > > }
> > >
> > > +/**
> > > + * Copy 32 bytes from one location to another,
> > > + * locations should not overlap.
> > > + */
> > > static inline void
> > > rte_mov32(uint8_t *dst, const uint8_t *src) {
> > > - __m128i reg_a, reg_b;
> > > - asm volatile (
> > > - "movdqu (%[src]), %[reg_a]\n\t"
> > > - "movdqu 16(%[src]), %[reg_b]\n\t"
> > > - "movdqu %[reg_a], (%[dst])\n\t"
> > > - "movdqu %[reg_b], 16(%[dst])\n\t"
> > > - : [reg_a] "=x" (reg_a),
> > > - [reg_b] "=x" (reg_b)
> > > - : [src] "r" (src),
> > > - [dst] "r"(dst)
> > > - : "memory"
> > > - );
> > > -}
> > > + __m256i ymm0;
> > >
> > > -static inline void
> > > -rte_mov48(uint8_t *dst, const uint8_t *src) -{
> > > - __m128i reg_a, reg_b, reg_c;
> > > - asm volatile (
> > > - "movdqu (%[src]), %[reg_a]\n\t"
> > > - "movdqu 16(%[src]), %[reg_b]\n\t"
> > > - "movdqu 32(%[src]), %[reg_c]\n\t"
> > > - "movdqu %[reg_a], (%[dst])\n\t"
> > > - "movdqu %[reg_b], 16(%[dst])\n\t"
> > > - "movdqu %[reg_c], 32(%[dst])\n\t"
> > > - : [reg_a] "=x" (reg_a),
> > > - [reg_b] "=x" (reg_b),
> > > - [reg_c] "=x" (reg_c)
> > > - : [src] "r" (src),
> > > - [dst] "r"(dst)
> > > - : "memory"
> > > - );
> > > + ymm0 = _mm256_loadu_si256((const __m256i *)src);
> > > + _mm256_storeu_si256((__m256i *)dst, ymm0);
> > > }
> > >
> > > +/**
> > > + * Copy 64 bytes from one location to another,
> > > + * locations should not overlap.
> > > + */
> > > static inline void
> > > rte_mov64(uint8_t *dst, const uint8_t *src) {
> > > - __m128i reg_a, reg_b, reg_c, reg_d;
> > > - asm volatile (
> > > - "movdqu (%[src]), %[reg_a]\n\t"
> > > - "movdqu 16(%[src]), %[reg_b]\n\t"
> > > - "movdqu 32(%[src]), %[reg_c]\n\t"
> > > - "movdqu 48(%[src]), %[reg_d]\n\t"
> > > - "movdqu %[reg_a], (%[dst])\n\t"
> > > - "movdqu %[reg_b], 16(%[dst])\n\t"
> > > - "movdqu %[reg_c], 32(%[dst])\n\t"
> > > - "movdqu %[reg_d], 48(%[dst])\n\t"
> > > - : [reg_a] "=x" (reg_a),
> > > - [reg_b] "=x" (reg_b),
> > > - [reg_c] "=x" (reg_c),
> > > - [reg_d] "=x" (reg_d)
> > > - : [src] "r" (src),
> > > - [dst] "r"(dst)
> > > - : "memory"
> > > - );
> > > + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
> > > + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
> > > }
> > >
> > > +/**
> > > + * Copy 128 bytes from one location to another,
> > > + * locations should not overlap.
> > > + */
> > > static inline void
> > > rte_mov128(uint8_t *dst, const uint8_t *src) {
> > > - __m128i reg_a, reg_b, reg_c, reg_d, reg_e, reg_f, reg_g, reg_h;
> > > - asm volatile (
> > > - "movdqu (%[src]), %[reg_a]\n\t"
> > > - "movdqu 16(%[src]), %[reg_b]\n\t"
> > > - "movdqu 32(%[src]), %[reg_c]\n\t"
> > > - "movdqu 48(%[src]), %[reg_d]\n\t"
> > > - "movdqu 64(%[src]), %[reg_e]\n\t"
> > > - "movdqu 80(%[src]), %[reg_f]\n\t"
> > > - "movdqu 96(%[src]), %[reg_g]\n\t"
> > > - "movdqu 112(%[src]), %[reg_h]\n\t"
> > > - "movdqu %[reg_a], (%[dst])\n\t"
> > > - "movdqu %[reg_b], 16(%[dst])\n\t"
> > > - "movdqu %[reg_c], 32(%[dst])\n\t"
> > > - "movdqu %[reg_d], 48(%[dst])\n\t"
> > > - "movdqu %[reg_e], 64(%[dst])\n\t"
> > > - "movdqu %[reg_f], 80(%[dst])\n\t"
> > > - "movdqu %[reg_g], 96(%[dst])\n\t"
> > > - "movdqu %[reg_h], 112(%[dst])\n\t"
> > > - : [reg_a] "=x" (reg_a),
> > > - [reg_b] "=x" (reg_b),
> > > - [reg_c] "=x" (reg_c),
> > > - [reg_d] "=x" (reg_d),
> > > - [reg_e] "=x" (reg_e),
> > > - [reg_f] "=x" (reg_f),
> > > - [reg_g] "=x" (reg_g),
> > > - [reg_h] "=x" (reg_h)
> > > - : [src] "r" (src),
> > > - [dst] "r"(dst)
> > > - : "memory"
> > > - );
> > > + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
> > > + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
> > > + rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
> > > + rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
> > > }
> > >
> > > -#ifdef __INTEL_COMPILER
> > > -#pragma warning(enable:593)
> > > -#endif
> > > -
> > > +/**
> > > + * Copy 256 bytes from one location to another,
> > > + * locations should not overlap.
> > > + */
> > > static inline void
> > > rte_mov256(uint8_t *dst, const uint8_t *src) {
> > > - rte_mov128(dst, src);
> > > - rte_mov128(dst + 128, src + 128);
> > > + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32);
> > > + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32);
> > > + rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32);
> > > + rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32);
> > > + rte_mov32((uint8_t *)dst + 4 * 32, (const uint8_t *)src + 4 * 32);
> > > + rte_mov32((uint8_t *)dst + 5 * 32, (const uint8_t *)src + 5 * 32);
> > > + rte_mov32((uint8_t *)dst + 6 * 32, (const uint8_t *)src + 6 * 32);
> > > + rte_mov32((uint8_t *)dst + 7 * 32, (const uint8_t *)src + 7 * 32);
> > > }
> > >
> > > -#define rte_memcpy(dst, src, n) \
> > > - ({ (__builtin_constant_p(n)) ? \
> > > - memcpy((dst), (src), (n)) : \
> > > - rte_memcpy_func((dst), (src), (n)); })
> > > +/**
> > > + * Copy 64-byte blocks from one location to another,
> > > + * locations should not overlap.
> > > + */
> > > +static inline void
> > > +rte_mov64blocks(uint8_t *dst, const uint8_t *src, size_t n) {
> > > + __m256i ymm0, ymm1;
> > > +
> > > + while (n >= 64) {
> > > + ymm0 = _mm256_loadu_si256((const __m256i *)((const
> > uint8_t *)src + 0 * 32));
> > > + n -= 64;
> > > + ymm1 = _mm256_loadu_si256((const __m256i *)((const
> > uint8_t *)src + 1 * 32));
> > > + src = (const uint8_t *)src + 64;
> > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32),
> > ymm0);
> > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32),
> > ymm1);
> > > + dst = (uint8_t *)dst + 64;
> > > + }
> > > +}
> > > +
> > > +/**
> > > + * Copy 256-byte blocks from one location to another,
> > > + * locations should not overlap.
> > > + */
> > > +static inline void
> > > +rte_mov256blocks(uint8_t *dst, const uint8_t *src, size_t n) {
> > > + __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7;
> > > +
> > > + while (n >= 256) {
> > > + ymm0 = _mm256_loadu_si256((const __m256i *)((const
> > uint8_t *)src + 0 * 32));
> > > + n -= 256;
> > > + ymm1 = _mm256_loadu_si256((const __m256i *)((const
> > uint8_t *)src + 1 * 32));
> > > + ymm2 = _mm256_loadu_si256((const __m256i *)((const
> > uint8_t *)src + 2 * 32));
> > > + ymm3 = _mm256_loadu_si256((const __m256i *)((const
> > uint8_t *)src + 3 * 32));
> > > + ymm4 = _mm256_loadu_si256((const __m256i *)((const
> > uint8_t *)src + 4 * 32));
> > > + ymm5 = _mm256_loadu_si256((const __m256i *)((const
> > uint8_t *)src + 5 * 32));
> > > + ymm6 = _mm256_loadu_si256((const __m256i *)((const
> > uint8_t *)src + 6 * 32));
> > > + ymm7 = _mm256_loadu_si256((const __m256i *)((const
> > uint8_t *)src + 7 * 32));
> > > + src = (const uint8_t *)src + 256;
> > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32),
> > ymm0);
> > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32),
> > ymm1);
> > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32),
> > ymm2);
> > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32),
> > ymm3);
> > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 4 * 32),
> > ymm4);
> > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 5 * 32),
> > ymm5);
> > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 6 * 32),
> > ymm6);
> > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 7 * 32),
> > ymm7);
> > > + dst = (uint8_t *)dst + 256;
> > > + }
> > > +}
> > >
> > > static inline void *
> > > -rte_memcpy_func(void *dst, const void *src, size_t n)
> > > +rte_memcpy(void *dst, const void *src, size_t n)
> > > {
> > > void *ret = dst;
> > > + int dstofss;
> > > + int bits;
> > >
> > > - /* We can't copy < 16 bytes using XMM registers so do it manually. */
> > > + /**
> > > + * Copy less than 16 bytes
> > > + */
> > > if (n < 16) {
> > > if (n & 0x01) {
> > > *(uint8_t *)dst = *(const uint8_t *)src;
> > > - dst = (uint8_t *)dst + 1;
> > > src = (const uint8_t *)src + 1;
> > > + dst = (uint8_t *)dst + 1;
> > > }
> > > if (n & 0x02) {
> > > *(uint16_t *)dst = *(const uint16_t *)src;
> > > - dst = (uint16_t *)dst + 1;
> > > src = (const uint16_t *)src + 1;
> > > + dst = (uint16_t *)dst + 1;
> > > }
> > > if (n & 0x04) {
> > > *(uint32_t *)dst = *(const uint32_t *)src;
> > > - dst = (uint32_t *)dst + 1;
> > > src = (const uint32_t *)src + 1;
> > > + dst = (uint32_t *)dst + 1;
> > > }
> > > if (n & 0x08) {
> > > *(uint64_t *)dst = *(const uint64_t *)src; @@ -201,95
> > +224,410 @@
> > > rte_memcpy_func(void *dst, const void *src, size_t n)
> > > return ret;
> > > }
> > >
> > > - /* Special fast cases for <= 128 bytes */
> > > + /**
> > > + * Fast way when copy size doesn't exceed 512 bytes
> > > + */
> > > if (n <= 32) {
> > > rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> > > rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 +
> > n);
> > > return ret;
> > > }
> > > -
> > > if (n <= 64) {
> > > rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > > rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 +
> > n);
> > > return ret;
> > > }
> > > -
> > > - if (n <= 128) {
> > > - rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > > - rte_mov64((uint8_t *)dst - 64 + n, (const uint8_t *)src - 64 +
> > n);
> > > + if (n <= 512) {
> > > + if (n >= 256) {
> > > + n -= 256;
> > > + rte_mov256((uint8_t *)dst, (const uint8_t *)src);
> > > + src = (const uint8_t *)src + 256;
> > > + dst = (uint8_t *)dst + 256;
> > > + }
> > > + if (n >= 128) {
> > > + n -= 128;
> > > + rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> > > + src = (const uint8_t *)src + 128;
> > > + dst = (uint8_t *)dst + 128;
> > > + }
> > > + if (n >= 64) {
> > > + n -= 64;
> > > + rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > > + src = (const uint8_t *)src + 64;
> > > + dst = (uint8_t *)dst + 64;
> > > + }
> > > +COPY_BLOCK_64_BACK31:
> > > + if (n > 32) {
> > > + rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > > + rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t
> > *)src - 32 + n);
> > > + return ret;
> > > + }
> > > + if (n > 0) {
> > > + rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t
> > *)src - 32 + n);
> > > + }
> > > return ret;
> > > }
> > >
> > > - /*
> > > - * For large copies > 128 bytes. This combination of 256, 64 and 16
> > byte
> > > - * copies was found to be faster than doing 128 and 32 byte copies as
> > > - * well.
> > > + /**
> > > + * Make store aligned when copy size exceeds 512 bytes
> > > */
> > > - for ( ; n >= 256; n -= 256) {
> > > - rte_mov256((uint8_t *)dst, (const uint8_t *)src);
> > > - dst = (uint8_t *)dst + 256;
> > > - src = (const uint8_t *)src + 256;
> > > + dstofss = 32 - (int)((long long)(void *)dst & 0x1F);
> > > + n -= dstofss;
> > > + rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > > + src = (const uint8_t *)src + dstofss;
> > > + dst = (uint8_t *)dst + dstofss;
> > > +
> > > + /**
> > > + * Copy 256-byte blocks.
> > > + * Use copy block function for better instruction order control,
> > > + * which is important when load is unaligned.
> > > + */
> > > + rte_mov256blocks((uint8_t *)dst, (const uint8_t *)src, n);
> > > + bits = n;
> > > + n = n & 255;
> > > + bits -= n;
> > > + src = (const uint8_t *)src + bits;
> > > + dst = (uint8_t *)dst + bits;
> > > +
> > > + /**
> > > + * Copy 64-byte blocks.
> > > + * Use copy block function for better instruction order control,
> > > + * which is important when load is unaligned.
> > > + */
> > > + if (n >= 64) {
> > > + rte_mov64blocks((uint8_t *)dst, (const uint8_t *)src, n);
> > > + bits = n;
> > > + n = n & 63;
> > > + bits -= n;
> > > + src = (const uint8_t *)src + bits;
> > > + dst = (uint8_t *)dst + bits;
> > > }
> > >
> > > - /*
> > > - * We split the remaining bytes (which will be less than 256) into
> > > - * 64byte (2^6) chunks.
> > > - * Using incrementing integers in the case labels of a switch
> > statement
> > > - * enourages the compiler to use a jump table. To get incrementing
> > > - * integers, we shift the 2 relevant bits to the LSB position to first
> > > - * get decrementing integers, and then subtract.
> > > + /**
> > > + * Copy whatever left
> > > */
> > > - switch (3 - (n >> 6)) {
> > > - case 0x00:
> > > - rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > > - n -= 64;
> > > - dst = (uint8_t *)dst + 64;
> > > - src = (const uint8_t *)src + 64; /* fallthrough */
> > > - case 0x01:
> > > - rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > > - n -= 64;
> > > - dst = (uint8_t *)dst + 64;
> > > - src = (const uint8_t *)src + 64; /* fallthrough */
> > > - case 0x02:
> > > - rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > > - n -= 64;
> > > - dst = (uint8_t *)dst + 64;
> > > - src = (const uint8_t *)src + 64; /* fallthrough */
> > > - default:
> > > - ;
> > > + goto COPY_BLOCK_64_BACK31;
> > > +}
> > > +
> > > +#else /* RTE_MACHINE_CPUFLAG_AVX2 */
> > > +
> > > +/**
> > > + * SSE & AVX implementation below
> > > + */
> > > +
> > > +/**
> > > + * Copy 16 bytes from one location to another,
> > > + * locations should not overlap.
> > > + */
> > > +static inline void
> > > +rte_mov16(uint8_t *dst, const uint8_t *src) {
> > > + __m128i xmm0;
> > > +
> > > + xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src);
> > > + _mm_storeu_si128((__m128i *)dst, xmm0); }
> > > +
> > > +/**
> > > + * Copy 32 bytes from one location to another,
> > > + * locations should not overlap.
> > > + */
> > > +static inline void
> > > +rte_mov32(uint8_t *dst, const uint8_t *src) {
> > > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> > > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16); }
> > > +
> > > +/**
> > > + * Copy 64 bytes from one location to another,
> > > + * locations should not overlap.
> > > + */
> > > +static inline void
> > > +rte_mov64(uint8_t *dst, const uint8_t *src) {
> > > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> > > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> > > + rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> > > + rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16); }
> > > +
> > > +/**
> > > + * Copy 128 bytes from one location to another,
> > > + * locations should not overlap.
> > > + */
> > > +static inline void
> > > +rte_mov128(uint8_t *dst, const uint8_t *src) {
> > > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> > > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> > > + rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> > > + rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
> > > + rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
> > > + rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
> > > + rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
> > > + rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16); }
> > > +
> > > +/**
> > > + * Copy 256 bytes from one location to another,
> > > + * locations should not overlap.
> > > + */
> > > +static inline void
> > > +rte_mov256(uint8_t *dst, const uint8_t *src) {
> > > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16);
> > > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16);
> > > + rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16);
> > > + rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16);
> > > + rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16);
> > > + rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16);
> > > + rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16);
> > > + rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16);
> > > + rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16);
> > > + rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16);
> > > + rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16);
> > > + rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16);
> > > + rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16);
> > > + rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16);
> > > + rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16);
> > > + rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16);
> > > +}
> > > +
> > > +/**
> > > + * Macro for copying unaligned block from one location to another
> > > +with constant load offset,
> > > + * 47 bytes leftover maximum,
> > > + * locations should not overlap.
> > > + * Requirements:
> > > + * - Store is aligned
> > > + * - Load offset is <offset>, which must be immediate value within
> > > +[1, 15]
> > > + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit
> > > +forwards are available for loading
> > > + * - <dst>, <src>, <len> must be variables
> > > + * - __m128i <xmm0> ~ <xmm8> must be pre-defined */
> > > +#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset)
> > \
> > > +({ \
> > > + int tmp; \
> > > + while (len >= 128 + 16 - offset) { \
> > > + xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 0 * 16)); \
> > > + len -= 128; \
> > > + xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 1 * 16)); \
> > > + xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 2 * 16)); \
> > > + xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 3 * 16)); \
> > > + xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 4 * 16)); \
> > > + xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 5 * 16)); \
> > > + xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 6 * 16)); \
> > > + xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 7 * 16)); \
> > > + xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 8 * 16)); \
> > > + src = (const uint8_t *)src + 128; \
> > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16),
> > _mm_alignr_epi8(xmm1, xmm0, offset)); \
> > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16),
> > _mm_alignr_epi8(xmm2, xmm1, offset)); \
> > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16),
> > _mm_alignr_epi8(xmm3, xmm2, offset)); \
> > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16),
> > _mm_alignr_epi8(xmm4, xmm3, offset)); \
> > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16),
> > _mm_alignr_epi8(xmm5, xmm4, offset)); \
> > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16),
> > _mm_alignr_epi8(xmm6, xmm5, offset)); \
> > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16),
> > _mm_alignr_epi8(xmm7, xmm6, offset)); \
> > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16),
> > _mm_alignr_epi8(xmm8, xmm7, offset)); \
> > > + dst = (uint8_t *)dst + 128; \
> > > + } \
> > > + tmp = len; \
> > > + len = ((len - 16 + offset) & 127) + 16 - offset; \
> > > + tmp -= len; \
> > > + src = (const uint8_t *)src + tmp; \
> > > + dst = (uint8_t *)dst + tmp; \
> > > + if (len >= 32 + 16 - offset) { \
> > > + while (len >= 32 + 16 - offset) { \
> > > + xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 0 * 16)); \
> > > + len -= 32; \
> > > + xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 1 * 16)); \
> > > + xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src -
> > offset + 2 * 16)); \
> > > + src = (const uint8_t *)src + 32; \
> > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16),
> > _mm_alignr_epi8(xmm1, xmm0, offset)); \
> > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16),
> > _mm_alignr_epi8(xmm2, xmm1, offset)); \
> > > + dst = (uint8_t *)dst + 32; \
> > > + } \
> > > + tmp = len; \
> > > + len = ((len - 16 + offset) & 31) + 16 - offset; \
> > > + tmp -= len; \
> > > + src = (const uint8_t *)src + tmp; \
> > > + dst = (uint8_t *)dst + tmp; \
> > > + } \
> > > +})
> > > +
> > > +/**
> > > + * Macro for copying unaligned block from one location to another,
> > > + * 47 bytes leftover maximum,
> > > + * locations should not overlap.
> > > + * Use switch here because the aligning instruction requires immediate
> > value for shift count.
> > > + * Requirements:
> > > + * - Store is aligned
> > > + * - Load offset is <offset>, which must be within [1, 15]
> > > + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit
> > > +forwards are available for loading
> > > + * - <dst>, <src>, <len> must be variables
> > > + * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM
> > must be
> > > +pre-defined */
> > > +#define MOVEUNALIGNED_LEFT47(dst, src, len, offset) \
> > > +({ \
> > > + switch (offset) { \
> > > + case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break; \
> > > + case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break; \
> > > + case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break; \
> > > + case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break; \
> > > + case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break; \
> > > + case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break; \
> > > + case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break; \
> > > + case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break; \
> > > + case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break; \
> > > + case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break; \
> > > + case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break; \
> > > + case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break; \
> > > + case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break; \
> > > + case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break; \
> > > + case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break; \
> > > + default:; \
> > > + } \
> > > +})
> >
> > We probably didn't understand each other properly.
> > My thought was to do something like:
> >
> > #define MOVEUNALIGNED_128(offset) do { \
> > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16),
> > _mm_alignr_epi8(xmm1, xmm0, offset)); \
> > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16),
> > _mm_alignr_epi8(xmm2, xmm1, offset)); \
> > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16),
> > _mm_alignr_epi8(xmm3, xmm2, offset)); \
> > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16),
> > _mm_alignr_epi8(xmm4, xmm3, offset)); \
> > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16),
> > _mm_alignr_epi8(xmm5, xmm4, offset)); \
> > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16),
> > _mm_alignr_epi8(xmm6, xmm5, offset)); \
> > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16),
> > _mm_alignr_epi8(xmm7, xmm6, offset)); \
> > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16),
> > _mm_alignr_epi8(xmm8, xmm7, offset)); \
> > } while(0);
> >
> >
> > Then at MOVEUNALIGNED_LEFT47_IMM:
> > ....
> > xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7
> > * 16)); \
> > xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8
> > * 16)); \
> > src = (const uint8_t *)src + 128;
> >
> > switch(offset) {
> > case 0x01: MOVEUNALIGNED_128(1); break;
> > ...
> > case 0x0f: MOVEUNALIGNED_128(f); break; } ...
> > dst = (uint8_t *)dst + 128
> >
> > An then in rte_memcpy() you don't need a switch for
> > MOVEUNALIGNED_LEFT47_IMM itself.
> > Thought it might help to generate smaller code for rte_ememcpy() while
> > keeping performance reasonably high.
> >
> > Konstantin
> >
> >
> > > +
> > > +static inline void *
> > > +rte_memcpy(void *dst, const void *src, size_t n) {
> > > + __m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7,
> > xmm8;
> > > + void *ret = dst;
> > > + int dstofss;
> > > + int srcofs;
> > > +
> > > + /**
> > > + * Copy less than 16 bytes
> > > + */
> > > + if (n < 16) {
> > > + if (n & 0x01) {
> > > + *(uint8_t *)dst = *(const uint8_t *)src;
> > > + src = (const uint8_t *)src + 1;
> > > + dst = (uint8_t *)dst + 1;
> > > + }
> > > + if (n & 0x02) {
> > > + *(uint16_t *)dst = *(const uint16_t *)src;
> > > + src = (const uint16_t *)src + 1;
> > > + dst = (uint16_t *)dst + 1;
> > > + }
> > > + if (n & 0x04) {
> > > + *(uint32_t *)dst = *(const uint32_t *)src;
> > > + src = (const uint32_t *)src + 1;
> > > + dst = (uint32_t *)dst + 1;
> > > + }
> > > + if (n & 0x08) {
> > > + *(uint64_t *)dst = *(const uint64_t *)src;
> > > + }
> > > + return ret;
> > > }
> > >
> > > - /*
> > > - * We split the remaining bytes (which will be less than 64) into
> > > - * 16byte (2^4) chunks, using the same switch structure as above.
> > > + /**
> > > + * Fast way when copy size doesn't exceed 512 bytes
> > > */
> > > - switch (3 - (n >> 4)) {
> > > - case 0x00:
> > > - rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> > > - n -= 16;
> > > - dst = (uint8_t *)dst + 16;
> > > - src = (const uint8_t *)src + 16; /* fallthrough */
> > > - case 0x01:
> > > - rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> > > - n -= 16;
> > > - dst = (uint8_t *)dst + 16;
> > > - src = (const uint8_t *)src + 16; /* fallthrough */
> > > - case 0x02:
> > > + if (n <= 32) {
> > > rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> > > - n -= 16;
> > > - dst = (uint8_t *)dst + 16;
> > > - src = (const uint8_t *)src + 16; /* fallthrough */
> > > - default:
> > > - ;
> > > + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 +
> > n);
> > > + return ret;
> > > }
> > > -
> > > - /* Copy any remaining bytes, without going beyond end of buffers
> > */
> > > - if (n != 0) {
> > > + if (n <= 48) {
> > > + rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > > + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 +
> > n);
> > > + return ret;
> > > + }
> > > + if (n <= 64) {
> > > + rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > > + rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32);
> > > rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 +
> > n);
> > > + return ret;
> > > + }
> > > + if (n <= 128) {
> > > + goto COPY_BLOCK_128_BACK15;
> > > }
> > > - return ret;
> > > + if (n <= 512) {
> > > + if (n >= 256) {
> > > + n -= 256;
> > > + rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> > > + rte_mov128((uint8_t *)dst + 128, (const uint8_t *)src
> > + 128);
> > > + src = (const uint8_t *)src + 256;
> > > + dst = (uint8_t *)dst + 256;
> > > + }
> > > +COPY_BLOCK_255_BACK15:
> > > + if (n >= 128) {
> > > + n -= 128;
> > > + rte_mov128((uint8_t *)dst, (const uint8_t *)src);
> > > + src = (const uint8_t *)src + 128;
> > > + dst = (uint8_t *)dst + 128;
> > > + }
> > > +COPY_BLOCK_128_BACK15:
> > > + if (n >= 64) {
> > > + n -= 64;
> > > + rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > > + src = (const uint8_t *)src + 64;
> > > + dst = (uint8_t *)dst + 64;
> > > + }
> > > +COPY_BLOCK_64_BACK15:
> > > + if (n >= 32) {
> > > + n -= 32;
> > > + rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > > + src = (const uint8_t *)src + 32;
> > > + dst = (uint8_t *)dst + 32;
> > > + }
> > > + if (n > 16) {
> > > + rte_mov16((uint8_t *)dst, (const uint8_t *)src);
> > > + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t
> > *)src - 16 + n);
> > > + return ret;
> > > + }
> > > + if (n > 0) {
> > > + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t
> > *)src - 16 + n);
> > > + }
> > > + return ret;
> > > + }
> > > +
> > > + /**
> > > + * Make store aligned when copy size exceeds 512 bytes,
> > > + * and make sure the first 15 bytes are copied, because
> > > + * unaligned copy functions require up to 15 bytes
> > > + * backwards access.
> > > + */
> > > + dstofss = 16 - (int)((long long)(void *)dst & 0x0F) + 16;
> > > + n -= dstofss;
> > > + rte_mov32((uint8_t *)dst, (const uint8_t *)src);
> > > + src = (const uint8_t *)src + dstofss;
> > > + dst = (uint8_t *)dst + dstofss;
> > > + srcofs = (int)((long long)(const void *)src & 0x0F);
> > > +
> > > + /**
> > > + * For aligned copy
> > > + */
> > > + if (srcofs == 0) {
> > > + /**
> > > + * Copy 256-byte blocks
> > > + */
> > > + for (; n >= 256; n -= 256) {
> > > + rte_mov256((uint8_t *)dst, (const uint8_t *)src);
> > > + dst = (uint8_t *)dst + 256;
> > > + src = (const uint8_t *)src + 256;
> > > + }
> > > +
> > > + /**
> > > + * Copy whatever left
> > > + */
> > > + goto COPY_BLOCK_255_BACK15;
> > > + }
> > > +
> > > + /**
> > > + * For copy with unaligned load
> > > + */
> > > + MOVEUNALIGNED_LEFT47(dst, src, n, srcofs);
> > > +
> > > + /**
> > > + * Copy whatever left
> > > + */
> > > + goto COPY_BLOCK_64_BACK15;
> > > }
> > >
> > > +#endif /* RTE_MACHINE_CPUFLAG_AVX2 */
> > > +
> > > #ifdef __cplusplus
> > > }
> > > #endif
> > > --
> > > 1.9.3
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization
2015-01-29 2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang
` (4 preceding siblings ...)
2015-01-29 6:16 ` [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Fu, JingguoX
@ 2015-02-10 3:06 ` Liang, Cunming
2015-02-16 15:57 ` De Lara Guarch, Pablo
6 siblings, 0 replies; 12+ messages in thread
From: Liang, Cunming @ 2015-02-10 3:06 UTC (permalink / raw)
To: Wang, Zhihong, dev
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhihong Wang
> Sent: Thursday, January 29, 2015 10:39 AM
> To: dev@dpdk.org
> Subject: [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization
>
> This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
> It also extends memcpy test coverage with unaligned cases and more test points.
>
> Optimization techniques are summarized below:
>
> 1. Utilize full cache bandwidth
>
> 2. Enforce aligned stores
>
> 3. Apply load address alignment based on architecture features
>
> 4. Make load/store address available as early as possible
>
> 5. General optimization techniques like inlining, branch reducing, prefetch
> pattern access
>
> --------------
> Changes in v2:
>
> 1. Reduced constant test cases in app/test/test_memcpy_perf.c for fast build
>
> 2. Modified macro definition for better code readability & safety
>
> Zhihong Wang (4):
> app/test: Disabled VTA for memcpy test in app/test/Makefile
> app/test: Removed unnecessary test cases in app/test/test_memcpy.c
> app/test: Extended test coverage in app/test/test_memcpy_perf.c
> lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
> and AVX platforms
>
> app/test/Makefile | 6 +
> app/test/test_memcpy.c | 52 +-
> app/test/test_memcpy_perf.c | 220 ++++---
> .../common/include/arch/x86/rte_memcpy.h | 680 +++++++++++++++-----
> -
> 4 files changed, 654 insertions(+), 304 deletions(-)
>
> --
> 1.9.3
Acked-by: Cunming Liang <cunming.liang@intel.com>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization
2015-01-29 2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang
` (5 preceding siblings ...)
2015-02-10 3:06 ` Liang, Cunming
@ 2015-02-16 15:57 ` De Lara Guarch, Pablo
2015-02-25 10:46 ` Thomas Monjalon
6 siblings, 1 reply; 12+ messages in thread
From: De Lara Guarch, Pablo @ 2015-02-16 15:57 UTC (permalink / raw)
To: Wang, Zhihong, dev
> -----Original Message-----
> From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhihong Wang
> Sent: Thursday, January 29, 2015 2:39 AM
> To: dev@dpdk.org
> Subject: [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization
>
> This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
> It also extends memcpy test coverage with unaligned cases and more test
> points.
>
> Optimization techniques are summarized below:
>
> 1. Utilize full cache bandwidth
>
> 2. Enforce aligned stores
>
> 3. Apply load address alignment based on architecture features
>
> 4. Make load/store address available as early as possible
>
> 5. General optimization techniques like inlining, branch reducing, prefetch
> pattern access
>
> --------------
> Changes in v2:
>
> 1. Reduced constant test cases in app/test/test_memcpy_perf.c for fast
> build
>
> 2. Modified macro definition for better code readability & safety
>
> Zhihong Wang (4):
> app/test: Disabled VTA for memcpy test in app/test/Makefile
> app/test: Removed unnecessary test cases in app/test/test_memcpy.c
> app/test: Extended test coverage in app/test/test_memcpy_perf.c
> lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
> and AVX platforms
>
> app/test/Makefile | 6 +
> app/test/test_memcpy.c | 52 +-
> app/test/test_memcpy_perf.c | 220 ++++---
> .../common/include/arch/x86/rte_memcpy.h | 680
> +++++++++++++++------
> 4 files changed, 654 insertions(+), 304 deletions(-)
>
> --
> 1.9.3
Acked-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization
2015-02-16 15:57 ` De Lara Guarch, Pablo
@ 2015-02-25 10:46 ` Thomas Monjalon
0 siblings, 0 replies; 12+ messages in thread
From: Thomas Monjalon @ 2015-02-25 10:46 UTC (permalink / raw)
To: Wang, Zhihong, konstantin.ananyev; +Cc: dev
> > This patch set optimizes memcpy for DPDK for both SSE and AVX platforms.
> > It also extends memcpy test coverage with unaligned cases and more test
> > points.
> >
> > Optimization techniques are summarized below:
> >
> > 1. Utilize full cache bandwidth
> >
> > 2. Enforce aligned stores
> >
> > 3. Apply load address alignment based on architecture features
> >
> > 4. Make load/store address available as early as possible
> >
> > 5. General optimization techniques like inlining, branch reducing, prefetch
> > pattern access
> >
> > --------------
> > Changes in v2:
> >
> > 1. Reduced constant test cases in app/test/test_memcpy_perf.c for fast
> > build
> >
> > 2. Modified macro definition for better code readability & safety
> >
> > Zhihong Wang (4):
> > app/test: Disabled VTA for memcpy test in app/test/Makefile
> > app/test: Removed unnecessary test cases in app/test/test_memcpy.c
> > app/test: Extended test coverage in app/test/test_memcpy_perf.c
> > lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE
> > and AVX platforms
>
> Acked-by: Pablo de Lara <pablo.de.lara.guarch@intel.com>
Applied, thanks for the great work!
Note: we are still looking for a maintainer of x86 EAL.
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2015-02-25 10:47 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-29 2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang
2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 1/4] app/test: Disabled VTA for memcpy test in app/test/Makefile Zhihong Wang
2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 2/4] app/test: Removed unnecessary test cases in app/test/test_memcpy.c Zhihong Wang
2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 3/4] app/test: Extended test coverage in app/test/test_memcpy_perf.c Zhihong Wang
2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms Zhihong Wang
2015-01-29 15:17 ` Ananyev, Konstantin
2015-01-30 5:57 ` Wang, Zhihong
2015-01-30 10:44 ` Ananyev, Konstantin
2015-01-29 6:16 ` [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Fu, JingguoX
2015-02-10 3:06 ` Liang, Cunming
2015-02-16 15:57 ` De Lara Guarch, Pablo
2015-02-25 10:46 ` Thomas Monjalon
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).