* [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization @ 2015-01-29 2:38 Zhihong Wang 2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 1/4] app/test: Disabled VTA for memcpy test in app/test/Makefile Zhihong Wang ` (6 more replies) 0 siblings, 7 replies; 12+ messages in thread From: Zhihong Wang @ 2015-01-29 2:38 UTC (permalink / raw) To: dev This patch set optimizes memcpy for DPDK for both SSE and AVX platforms. It also extends memcpy test coverage with unaligned cases and more test points. Optimization techniques are summarized below: 1. Utilize full cache bandwidth 2. Enforce aligned stores 3. Apply load address alignment based on architecture features 4. Make load/store address available as early as possible 5. General optimization techniques like inlining, branch reducing, prefetch pattern access -------------- Changes in v2: 1. Reduced constant test cases in app/test/test_memcpy_perf.c for fast build 2. Modified macro definition for better code readability & safety Zhihong Wang (4): app/test: Disabled VTA for memcpy test in app/test/Makefile app/test: Removed unnecessary test cases in app/test/test_memcpy.c app/test: Extended test coverage in app/test/test_memcpy_perf.c lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms app/test/Makefile | 6 + app/test/test_memcpy.c | 52 +- app/test/test_memcpy_perf.c | 220 ++++--- .../common/include/arch/x86/rte_memcpy.h | 680 +++++++++++++++------ 4 files changed, 654 insertions(+), 304 deletions(-) -- 1.9.3 ^ permalink raw reply [flat|nested] 12+ messages in thread
* [dpdk-dev] [PATCH v2 1/4] app/test: Disabled VTA for memcpy test in app/test/Makefile 2015-01-29 2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang @ 2015-01-29 2:38 ` Zhihong Wang 2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 2/4] app/test: Removed unnecessary test cases in app/test/test_memcpy.c Zhihong Wang ` (5 subsequent siblings) 6 siblings, 0 replies; 12+ messages in thread From: Zhihong Wang @ 2015-01-29 2:38 UTC (permalink / raw) To: dev VTA is for debugging only, it increases compile time and binary size, especially when there're a lot of inlines. So disable it since memcpy test contains a lot of inline calls. Signed-off-by: Zhihong Wang <zhihong.wang@intel.com> --- app/test/Makefile | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/app/test/Makefile b/app/test/Makefile index 4311f96..94dbadf 100644 --- a/app/test/Makefile +++ b/app/test/Makefile @@ -143,6 +143,12 @@ CFLAGS_test_kni.o += -Wno-deprecated-declarations endif CFLAGS += -D_GNU_SOURCE +# Disable VTA for memcpy test +ifeq ($(CC), gcc) +CFLAGS_test_memcpy.o += -fno-var-tracking-assignments +CFLAGS_test_memcpy_perf.o += -fno-var-tracking-assignments +endif + # this application needs libraries first DEPDIRS-y += lib -- 1.9.3 ^ permalink raw reply [flat|nested] 12+ messages in thread
* [dpdk-dev] [PATCH v2 2/4] app/test: Removed unnecessary test cases in app/test/test_memcpy.c 2015-01-29 2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang 2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 1/4] app/test: Disabled VTA for memcpy test in app/test/Makefile Zhihong Wang @ 2015-01-29 2:38 ` Zhihong Wang 2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 3/4] app/test: Extended test coverage in app/test/test_memcpy_perf.c Zhihong Wang ` (4 subsequent siblings) 6 siblings, 0 replies; 12+ messages in thread From: Zhihong Wang @ 2015-01-29 2:38 UTC (permalink / raw) To: dev Removed unnecessary test cases for base move functions since the function "func_test" covers them all. Signed-off-by: Zhihong Wang <zhihong.wang@intel.com> --- app/test/test_memcpy.c | 52 +------------------------------------------------- 1 file changed, 1 insertion(+), 51 deletions(-) diff --git a/app/test/test_memcpy.c b/app/test/test_memcpy.c index 56b8e1e..b2bb4e0 100644 --- a/app/test/test_memcpy.c +++ b/app/test/test_memcpy.c @@ -78,56 +78,9 @@ static size_t buf_sizes[TEST_VALUE_RANGE]; #define TEST_BATCH_SIZE 100 /* Data is aligned on this many bytes (power of 2) */ -#define ALIGNMENT_UNIT 16 +#define ALIGNMENT_UNIT 32 - -/* Structure with base memcpy func pointer, and number of bytes it copies */ -struct base_memcpy_func { - void (*func)(uint8_t *dst, const uint8_t *src); - unsigned size; -}; - -/* To create base_memcpy_func structure entries */ -#define BASE_FUNC(n) {rte_mov##n, n} - -/* Max number of bytes that can be copies with a "base" memcpy functions */ -#define MAX_BASE_FUNC_SIZE 256 - -/* - * Test the "base" memcpy functions, that a copy fixed number of bytes. - */ -static int -base_func_test(void) -{ - const struct base_memcpy_func base_memcpy_funcs[6] = { - BASE_FUNC(16), - BASE_FUNC(32), - BASE_FUNC(48), - BASE_FUNC(64), - BASE_FUNC(128), - BASE_FUNC(256), - }; - unsigned i, j; - unsigned num_funcs = sizeof(base_memcpy_funcs) / sizeof(base_memcpy_funcs[0]); - uint8_t dst[MAX_BASE_FUNC_SIZE]; - uint8_t src[MAX_BASE_FUNC_SIZE]; - - for (i = 0; i < num_funcs; i++) { - unsigned size = base_memcpy_funcs[i].size; - for (j = 0; j < size; j++) { - dst[j] = 0; - src[j] = (uint8_t) rte_rand(); - } - base_memcpy_funcs[i].func(dst, src); - for (j = 0; j < size; j++) - if (dst[j] != src[j]) - return -1; - } - - return 0; -} - /* * Create two buffers, and initialise one with random values. These are copied * to the second buffer and then compared to see if the copy was successful. @@ -218,9 +171,6 @@ test_memcpy(void) ret = func_test(); if (ret != 0) return -1; - ret = base_func_test(); - if (ret != 0) - return -1; return 0; } -- 1.9.3 ^ permalink raw reply [flat|nested] 12+ messages in thread
* [dpdk-dev] [PATCH v2 3/4] app/test: Extended test coverage in app/test/test_memcpy_perf.c 2015-01-29 2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang 2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 1/4] app/test: Disabled VTA for memcpy test in app/test/Makefile Zhihong Wang 2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 2/4] app/test: Removed unnecessary test cases in app/test/test_memcpy.c Zhihong Wang @ 2015-01-29 2:38 ` Zhihong Wang 2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms Zhihong Wang ` (3 subsequent siblings) 6 siblings, 0 replies; 12+ messages in thread From: Zhihong Wang @ 2015-01-29 2:38 UTC (permalink / raw) To: dev Main code changes: 1. Added more typical data points for a thorough performance test 2. Added unaligned test cases since it's common in DPDK usage Signed-off-by: Zhihong Wang <zhihong.wang@intel.com> --- app/test/test_memcpy_perf.c | 220 +++++++++++++++++++++++++++----------------- 1 file changed, 138 insertions(+), 82 deletions(-) diff --git a/app/test/test_memcpy_perf.c b/app/test/test_memcpy_perf.c index 7809610..754828e 100644 --- a/app/test/test_memcpy_perf.c +++ b/app/test/test_memcpy_perf.c @@ -54,9 +54,10 @@ /* List of buffer sizes to test */ #if TEST_VALUE_RANGE == 0 static size_t buf_sizes[] = { - 0, 1, 7, 8, 9, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128, 129, 255, - 256, 257, 320, 384, 511, 512, 513, 1023, 1024, 1025, 1518, 1522, 1600, - 2048, 3072, 4096, 5120, 6144, 7168, 8192 + 1, 2, 3, 4, 5, 6, 7, 8, 9, 12, 15, 16, 17, 31, 32, 33, 63, 64, 65, 127, 128, + 129, 191, 192, 193, 255, 256, 257, 319, 320, 321, 383, 384, 385, 447, 448, + 449, 511, 512, 513, 767, 768, 769, 1023, 1024, 1025, 1518, 1522, 1536, 1600, + 2048, 2560, 3072, 3584, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192 }; /* MUST be as large as largest packet size above */ #define SMALL_BUFFER_SIZE 8192 @@ -78,7 +79,7 @@ static size_t buf_sizes[TEST_VALUE_RANGE]; #define TEST_BATCH_SIZE 100 /* Data is aligned on this many bytes (power of 2) */ -#define ALIGNMENT_UNIT 16 +#define ALIGNMENT_UNIT 32 /* * Pointers used in performance tests. The two large buffers are for uncached @@ -94,19 +95,19 @@ init_buffers(void) { unsigned i; - large_buf_read = rte_malloc("memcpy", LARGE_BUFFER_SIZE, ALIGNMENT_UNIT); + large_buf_read = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT); if (large_buf_read == NULL) goto error_large_buf_read; - large_buf_write = rte_malloc("memcpy", LARGE_BUFFER_SIZE, ALIGNMENT_UNIT); + large_buf_write = rte_malloc("memcpy", LARGE_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT); if (large_buf_write == NULL) goto error_large_buf_write; - small_buf_read = rte_malloc("memcpy", SMALL_BUFFER_SIZE, ALIGNMENT_UNIT); + small_buf_read = rte_malloc("memcpy", SMALL_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT); if (small_buf_read == NULL) goto error_small_buf_read; - small_buf_write = rte_malloc("memcpy", SMALL_BUFFER_SIZE, ALIGNMENT_UNIT); + small_buf_write = rte_malloc("memcpy", SMALL_BUFFER_SIZE + ALIGNMENT_UNIT, ALIGNMENT_UNIT); if (small_buf_write == NULL) goto error_small_buf_write; @@ -140,25 +141,25 @@ free_buffers(void) /* * Get a random offset into large array, with enough space needed to perform - * max copy size. Offset is aligned. + * max copy size. Offset is aligned, uoffset is used for unalignment setting. */ static inline size_t -get_rand_offset(void) +get_rand_offset(size_t uoffset) { - return ((rte_rand() % (LARGE_BUFFER_SIZE - SMALL_BUFFER_SIZE)) & - ~(ALIGNMENT_UNIT - 1)); + return (((rte_rand() % (LARGE_BUFFER_SIZE - SMALL_BUFFER_SIZE)) & + ~(ALIGNMENT_UNIT - 1)) + uoffset); } /* Fill in source and destination addresses. */ static inline void -fill_addr_arrays(size_t *dst_addr, int is_dst_cached, - size_t *src_addr, int is_src_cached) +fill_addr_arrays(size_t *dst_addr, int is_dst_cached, size_t dst_uoffset, + size_t *src_addr, int is_src_cached, size_t src_uoffset) { unsigned int i; for (i = 0; i < TEST_BATCH_SIZE; i++) { - dst_addr[i] = (is_dst_cached) ? 0 : get_rand_offset(); - src_addr[i] = (is_src_cached) ? 0 : get_rand_offset(); + dst_addr[i] = (is_dst_cached) ? dst_uoffset : get_rand_offset(dst_uoffset); + src_addr[i] = (is_src_cached) ? src_uoffset : get_rand_offset(src_uoffset); } } @@ -169,16 +170,17 @@ fill_addr_arrays(size_t *dst_addr, int is_dst_cached, */ static void do_uncached_write(uint8_t *dst, int is_dst_cached, - const uint8_t *src, int is_src_cached, size_t size) + const uint8_t *src, int is_src_cached, size_t size) { unsigned i, j; size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE]; for (i = 0; i < (TEST_ITERATIONS / TEST_BATCH_SIZE); i++) { - fill_addr_arrays(dst_addrs, is_dst_cached, - src_addrs, is_src_cached); - for (j = 0; j < TEST_BATCH_SIZE; j++) + fill_addr_arrays(dst_addrs, is_dst_cached, 0, + src_addrs, is_src_cached, 0); + for (j = 0; j < TEST_BATCH_SIZE; j++) { rte_memcpy(dst+dst_addrs[j], src+src_addrs[j], size); + } } } @@ -186,52 +188,111 @@ do_uncached_write(uint8_t *dst, int is_dst_cached, * Run a single memcpy performance test. This is a macro to ensure that if * the "size" parameter is a constant it won't be converted to a variable. */ -#define SINGLE_PERF_TEST(dst, is_dst_cached, src, is_src_cached, size) do { \ - unsigned int iter, t; \ - size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE]; \ - uint64_t start_time, total_time = 0; \ - uint64_t total_time2 = 0; \ - for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) { \ - fill_addr_arrays(dst_addrs, is_dst_cached, \ - src_addrs, is_src_cached); \ - start_time = rte_rdtsc(); \ - for (t = 0; t < TEST_BATCH_SIZE; t++) \ - rte_memcpy(dst+dst_addrs[t], src+src_addrs[t], size); \ - total_time += rte_rdtsc() - start_time; \ - } \ - for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) { \ - fill_addr_arrays(dst_addrs, is_dst_cached, \ - src_addrs, is_src_cached); \ - start_time = rte_rdtsc(); \ - for (t = 0; t < TEST_BATCH_SIZE; t++) \ - memcpy(dst+dst_addrs[t], src+src_addrs[t], size); \ - total_time2 += rte_rdtsc() - start_time; \ - } \ - printf("%8.0f -", (double)total_time /TEST_ITERATIONS); \ - printf("%5.0f", (double)total_time2 / TEST_ITERATIONS); \ +#define SINGLE_PERF_TEST(dst, is_dst_cached, dst_uoffset, \ + src, is_src_cached, src_uoffset, size) \ +do { \ + unsigned int iter, t; \ + size_t dst_addrs[TEST_BATCH_SIZE], src_addrs[TEST_BATCH_SIZE]; \ + uint64_t start_time, total_time = 0; \ + uint64_t total_time2 = 0; \ + for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) { \ + fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset, \ + src_addrs, is_src_cached, src_uoffset); \ + start_time = rte_rdtsc(); \ + for (t = 0; t < TEST_BATCH_SIZE; t++) \ + rte_memcpy(dst+dst_addrs[t], src+src_addrs[t], size); \ + total_time += rte_rdtsc() - start_time; \ + } \ + for (iter = 0; iter < (TEST_ITERATIONS / TEST_BATCH_SIZE); iter++) { \ + fill_addr_arrays(dst_addrs, is_dst_cached, dst_uoffset, \ + src_addrs, is_src_cached, src_uoffset); \ + start_time = rte_rdtsc(); \ + for (t = 0; t < TEST_BATCH_SIZE; t++) \ + memcpy(dst+dst_addrs[t], src+src_addrs[t], size); \ + total_time2 += rte_rdtsc() - start_time; \ + } \ + printf("%8.0f -", (double)total_time /TEST_ITERATIONS); \ + printf("%5.0f", (double)total_time2 / TEST_ITERATIONS); \ } while (0) -/* Run memcpy() tests for each cached/uncached permutation. */ -#define ALL_PERF_TESTS_FOR_SIZE(n) do { \ - if (__builtin_constant_p(n)) \ - printf("\nC%6u", (unsigned)n); \ - else \ - printf("\n%7u", (unsigned)n); \ - SINGLE_PERF_TEST(small_buf_write, 1, small_buf_read, 1, n); \ - SINGLE_PERF_TEST(large_buf_write, 0, small_buf_read, 1, n); \ - SINGLE_PERF_TEST(small_buf_write, 1, large_buf_read, 0, n); \ - SINGLE_PERF_TEST(large_buf_write, 0, large_buf_read, 0, n); \ +/* Run aligned memcpy tests for each cached/uncached permutation */ +#define ALL_PERF_TESTS_FOR_SIZE(n) \ +do { \ + if (__builtin_constant_p(n)) \ + printf("\nC%6u", (unsigned)n); \ + else \ + printf("\n%7u", (unsigned)n); \ + SINGLE_PERF_TEST(small_buf_write, 1, 0, small_buf_read, 1, 0, n); \ + SINGLE_PERF_TEST(large_buf_write, 0, 0, small_buf_read, 1, 0, n); \ + SINGLE_PERF_TEST(small_buf_write, 1, 0, large_buf_read, 0, 0, n); \ + SINGLE_PERF_TEST(large_buf_write, 0, 0, large_buf_read, 0, 0, n); \ } while (0) -/* - * Run performance tests for a number of different sizes and cached/uncached - * permutations. - */ +/* Run unaligned memcpy tests for each cached/uncached permutation */ +#define ALL_PERF_TESTS_FOR_SIZE_UNALIGNED(n) \ +do { \ + if (__builtin_constant_p(n)) \ + printf("\nC%6u", (unsigned)n); \ + else \ + printf("\n%7u", (unsigned)n); \ + SINGLE_PERF_TEST(small_buf_write, 1, 1, small_buf_read, 1, 5, n); \ + SINGLE_PERF_TEST(large_buf_write, 0, 1, small_buf_read, 1, 5, n); \ + SINGLE_PERF_TEST(small_buf_write, 1, 1, large_buf_read, 0, 5, n); \ + SINGLE_PERF_TEST(large_buf_write, 0, 1, large_buf_read, 0, 5, n); \ +} while (0) + +/* Run memcpy tests for constant length */ +#define ALL_PERF_TEST_FOR_CONSTANT \ +do { \ + TEST_CONSTANT(6U); TEST_CONSTANT(64U); TEST_CONSTANT(128U); \ + TEST_CONSTANT(192U); TEST_CONSTANT(256U); TEST_CONSTANT(512U); \ + TEST_CONSTANT(768U); TEST_CONSTANT(1024U); TEST_CONSTANT(1536U); \ +} while (0) + +/* Run all memcpy tests for aligned constant cases */ +static inline void +perf_test_constant_aligned(void) +{ +#define TEST_CONSTANT ALL_PERF_TESTS_FOR_SIZE + ALL_PERF_TEST_FOR_CONSTANT; +#undef TEST_CONSTANT +} + +/* Run all memcpy tests for unaligned constant cases */ +static inline void +perf_test_constant_unaligned(void) +{ +#define TEST_CONSTANT ALL_PERF_TESTS_FOR_SIZE_UNALIGNED + ALL_PERF_TEST_FOR_CONSTANT; +#undef TEST_CONSTANT +} + +/* Run all memcpy tests for aligned variable cases */ +static inline void +perf_test_variable_aligned(void) +{ + unsigned n = sizeof(buf_sizes) / sizeof(buf_sizes[0]); + unsigned i; + for (i = 0; i < n; i++) { + ALL_PERF_TESTS_FOR_SIZE((size_t)buf_sizes[i]); + } +} + +/* Run all memcpy tests for unaligned variable cases */ +static inline void +perf_test_variable_unaligned(void) +{ + unsigned n = sizeof(buf_sizes) / sizeof(buf_sizes[0]); + unsigned i; + for (i = 0; i < n; i++) { + ALL_PERF_TESTS_FOR_SIZE_UNALIGNED((size_t)buf_sizes[i]); + } +} + +/* Run all memcpy tests */ static int perf_test(void) { - const unsigned num_buf_sizes = sizeof(buf_sizes) / sizeof(buf_sizes[0]); - unsigned i; int ret; ret = init_buffers(); @@ -239,7 +300,8 @@ perf_test(void) return ret; #if TEST_VALUE_RANGE != 0 - /* Setup buf_sizes array, if required */ + /* Set up buf_sizes array, if required */ + unsigned i; for (i = 0; i < TEST_VALUE_RANGE; i++) buf_sizes[i] = i; #endif @@ -248,28 +310,23 @@ perf_test(void) do_uncached_write(large_buf_write, 0, small_buf_read, 1, SMALL_BUFFER_SIZE); printf("\n** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **\n" - "======= ============== ============== ============== ==============\n" - " Size Cache to cache Cache to mem Mem to cache Mem to mem\n" - "(bytes) (ticks) (ticks) (ticks) (ticks)\n" - "------- -------------- -------------- -------------- --------------"); - - /* Do tests where size is a variable */ - for (i = 0; i < num_buf_sizes; i++) { - ALL_PERF_TESTS_FOR_SIZE((size_t)buf_sizes[i]); - } + "======= ============== ============== ============== ==============\n" + " Size Cache to cache Cache to mem Mem to cache Mem to mem\n" + "(bytes) (ticks) (ticks) (ticks) (ticks)\n" + "------- -------------- -------------- -------------- --------------"); + + printf("\n========================== %2dB aligned ============================", ALIGNMENT_UNIT); + /* Do aligned tests where size is a variable */ + perf_test_variable_aligned(); printf("\n------- -------------- -------------- -------------- --------------"); - /* Do tests where size is a compile-time constant */ - ALL_PERF_TESTS_FOR_SIZE(63U); - ALL_PERF_TESTS_FOR_SIZE(64U); - ALL_PERF_TESTS_FOR_SIZE(65U); - ALL_PERF_TESTS_FOR_SIZE(255U); - ALL_PERF_TESTS_FOR_SIZE(256U); - ALL_PERF_TESTS_FOR_SIZE(257U); - ALL_PERF_TESTS_FOR_SIZE(1023U); - ALL_PERF_TESTS_FOR_SIZE(1024U); - ALL_PERF_TESTS_FOR_SIZE(1025U); - ALL_PERF_TESTS_FOR_SIZE(1518U); - + /* Do aligned tests where size is a compile-time constant */ + perf_test_constant_aligned(); + printf("\n=========================== Unaligned ============================="); + /* Do unaligned tests where size is a variable */ + perf_test_variable_unaligned(); + printf("\n------- -------------- -------------- -------------- --------------"); + /* Do unaligned tests where size is a compile-time constant */ + perf_test_constant_unaligned(); printf("\n======= ============== ============== ============== ==============\n\n"); free_buffers(); @@ -277,7 +334,6 @@ perf_test(void) return 0; } - static int test_memcpy_perf(void) { -- 1.9.3 ^ permalink raw reply [flat|nested] 12+ messages in thread
* [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms 2015-01-29 2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang ` (2 preceding siblings ...) 2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 3/4] app/test: Extended test coverage in app/test/test_memcpy_perf.c Zhihong Wang @ 2015-01-29 2:38 ` Zhihong Wang 2015-01-29 15:17 ` Ananyev, Konstantin 2015-01-29 6:16 ` [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Fu, JingguoX ` (2 subsequent siblings) 6 siblings, 1 reply; 12+ messages in thread From: Zhihong Wang @ 2015-01-29 2:38 UTC (permalink / raw) To: dev Main code changes: 1. Differentiate architectural features based on CPU flags a. Implement separated move functions for SSE/AVX/AVX2 to make full utilization of cache bandwidth b. Implement separated copy flow specifically optimized for target architecture 2. Rewrite the memcpy function "rte_memcpy" a. Add store aligning b. Add load aligning based on architectural features c. Put block copy loop into inline move functions for better control of instruction order d. Eliminate unnecessary MOVs 3. Rewrite the inline move functions a. Add move functions for unaligned load cases b. Change instruction order in copy loops for better pipeline utilization c. Use intrinsics instead of assembly code 4. Remove slow glibc call for constant copies Signed-off-by: Zhihong Wang <zhihong.wang@intel.com> --- .../common/include/arch/x86/rte_memcpy.h | 680 +++++++++++++++------ 1 file changed, 509 insertions(+), 171 deletions(-) diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h index fb9eba8..7b2d382 100644 --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h @@ -34,166 +34,189 @@ #ifndef _RTE_MEMCPY_X86_64_H_ #define _RTE_MEMCPY_X86_64_H_ +/** + * @file + * + * Functions for SSE/AVX/AVX2 implementation of memcpy(). + */ + +#include <stdio.h> #include <stdint.h> #include <string.h> -#include <emmintrin.h> +#include <x86intrin.h> #ifdef __cplusplus extern "C" { #endif -#include "generic/rte_memcpy.h" +/** + * Copy bytes from one location to another. The locations must not overlap. + * + * @note This is implemented as a macro, so it's address should not be taken + * and care is needed as parameter expressions may be evaluated multiple times. + * + * @param dst + * Pointer to the destination of the data. + * @param src + * Pointer to the source data. + * @param n + * Number of bytes to copy. + * @return + * Pointer to the destination data. + */ +static inline void * +rte_memcpy(void *dst, const void *src, size_t n) __attribute__((always_inline)); -#ifdef __INTEL_COMPILER -#pragma warning(disable:593) /* Stop unused variable warning (reg_a etc). */ -#endif +#ifdef RTE_MACHINE_CPUFLAG_AVX2 +/** + * AVX2 implementation below + */ + +/** + * Copy 16 bytes from one location to another, + * locations should not overlap. + */ static inline void rte_mov16(uint8_t *dst, const uint8_t *src) { - __m128i reg_a; - asm volatile ( - "movdqu (%[src]), %[reg_a]\n\t" - "movdqu %[reg_a], (%[dst])\n\t" - : [reg_a] "=x" (reg_a) - : [src] "r" (src), - [dst] "r"(dst) - : "memory" - ); + __m128i xmm0; + + xmm0 = _mm_loadu_si128((const __m128i *)src); + _mm_storeu_si128((__m128i *)dst, xmm0); } +/** + * Copy 32 bytes from one location to another, + * locations should not overlap. + */ static inline void rte_mov32(uint8_t *dst, const uint8_t *src) { - __m128i reg_a, reg_b; - asm volatile ( - "movdqu (%[src]), %[reg_a]\n\t" - "movdqu 16(%[src]), %[reg_b]\n\t" - "movdqu %[reg_a], (%[dst])\n\t" - "movdqu %[reg_b], 16(%[dst])\n\t" - : [reg_a] "=x" (reg_a), - [reg_b] "=x" (reg_b) - : [src] "r" (src), - [dst] "r"(dst) - : "memory" - ); -} + __m256i ymm0; -static inline void -rte_mov48(uint8_t *dst, const uint8_t *src) -{ - __m128i reg_a, reg_b, reg_c; - asm volatile ( - "movdqu (%[src]), %[reg_a]\n\t" - "movdqu 16(%[src]), %[reg_b]\n\t" - "movdqu 32(%[src]), %[reg_c]\n\t" - "movdqu %[reg_a], (%[dst])\n\t" - "movdqu %[reg_b], 16(%[dst])\n\t" - "movdqu %[reg_c], 32(%[dst])\n\t" - : [reg_a] "=x" (reg_a), - [reg_b] "=x" (reg_b), - [reg_c] "=x" (reg_c) - : [src] "r" (src), - [dst] "r"(dst) - : "memory" - ); + ymm0 = _mm256_loadu_si256((const __m256i *)src); + _mm256_storeu_si256((__m256i *)dst, ymm0); } +/** + * Copy 64 bytes from one location to another, + * locations should not overlap. + */ static inline void rte_mov64(uint8_t *dst, const uint8_t *src) { - __m128i reg_a, reg_b, reg_c, reg_d; - asm volatile ( - "movdqu (%[src]), %[reg_a]\n\t" - "movdqu 16(%[src]), %[reg_b]\n\t" - "movdqu 32(%[src]), %[reg_c]\n\t" - "movdqu 48(%[src]), %[reg_d]\n\t" - "movdqu %[reg_a], (%[dst])\n\t" - "movdqu %[reg_b], 16(%[dst])\n\t" - "movdqu %[reg_c], 32(%[dst])\n\t" - "movdqu %[reg_d], 48(%[dst])\n\t" - : [reg_a] "=x" (reg_a), - [reg_b] "=x" (reg_b), - [reg_c] "=x" (reg_c), - [reg_d] "=x" (reg_d) - : [src] "r" (src), - [dst] "r"(dst) - : "memory" - ); + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32); + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32); } +/** + * Copy 128 bytes from one location to another, + * locations should not overlap. + */ static inline void rte_mov128(uint8_t *dst, const uint8_t *src) { - __m128i reg_a, reg_b, reg_c, reg_d, reg_e, reg_f, reg_g, reg_h; - asm volatile ( - "movdqu (%[src]), %[reg_a]\n\t" - "movdqu 16(%[src]), %[reg_b]\n\t" - "movdqu 32(%[src]), %[reg_c]\n\t" - "movdqu 48(%[src]), %[reg_d]\n\t" - "movdqu 64(%[src]), %[reg_e]\n\t" - "movdqu 80(%[src]), %[reg_f]\n\t" - "movdqu 96(%[src]), %[reg_g]\n\t" - "movdqu 112(%[src]), %[reg_h]\n\t" - "movdqu %[reg_a], (%[dst])\n\t" - "movdqu %[reg_b], 16(%[dst])\n\t" - "movdqu %[reg_c], 32(%[dst])\n\t" - "movdqu %[reg_d], 48(%[dst])\n\t" - "movdqu %[reg_e], 64(%[dst])\n\t" - "movdqu %[reg_f], 80(%[dst])\n\t" - "movdqu %[reg_g], 96(%[dst])\n\t" - "movdqu %[reg_h], 112(%[dst])\n\t" - : [reg_a] "=x" (reg_a), - [reg_b] "=x" (reg_b), - [reg_c] "=x" (reg_c), - [reg_d] "=x" (reg_d), - [reg_e] "=x" (reg_e), - [reg_f] "=x" (reg_f), - [reg_g] "=x" (reg_g), - [reg_h] "=x" (reg_h) - : [src] "r" (src), - [dst] "r"(dst) - : "memory" - ); + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32); + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32); + rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32); + rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32); } -#ifdef __INTEL_COMPILER -#pragma warning(enable:593) -#endif - +/** + * Copy 256 bytes from one location to another, + * locations should not overlap. + */ static inline void rte_mov256(uint8_t *dst, const uint8_t *src) { - rte_mov128(dst, src); - rte_mov128(dst + 128, src + 128); + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32); + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32); + rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32); + rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32); + rte_mov32((uint8_t *)dst + 4 * 32, (const uint8_t *)src + 4 * 32); + rte_mov32((uint8_t *)dst + 5 * 32, (const uint8_t *)src + 5 * 32); + rte_mov32((uint8_t *)dst + 6 * 32, (const uint8_t *)src + 6 * 32); + rte_mov32((uint8_t *)dst + 7 * 32, (const uint8_t *)src + 7 * 32); } -#define rte_memcpy(dst, src, n) \ - ({ (__builtin_constant_p(n)) ? \ - memcpy((dst), (src), (n)) : \ - rte_memcpy_func((dst), (src), (n)); }) +/** + * Copy 64-byte blocks from one location to another, + * locations should not overlap. + */ +static inline void +rte_mov64blocks(uint8_t *dst, const uint8_t *src, size_t n) +{ + __m256i ymm0, ymm1; + + while (n >= 64) { + ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32)); + n -= 64; + ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32)); + src = (const uint8_t *)src + 64; + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0); + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1); + dst = (uint8_t *)dst + 64; + } +} + +/** + * Copy 256-byte blocks from one location to another, + * locations should not overlap. + */ +static inline void +rte_mov256blocks(uint8_t *dst, const uint8_t *src, size_t n) +{ + __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7; + + while (n >= 256) { + ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32)); + n -= 256; + ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32)); + ymm2 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 2 * 32)); + ymm3 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 3 * 32)); + ymm4 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 4 * 32)); + ymm5 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 5 * 32)); + ymm6 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 6 * 32)); + ymm7 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 7 * 32)); + src = (const uint8_t *)src + 256; + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0); + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1); + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2); + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3); + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 4 * 32), ymm4); + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 5 * 32), ymm5); + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 6 * 32), ymm6); + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 7 * 32), ymm7); + dst = (uint8_t *)dst + 256; + } +} static inline void * -rte_memcpy_func(void *dst, const void *src, size_t n) +rte_memcpy(void *dst, const void *src, size_t n) { void *ret = dst; + int dstofss; + int bits; - /* We can't copy < 16 bytes using XMM registers so do it manually. */ + /** + * Copy less than 16 bytes + */ if (n < 16) { if (n & 0x01) { *(uint8_t *)dst = *(const uint8_t *)src; - dst = (uint8_t *)dst + 1; src = (const uint8_t *)src + 1; + dst = (uint8_t *)dst + 1; } if (n & 0x02) { *(uint16_t *)dst = *(const uint16_t *)src; - dst = (uint16_t *)dst + 1; src = (const uint16_t *)src + 1; + dst = (uint16_t *)dst + 1; } if (n & 0x04) { *(uint32_t *)dst = *(const uint32_t *)src; - dst = (uint32_t *)dst + 1; src = (const uint32_t *)src + 1; + dst = (uint32_t *)dst + 1; } if (n & 0x08) { *(uint64_t *)dst = *(const uint64_t *)src; @@ -201,95 +224,410 @@ rte_memcpy_func(void *dst, const void *src, size_t n) return ret; } - /* Special fast cases for <= 128 bytes */ + /** + * Fast way when copy size doesn't exceed 512 bytes + */ if (n <= 32) { rte_mov16((uint8_t *)dst, (const uint8_t *)src); rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n); return ret; } - if (n <= 64) { rte_mov32((uint8_t *)dst, (const uint8_t *)src); rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n); return ret; } - - if (n <= 128) { - rte_mov64((uint8_t *)dst, (const uint8_t *)src); - rte_mov64((uint8_t *)dst - 64 + n, (const uint8_t *)src - 64 + n); + if (n <= 512) { + if (n >= 256) { + n -= 256; + rte_mov256((uint8_t *)dst, (const uint8_t *)src); + src = (const uint8_t *)src + 256; + dst = (uint8_t *)dst + 256; + } + if (n >= 128) { + n -= 128; + rte_mov128((uint8_t *)dst, (const uint8_t *)src); + src = (const uint8_t *)src + 128; + dst = (uint8_t *)dst + 128; + } + if (n >= 64) { + n -= 64; + rte_mov64((uint8_t *)dst, (const uint8_t *)src); + src = (const uint8_t *)src + 64; + dst = (uint8_t *)dst + 64; + } +COPY_BLOCK_64_BACK31: + if (n > 32) { + rte_mov32((uint8_t *)dst, (const uint8_t *)src); + rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n); + return ret; + } + if (n > 0) { + rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n); + } return ret; } - /* - * For large copies > 128 bytes. This combination of 256, 64 and 16 byte - * copies was found to be faster than doing 128 and 32 byte copies as - * well. + /** + * Make store aligned when copy size exceeds 512 bytes */ - for ( ; n >= 256; n -= 256) { - rte_mov256((uint8_t *)dst, (const uint8_t *)src); - dst = (uint8_t *)dst + 256; - src = (const uint8_t *)src + 256; + dstofss = 32 - (int)((long long)(void *)dst & 0x1F); + n -= dstofss; + rte_mov32((uint8_t *)dst, (const uint8_t *)src); + src = (const uint8_t *)src + dstofss; + dst = (uint8_t *)dst + dstofss; + + /** + * Copy 256-byte blocks. + * Use copy block function for better instruction order control, + * which is important when load is unaligned. + */ + rte_mov256blocks((uint8_t *)dst, (const uint8_t *)src, n); + bits = n; + n = n & 255; + bits -= n; + src = (const uint8_t *)src + bits; + dst = (uint8_t *)dst + bits; + + /** + * Copy 64-byte blocks. + * Use copy block function for better instruction order control, + * which is important when load is unaligned. + */ + if (n >= 64) { + rte_mov64blocks((uint8_t *)dst, (const uint8_t *)src, n); + bits = n; + n = n & 63; + bits -= n; + src = (const uint8_t *)src + bits; + dst = (uint8_t *)dst + bits; } - /* - * We split the remaining bytes (which will be less than 256) into - * 64byte (2^6) chunks. - * Using incrementing integers in the case labels of a switch statement - * enourages the compiler to use a jump table. To get incrementing - * integers, we shift the 2 relevant bits to the LSB position to first - * get decrementing integers, and then subtract. + /** + * Copy whatever left */ - switch (3 - (n >> 6)) { - case 0x00: - rte_mov64((uint8_t *)dst, (const uint8_t *)src); - n -= 64; - dst = (uint8_t *)dst + 64; - src = (const uint8_t *)src + 64; /* fallthrough */ - case 0x01: - rte_mov64((uint8_t *)dst, (const uint8_t *)src); - n -= 64; - dst = (uint8_t *)dst + 64; - src = (const uint8_t *)src + 64; /* fallthrough */ - case 0x02: - rte_mov64((uint8_t *)dst, (const uint8_t *)src); - n -= 64; - dst = (uint8_t *)dst + 64; - src = (const uint8_t *)src + 64; /* fallthrough */ - default: - ; + goto COPY_BLOCK_64_BACK31; +} + +#else /* RTE_MACHINE_CPUFLAG_AVX2 */ + +/** + * SSE & AVX implementation below + */ + +/** + * Copy 16 bytes from one location to another, + * locations should not overlap. + */ +static inline void +rte_mov16(uint8_t *dst, const uint8_t *src) +{ + __m128i xmm0; + + xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src); + _mm_storeu_si128((__m128i *)dst, xmm0); +} + +/** + * Copy 32 bytes from one location to another, + * locations should not overlap. + */ +static inline void +rte_mov32(uint8_t *dst, const uint8_t *src) +{ + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16); + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16); +} + +/** + * Copy 64 bytes from one location to another, + * locations should not overlap. + */ +static inline void +rte_mov64(uint8_t *dst, const uint8_t *src) +{ + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16); + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16); + rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16); + rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16); +} + +/** + * Copy 128 bytes from one location to another, + * locations should not overlap. + */ +static inline void +rte_mov128(uint8_t *dst, const uint8_t *src) +{ + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16); + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16); + rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16); + rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16); + rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16); + rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16); + rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16); + rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16); +} + +/** + * Copy 256 bytes from one location to another, + * locations should not overlap. + */ +static inline void +rte_mov256(uint8_t *dst, const uint8_t *src) +{ + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16); + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16); + rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16); + rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16); + rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16); + rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16); + rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16); + rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16); + rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16); + rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16); + rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16); + rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16); + rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16); + rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16); + rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16); + rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16); +} + +/** + * Macro for copying unaligned block from one location to another with constant load offset, + * 47 bytes leftover maximum, + * locations should not overlap. + * Requirements: + * - Store is aligned + * - Load offset is <offset>, which must be immediate value within [1, 15] + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading + * - <dst>, <src>, <len> must be variables + * - __m128i <xmm0> ~ <xmm8> must be pre-defined + */ +#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset) \ +({ \ + int tmp; \ + while (len >= 128 + 16 - offset) { \ + xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16)); \ + len -= 128; \ + xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16)); \ + xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16)); \ + xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16)); \ + xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16)); \ + xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16)); \ + xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16)); \ + xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16)); \ + xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16)); \ + src = (const uint8_t *)src + 128; \ + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset)); \ + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset)); \ + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset)); \ + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset)); \ + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset)); \ + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset)); \ + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset)); \ + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset)); \ + dst = (uint8_t *)dst + 128; \ + } \ + tmp = len; \ + len = ((len - 16 + offset) & 127) + 16 - offset; \ + tmp -= len; \ + src = (const uint8_t *)src + tmp; \ + dst = (uint8_t *)dst + tmp; \ + if (len >= 32 + 16 - offset) { \ + while (len >= 32 + 16 - offset) { \ + xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16)); \ + len -= 32; \ + xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16)); \ + xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16)); \ + src = (const uint8_t *)src + 32; \ + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset)); \ + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset)); \ + dst = (uint8_t *)dst + 32; \ + } \ + tmp = len; \ + len = ((len - 16 + offset) & 31) + 16 - offset; \ + tmp -= len; \ + src = (const uint8_t *)src + tmp; \ + dst = (uint8_t *)dst + tmp; \ + } \ +}) + +/** + * Macro for copying unaligned block from one location to another, + * 47 bytes leftover maximum, + * locations should not overlap. + * Use switch here because the aligning instruction requires immediate value for shift count. + * Requirements: + * - Store is aligned + * - Load offset is <offset>, which must be within [1, 15] + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading + * - <dst>, <src>, <len> must be variables + * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined + */ +#define MOVEUNALIGNED_LEFT47(dst, src, len, offset) \ +({ \ + switch (offset) { \ + case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break; \ + case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break; \ + case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break; \ + case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break; \ + case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break; \ + case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break; \ + case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break; \ + case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break; \ + case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break; \ + case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break; \ + case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break; \ + case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break; \ + case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break; \ + case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break; \ + case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break; \ + default:; \ + } \ +}) + +static inline void * +rte_memcpy(void *dst, const void *src, size_t n) +{ + __m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8; + void *ret = dst; + int dstofss; + int srcofs; + + /** + * Copy less than 16 bytes + */ + if (n < 16) { + if (n & 0x01) { + *(uint8_t *)dst = *(const uint8_t *)src; + src = (const uint8_t *)src + 1; + dst = (uint8_t *)dst + 1; + } + if (n & 0x02) { + *(uint16_t *)dst = *(const uint16_t *)src; + src = (const uint16_t *)src + 1; + dst = (uint16_t *)dst + 1; + } + if (n & 0x04) { + *(uint32_t *)dst = *(const uint32_t *)src; + src = (const uint32_t *)src + 1; + dst = (uint32_t *)dst + 1; + } + if (n & 0x08) { + *(uint64_t *)dst = *(const uint64_t *)src; + } + return ret; } - /* - * We split the remaining bytes (which will be less than 64) into - * 16byte (2^4) chunks, using the same switch structure as above. + /** + * Fast way when copy size doesn't exceed 512 bytes */ - switch (3 - (n >> 4)) { - case 0x00: - rte_mov16((uint8_t *)dst, (const uint8_t *)src); - n -= 16; - dst = (uint8_t *)dst + 16; - src = (const uint8_t *)src + 16; /* fallthrough */ - case 0x01: - rte_mov16((uint8_t *)dst, (const uint8_t *)src); - n -= 16; - dst = (uint8_t *)dst + 16; - src = (const uint8_t *)src + 16; /* fallthrough */ - case 0x02: + if (n <= 32) { rte_mov16((uint8_t *)dst, (const uint8_t *)src); - n -= 16; - dst = (uint8_t *)dst + 16; - src = (const uint8_t *)src + 16; /* fallthrough */ - default: - ; + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n); + return ret; } - - /* Copy any remaining bytes, without going beyond end of buffers */ - if (n != 0) { + if (n <= 48) { + rte_mov32((uint8_t *)dst, (const uint8_t *)src); + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n); + return ret; + } + if (n <= 64) { + rte_mov32((uint8_t *)dst, (const uint8_t *)src); + rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32); rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n); + return ret; + } + if (n <= 128) { + goto COPY_BLOCK_128_BACK15; } - return ret; + if (n <= 512) { + if (n >= 256) { + n -= 256; + rte_mov128((uint8_t *)dst, (const uint8_t *)src); + rte_mov128((uint8_t *)dst + 128, (const uint8_t *)src + 128); + src = (const uint8_t *)src + 256; + dst = (uint8_t *)dst + 256; + } +COPY_BLOCK_255_BACK15: + if (n >= 128) { + n -= 128; + rte_mov128((uint8_t *)dst, (const uint8_t *)src); + src = (const uint8_t *)src + 128; + dst = (uint8_t *)dst + 128; + } +COPY_BLOCK_128_BACK15: + if (n >= 64) { + n -= 64; + rte_mov64((uint8_t *)dst, (const uint8_t *)src); + src = (const uint8_t *)src + 64; + dst = (uint8_t *)dst + 64; + } +COPY_BLOCK_64_BACK15: + if (n >= 32) { + n -= 32; + rte_mov32((uint8_t *)dst, (const uint8_t *)src); + src = (const uint8_t *)src + 32; + dst = (uint8_t *)dst + 32; + } + if (n > 16) { + rte_mov16((uint8_t *)dst, (const uint8_t *)src); + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n); + return ret; + } + if (n > 0) { + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n); + } + return ret; + } + + /** + * Make store aligned when copy size exceeds 512 bytes, + * and make sure the first 15 bytes are copied, because + * unaligned copy functions require up to 15 bytes + * backwards access. + */ + dstofss = 16 - (int)((long long)(void *)dst & 0x0F) + 16; + n -= dstofss; + rte_mov32((uint8_t *)dst, (const uint8_t *)src); + src = (const uint8_t *)src + dstofss; + dst = (uint8_t *)dst + dstofss; + srcofs = (int)((long long)(const void *)src & 0x0F); + + /** + * For aligned copy + */ + if (srcofs == 0) { + /** + * Copy 256-byte blocks + */ + for (; n >= 256; n -= 256) { + rte_mov256((uint8_t *)dst, (const uint8_t *)src); + dst = (uint8_t *)dst + 256; + src = (const uint8_t *)src + 256; + } + + /** + * Copy whatever left + */ + goto COPY_BLOCK_255_BACK15; + } + + /** + * For copy with unaligned load + */ + MOVEUNALIGNED_LEFT47(dst, src, n, srcofs); + + /** + * Copy whatever left + */ + goto COPY_BLOCK_64_BACK15; } +#endif /* RTE_MACHINE_CPUFLAG_AVX2 */ + #ifdef __cplusplus } #endif -- 1.9.3 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms 2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms Zhihong Wang @ 2015-01-29 15:17 ` Ananyev, Konstantin 2015-01-30 5:57 ` Wang, Zhihong 0 siblings, 1 reply; 12+ messages in thread From: Ananyev, Konstantin @ 2015-01-29 15:17 UTC (permalink / raw) To: Wang, Zhihong, dev Hi Zhihong, > -----Original Message----- > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhihong Wang > Sent: Thursday, January 29, 2015 2:39 AM > To: dev@dpdk.org > Subject: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms > > Main code changes: > > 1. Differentiate architectural features based on CPU flags > > a. Implement separated move functions for SSE/AVX/AVX2 to make full utilization of cache bandwidth > > b. Implement separated copy flow specifically optimized for target architecture > > 2. Rewrite the memcpy function "rte_memcpy" > > a. Add store aligning > > b. Add load aligning based on architectural features > > c. Put block copy loop into inline move functions for better control of instruction order > > d. Eliminate unnecessary MOVs > > 3. Rewrite the inline move functions > > a. Add move functions for unaligned load cases > > b. Change instruction order in copy loops for better pipeline utilization > > c. Use intrinsics instead of assembly code > > 4. Remove slow glibc call for constant copies > > Signed-off-by: Zhihong Wang <zhihong.wang@intel.com> > --- > .../common/include/arch/x86/rte_memcpy.h | 680 +++++++++++++++------ > 1 file changed, 509 insertions(+), 171 deletions(-) > > diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h > index fb9eba8..7b2d382 100644 > --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h > +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h > @@ -34,166 +34,189 @@ > #ifndef _RTE_MEMCPY_X86_64_H_ > #define _RTE_MEMCPY_X86_64_H_ > > +/** > + * @file > + * > + * Functions for SSE/AVX/AVX2 implementation of memcpy(). > + */ > + > +#include <stdio.h> > #include <stdint.h> > #include <string.h> > -#include <emmintrin.h> > +#include <x86intrin.h> > > #ifdef __cplusplus > extern "C" { > #endif > > -#include "generic/rte_memcpy.h" > +/** > + * Copy bytes from one location to another. The locations must not overlap. > + * > + * @note This is implemented as a macro, so it's address should not be taken > + * and care is needed as parameter expressions may be evaluated multiple times. > + * > + * @param dst > + * Pointer to the destination of the data. > + * @param src > + * Pointer to the source data. > + * @param n > + * Number of bytes to copy. > + * @return > + * Pointer to the destination data. > + */ > +static inline void * > +rte_memcpy(void *dst, const void *src, size_t n) __attribute__((always_inline)); > > -#ifdef __INTEL_COMPILER > -#pragma warning(disable:593) /* Stop unused variable warning (reg_a etc). */ > -#endif > +#ifdef RTE_MACHINE_CPUFLAG_AVX2 > > +/** > + * AVX2 implementation below > + */ > + > +/** > + * Copy 16 bytes from one location to another, > + * locations should not overlap. > + */ > static inline void > rte_mov16(uint8_t *dst, const uint8_t *src) > { > - __m128i reg_a; > - asm volatile ( > - "movdqu (%[src]), %[reg_a]\n\t" > - "movdqu %[reg_a], (%[dst])\n\t" > - : [reg_a] "=x" (reg_a) > - : [src] "r" (src), > - [dst] "r"(dst) > - : "memory" > - ); > + __m128i xmm0; > + > + xmm0 = _mm_loadu_si128((const __m128i *)src); > + _mm_storeu_si128((__m128i *)dst, xmm0); > } > > +/** > + * Copy 32 bytes from one location to another, > + * locations should not overlap. > + */ > static inline void > rte_mov32(uint8_t *dst, const uint8_t *src) > { > - __m128i reg_a, reg_b; > - asm volatile ( > - "movdqu (%[src]), %[reg_a]\n\t" > - "movdqu 16(%[src]), %[reg_b]\n\t" > - "movdqu %[reg_a], (%[dst])\n\t" > - "movdqu %[reg_b], 16(%[dst])\n\t" > - : [reg_a] "=x" (reg_a), > - [reg_b] "=x" (reg_b) > - : [src] "r" (src), > - [dst] "r"(dst) > - : "memory" > - ); > -} > + __m256i ymm0; > > -static inline void > -rte_mov48(uint8_t *dst, const uint8_t *src) > -{ > - __m128i reg_a, reg_b, reg_c; > - asm volatile ( > - "movdqu (%[src]), %[reg_a]\n\t" > - "movdqu 16(%[src]), %[reg_b]\n\t" > - "movdqu 32(%[src]), %[reg_c]\n\t" > - "movdqu %[reg_a], (%[dst])\n\t" > - "movdqu %[reg_b], 16(%[dst])\n\t" > - "movdqu %[reg_c], 32(%[dst])\n\t" > - : [reg_a] "=x" (reg_a), > - [reg_b] "=x" (reg_b), > - [reg_c] "=x" (reg_c) > - : [src] "r" (src), > - [dst] "r"(dst) > - : "memory" > - ); > + ymm0 = _mm256_loadu_si256((const __m256i *)src); > + _mm256_storeu_si256((__m256i *)dst, ymm0); > } > > +/** > + * Copy 64 bytes from one location to another, > + * locations should not overlap. > + */ > static inline void > rte_mov64(uint8_t *dst, const uint8_t *src) > { > - __m128i reg_a, reg_b, reg_c, reg_d; > - asm volatile ( > - "movdqu (%[src]), %[reg_a]\n\t" > - "movdqu 16(%[src]), %[reg_b]\n\t" > - "movdqu 32(%[src]), %[reg_c]\n\t" > - "movdqu 48(%[src]), %[reg_d]\n\t" > - "movdqu %[reg_a], (%[dst])\n\t" > - "movdqu %[reg_b], 16(%[dst])\n\t" > - "movdqu %[reg_c], 32(%[dst])\n\t" > - "movdqu %[reg_d], 48(%[dst])\n\t" > - : [reg_a] "=x" (reg_a), > - [reg_b] "=x" (reg_b), > - [reg_c] "=x" (reg_c), > - [reg_d] "=x" (reg_d) > - : [src] "r" (src), > - [dst] "r"(dst) > - : "memory" > - ); > + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32); > + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32); > } > > +/** > + * Copy 128 bytes from one location to another, > + * locations should not overlap. > + */ > static inline void > rte_mov128(uint8_t *dst, const uint8_t *src) > { > - __m128i reg_a, reg_b, reg_c, reg_d, reg_e, reg_f, reg_g, reg_h; > - asm volatile ( > - "movdqu (%[src]), %[reg_a]\n\t" > - "movdqu 16(%[src]), %[reg_b]\n\t" > - "movdqu 32(%[src]), %[reg_c]\n\t" > - "movdqu 48(%[src]), %[reg_d]\n\t" > - "movdqu 64(%[src]), %[reg_e]\n\t" > - "movdqu 80(%[src]), %[reg_f]\n\t" > - "movdqu 96(%[src]), %[reg_g]\n\t" > - "movdqu 112(%[src]), %[reg_h]\n\t" > - "movdqu %[reg_a], (%[dst])\n\t" > - "movdqu %[reg_b], 16(%[dst])\n\t" > - "movdqu %[reg_c], 32(%[dst])\n\t" > - "movdqu %[reg_d], 48(%[dst])\n\t" > - "movdqu %[reg_e], 64(%[dst])\n\t" > - "movdqu %[reg_f], 80(%[dst])\n\t" > - "movdqu %[reg_g], 96(%[dst])\n\t" > - "movdqu %[reg_h], 112(%[dst])\n\t" > - : [reg_a] "=x" (reg_a), > - [reg_b] "=x" (reg_b), > - [reg_c] "=x" (reg_c), > - [reg_d] "=x" (reg_d), > - [reg_e] "=x" (reg_e), > - [reg_f] "=x" (reg_f), > - [reg_g] "=x" (reg_g), > - [reg_h] "=x" (reg_h) > - : [src] "r" (src), > - [dst] "r"(dst) > - : "memory" > - ); > + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32); > + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32); > + rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32); > + rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32); > } > > -#ifdef __INTEL_COMPILER > -#pragma warning(enable:593) > -#endif > - > +/** > + * Copy 256 bytes from one location to another, > + * locations should not overlap. > + */ > static inline void > rte_mov256(uint8_t *dst, const uint8_t *src) > { > - rte_mov128(dst, src); > - rte_mov128(dst + 128, src + 128); > + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32); > + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32); > + rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32); > + rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32); > + rte_mov32((uint8_t *)dst + 4 * 32, (const uint8_t *)src + 4 * 32); > + rte_mov32((uint8_t *)dst + 5 * 32, (const uint8_t *)src + 5 * 32); > + rte_mov32((uint8_t *)dst + 6 * 32, (const uint8_t *)src + 6 * 32); > + rte_mov32((uint8_t *)dst + 7 * 32, (const uint8_t *)src + 7 * 32); > } > > -#define rte_memcpy(dst, src, n) \ > - ({ (__builtin_constant_p(n)) ? \ > - memcpy((dst), (src), (n)) : \ > - rte_memcpy_func((dst), (src), (n)); }) > +/** > + * Copy 64-byte blocks from one location to another, > + * locations should not overlap. > + */ > +static inline void > +rte_mov64blocks(uint8_t *dst, const uint8_t *src, size_t n) > +{ > + __m256i ymm0, ymm1; > + > + while (n >= 64) { > + ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32)); > + n -= 64; > + ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32)); > + src = (const uint8_t *)src + 64; > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0); > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1); > + dst = (uint8_t *)dst + 64; > + } > +} > + > +/** > + * Copy 256-byte blocks from one location to another, > + * locations should not overlap. > + */ > +static inline void > +rte_mov256blocks(uint8_t *dst, const uint8_t *src, size_t n) > +{ > + __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7; > + > + while (n >= 256) { > + ymm0 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 0 * 32)); > + n -= 256; > + ymm1 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 1 * 32)); > + ymm2 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 2 * 32)); > + ymm3 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 3 * 32)); > + ymm4 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 4 * 32)); > + ymm5 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 5 * 32)); > + ymm6 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 6 * 32)); > + ymm7 = _mm256_loadu_si256((const __m256i *)((const uint8_t *)src + 7 * 32)); > + src = (const uint8_t *)src + 256; > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), ymm0); > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), ymm1); > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), ymm2); > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), ymm3); > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 4 * 32), ymm4); > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 5 * 32), ymm5); > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 6 * 32), ymm6); > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 7 * 32), ymm7); > + dst = (uint8_t *)dst + 256; > + } > +} > > static inline void * > -rte_memcpy_func(void *dst, const void *src, size_t n) > +rte_memcpy(void *dst, const void *src, size_t n) > { > void *ret = dst; > + int dstofss; > + int bits; > > - /* We can't copy < 16 bytes using XMM registers so do it manually. */ > + /** > + * Copy less than 16 bytes > + */ > if (n < 16) { > if (n & 0x01) { > *(uint8_t *)dst = *(const uint8_t *)src; > - dst = (uint8_t *)dst + 1; > src = (const uint8_t *)src + 1; > + dst = (uint8_t *)dst + 1; > } > if (n & 0x02) { > *(uint16_t *)dst = *(const uint16_t *)src; > - dst = (uint16_t *)dst + 1; > src = (const uint16_t *)src + 1; > + dst = (uint16_t *)dst + 1; > } > if (n & 0x04) { > *(uint32_t *)dst = *(const uint32_t *)src; > - dst = (uint32_t *)dst + 1; > src = (const uint32_t *)src + 1; > + dst = (uint32_t *)dst + 1; > } > if (n & 0x08) { > *(uint64_t *)dst = *(const uint64_t *)src; > @@ -201,95 +224,410 @@ rte_memcpy_func(void *dst, const void *src, size_t n) > return ret; > } > > - /* Special fast cases for <= 128 bytes */ > + /** > + * Fast way when copy size doesn't exceed 512 bytes > + */ > if (n <= 32) { > rte_mov16((uint8_t *)dst, (const uint8_t *)src); > rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n); > return ret; > } > - > if (n <= 64) { > rte_mov32((uint8_t *)dst, (const uint8_t *)src); > rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n); > return ret; > } > - > - if (n <= 128) { > - rte_mov64((uint8_t *)dst, (const uint8_t *)src); > - rte_mov64((uint8_t *)dst - 64 + n, (const uint8_t *)src - 64 + n); > + if (n <= 512) { > + if (n >= 256) { > + n -= 256; > + rte_mov256((uint8_t *)dst, (const uint8_t *)src); > + src = (const uint8_t *)src + 256; > + dst = (uint8_t *)dst + 256; > + } > + if (n >= 128) { > + n -= 128; > + rte_mov128((uint8_t *)dst, (const uint8_t *)src); > + src = (const uint8_t *)src + 128; > + dst = (uint8_t *)dst + 128; > + } > + if (n >= 64) { > + n -= 64; > + rte_mov64((uint8_t *)dst, (const uint8_t *)src); > + src = (const uint8_t *)src + 64; > + dst = (uint8_t *)dst + 64; > + } > +COPY_BLOCK_64_BACK31: > + if (n > 32) { > + rte_mov32((uint8_t *)dst, (const uint8_t *)src); > + rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n); > + return ret; > + } > + if (n > 0) { > + rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + n); > + } > return ret; > } > > - /* > - * For large copies > 128 bytes. This combination of 256, 64 and 16 byte > - * copies was found to be faster than doing 128 and 32 byte copies as > - * well. > + /** > + * Make store aligned when copy size exceeds 512 bytes > */ > - for ( ; n >= 256; n -= 256) { > - rte_mov256((uint8_t *)dst, (const uint8_t *)src); > - dst = (uint8_t *)dst + 256; > - src = (const uint8_t *)src + 256; > + dstofss = 32 - (int)((long long)(void *)dst & 0x1F); > + n -= dstofss; > + rte_mov32((uint8_t *)dst, (const uint8_t *)src); > + src = (const uint8_t *)src + dstofss; > + dst = (uint8_t *)dst + dstofss; > + > + /** > + * Copy 256-byte blocks. > + * Use copy block function for better instruction order control, > + * which is important when load is unaligned. > + */ > + rte_mov256blocks((uint8_t *)dst, (const uint8_t *)src, n); > + bits = n; > + n = n & 255; > + bits -= n; > + src = (const uint8_t *)src + bits; > + dst = (uint8_t *)dst + bits; > + > + /** > + * Copy 64-byte blocks. > + * Use copy block function for better instruction order control, > + * which is important when load is unaligned. > + */ > + if (n >= 64) { > + rte_mov64blocks((uint8_t *)dst, (const uint8_t *)src, n); > + bits = n; > + n = n & 63; > + bits -= n; > + src = (const uint8_t *)src + bits; > + dst = (uint8_t *)dst + bits; > } > > - /* > - * We split the remaining bytes (which will be less than 256) into > - * 64byte (2^6) chunks. > - * Using incrementing integers in the case labels of a switch statement > - * enourages the compiler to use a jump table. To get incrementing > - * integers, we shift the 2 relevant bits to the LSB position to first > - * get decrementing integers, and then subtract. > + /** > + * Copy whatever left > */ > - switch (3 - (n >> 6)) { > - case 0x00: > - rte_mov64((uint8_t *)dst, (const uint8_t *)src); > - n -= 64; > - dst = (uint8_t *)dst + 64; > - src = (const uint8_t *)src + 64; /* fallthrough */ > - case 0x01: > - rte_mov64((uint8_t *)dst, (const uint8_t *)src); > - n -= 64; > - dst = (uint8_t *)dst + 64; > - src = (const uint8_t *)src + 64; /* fallthrough */ > - case 0x02: > - rte_mov64((uint8_t *)dst, (const uint8_t *)src); > - n -= 64; > - dst = (uint8_t *)dst + 64; > - src = (const uint8_t *)src + 64; /* fallthrough */ > - default: > - ; > + goto COPY_BLOCK_64_BACK31; > +} > + > +#else /* RTE_MACHINE_CPUFLAG_AVX2 */ > + > +/** > + * SSE & AVX implementation below > + */ > + > +/** > + * Copy 16 bytes from one location to another, > + * locations should not overlap. > + */ > +static inline void > +rte_mov16(uint8_t *dst, const uint8_t *src) > +{ > + __m128i xmm0; > + > + xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src); > + _mm_storeu_si128((__m128i *)dst, xmm0); > +} > + > +/** > + * Copy 32 bytes from one location to another, > + * locations should not overlap. > + */ > +static inline void > +rte_mov32(uint8_t *dst, const uint8_t *src) > +{ > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16); > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16); > +} > + > +/** > + * Copy 64 bytes from one location to another, > + * locations should not overlap. > + */ > +static inline void > +rte_mov64(uint8_t *dst, const uint8_t *src) > +{ > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16); > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16); > + rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16); > + rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16); > +} > + > +/** > + * Copy 128 bytes from one location to another, > + * locations should not overlap. > + */ > +static inline void > +rte_mov128(uint8_t *dst, const uint8_t *src) > +{ > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16); > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16); > + rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16); > + rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16); > + rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16); > + rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16); > + rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16); > + rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16); > +} > + > +/** > + * Copy 256 bytes from one location to another, > + * locations should not overlap. > + */ > +static inline void > +rte_mov256(uint8_t *dst, const uint8_t *src) > +{ > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16); > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16); > + rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16); > + rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16); > + rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16); > + rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16); > + rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16); > + rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16); > + rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16); > + rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16); > + rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16); > + rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16); > + rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16); > + rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16); > + rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16); > + rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16); > +} > + > +/** > + * Macro for copying unaligned block from one location to another with constant load offset, > + * 47 bytes leftover maximum, > + * locations should not overlap. > + * Requirements: > + * - Store is aligned > + * - Load offset is <offset>, which must be immediate value within [1, 15] > + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading > + * - <dst>, <src>, <len> must be variables > + * - __m128i <xmm0> ~ <xmm8> must be pre-defined > + */ > +#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset) \ > +({ \ > + int tmp; \ > + while (len >= 128 + 16 - offset) { \ > + xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16)); \ > + len -= 128; \ > + xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16)); \ > + xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16)); \ > + xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 3 * 16)); \ > + xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 4 * 16)); \ > + xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 5 * 16)); \ > + xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 6 * 16)); \ > + xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16)); \ > + xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16)); \ > + src = (const uint8_t *)src + 128; \ > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset)); \ > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset)); \ > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset)); \ > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset)); \ > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset)); \ > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset)); \ > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset)); \ > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset)); \ > + dst = (uint8_t *)dst + 128; \ > + } \ > + tmp = len; \ > + len = ((len - 16 + offset) & 127) + 16 - offset; \ > + tmp -= len; \ > + src = (const uint8_t *)src + tmp; \ > + dst = (uint8_t *)dst + tmp; \ > + if (len >= 32 + 16 - offset) { \ > + while (len >= 32 + 16 - offset) { \ > + xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 0 * 16)); \ > + len -= 32; \ > + xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 1 * 16)); \ > + xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 2 * 16)); \ > + src = (const uint8_t *)src + 32; \ > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset)); \ > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset)); \ > + dst = (uint8_t *)dst + 32; \ > + } \ > + tmp = len; \ > + len = ((len - 16 + offset) & 31) + 16 - offset; \ > + tmp -= len; \ > + src = (const uint8_t *)src + tmp; \ > + dst = (uint8_t *)dst + tmp; \ > + } \ > +}) > + > +/** > + * Macro for copying unaligned block from one location to another, > + * 47 bytes leftover maximum, > + * locations should not overlap. > + * Use switch here because the aligning instruction requires immediate value for shift count. > + * Requirements: > + * - Store is aligned > + * - Load offset is <offset>, which must be within [1, 15] > + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit forwards are available for loading > + * - <dst>, <src>, <len> must be variables > + * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM must be pre-defined > + */ > +#define MOVEUNALIGNED_LEFT47(dst, src, len, offset) \ > +({ \ > + switch (offset) { \ > + case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break; \ > + case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break; \ > + case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break; \ > + case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break; \ > + case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break; \ > + case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break; \ > + case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break; \ > + case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break; \ > + case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break; \ > + case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break; \ > + case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break; \ > + case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break; \ > + case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break; \ > + case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break; \ > + case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break; \ > + default:; \ > + } \ > +}) We probably didn't understand each other properly. My thought was to do something like: #define MOVEUNALIGNED_128(offset) do { \ _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), _mm_alignr_epi8(xmm1, xmm0, offset)); \ _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), _mm_alignr_epi8(xmm2, xmm1, offset)); \ _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), _mm_alignr_epi8(xmm3, xmm2, offset)); \ _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), _mm_alignr_epi8(xmm4, xmm3, offset)); \ _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), _mm_alignr_epi8(xmm5, xmm4, offset)); \ _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), _mm_alignr_epi8(xmm6, xmm5, offset)); \ _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), _mm_alignr_epi8(xmm7, xmm6, offset)); \ _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), _mm_alignr_epi8(xmm8, xmm7, offset)); \ } while(0); Then at MOVEUNALIGNED_LEFT47_IMM: .... xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 * 16)); \ xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 * 16)); \ src = (const uint8_t *)src + 128; switch(offset) { case 0x01: MOVEUNALIGNED_128(1); break; ... case 0x0f: MOVEUNALIGNED_128(f); break; } ... dst = (uint8_t *)dst + 128 An then in rte_memcpy() you don't need a switch for MOVEUNALIGNED_LEFT47_IMM itself. Thought it might help to generate smaller code for rte_ememcpy() while keeping performance reasonably high. Konstantin > + > +static inline void * > +rte_memcpy(void *dst, const void *src, size_t n) > +{ > + __m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, xmm8; > + void *ret = dst; > + int dstofss; > + int srcofs; > + > + /** > + * Copy less than 16 bytes > + */ > + if (n < 16) { > + if (n & 0x01) { > + *(uint8_t *)dst = *(const uint8_t *)src; > + src = (const uint8_t *)src + 1; > + dst = (uint8_t *)dst + 1; > + } > + if (n & 0x02) { > + *(uint16_t *)dst = *(const uint16_t *)src; > + src = (const uint16_t *)src + 1; > + dst = (uint16_t *)dst + 1; > + } > + if (n & 0x04) { > + *(uint32_t *)dst = *(const uint32_t *)src; > + src = (const uint32_t *)src + 1; > + dst = (uint32_t *)dst + 1; > + } > + if (n & 0x08) { > + *(uint64_t *)dst = *(const uint64_t *)src; > + } > + return ret; > } > > - /* > - * We split the remaining bytes (which will be less than 64) into > - * 16byte (2^4) chunks, using the same switch structure as above. > + /** > + * Fast way when copy size doesn't exceed 512 bytes > */ > - switch (3 - (n >> 4)) { > - case 0x00: > - rte_mov16((uint8_t *)dst, (const uint8_t *)src); > - n -= 16; > - dst = (uint8_t *)dst + 16; > - src = (const uint8_t *)src + 16; /* fallthrough */ > - case 0x01: > - rte_mov16((uint8_t *)dst, (const uint8_t *)src); > - n -= 16; > - dst = (uint8_t *)dst + 16; > - src = (const uint8_t *)src + 16; /* fallthrough */ > - case 0x02: > + if (n <= 32) { > rte_mov16((uint8_t *)dst, (const uint8_t *)src); > - n -= 16; > - dst = (uint8_t *)dst + 16; > - src = (const uint8_t *)src + 16; /* fallthrough */ > - default: > - ; > + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n); > + return ret; > } > - > - /* Copy any remaining bytes, without going beyond end of buffers */ > - if (n != 0) { > + if (n <= 48) { > + rte_mov32((uint8_t *)dst, (const uint8_t *)src); > + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n); > + return ret; > + } > + if (n <= 64) { > + rte_mov32((uint8_t *)dst, (const uint8_t *)src); > + rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32); > rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n); > + return ret; > + } > + if (n <= 128) { > + goto COPY_BLOCK_128_BACK15; > } > - return ret; > + if (n <= 512) { > + if (n >= 256) { > + n -= 256; > + rte_mov128((uint8_t *)dst, (const uint8_t *)src); > + rte_mov128((uint8_t *)dst + 128, (const uint8_t *)src + 128); > + src = (const uint8_t *)src + 256; > + dst = (uint8_t *)dst + 256; > + } > +COPY_BLOCK_255_BACK15: > + if (n >= 128) { > + n -= 128; > + rte_mov128((uint8_t *)dst, (const uint8_t *)src); > + src = (const uint8_t *)src + 128; > + dst = (uint8_t *)dst + 128; > + } > +COPY_BLOCK_128_BACK15: > + if (n >= 64) { > + n -= 64; > + rte_mov64((uint8_t *)dst, (const uint8_t *)src); > + src = (const uint8_t *)src + 64; > + dst = (uint8_t *)dst + 64; > + } > +COPY_BLOCK_64_BACK15: > + if (n >= 32) { > + n -= 32; > + rte_mov32((uint8_t *)dst, (const uint8_t *)src); > + src = (const uint8_t *)src + 32; > + dst = (uint8_t *)dst + 32; > + } > + if (n > 16) { > + rte_mov16((uint8_t *)dst, (const uint8_t *)src); > + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n); > + return ret; > + } > + if (n > 0) { > + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + n); > + } > + return ret; > + } > + > + /** > + * Make store aligned when copy size exceeds 512 bytes, > + * and make sure the first 15 bytes are copied, because > + * unaligned copy functions require up to 15 bytes > + * backwards access. > + */ > + dstofss = 16 - (int)((long long)(void *)dst & 0x0F) + 16; > + n -= dstofss; > + rte_mov32((uint8_t *)dst, (const uint8_t *)src); > + src = (const uint8_t *)src + dstofss; > + dst = (uint8_t *)dst + dstofss; > + srcofs = (int)((long long)(const void *)src & 0x0F); > + > + /** > + * For aligned copy > + */ > + if (srcofs == 0) { > + /** > + * Copy 256-byte blocks > + */ > + for (; n >= 256; n -= 256) { > + rte_mov256((uint8_t *)dst, (const uint8_t *)src); > + dst = (uint8_t *)dst + 256; > + src = (const uint8_t *)src + 256; > + } > + > + /** > + * Copy whatever left > + */ > + goto COPY_BLOCK_255_BACK15; > + } > + > + /** > + * For copy with unaligned load > + */ > + MOVEUNALIGNED_LEFT47(dst, src, n, srcofs); > + > + /** > + * Copy whatever left > + */ > + goto COPY_BLOCK_64_BACK15; > } > > +#endif /* RTE_MACHINE_CPUFLAG_AVX2 */ > + > #ifdef __cplusplus > } > #endif > -- > 1.9.3 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms 2015-01-29 15:17 ` Ananyev, Konstantin @ 2015-01-30 5:57 ` Wang, Zhihong 2015-01-30 10:44 ` Ananyev, Konstantin 0 siblings, 1 reply; 12+ messages in thread From: Wang, Zhihong @ 2015-01-30 5:57 UTC (permalink / raw) To: Ananyev, Konstantin, dev Hey Konstantin, This method does reduce code size but lead to significant performance drop. I think we need to keep the original code. Thanks Zhihong (John) > -----Original Message----- > From: Ananyev, Konstantin > Sent: Thursday, January 29, 2015 11:18 PM > To: Wang, Zhihong; dev@dpdk.org > Subject: RE: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in > arch/x86/rte_memcpy.h for both SSE and AVX platforms > > Hi Zhihong, > > > -----Original Message----- > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhihong Wang > > Sent: Thursday, January 29, 2015 2:39 AM > > To: dev@dpdk.org > > Subject: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in > > arch/x86/rte_memcpy.h for both SSE and AVX platforms > > > > Main code changes: > > > > 1. Differentiate architectural features based on CPU flags > > > > a. Implement separated move functions for SSE/AVX/AVX2 to make > > full utilization of cache bandwidth > > > > b. Implement separated copy flow specifically optimized for target > > architecture > > > > 2. Rewrite the memcpy function "rte_memcpy" > > > > a. Add store aligning > > > > b. Add load aligning based on architectural features > > > > c. Put block copy loop into inline move functions for better > > control of instruction order > > > > d. Eliminate unnecessary MOVs > > > > 3. Rewrite the inline move functions > > > > a. Add move functions for unaligned load cases > > > > b. Change instruction order in copy loops for better pipeline > > utilization > > > > c. Use intrinsics instead of assembly code > > > > 4. Remove slow glibc call for constant copies > > > > Signed-off-by: Zhihong Wang <zhihong.wang@intel.com> > > --- > > .../common/include/arch/x86/rte_memcpy.h | 680 > +++++++++++++++------ > > 1 file changed, 509 insertions(+), 171 deletions(-) > > > > diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h > > b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h > > index fb9eba8..7b2d382 100644 > > --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h > > +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h > > @@ -34,166 +34,189 @@ > > #ifndef _RTE_MEMCPY_X86_64_H_ > > #define _RTE_MEMCPY_X86_64_H_ > > > > +/** > > + * @file > > + * > > + * Functions for SSE/AVX/AVX2 implementation of memcpy(). > > + */ > > + > > +#include <stdio.h> > > #include <stdint.h> > > #include <string.h> > > -#include <emmintrin.h> > > +#include <x86intrin.h> > > > > #ifdef __cplusplus > > extern "C" { > > #endif > > > > -#include "generic/rte_memcpy.h" > > +/** > > + * Copy bytes from one location to another. The locations must not > overlap. > > + * > > + * @note This is implemented as a macro, so it's address should not > > +be taken > > + * and care is needed as parameter expressions may be evaluated > multiple times. > > + * > > + * @param dst > > + * Pointer to the destination of the data. > > + * @param src > > + * Pointer to the source data. > > + * @param n > > + * Number of bytes to copy. > > + * @return > > + * Pointer to the destination data. > > + */ > > +static inline void * > > +rte_memcpy(void *dst, const void *src, size_t n) > > +__attribute__((always_inline)); > > > > -#ifdef __INTEL_COMPILER > > -#pragma warning(disable:593) /* Stop unused variable warning (reg_a > > etc). */ -#endif > > +#ifdef RTE_MACHINE_CPUFLAG_AVX2 > > > > +/** > > + * AVX2 implementation below > > + */ > > + > > +/** > > + * Copy 16 bytes from one location to another, > > + * locations should not overlap. > > + */ > > static inline void > > rte_mov16(uint8_t *dst, const uint8_t *src) { > > - __m128i reg_a; > > - asm volatile ( > > - "movdqu (%[src]), %[reg_a]\n\t" > > - "movdqu %[reg_a], (%[dst])\n\t" > > - : [reg_a] "=x" (reg_a) > > - : [src] "r" (src), > > - [dst] "r"(dst) > > - : "memory" > > - ); > > + __m128i xmm0; > > + > > + xmm0 = _mm_loadu_si128((const __m128i *)src); > > + _mm_storeu_si128((__m128i *)dst, xmm0); > > } > > > > +/** > > + * Copy 32 bytes from one location to another, > > + * locations should not overlap. > > + */ > > static inline void > > rte_mov32(uint8_t *dst, const uint8_t *src) { > > - __m128i reg_a, reg_b; > > - asm volatile ( > > - "movdqu (%[src]), %[reg_a]\n\t" > > - "movdqu 16(%[src]), %[reg_b]\n\t" > > - "movdqu %[reg_a], (%[dst])\n\t" > > - "movdqu %[reg_b], 16(%[dst])\n\t" > > - : [reg_a] "=x" (reg_a), > > - [reg_b] "=x" (reg_b) > > - : [src] "r" (src), > > - [dst] "r"(dst) > > - : "memory" > > - ); > > -} > > + __m256i ymm0; > > > > -static inline void > > -rte_mov48(uint8_t *dst, const uint8_t *src) -{ > > - __m128i reg_a, reg_b, reg_c; > > - asm volatile ( > > - "movdqu (%[src]), %[reg_a]\n\t" > > - "movdqu 16(%[src]), %[reg_b]\n\t" > > - "movdqu 32(%[src]), %[reg_c]\n\t" > > - "movdqu %[reg_a], (%[dst])\n\t" > > - "movdqu %[reg_b], 16(%[dst])\n\t" > > - "movdqu %[reg_c], 32(%[dst])\n\t" > > - : [reg_a] "=x" (reg_a), > > - [reg_b] "=x" (reg_b), > > - [reg_c] "=x" (reg_c) > > - : [src] "r" (src), > > - [dst] "r"(dst) > > - : "memory" > > - ); > > + ymm0 = _mm256_loadu_si256((const __m256i *)src); > > + _mm256_storeu_si256((__m256i *)dst, ymm0); > > } > > > > +/** > > + * Copy 64 bytes from one location to another, > > + * locations should not overlap. > > + */ > > static inline void > > rte_mov64(uint8_t *dst, const uint8_t *src) { > > - __m128i reg_a, reg_b, reg_c, reg_d; > > - asm volatile ( > > - "movdqu (%[src]), %[reg_a]\n\t" > > - "movdqu 16(%[src]), %[reg_b]\n\t" > > - "movdqu 32(%[src]), %[reg_c]\n\t" > > - "movdqu 48(%[src]), %[reg_d]\n\t" > > - "movdqu %[reg_a], (%[dst])\n\t" > > - "movdqu %[reg_b], 16(%[dst])\n\t" > > - "movdqu %[reg_c], 32(%[dst])\n\t" > > - "movdqu %[reg_d], 48(%[dst])\n\t" > > - : [reg_a] "=x" (reg_a), > > - [reg_b] "=x" (reg_b), > > - [reg_c] "=x" (reg_c), > > - [reg_d] "=x" (reg_d) > > - : [src] "r" (src), > > - [dst] "r"(dst) > > - : "memory" > > - ); > > + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32); > > + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32); > > } > > > > +/** > > + * Copy 128 bytes from one location to another, > > + * locations should not overlap. > > + */ > > static inline void > > rte_mov128(uint8_t *dst, const uint8_t *src) { > > - __m128i reg_a, reg_b, reg_c, reg_d, reg_e, reg_f, reg_g, reg_h; > > - asm volatile ( > > - "movdqu (%[src]), %[reg_a]\n\t" > > - "movdqu 16(%[src]), %[reg_b]\n\t" > > - "movdqu 32(%[src]), %[reg_c]\n\t" > > - "movdqu 48(%[src]), %[reg_d]\n\t" > > - "movdqu 64(%[src]), %[reg_e]\n\t" > > - "movdqu 80(%[src]), %[reg_f]\n\t" > > - "movdqu 96(%[src]), %[reg_g]\n\t" > > - "movdqu 112(%[src]), %[reg_h]\n\t" > > - "movdqu %[reg_a], (%[dst])\n\t" > > - "movdqu %[reg_b], 16(%[dst])\n\t" > > - "movdqu %[reg_c], 32(%[dst])\n\t" > > - "movdqu %[reg_d], 48(%[dst])\n\t" > > - "movdqu %[reg_e], 64(%[dst])\n\t" > > - "movdqu %[reg_f], 80(%[dst])\n\t" > > - "movdqu %[reg_g], 96(%[dst])\n\t" > > - "movdqu %[reg_h], 112(%[dst])\n\t" > > - : [reg_a] "=x" (reg_a), > > - [reg_b] "=x" (reg_b), > > - [reg_c] "=x" (reg_c), > > - [reg_d] "=x" (reg_d), > > - [reg_e] "=x" (reg_e), > > - [reg_f] "=x" (reg_f), > > - [reg_g] "=x" (reg_g), > > - [reg_h] "=x" (reg_h) > > - : [src] "r" (src), > > - [dst] "r"(dst) > > - : "memory" > > - ); > > + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32); > > + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32); > > + rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32); > > + rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32); > > } > > > > -#ifdef __INTEL_COMPILER > > -#pragma warning(enable:593) > > -#endif > > - > > +/** > > + * Copy 256 bytes from one location to another, > > + * locations should not overlap. > > + */ > > static inline void > > rte_mov256(uint8_t *dst, const uint8_t *src) { > > - rte_mov128(dst, src); > > - rte_mov128(dst + 128, src + 128); > > + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32); > > + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32); > > + rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32); > > + rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32); > > + rte_mov32((uint8_t *)dst + 4 * 32, (const uint8_t *)src + 4 * 32); > > + rte_mov32((uint8_t *)dst + 5 * 32, (const uint8_t *)src + 5 * 32); > > + rte_mov32((uint8_t *)dst + 6 * 32, (const uint8_t *)src + 6 * 32); > > + rte_mov32((uint8_t *)dst + 7 * 32, (const uint8_t *)src + 7 * 32); > > } > > > > -#define rte_memcpy(dst, src, n) \ > > - ({ (__builtin_constant_p(n)) ? \ > > - memcpy((dst), (src), (n)) : \ > > - rte_memcpy_func((dst), (src), (n)); }) > > +/** > > + * Copy 64-byte blocks from one location to another, > > + * locations should not overlap. > > + */ > > +static inline void > > +rte_mov64blocks(uint8_t *dst, const uint8_t *src, size_t n) { > > + __m256i ymm0, ymm1; > > + > > + while (n >= 64) { > > + ymm0 = _mm256_loadu_si256((const __m256i *)((const > uint8_t *)src + 0 * 32)); > > + n -= 64; > > + ymm1 = _mm256_loadu_si256((const __m256i *)((const > uint8_t *)src + 1 * 32)); > > + src = (const uint8_t *)src + 64; > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), > ymm0); > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), > ymm1); > > + dst = (uint8_t *)dst + 64; > > + } > > +} > > + > > +/** > > + * Copy 256-byte blocks from one location to another, > > + * locations should not overlap. > > + */ > > +static inline void > > +rte_mov256blocks(uint8_t *dst, const uint8_t *src, size_t n) { > > + __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7; > > + > > + while (n >= 256) { > > + ymm0 = _mm256_loadu_si256((const __m256i *)((const > uint8_t *)src + 0 * 32)); > > + n -= 256; > > + ymm1 = _mm256_loadu_si256((const __m256i *)((const > uint8_t *)src + 1 * 32)); > > + ymm2 = _mm256_loadu_si256((const __m256i *)((const > uint8_t *)src + 2 * 32)); > > + ymm3 = _mm256_loadu_si256((const __m256i *)((const > uint8_t *)src + 3 * 32)); > > + ymm4 = _mm256_loadu_si256((const __m256i *)((const > uint8_t *)src + 4 * 32)); > > + ymm5 = _mm256_loadu_si256((const __m256i *)((const > uint8_t *)src + 5 * 32)); > > + ymm6 = _mm256_loadu_si256((const __m256i *)((const > uint8_t *)src + 6 * 32)); > > + ymm7 = _mm256_loadu_si256((const __m256i *)((const > uint8_t *)src + 7 * 32)); > > + src = (const uint8_t *)src + 256; > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), > ymm0); > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), > ymm1); > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), > ymm2); > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), > ymm3); > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 4 * 32), > ymm4); > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 5 * 32), > ymm5); > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 6 * 32), > ymm6); > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 7 * 32), > ymm7); > > + dst = (uint8_t *)dst + 256; > > + } > > +} > > > > static inline void * > > -rte_memcpy_func(void *dst, const void *src, size_t n) > > +rte_memcpy(void *dst, const void *src, size_t n) > > { > > void *ret = dst; > > + int dstofss; > > + int bits; > > > > - /* We can't copy < 16 bytes using XMM registers so do it manually. */ > > + /** > > + * Copy less than 16 bytes > > + */ > > if (n < 16) { > > if (n & 0x01) { > > *(uint8_t *)dst = *(const uint8_t *)src; > > - dst = (uint8_t *)dst + 1; > > src = (const uint8_t *)src + 1; > > + dst = (uint8_t *)dst + 1; > > } > > if (n & 0x02) { > > *(uint16_t *)dst = *(const uint16_t *)src; > > - dst = (uint16_t *)dst + 1; > > src = (const uint16_t *)src + 1; > > + dst = (uint16_t *)dst + 1; > > } > > if (n & 0x04) { > > *(uint32_t *)dst = *(const uint32_t *)src; > > - dst = (uint32_t *)dst + 1; > > src = (const uint32_t *)src + 1; > > + dst = (uint32_t *)dst + 1; > > } > > if (n & 0x08) { > > *(uint64_t *)dst = *(const uint64_t *)src; @@ -201,95 > +224,410 @@ > > rte_memcpy_func(void *dst, const void *src, size_t n) > > return ret; > > } > > > > - /* Special fast cases for <= 128 bytes */ > > + /** > > + * Fast way when copy size doesn't exceed 512 bytes > > + */ > > if (n <= 32) { > > rte_mov16((uint8_t *)dst, (const uint8_t *)src); > > rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + > n); > > return ret; > > } > > - > > if (n <= 64) { > > rte_mov32((uint8_t *)dst, (const uint8_t *)src); > > rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + > n); > > return ret; > > } > > - > > - if (n <= 128) { > > - rte_mov64((uint8_t *)dst, (const uint8_t *)src); > > - rte_mov64((uint8_t *)dst - 64 + n, (const uint8_t *)src - 64 + > n); > > + if (n <= 512) { > > + if (n >= 256) { > > + n -= 256; > > + rte_mov256((uint8_t *)dst, (const uint8_t *)src); > > + src = (const uint8_t *)src + 256; > > + dst = (uint8_t *)dst + 256; > > + } > > + if (n >= 128) { > > + n -= 128; > > + rte_mov128((uint8_t *)dst, (const uint8_t *)src); > > + src = (const uint8_t *)src + 128; > > + dst = (uint8_t *)dst + 128; > > + } > > + if (n >= 64) { > > + n -= 64; > > + rte_mov64((uint8_t *)dst, (const uint8_t *)src); > > + src = (const uint8_t *)src + 64; > > + dst = (uint8_t *)dst + 64; > > + } > > +COPY_BLOCK_64_BACK31: > > + if (n > 32) { > > + rte_mov32((uint8_t *)dst, (const uint8_t *)src); > > + rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t > *)src - 32 + n); > > + return ret; > > + } > > + if (n > 0) { > > + rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t > *)src - 32 + n); > > + } > > return ret; > > } > > > > - /* > > - * For large copies > 128 bytes. This combination of 256, 64 and 16 > byte > > - * copies was found to be faster than doing 128 and 32 byte copies as > > - * well. > > + /** > > + * Make store aligned when copy size exceeds 512 bytes > > */ > > - for ( ; n >= 256; n -= 256) { > > - rte_mov256((uint8_t *)dst, (const uint8_t *)src); > > - dst = (uint8_t *)dst + 256; > > - src = (const uint8_t *)src + 256; > > + dstofss = 32 - (int)((long long)(void *)dst & 0x1F); > > + n -= dstofss; > > + rte_mov32((uint8_t *)dst, (const uint8_t *)src); > > + src = (const uint8_t *)src + dstofss; > > + dst = (uint8_t *)dst + dstofss; > > + > > + /** > > + * Copy 256-byte blocks. > > + * Use copy block function for better instruction order control, > > + * which is important when load is unaligned. > > + */ > > + rte_mov256blocks((uint8_t *)dst, (const uint8_t *)src, n); > > + bits = n; > > + n = n & 255; > > + bits -= n; > > + src = (const uint8_t *)src + bits; > > + dst = (uint8_t *)dst + bits; > > + > > + /** > > + * Copy 64-byte blocks. > > + * Use copy block function for better instruction order control, > > + * which is important when load is unaligned. > > + */ > > + if (n >= 64) { > > + rte_mov64blocks((uint8_t *)dst, (const uint8_t *)src, n); > > + bits = n; > > + n = n & 63; > > + bits -= n; > > + src = (const uint8_t *)src + bits; > > + dst = (uint8_t *)dst + bits; > > } > > > > - /* > > - * We split the remaining bytes (which will be less than 256) into > > - * 64byte (2^6) chunks. > > - * Using incrementing integers in the case labels of a switch > statement > > - * enourages the compiler to use a jump table. To get incrementing > > - * integers, we shift the 2 relevant bits to the LSB position to first > > - * get decrementing integers, and then subtract. > > + /** > > + * Copy whatever left > > */ > > - switch (3 - (n >> 6)) { > > - case 0x00: > > - rte_mov64((uint8_t *)dst, (const uint8_t *)src); > > - n -= 64; > > - dst = (uint8_t *)dst + 64; > > - src = (const uint8_t *)src + 64; /* fallthrough */ > > - case 0x01: > > - rte_mov64((uint8_t *)dst, (const uint8_t *)src); > > - n -= 64; > > - dst = (uint8_t *)dst + 64; > > - src = (const uint8_t *)src + 64; /* fallthrough */ > > - case 0x02: > > - rte_mov64((uint8_t *)dst, (const uint8_t *)src); > > - n -= 64; > > - dst = (uint8_t *)dst + 64; > > - src = (const uint8_t *)src + 64; /* fallthrough */ > > - default: > > - ; > > + goto COPY_BLOCK_64_BACK31; > > +} > > + > > +#else /* RTE_MACHINE_CPUFLAG_AVX2 */ > > + > > +/** > > + * SSE & AVX implementation below > > + */ > > + > > +/** > > + * Copy 16 bytes from one location to another, > > + * locations should not overlap. > > + */ > > +static inline void > > +rte_mov16(uint8_t *dst, const uint8_t *src) { > > + __m128i xmm0; > > + > > + xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src); > > + _mm_storeu_si128((__m128i *)dst, xmm0); } > > + > > +/** > > + * Copy 32 bytes from one location to another, > > + * locations should not overlap. > > + */ > > +static inline void > > +rte_mov32(uint8_t *dst, const uint8_t *src) { > > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16); > > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16); } > > + > > +/** > > + * Copy 64 bytes from one location to another, > > + * locations should not overlap. > > + */ > > +static inline void > > +rte_mov64(uint8_t *dst, const uint8_t *src) { > > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16); > > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16); > > + rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16); > > + rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16); } > > + > > +/** > > + * Copy 128 bytes from one location to another, > > + * locations should not overlap. > > + */ > > +static inline void > > +rte_mov128(uint8_t *dst, const uint8_t *src) { > > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16); > > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16); > > + rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16); > > + rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16); > > + rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16); > > + rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16); > > + rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16); > > + rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16); } > > + > > +/** > > + * Copy 256 bytes from one location to another, > > + * locations should not overlap. > > + */ > > +static inline void > > +rte_mov256(uint8_t *dst, const uint8_t *src) { > > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16); > > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16); > > + rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16); > > + rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16); > > + rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16); > > + rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16); > > + rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16); > > + rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16); > > + rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16); > > + rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16); > > + rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16); > > + rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16); > > + rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16); > > + rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16); > > + rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16); > > + rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16); > > +} > > + > > +/** > > + * Macro for copying unaligned block from one location to another > > +with constant load offset, > > + * 47 bytes leftover maximum, > > + * locations should not overlap. > > + * Requirements: > > + * - Store is aligned > > + * - Load offset is <offset>, which must be immediate value within > > +[1, 15] > > + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit > > +forwards are available for loading > > + * - <dst>, <src>, <len> must be variables > > + * - __m128i <xmm0> ~ <xmm8> must be pre-defined */ > > +#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset) > \ > > +({ \ > > + int tmp; \ > > + while (len >= 128 + 16 - offset) { \ > > + xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > offset + 0 * 16)); \ > > + len -= 128; \ > > + xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > offset + 1 * 16)); \ > > + xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > offset + 2 * 16)); \ > > + xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > offset + 3 * 16)); \ > > + xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > offset + 4 * 16)); \ > > + xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > offset + 5 * 16)); \ > > + xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > offset + 6 * 16)); \ > > + xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > offset + 7 * 16)); \ > > + xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > offset + 8 * 16)); \ > > + src = (const uint8_t *)src + 128; \ > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), > _mm_alignr_epi8(xmm1, xmm0, offset)); \ > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), > _mm_alignr_epi8(xmm2, xmm1, offset)); \ > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), > _mm_alignr_epi8(xmm3, xmm2, offset)); \ > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), > _mm_alignr_epi8(xmm4, xmm3, offset)); \ > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), > _mm_alignr_epi8(xmm5, xmm4, offset)); \ > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), > _mm_alignr_epi8(xmm6, xmm5, offset)); \ > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), > _mm_alignr_epi8(xmm7, xmm6, offset)); \ > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), > _mm_alignr_epi8(xmm8, xmm7, offset)); \ > > + dst = (uint8_t *)dst + 128; \ > > + } \ > > + tmp = len; \ > > + len = ((len - 16 + offset) & 127) + 16 - offset; \ > > + tmp -= len; \ > > + src = (const uint8_t *)src + tmp; \ > > + dst = (uint8_t *)dst + tmp; \ > > + if (len >= 32 + 16 - offset) { \ > > + while (len >= 32 + 16 - offset) { \ > > + xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > offset + 0 * 16)); \ > > + len -= 32; \ > > + xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > offset + 1 * 16)); \ > > + xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > offset + 2 * 16)); \ > > + src = (const uint8_t *)src + 32; \ > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), > _mm_alignr_epi8(xmm1, xmm0, offset)); \ > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), > _mm_alignr_epi8(xmm2, xmm1, offset)); \ > > + dst = (uint8_t *)dst + 32; \ > > + } \ > > + tmp = len; \ > > + len = ((len - 16 + offset) & 31) + 16 - offset; \ > > + tmp -= len; \ > > + src = (const uint8_t *)src + tmp; \ > > + dst = (uint8_t *)dst + tmp; \ > > + } \ > > +}) > > + > > +/** > > + * Macro for copying unaligned block from one location to another, > > + * 47 bytes leftover maximum, > > + * locations should not overlap. > > + * Use switch here because the aligning instruction requires immediate > value for shift count. > > + * Requirements: > > + * - Store is aligned > > + * - Load offset is <offset>, which must be within [1, 15] > > + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit > > +forwards are available for loading > > + * - <dst>, <src>, <len> must be variables > > + * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM > must be > > +pre-defined */ > > +#define MOVEUNALIGNED_LEFT47(dst, src, len, offset) \ > > +({ \ > > + switch (offset) { \ > > + case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break; \ > > + case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break; \ > > + case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break; \ > > + case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break; \ > > + case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break; \ > > + case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break; \ > > + case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break; \ > > + case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break; \ > > + case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break; \ > > + case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break; \ > > + case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break; \ > > + case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break; \ > > + case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break; \ > > + case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break; \ > > + case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break; \ > > + default:; \ > > + } \ > > +}) > > We probably didn't understand each other properly. > My thought was to do something like: > > #define MOVEUNALIGNED_128(offset) do { \ > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), > _mm_alignr_epi8(xmm1, xmm0, offset)); \ > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), > _mm_alignr_epi8(xmm2, xmm1, offset)); \ > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), > _mm_alignr_epi8(xmm3, xmm2, offset)); \ > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), > _mm_alignr_epi8(xmm4, xmm3, offset)); \ > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), > _mm_alignr_epi8(xmm5, xmm4, offset)); \ > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), > _mm_alignr_epi8(xmm6, xmm5, offset)); \ > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), > _mm_alignr_epi8(xmm7, xmm6, offset)); \ > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), > _mm_alignr_epi8(xmm8, xmm7, offset)); \ > } while(0); > > > Then at MOVEUNALIGNED_LEFT47_IMM: > .... > xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 > * 16)); \ > xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 > * 16)); \ > src = (const uint8_t *)src + 128; > > switch(offset) { > case 0x01: MOVEUNALIGNED_128(1); break; > ... > case 0x0f: MOVEUNALIGNED_128(f); break; } ... > dst = (uint8_t *)dst + 128 > > An then in rte_memcpy() you don't need a switch for > MOVEUNALIGNED_LEFT47_IMM itself. > Thought it might help to generate smaller code for rte_ememcpy() while > keeping performance reasonably high. > > Konstantin > > > > + > > +static inline void * > > +rte_memcpy(void *dst, const void *src, size_t n) { > > + __m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, > xmm8; > > + void *ret = dst; > > + int dstofss; > > + int srcofs; > > + > > + /** > > + * Copy less than 16 bytes > > + */ > > + if (n < 16) { > > + if (n & 0x01) { > > + *(uint8_t *)dst = *(const uint8_t *)src; > > + src = (const uint8_t *)src + 1; > > + dst = (uint8_t *)dst + 1; > > + } > > + if (n & 0x02) { > > + *(uint16_t *)dst = *(const uint16_t *)src; > > + src = (const uint16_t *)src + 1; > > + dst = (uint16_t *)dst + 1; > > + } > > + if (n & 0x04) { > > + *(uint32_t *)dst = *(const uint32_t *)src; > > + src = (const uint32_t *)src + 1; > > + dst = (uint32_t *)dst + 1; > > + } > > + if (n & 0x08) { > > + *(uint64_t *)dst = *(const uint64_t *)src; > > + } > > + return ret; > > } > > > > - /* > > - * We split the remaining bytes (which will be less than 64) into > > - * 16byte (2^4) chunks, using the same switch structure as above. > > + /** > > + * Fast way when copy size doesn't exceed 512 bytes > > */ > > - switch (3 - (n >> 4)) { > > - case 0x00: > > - rte_mov16((uint8_t *)dst, (const uint8_t *)src); > > - n -= 16; > > - dst = (uint8_t *)dst + 16; > > - src = (const uint8_t *)src + 16; /* fallthrough */ > > - case 0x01: > > - rte_mov16((uint8_t *)dst, (const uint8_t *)src); > > - n -= 16; > > - dst = (uint8_t *)dst + 16; > > - src = (const uint8_t *)src + 16; /* fallthrough */ > > - case 0x02: > > + if (n <= 32) { > > rte_mov16((uint8_t *)dst, (const uint8_t *)src); > > - n -= 16; > > - dst = (uint8_t *)dst + 16; > > - src = (const uint8_t *)src + 16; /* fallthrough */ > > - default: > > - ; > > + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + > n); > > + return ret; > > } > > - > > - /* Copy any remaining bytes, without going beyond end of buffers > */ > > - if (n != 0) { > > + if (n <= 48) { > > + rte_mov32((uint8_t *)dst, (const uint8_t *)src); > > + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + > n); > > + return ret; > > + } > > + if (n <= 64) { > > + rte_mov32((uint8_t *)dst, (const uint8_t *)src); > > + rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32); > > rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + > n); > > + return ret; > > + } > > + if (n <= 128) { > > + goto COPY_BLOCK_128_BACK15; > > } > > - return ret; > > + if (n <= 512) { > > + if (n >= 256) { > > + n -= 256; > > + rte_mov128((uint8_t *)dst, (const uint8_t *)src); > > + rte_mov128((uint8_t *)dst + 128, (const uint8_t *)src > + 128); > > + src = (const uint8_t *)src + 256; > > + dst = (uint8_t *)dst + 256; > > + } > > +COPY_BLOCK_255_BACK15: > > + if (n >= 128) { > > + n -= 128; > > + rte_mov128((uint8_t *)dst, (const uint8_t *)src); > > + src = (const uint8_t *)src + 128; > > + dst = (uint8_t *)dst + 128; > > + } > > +COPY_BLOCK_128_BACK15: > > + if (n >= 64) { > > + n -= 64; > > + rte_mov64((uint8_t *)dst, (const uint8_t *)src); > > + src = (const uint8_t *)src + 64; > > + dst = (uint8_t *)dst + 64; > > + } > > +COPY_BLOCK_64_BACK15: > > + if (n >= 32) { > > + n -= 32; > > + rte_mov32((uint8_t *)dst, (const uint8_t *)src); > > + src = (const uint8_t *)src + 32; > > + dst = (uint8_t *)dst + 32; > > + } > > + if (n > 16) { > > + rte_mov16((uint8_t *)dst, (const uint8_t *)src); > > + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t > *)src - 16 + n); > > + return ret; > > + } > > + if (n > 0) { > > + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t > *)src - 16 + n); > > + } > > + return ret; > > + } > > + > > + /** > > + * Make store aligned when copy size exceeds 512 bytes, > > + * and make sure the first 15 bytes are copied, because > > + * unaligned copy functions require up to 15 bytes > > + * backwards access. > > + */ > > + dstofss = 16 - (int)((long long)(void *)dst & 0x0F) + 16; > > + n -= dstofss; > > + rte_mov32((uint8_t *)dst, (const uint8_t *)src); > > + src = (const uint8_t *)src + dstofss; > > + dst = (uint8_t *)dst + dstofss; > > + srcofs = (int)((long long)(const void *)src & 0x0F); > > + > > + /** > > + * For aligned copy > > + */ > > + if (srcofs == 0) { > > + /** > > + * Copy 256-byte blocks > > + */ > > + for (; n >= 256; n -= 256) { > > + rte_mov256((uint8_t *)dst, (const uint8_t *)src); > > + dst = (uint8_t *)dst + 256; > > + src = (const uint8_t *)src + 256; > > + } > > + > > + /** > > + * Copy whatever left > > + */ > > + goto COPY_BLOCK_255_BACK15; > > + } > > + > > + /** > > + * For copy with unaligned load > > + */ > > + MOVEUNALIGNED_LEFT47(dst, src, n, srcofs); > > + > > + /** > > + * Copy whatever left > > + */ > > + goto COPY_BLOCK_64_BACK15; > > } > > > > +#endif /* RTE_MACHINE_CPUFLAG_AVX2 */ > > + > > #ifdef __cplusplus > > } > > #endif > > -- > > 1.9.3 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms 2015-01-30 5:57 ` Wang, Zhihong @ 2015-01-30 10:44 ` Ananyev, Konstantin 0 siblings, 0 replies; 12+ messages in thread From: Ananyev, Konstantin @ 2015-01-30 10:44 UTC (permalink / raw) To: Wang, Zhihong, dev Hey Zhihong, > -----Original Message----- > From: Wang, Zhihong > Sent: Friday, January 30, 2015 5:57 AM > To: Ananyev, Konstantin; dev@dpdk.org > Subject: RE: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms > > Hey Konstantin, > > This method does reduce code size but lead to significant performance drop. > I think we need to keep the original code. Sure, no point to make it slower. Thanks for trying it anyway. Konstantin > > > Thanks > Zhihong (John) > > > > -----Original Message----- > > From: Ananyev, Konstantin > > Sent: Thursday, January 29, 2015 11:18 PM > > To: Wang, Zhihong; dev@dpdk.org > > Subject: RE: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in > > arch/x86/rte_memcpy.h for both SSE and AVX platforms > > > > Hi Zhihong, > > > > > -----Original Message----- > > > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhihong Wang > > > Sent: Thursday, January 29, 2015 2:39 AM > > > To: dev@dpdk.org > > > Subject: [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in > > > arch/x86/rte_memcpy.h for both SSE and AVX platforms > > > > > > Main code changes: > > > > > > 1. Differentiate architectural features based on CPU flags > > > > > > a. Implement separated move functions for SSE/AVX/AVX2 to make > > > full utilization of cache bandwidth > > > > > > b. Implement separated copy flow specifically optimized for target > > > architecture > > > > > > 2. Rewrite the memcpy function "rte_memcpy" > > > > > > a. Add store aligning > > > > > > b. Add load aligning based on architectural features > > > > > > c. Put block copy loop into inline move functions for better > > > control of instruction order > > > > > > d. Eliminate unnecessary MOVs > > > > > > 3. Rewrite the inline move functions > > > > > > a. Add move functions for unaligned load cases > > > > > > b. Change instruction order in copy loops for better pipeline > > > utilization > > > > > > c. Use intrinsics instead of assembly code > > > > > > 4. Remove slow glibc call for constant copies > > > > > > Signed-off-by: Zhihong Wang <zhihong.wang@intel.com> > > > --- > > > .../common/include/arch/x86/rte_memcpy.h | 680 > > +++++++++++++++------ > > > 1 file changed, 509 insertions(+), 171 deletions(-) > > > > > > diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h > > > b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h > > > index fb9eba8..7b2d382 100644 > > > --- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h > > > +++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h > > > @@ -34,166 +34,189 @@ > > > #ifndef _RTE_MEMCPY_X86_64_H_ > > > #define _RTE_MEMCPY_X86_64_H_ > > > > > > +/** > > > + * @file > > > + * > > > + * Functions for SSE/AVX/AVX2 implementation of memcpy(). > > > + */ > > > + > > > +#include <stdio.h> > > > #include <stdint.h> > > > #include <string.h> > > > -#include <emmintrin.h> > > > +#include <x86intrin.h> > > > > > > #ifdef __cplusplus > > > extern "C" { > > > #endif > > > > > > -#include "generic/rte_memcpy.h" > > > +/** > > > + * Copy bytes from one location to another. The locations must not > > overlap. > > > + * > > > + * @note This is implemented as a macro, so it's address should not > > > +be taken > > > + * and care is needed as parameter expressions may be evaluated > > multiple times. > > > + * > > > + * @param dst > > > + * Pointer to the destination of the data. > > > + * @param src > > > + * Pointer to the source data. > > > + * @param n > > > + * Number of bytes to copy. > > > + * @return > > > + * Pointer to the destination data. > > > + */ > > > +static inline void * > > > +rte_memcpy(void *dst, const void *src, size_t n) > > > +__attribute__((always_inline)); > > > > > > -#ifdef __INTEL_COMPILER > > > -#pragma warning(disable:593) /* Stop unused variable warning (reg_a > > > etc). */ -#endif > > > +#ifdef RTE_MACHINE_CPUFLAG_AVX2 > > > > > > +/** > > > + * AVX2 implementation below > > > + */ > > > + > > > +/** > > > + * Copy 16 bytes from one location to another, > > > + * locations should not overlap. > > > + */ > > > static inline void > > > rte_mov16(uint8_t *dst, const uint8_t *src) { > > > - __m128i reg_a; > > > - asm volatile ( > > > - "movdqu (%[src]), %[reg_a]\n\t" > > > - "movdqu %[reg_a], (%[dst])\n\t" > > > - : [reg_a] "=x" (reg_a) > > > - : [src] "r" (src), > > > - [dst] "r"(dst) > > > - : "memory" > > > - ); > > > + __m128i xmm0; > > > + > > > + xmm0 = _mm_loadu_si128((const __m128i *)src); > > > + _mm_storeu_si128((__m128i *)dst, xmm0); > > > } > > > > > > +/** > > > + * Copy 32 bytes from one location to another, > > > + * locations should not overlap. > > > + */ > > > static inline void > > > rte_mov32(uint8_t *dst, const uint8_t *src) { > > > - __m128i reg_a, reg_b; > > > - asm volatile ( > > > - "movdqu (%[src]), %[reg_a]\n\t" > > > - "movdqu 16(%[src]), %[reg_b]\n\t" > > > - "movdqu %[reg_a], (%[dst])\n\t" > > > - "movdqu %[reg_b], 16(%[dst])\n\t" > > > - : [reg_a] "=x" (reg_a), > > > - [reg_b] "=x" (reg_b) > > > - : [src] "r" (src), > > > - [dst] "r"(dst) > > > - : "memory" > > > - ); > > > -} > > > + __m256i ymm0; > > > > > > -static inline void > > > -rte_mov48(uint8_t *dst, const uint8_t *src) -{ > > > - __m128i reg_a, reg_b, reg_c; > > > - asm volatile ( > > > - "movdqu (%[src]), %[reg_a]\n\t" > > > - "movdqu 16(%[src]), %[reg_b]\n\t" > > > - "movdqu 32(%[src]), %[reg_c]\n\t" > > > - "movdqu %[reg_a], (%[dst])\n\t" > > > - "movdqu %[reg_b], 16(%[dst])\n\t" > > > - "movdqu %[reg_c], 32(%[dst])\n\t" > > > - : [reg_a] "=x" (reg_a), > > > - [reg_b] "=x" (reg_b), > > > - [reg_c] "=x" (reg_c) > > > - : [src] "r" (src), > > > - [dst] "r"(dst) > > > - : "memory" > > > - ); > > > + ymm0 = _mm256_loadu_si256((const __m256i *)src); > > > + _mm256_storeu_si256((__m256i *)dst, ymm0); > > > } > > > > > > +/** > > > + * Copy 64 bytes from one location to another, > > > + * locations should not overlap. > > > + */ > > > static inline void > > > rte_mov64(uint8_t *dst, const uint8_t *src) { > > > - __m128i reg_a, reg_b, reg_c, reg_d; > > > - asm volatile ( > > > - "movdqu (%[src]), %[reg_a]\n\t" > > > - "movdqu 16(%[src]), %[reg_b]\n\t" > > > - "movdqu 32(%[src]), %[reg_c]\n\t" > > > - "movdqu 48(%[src]), %[reg_d]\n\t" > > > - "movdqu %[reg_a], (%[dst])\n\t" > > > - "movdqu %[reg_b], 16(%[dst])\n\t" > > > - "movdqu %[reg_c], 32(%[dst])\n\t" > > > - "movdqu %[reg_d], 48(%[dst])\n\t" > > > - : [reg_a] "=x" (reg_a), > > > - [reg_b] "=x" (reg_b), > > > - [reg_c] "=x" (reg_c), > > > - [reg_d] "=x" (reg_d) > > > - : [src] "r" (src), > > > - [dst] "r"(dst) > > > - : "memory" > > > - ); > > > + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32); > > > + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32); > > > } > > > > > > +/** > > > + * Copy 128 bytes from one location to another, > > > + * locations should not overlap. > > > + */ > > > static inline void > > > rte_mov128(uint8_t *dst, const uint8_t *src) { > > > - __m128i reg_a, reg_b, reg_c, reg_d, reg_e, reg_f, reg_g, reg_h; > > > - asm volatile ( > > > - "movdqu (%[src]), %[reg_a]\n\t" > > > - "movdqu 16(%[src]), %[reg_b]\n\t" > > > - "movdqu 32(%[src]), %[reg_c]\n\t" > > > - "movdqu 48(%[src]), %[reg_d]\n\t" > > > - "movdqu 64(%[src]), %[reg_e]\n\t" > > > - "movdqu 80(%[src]), %[reg_f]\n\t" > > > - "movdqu 96(%[src]), %[reg_g]\n\t" > > > - "movdqu 112(%[src]), %[reg_h]\n\t" > > > - "movdqu %[reg_a], (%[dst])\n\t" > > > - "movdqu %[reg_b], 16(%[dst])\n\t" > > > - "movdqu %[reg_c], 32(%[dst])\n\t" > > > - "movdqu %[reg_d], 48(%[dst])\n\t" > > > - "movdqu %[reg_e], 64(%[dst])\n\t" > > > - "movdqu %[reg_f], 80(%[dst])\n\t" > > > - "movdqu %[reg_g], 96(%[dst])\n\t" > > > - "movdqu %[reg_h], 112(%[dst])\n\t" > > > - : [reg_a] "=x" (reg_a), > > > - [reg_b] "=x" (reg_b), > > > - [reg_c] "=x" (reg_c), > > > - [reg_d] "=x" (reg_d), > > > - [reg_e] "=x" (reg_e), > > > - [reg_f] "=x" (reg_f), > > > - [reg_g] "=x" (reg_g), > > > - [reg_h] "=x" (reg_h) > > > - : [src] "r" (src), > > > - [dst] "r"(dst) > > > - : "memory" > > > - ); > > > + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32); > > > + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32); > > > + rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32); > > > + rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32); > > > } > > > > > > -#ifdef __INTEL_COMPILER > > > -#pragma warning(enable:593) > > > -#endif > > > - > > > +/** > > > + * Copy 256 bytes from one location to another, > > > + * locations should not overlap. > > > + */ > > > static inline void > > > rte_mov256(uint8_t *dst, const uint8_t *src) { > > > - rte_mov128(dst, src); > > > - rte_mov128(dst + 128, src + 128); > > > + rte_mov32((uint8_t *)dst + 0 * 32, (const uint8_t *)src + 0 * 32); > > > + rte_mov32((uint8_t *)dst + 1 * 32, (const uint8_t *)src + 1 * 32); > > > + rte_mov32((uint8_t *)dst + 2 * 32, (const uint8_t *)src + 2 * 32); > > > + rte_mov32((uint8_t *)dst + 3 * 32, (const uint8_t *)src + 3 * 32); > > > + rte_mov32((uint8_t *)dst + 4 * 32, (const uint8_t *)src + 4 * 32); > > > + rte_mov32((uint8_t *)dst + 5 * 32, (const uint8_t *)src + 5 * 32); > > > + rte_mov32((uint8_t *)dst + 6 * 32, (const uint8_t *)src + 6 * 32); > > > + rte_mov32((uint8_t *)dst + 7 * 32, (const uint8_t *)src + 7 * 32); > > > } > > > > > > -#define rte_memcpy(dst, src, n) \ > > > - ({ (__builtin_constant_p(n)) ? \ > > > - memcpy((dst), (src), (n)) : \ > > > - rte_memcpy_func((dst), (src), (n)); }) > > > +/** > > > + * Copy 64-byte blocks from one location to another, > > > + * locations should not overlap. > > > + */ > > > +static inline void > > > +rte_mov64blocks(uint8_t *dst, const uint8_t *src, size_t n) { > > > + __m256i ymm0, ymm1; > > > + > > > + while (n >= 64) { > > > + ymm0 = _mm256_loadu_si256((const __m256i *)((const > > uint8_t *)src + 0 * 32)); > > > + n -= 64; > > > + ymm1 = _mm256_loadu_si256((const __m256i *)((const > > uint8_t *)src + 1 * 32)); > > > + src = (const uint8_t *)src + 64; > > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), > > ymm0); > > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), > > ymm1); > > > + dst = (uint8_t *)dst + 64; > > > + } > > > +} > > > + > > > +/** > > > + * Copy 256-byte blocks from one location to another, > > > + * locations should not overlap. > > > + */ > > > +static inline void > > > +rte_mov256blocks(uint8_t *dst, const uint8_t *src, size_t n) { > > > + __m256i ymm0, ymm1, ymm2, ymm3, ymm4, ymm5, ymm6, ymm7; > > > + > > > + while (n >= 256) { > > > + ymm0 = _mm256_loadu_si256((const __m256i *)((const > > uint8_t *)src + 0 * 32)); > > > + n -= 256; > > > + ymm1 = _mm256_loadu_si256((const __m256i *)((const > > uint8_t *)src + 1 * 32)); > > > + ymm2 = _mm256_loadu_si256((const __m256i *)((const > > uint8_t *)src + 2 * 32)); > > > + ymm3 = _mm256_loadu_si256((const __m256i *)((const > > uint8_t *)src + 3 * 32)); > > > + ymm4 = _mm256_loadu_si256((const __m256i *)((const > > uint8_t *)src + 4 * 32)); > > > + ymm5 = _mm256_loadu_si256((const __m256i *)((const > > uint8_t *)src + 5 * 32)); > > > + ymm6 = _mm256_loadu_si256((const __m256i *)((const > > uint8_t *)src + 6 * 32)); > > > + ymm7 = _mm256_loadu_si256((const __m256i *)((const > > uint8_t *)src + 7 * 32)); > > > + src = (const uint8_t *)src + 256; > > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 0 * 32), > > ymm0); > > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 1 * 32), > > ymm1); > > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 2 * 32), > > ymm2); > > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 3 * 32), > > ymm3); > > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 4 * 32), > > ymm4); > > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 5 * 32), > > ymm5); > > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 6 * 32), > > ymm6); > > > + _mm256_storeu_si256((__m256i *)((uint8_t *)dst + 7 * 32), > > ymm7); > > > + dst = (uint8_t *)dst + 256; > > > + } > > > +} > > > > > > static inline void * > > > -rte_memcpy_func(void *dst, const void *src, size_t n) > > > +rte_memcpy(void *dst, const void *src, size_t n) > > > { > > > void *ret = dst; > > > + int dstofss; > > > + int bits; > > > > > > - /* We can't copy < 16 bytes using XMM registers so do it manually. */ > > > + /** > > > + * Copy less than 16 bytes > > > + */ > > > if (n < 16) { > > > if (n & 0x01) { > > > *(uint8_t *)dst = *(const uint8_t *)src; > > > - dst = (uint8_t *)dst + 1; > > > src = (const uint8_t *)src + 1; > > > + dst = (uint8_t *)dst + 1; > > > } > > > if (n & 0x02) { > > > *(uint16_t *)dst = *(const uint16_t *)src; > > > - dst = (uint16_t *)dst + 1; > > > src = (const uint16_t *)src + 1; > > > + dst = (uint16_t *)dst + 1; > > > } > > > if (n & 0x04) { > > > *(uint32_t *)dst = *(const uint32_t *)src; > > > - dst = (uint32_t *)dst + 1; > > > src = (const uint32_t *)src + 1; > > > + dst = (uint32_t *)dst + 1; > > > } > > > if (n & 0x08) { > > > *(uint64_t *)dst = *(const uint64_t *)src; @@ -201,95 > > +224,410 @@ > > > rte_memcpy_func(void *dst, const void *src, size_t n) > > > return ret; > > > } > > > > > > - /* Special fast cases for <= 128 bytes */ > > > + /** > > > + * Fast way when copy size doesn't exceed 512 bytes > > > + */ > > > if (n <= 32) { > > > rte_mov16((uint8_t *)dst, (const uint8_t *)src); > > > rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + > > n); > > > return ret; > > > } > > > - > > > if (n <= 64) { > > > rte_mov32((uint8_t *)dst, (const uint8_t *)src); > > > rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t *)src - 32 + > > n); > > > return ret; > > > } > > > - > > > - if (n <= 128) { > > > - rte_mov64((uint8_t *)dst, (const uint8_t *)src); > > > - rte_mov64((uint8_t *)dst - 64 + n, (const uint8_t *)src - 64 + > > n); > > > + if (n <= 512) { > > > + if (n >= 256) { > > > + n -= 256; > > > + rte_mov256((uint8_t *)dst, (const uint8_t *)src); > > > + src = (const uint8_t *)src + 256; > > > + dst = (uint8_t *)dst + 256; > > > + } > > > + if (n >= 128) { > > > + n -= 128; > > > + rte_mov128((uint8_t *)dst, (const uint8_t *)src); > > > + src = (const uint8_t *)src + 128; > > > + dst = (uint8_t *)dst + 128; > > > + } > > > + if (n >= 64) { > > > + n -= 64; > > > + rte_mov64((uint8_t *)dst, (const uint8_t *)src); > > > + src = (const uint8_t *)src + 64; > > > + dst = (uint8_t *)dst + 64; > > > + } > > > +COPY_BLOCK_64_BACK31: > > > + if (n > 32) { > > > + rte_mov32((uint8_t *)dst, (const uint8_t *)src); > > > + rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t > > *)src - 32 + n); > > > + return ret; > > > + } > > > + if (n > 0) { > > > + rte_mov32((uint8_t *)dst - 32 + n, (const uint8_t > > *)src - 32 + n); > > > + } > > > return ret; > > > } > > > > > > - /* > > > - * For large copies > 128 bytes. This combination of 256, 64 and 16 > > byte > > > - * copies was found to be faster than doing 128 and 32 byte copies as > > > - * well. > > > + /** > > > + * Make store aligned when copy size exceeds 512 bytes > > > */ > > > - for ( ; n >= 256; n -= 256) { > > > - rte_mov256((uint8_t *)dst, (const uint8_t *)src); > > > - dst = (uint8_t *)dst + 256; > > > - src = (const uint8_t *)src + 256; > > > + dstofss = 32 - (int)((long long)(void *)dst & 0x1F); > > > + n -= dstofss; > > > + rte_mov32((uint8_t *)dst, (const uint8_t *)src); > > > + src = (const uint8_t *)src + dstofss; > > > + dst = (uint8_t *)dst + dstofss; > > > + > > > + /** > > > + * Copy 256-byte blocks. > > > + * Use copy block function for better instruction order control, > > > + * which is important when load is unaligned. > > > + */ > > > + rte_mov256blocks((uint8_t *)dst, (const uint8_t *)src, n); > > > + bits = n; > > > + n = n & 255; > > > + bits -= n; > > > + src = (const uint8_t *)src + bits; > > > + dst = (uint8_t *)dst + bits; > > > + > > > + /** > > > + * Copy 64-byte blocks. > > > + * Use copy block function for better instruction order control, > > > + * which is important when load is unaligned. > > > + */ > > > + if (n >= 64) { > > > + rte_mov64blocks((uint8_t *)dst, (const uint8_t *)src, n); > > > + bits = n; > > > + n = n & 63; > > > + bits -= n; > > > + src = (const uint8_t *)src + bits; > > > + dst = (uint8_t *)dst + bits; > > > } > > > > > > - /* > > > - * We split the remaining bytes (which will be less than 256) into > > > - * 64byte (2^6) chunks. > > > - * Using incrementing integers in the case labels of a switch > > statement > > > - * enourages the compiler to use a jump table. To get incrementing > > > - * integers, we shift the 2 relevant bits to the LSB position to first > > > - * get decrementing integers, and then subtract. > > > + /** > > > + * Copy whatever left > > > */ > > > - switch (3 - (n >> 6)) { > > > - case 0x00: > > > - rte_mov64((uint8_t *)dst, (const uint8_t *)src); > > > - n -= 64; > > > - dst = (uint8_t *)dst + 64; > > > - src = (const uint8_t *)src + 64; /* fallthrough */ > > > - case 0x01: > > > - rte_mov64((uint8_t *)dst, (const uint8_t *)src); > > > - n -= 64; > > > - dst = (uint8_t *)dst + 64; > > > - src = (const uint8_t *)src + 64; /* fallthrough */ > > > - case 0x02: > > > - rte_mov64((uint8_t *)dst, (const uint8_t *)src); > > > - n -= 64; > > > - dst = (uint8_t *)dst + 64; > > > - src = (const uint8_t *)src + 64; /* fallthrough */ > > > - default: > > > - ; > > > + goto COPY_BLOCK_64_BACK31; > > > +} > > > + > > > +#else /* RTE_MACHINE_CPUFLAG_AVX2 */ > > > + > > > +/** > > > + * SSE & AVX implementation below > > > + */ > > > + > > > +/** > > > + * Copy 16 bytes from one location to another, > > > + * locations should not overlap. > > > + */ > > > +static inline void > > > +rte_mov16(uint8_t *dst, const uint8_t *src) { > > > + __m128i xmm0; > > > + > > > + xmm0 = _mm_loadu_si128((const __m128i *)(const __m128i *)src); > > > + _mm_storeu_si128((__m128i *)dst, xmm0); } > > > + > > > +/** > > > + * Copy 32 bytes from one location to another, > > > + * locations should not overlap. > > > + */ > > > +static inline void > > > +rte_mov32(uint8_t *dst, const uint8_t *src) { > > > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16); > > > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16); } > > > + > > > +/** > > > + * Copy 64 bytes from one location to another, > > > + * locations should not overlap. > > > + */ > > > +static inline void > > > +rte_mov64(uint8_t *dst, const uint8_t *src) { > > > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16); > > > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16); > > > + rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16); > > > + rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16); } > > > + > > > +/** > > > + * Copy 128 bytes from one location to another, > > > + * locations should not overlap. > > > + */ > > > +static inline void > > > +rte_mov128(uint8_t *dst, const uint8_t *src) { > > > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16); > > > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16); > > > + rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16); > > > + rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16); > > > + rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16); > > > + rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16); > > > + rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16); > > > + rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16); } > > > + > > > +/** > > > + * Copy 256 bytes from one location to another, > > > + * locations should not overlap. > > > + */ > > > +static inline void > > > +rte_mov256(uint8_t *dst, const uint8_t *src) { > > > + rte_mov16((uint8_t *)dst + 0 * 16, (const uint8_t *)src + 0 * 16); > > > + rte_mov16((uint8_t *)dst + 1 * 16, (const uint8_t *)src + 1 * 16); > > > + rte_mov16((uint8_t *)dst + 2 * 16, (const uint8_t *)src + 2 * 16); > > > + rte_mov16((uint8_t *)dst + 3 * 16, (const uint8_t *)src + 3 * 16); > > > + rte_mov16((uint8_t *)dst + 4 * 16, (const uint8_t *)src + 4 * 16); > > > + rte_mov16((uint8_t *)dst + 5 * 16, (const uint8_t *)src + 5 * 16); > > > + rte_mov16((uint8_t *)dst + 6 * 16, (const uint8_t *)src + 6 * 16); > > > + rte_mov16((uint8_t *)dst + 7 * 16, (const uint8_t *)src + 7 * 16); > > > + rte_mov16((uint8_t *)dst + 8 * 16, (const uint8_t *)src + 8 * 16); > > > + rte_mov16((uint8_t *)dst + 9 * 16, (const uint8_t *)src + 9 * 16); > > > + rte_mov16((uint8_t *)dst + 10 * 16, (const uint8_t *)src + 10 * 16); > > > + rte_mov16((uint8_t *)dst + 11 * 16, (const uint8_t *)src + 11 * 16); > > > + rte_mov16((uint8_t *)dst + 12 * 16, (const uint8_t *)src + 12 * 16); > > > + rte_mov16((uint8_t *)dst + 13 * 16, (const uint8_t *)src + 13 * 16); > > > + rte_mov16((uint8_t *)dst + 14 * 16, (const uint8_t *)src + 14 * 16); > > > + rte_mov16((uint8_t *)dst + 15 * 16, (const uint8_t *)src + 15 * 16); > > > +} > > > + > > > +/** > > > + * Macro for copying unaligned block from one location to another > > > +with constant load offset, > > > + * 47 bytes leftover maximum, > > > + * locations should not overlap. > > > + * Requirements: > > > + * - Store is aligned > > > + * - Load offset is <offset>, which must be immediate value within > > > +[1, 15] > > > + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit > > > +forwards are available for loading > > > + * - <dst>, <src>, <len> must be variables > > > + * - __m128i <xmm0> ~ <xmm8> must be pre-defined */ > > > +#define MOVEUNALIGNED_LEFT47_IMM(dst, src, len, offset) > > \ > > > +({ \ > > > + int tmp; \ > > > + while (len >= 128 + 16 - offset) { \ > > > + xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > > offset + 0 * 16)); \ > > > + len -= 128; \ > > > + xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > > offset + 1 * 16)); \ > > > + xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > > offset + 2 * 16)); \ > > > + xmm3 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > > offset + 3 * 16)); \ > > > + xmm4 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > > offset + 4 * 16)); \ > > > + xmm5 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > > offset + 5 * 16)); \ > > > + xmm6 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > > offset + 6 * 16)); \ > > > + xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > > offset + 7 * 16)); \ > > > + xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > > offset + 8 * 16)); \ > > > + src = (const uint8_t *)src + 128; \ > > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), > > _mm_alignr_epi8(xmm1, xmm0, offset)); \ > > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), > > _mm_alignr_epi8(xmm2, xmm1, offset)); \ > > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), > > _mm_alignr_epi8(xmm3, xmm2, offset)); \ > > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), > > _mm_alignr_epi8(xmm4, xmm3, offset)); \ > > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), > > _mm_alignr_epi8(xmm5, xmm4, offset)); \ > > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), > > _mm_alignr_epi8(xmm6, xmm5, offset)); \ > > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), > > _mm_alignr_epi8(xmm7, xmm6, offset)); \ > > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), > > _mm_alignr_epi8(xmm8, xmm7, offset)); \ > > > + dst = (uint8_t *)dst + 128; \ > > > + } \ > > > + tmp = len; \ > > > + len = ((len - 16 + offset) & 127) + 16 - offset; \ > > > + tmp -= len; \ > > > + src = (const uint8_t *)src + tmp; \ > > > + dst = (uint8_t *)dst + tmp; \ > > > + if (len >= 32 + 16 - offset) { \ > > > + while (len >= 32 + 16 - offset) { \ > > > + xmm0 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > > offset + 0 * 16)); \ > > > + len -= 32; \ > > > + xmm1 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > > offset + 1 * 16)); \ > > > + xmm2 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - > > offset + 2 * 16)); \ > > > + src = (const uint8_t *)src + 32; \ > > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), > > _mm_alignr_epi8(xmm1, xmm0, offset)); \ > > > + _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), > > _mm_alignr_epi8(xmm2, xmm1, offset)); \ > > > + dst = (uint8_t *)dst + 32; \ > > > + } \ > > > + tmp = len; \ > > > + len = ((len - 16 + offset) & 31) + 16 - offset; \ > > > + tmp -= len; \ > > > + src = (const uint8_t *)src + tmp; \ > > > + dst = (uint8_t *)dst + tmp; \ > > > + } \ > > > +}) > > > + > > > +/** > > > + * Macro for copying unaligned block from one location to another, > > > + * 47 bytes leftover maximum, > > > + * locations should not overlap. > > > + * Use switch here because the aligning instruction requires immediate > > value for shift count. > > > + * Requirements: > > > + * - Store is aligned > > > + * - Load offset is <offset>, which must be within [1, 15] > > > + * - For <src>, make sure <offset> bit backwards & <16 - offset> bit > > > +forwards are available for loading > > > + * - <dst>, <src>, <len> must be variables > > > + * - __m128i <xmm0> ~ <xmm8> used in MOVEUNALIGNED_LEFT47_IMM > > must be > > > +pre-defined */ > > > +#define MOVEUNALIGNED_LEFT47(dst, src, len, offset) \ > > > +({ \ > > > + switch (offset) { \ > > > + case 0x01: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x01); break; \ > > > + case 0x02: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x02); break; \ > > > + case 0x03: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x03); break; \ > > > + case 0x04: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x04); break; \ > > > + case 0x05: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x05); break; \ > > > + case 0x06: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x06); break; \ > > > + case 0x07: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x07); break; \ > > > + case 0x08: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x08); break; \ > > > + case 0x09: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x09); break; \ > > > + case 0x0A: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0A); break; \ > > > + case 0x0B: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0B); break; \ > > > + case 0x0C: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0C); break; \ > > > + case 0x0D: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0D); break; \ > > > + case 0x0E: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0E); break; \ > > > + case 0x0F: MOVEUNALIGNED_LEFT47_IMM(dst, src, n, 0x0F); break; \ > > > + default:; \ > > > + } \ > > > +}) > > > > We probably didn't understand each other properly. > > My thought was to do something like: > > > > #define MOVEUNALIGNED_128(offset) do { \ > > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 0 * 16), > > _mm_alignr_epi8(xmm1, xmm0, offset)); \ > > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 1 * 16), > > _mm_alignr_epi8(xmm2, xmm1, offset)); \ > > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 2 * 16), > > _mm_alignr_epi8(xmm3, xmm2, offset)); \ > > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 3 * 16), > > _mm_alignr_epi8(xmm4, xmm3, offset)); \ > > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 4 * 16), > > _mm_alignr_epi8(xmm5, xmm4, offset)); \ > > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 5 * 16), > > _mm_alignr_epi8(xmm6, xmm5, offset)); \ > > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 6 * 16), > > _mm_alignr_epi8(xmm7, xmm6, offset)); \ > > _mm_storeu_si128((__m128i *)((uint8_t *)dst + 7 * 16), > > _mm_alignr_epi8(xmm8, xmm7, offset)); \ > > } while(0); > > > > > > Then at MOVEUNALIGNED_LEFT47_IMM: > > .... > > xmm7 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 7 > > * 16)); \ > > xmm8 = _mm_loadu_si128((const __m128i *)((const uint8_t *)src - offset + 8 > > * 16)); \ > > src = (const uint8_t *)src + 128; > > > > switch(offset) { > > case 0x01: MOVEUNALIGNED_128(1); break; > > ... > > case 0x0f: MOVEUNALIGNED_128(f); break; } ... > > dst = (uint8_t *)dst + 128 > > > > An then in rte_memcpy() you don't need a switch for > > MOVEUNALIGNED_LEFT47_IMM itself. > > Thought it might help to generate smaller code for rte_ememcpy() while > > keeping performance reasonably high. > > > > Konstantin > > > > > > > + > > > +static inline void * > > > +rte_memcpy(void *dst, const void *src, size_t n) { > > > + __m128i xmm0, xmm1, xmm2, xmm3, xmm4, xmm5, xmm6, xmm7, > > xmm8; > > > + void *ret = dst; > > > + int dstofss; > > > + int srcofs; > > > + > > > + /** > > > + * Copy less than 16 bytes > > > + */ > > > + if (n < 16) { > > > + if (n & 0x01) { > > > + *(uint8_t *)dst = *(const uint8_t *)src; > > > + src = (const uint8_t *)src + 1; > > > + dst = (uint8_t *)dst + 1; > > > + } > > > + if (n & 0x02) { > > > + *(uint16_t *)dst = *(const uint16_t *)src; > > > + src = (const uint16_t *)src + 1; > > > + dst = (uint16_t *)dst + 1; > > > + } > > > + if (n & 0x04) { > > > + *(uint32_t *)dst = *(const uint32_t *)src; > > > + src = (const uint32_t *)src + 1; > > > + dst = (uint32_t *)dst + 1; > > > + } > > > + if (n & 0x08) { > > > + *(uint64_t *)dst = *(const uint64_t *)src; > > > + } > > > + return ret; > > > } > > > > > > - /* > > > - * We split the remaining bytes (which will be less than 64) into > > > - * 16byte (2^4) chunks, using the same switch structure as above. > > > + /** > > > + * Fast way when copy size doesn't exceed 512 bytes > > > */ > > > - switch (3 - (n >> 4)) { > > > - case 0x00: > > > - rte_mov16((uint8_t *)dst, (const uint8_t *)src); > > > - n -= 16; > > > - dst = (uint8_t *)dst + 16; > > > - src = (const uint8_t *)src + 16; /* fallthrough */ > > > - case 0x01: > > > - rte_mov16((uint8_t *)dst, (const uint8_t *)src); > > > - n -= 16; > > > - dst = (uint8_t *)dst + 16; > > > - src = (const uint8_t *)src + 16; /* fallthrough */ > > > - case 0x02: > > > + if (n <= 32) { > > > rte_mov16((uint8_t *)dst, (const uint8_t *)src); > > > - n -= 16; > > > - dst = (uint8_t *)dst + 16; > > > - src = (const uint8_t *)src + 16; /* fallthrough */ > > > - default: > > > - ; > > > + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + > > n); > > > + return ret; > > > } > > > - > > > - /* Copy any remaining bytes, without going beyond end of buffers > > */ > > > - if (n != 0) { > > > + if (n <= 48) { > > > + rte_mov32((uint8_t *)dst, (const uint8_t *)src); > > > + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + > > n); > > > + return ret; > > > + } > > > + if (n <= 64) { > > > + rte_mov32((uint8_t *)dst, (const uint8_t *)src); > > > + rte_mov16((uint8_t *)dst + 32, (const uint8_t *)src + 32); > > > rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t *)src - 16 + > > n); > > > + return ret; > > > + } > > > + if (n <= 128) { > > > + goto COPY_BLOCK_128_BACK15; > > > } > > > - return ret; > > > + if (n <= 512) { > > > + if (n >= 256) { > > > + n -= 256; > > > + rte_mov128((uint8_t *)dst, (const uint8_t *)src); > > > + rte_mov128((uint8_t *)dst + 128, (const uint8_t *)src > > + 128); > > > + src = (const uint8_t *)src + 256; > > > + dst = (uint8_t *)dst + 256; > > > + } > > > +COPY_BLOCK_255_BACK15: > > > + if (n >= 128) { > > > + n -= 128; > > > + rte_mov128((uint8_t *)dst, (const uint8_t *)src); > > > + src = (const uint8_t *)src + 128; > > > + dst = (uint8_t *)dst + 128; > > > + } > > > +COPY_BLOCK_128_BACK15: > > > + if (n >= 64) { > > > + n -= 64; > > > + rte_mov64((uint8_t *)dst, (const uint8_t *)src); > > > + src = (const uint8_t *)src + 64; > > > + dst = (uint8_t *)dst + 64; > > > + } > > > +COPY_BLOCK_64_BACK15: > > > + if (n >= 32) { > > > + n -= 32; > > > + rte_mov32((uint8_t *)dst, (const uint8_t *)src); > > > + src = (const uint8_t *)src + 32; > > > + dst = (uint8_t *)dst + 32; > > > + } > > > + if (n > 16) { > > > + rte_mov16((uint8_t *)dst, (const uint8_t *)src); > > > + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t > > *)src - 16 + n); > > > + return ret; > > > + } > > > + if (n > 0) { > > > + rte_mov16((uint8_t *)dst - 16 + n, (const uint8_t > > *)src - 16 + n); > > > + } > > > + return ret; > > > + } > > > + > > > + /** > > > + * Make store aligned when copy size exceeds 512 bytes, > > > + * and make sure the first 15 bytes are copied, because > > > + * unaligned copy functions require up to 15 bytes > > > + * backwards access. > > > + */ > > > + dstofss = 16 - (int)((long long)(void *)dst & 0x0F) + 16; > > > + n -= dstofss; > > > + rte_mov32((uint8_t *)dst, (const uint8_t *)src); > > > + src = (const uint8_t *)src + dstofss; > > > + dst = (uint8_t *)dst + dstofss; > > > + srcofs = (int)((long long)(const void *)src & 0x0F); > > > + > > > + /** > > > + * For aligned copy > > > + */ > > > + if (srcofs == 0) { > > > + /** > > > + * Copy 256-byte blocks > > > + */ > > > + for (; n >= 256; n -= 256) { > > > + rte_mov256((uint8_t *)dst, (const uint8_t *)src); > > > + dst = (uint8_t *)dst + 256; > > > + src = (const uint8_t *)src + 256; > > > + } > > > + > > > + /** > > > + * Copy whatever left > > > + */ > > > + goto COPY_BLOCK_255_BACK15; > > > + } > > > + > > > + /** > > > + * For copy with unaligned load > > > + */ > > > + MOVEUNALIGNED_LEFT47(dst, src, n, srcofs); > > > + > > > + /** > > > + * Copy whatever left > > > + */ > > > + goto COPY_BLOCK_64_BACK15; > > > } > > > > > > +#endif /* RTE_MACHINE_CPUFLAG_AVX2 */ > > > + > > > #ifdef __cplusplus > > > } > > > #endif > > > -- > > > 1.9.3 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization 2015-01-29 2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang ` (3 preceding siblings ...) 2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms Zhihong Wang @ 2015-01-29 6:16 ` Fu, JingguoX 2015-02-10 3:06 ` Liang, Cunming 2015-02-16 15:57 ` De Lara Guarch, Pablo 6 siblings, 0 replies; 12+ messages in thread From: Fu, JingguoX @ 2015-01-29 6:16 UTC (permalink / raw) To: Wang, Zhihong, dev Basic Information Patch name DPDK memcpy optimization v2 Brief description about test purpose Verify memory copy and memory copy performance cases on variety OS Test Flag Tested-by Tester name jingguox.fu at intel.com Test Tool Chain information N/A Commit ID 88fa98a60b34812bfed92e5b2706fcf7e1cbcbc8 Test Result Summary Total 6 cases, 6 passed, 0 failed Test environment - Environment 1: OS: Ubuntu12.04 3.2.0-23-generic X86_64 GCC: gcc version 4.6.3 CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ [8086:10fb] (rev 01) - Environment 2: OS: Ubuntu14.04 3.13.0-24-generic GCC: gcc version 4.8.2 CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ [8086:10fb] (rev 01) Environment 3: OS: Fedora18 3.6.10-4.fc18.x86_64 GCC: gcc version 4.7.2 20121109 CPU: Intel(R) Xeon(R) CPU E5-2680 v2 @ 2.80GHz NIC: Intel Corporation 82599ES 10-Gigabit SFI/SFP+ [8086:10fb] (rev 01) Detailed Testing information Test Case - name test_memcpy Test Case - Description Create two buffers, and initialise one with random values. These are copied to the second buffer and then compared to see if the copy was successful. The bytes outside the copied area are also checked to make sure they were not changed. Test Case -test sample/application test application in app/test Test Case -command / instruction # ./app/test/test -n 1 -c ffff #RTE>> memcpy_autotest Test Case - expected #RTE>> Test OK Test Result- PASSED Test Case - name test_memcpy_perf Test Case - Description a number of different sizes and cached/uncached permutations Test Case -test sample/application test application in app/test Test Case -command / instruction # ./app/test/test -n 1 -c ffff #RTE>> memcpy_perf_autotest Test Case - expected #RTE>> Test OK Test Result- PASSED -----Original Message----- From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhihong Wang Sent: Thursday, January 29, 2015 10:39 To: dev@dpdk.org Subject: [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization This patch set optimizes memcpy for DPDK for both SSE and AVX platforms. It also extends memcpy test coverage with unaligned cases and more test points. Optimization techniques are summarized below: 1. Utilize full cache bandwidth 2. Enforce aligned stores 3. Apply load address alignment based on architecture features 4. Make load/store address available as early as possible 5. General optimization techniques like inlining, branch reducing, prefetch pattern access -------------- Changes in v2: 1. Reduced constant test cases in app/test/test_memcpy_perf.c for fast build 2. Modified macro definition for better code readability & safety Zhihong Wang (4): app/test: Disabled VTA for memcpy test in app/test/Makefile app/test: Removed unnecessary test cases in app/test/test_memcpy.c app/test: Extended test coverage in app/test/test_memcpy_perf.c lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms app/test/Makefile | 6 + app/test/test_memcpy.c | 52 +- app/test/test_memcpy_perf.c | 220 ++++--- .../common/include/arch/x86/rte_memcpy.h | 680 +++++++++++++++------ 4 files changed, 654 insertions(+), 304 deletions(-) -- 1.9.3 ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization 2015-01-29 2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang ` (4 preceding siblings ...) 2015-01-29 6:16 ` [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Fu, JingguoX @ 2015-02-10 3:06 ` Liang, Cunming 2015-02-16 15:57 ` De Lara Guarch, Pablo 6 siblings, 0 replies; 12+ messages in thread From: Liang, Cunming @ 2015-02-10 3:06 UTC (permalink / raw) To: Wang, Zhihong, dev > -----Original Message----- > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhihong Wang > Sent: Thursday, January 29, 2015 10:39 AM > To: dev@dpdk.org > Subject: [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization > > This patch set optimizes memcpy for DPDK for both SSE and AVX platforms. > It also extends memcpy test coverage with unaligned cases and more test points. > > Optimization techniques are summarized below: > > 1. Utilize full cache bandwidth > > 2. Enforce aligned stores > > 3. Apply load address alignment based on architecture features > > 4. Make load/store address available as early as possible > > 5. General optimization techniques like inlining, branch reducing, prefetch > pattern access > > -------------- > Changes in v2: > > 1. Reduced constant test cases in app/test/test_memcpy_perf.c for fast build > > 2. Modified macro definition for better code readability & safety > > Zhihong Wang (4): > app/test: Disabled VTA for memcpy test in app/test/Makefile > app/test: Removed unnecessary test cases in app/test/test_memcpy.c > app/test: Extended test coverage in app/test/test_memcpy_perf.c > lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE > and AVX platforms > > app/test/Makefile | 6 + > app/test/test_memcpy.c | 52 +- > app/test/test_memcpy_perf.c | 220 ++++--- > .../common/include/arch/x86/rte_memcpy.h | 680 +++++++++++++++----- > - > 4 files changed, 654 insertions(+), 304 deletions(-) > > -- > 1.9.3 Acked-by: Cunming Liang <cunming.liang@intel.com> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization 2015-01-29 2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang ` (5 preceding siblings ...) 2015-02-10 3:06 ` Liang, Cunming @ 2015-02-16 15:57 ` De Lara Guarch, Pablo 2015-02-25 10:46 ` Thomas Monjalon 6 siblings, 1 reply; 12+ messages in thread From: De Lara Guarch, Pablo @ 2015-02-16 15:57 UTC (permalink / raw) To: Wang, Zhihong, dev > -----Original Message----- > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of Zhihong Wang > Sent: Thursday, January 29, 2015 2:39 AM > To: dev@dpdk.org > Subject: [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization > > This patch set optimizes memcpy for DPDK for both SSE and AVX platforms. > It also extends memcpy test coverage with unaligned cases and more test > points. > > Optimization techniques are summarized below: > > 1. Utilize full cache bandwidth > > 2. Enforce aligned stores > > 3. Apply load address alignment based on architecture features > > 4. Make load/store address available as early as possible > > 5. General optimization techniques like inlining, branch reducing, prefetch > pattern access > > -------------- > Changes in v2: > > 1. Reduced constant test cases in app/test/test_memcpy_perf.c for fast > build > > 2. Modified macro definition for better code readability & safety > > Zhihong Wang (4): > app/test: Disabled VTA for memcpy test in app/test/Makefile > app/test: Removed unnecessary test cases in app/test/test_memcpy.c > app/test: Extended test coverage in app/test/test_memcpy_perf.c > lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE > and AVX platforms > > app/test/Makefile | 6 + > app/test/test_memcpy.c | 52 +- > app/test/test_memcpy_perf.c | 220 ++++--- > .../common/include/arch/x86/rte_memcpy.h | 680 > +++++++++++++++------ > 4 files changed, 654 insertions(+), 304 deletions(-) > > -- > 1.9.3 Acked-by: Pablo de Lara <pablo.de.lara.guarch@intel.com> ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization 2015-02-16 15:57 ` De Lara Guarch, Pablo @ 2015-02-25 10:46 ` Thomas Monjalon 0 siblings, 0 replies; 12+ messages in thread From: Thomas Monjalon @ 2015-02-25 10:46 UTC (permalink / raw) To: Wang, Zhihong, konstantin.ananyev; +Cc: dev > > This patch set optimizes memcpy for DPDK for both SSE and AVX platforms. > > It also extends memcpy test coverage with unaligned cases and more test > > points. > > > > Optimization techniques are summarized below: > > > > 1. Utilize full cache bandwidth > > > > 2. Enforce aligned stores > > > > 3. Apply load address alignment based on architecture features > > > > 4. Make load/store address available as early as possible > > > > 5. General optimization techniques like inlining, branch reducing, prefetch > > pattern access > > > > -------------- > > Changes in v2: > > > > 1. Reduced constant test cases in app/test/test_memcpy_perf.c for fast > > build > > > > 2. Modified macro definition for better code readability & safety > > > > Zhihong Wang (4): > > app/test: Disabled VTA for memcpy test in app/test/Makefile > > app/test: Removed unnecessary test cases in app/test/test_memcpy.c > > app/test: Extended test coverage in app/test/test_memcpy_perf.c > > lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE > > and AVX platforms > > Acked-by: Pablo de Lara <pablo.de.lara.guarch@intel.com> Applied, thanks for the great work! Note: we are still looking for a maintainer of x86 EAL. ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2015-02-25 10:47 UTC | newest] Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2015-01-29 2:38 [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Zhihong Wang 2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 1/4] app/test: Disabled VTA for memcpy test in app/test/Makefile Zhihong Wang 2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 2/4] app/test: Removed unnecessary test cases in app/test/test_memcpy.c Zhihong Wang 2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 3/4] app/test: Extended test coverage in app/test/test_memcpy_perf.c Zhihong Wang 2015-01-29 2:38 ` [dpdk-dev] [PATCH v2 4/4] lib/librte_eal: Optimized memcpy in arch/x86/rte_memcpy.h for both SSE and AVX platforms Zhihong Wang 2015-01-29 15:17 ` Ananyev, Konstantin 2015-01-30 5:57 ` Wang, Zhihong 2015-01-30 10:44 ` Ananyev, Konstantin 2015-01-29 6:16 ` [dpdk-dev] [PATCH v2 0/4] DPDK memcpy optimization Fu, JingguoX 2015-02-10 3:06 ` Liang, Cunming 2015-02-16 15:57 ` De Lara Guarch, Pablo 2015-02-25 10:46 ` Thomas Monjalon
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).