* [dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms
@ 2016-01-14 6:13 Zhihong Wang
2016-01-14 6:13 ` [dpdk-dev] [PATCH 1/4] lib/librte_eal: Identify AVX512 CPU flag Zhihong Wang
` (5 more replies)
0 siblings, 6 replies; 23+ messages in thread
From: Zhihong Wang @ 2016-01-14 6:13 UTC (permalink / raw)
To: dev
This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
utilization of hardware resources and deliver high performance.
In current DPDK, memcpy holds a large proportion of execution time in
libs like Vhost, especially for large packets, and this patch can bring
considerable benefits.
The implementation is based on the current DPDK memcpy framework, some
background introduction can be found in these threads:
http://dpdk.org/ml/archives/dev/2014-November/008158.html
http://dpdk.org/ml/archives/dev/2015-January/011800.html
Code changes are:
1. Read CPUID to check if AVX512 is supported by CPU
2. Predefine AVX512 macro if AVX512 is enabled by compiler
3. Implement AVX512 memcpy and choose the right implementation based on
predefined macros
4. Decide alignment unit for memcpy perf test based on predefined macros
Zhihong Wang (4):
lib/librte_eal: Identify AVX512 CPU flag
mk: Predefine AVX512 macro for compiler
lib/librte_eal: Optimize memcpy for AVX512 platforms
app/test: Adjust alignment unit for memcpy perf test
app/test/test_memcpy_perf.c | 6 +
.../common/include/arch/x86/rte_cpuflags.h | 2 +
.../common/include/arch/x86/rte_memcpy.h | 247 ++++++++++++++++++++-
mk/rte.cpuflags.mk | 4 +
4 files changed, 255 insertions(+), 4 deletions(-)
--
2.5.0
^ permalink raw reply [flat|nested] 23+ messages in thread
* [dpdk-dev] [PATCH 1/4] lib/librte_eal: Identify AVX512 CPU flag
2016-01-14 6:13 [dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms Zhihong Wang
@ 2016-01-14 6:13 ` Zhihong Wang
2016-01-14 6:13 ` [dpdk-dev] [PATCH 2/4] mk: Predefine AVX512 macro for compiler Zhihong Wang
` (4 subsequent siblings)
5 siblings, 0 replies; 23+ messages in thread
From: Zhihong Wang @ 2016-01-14 6:13 UTC (permalink / raw)
To: dev
Read CPUID to check if AVX512 is supported by CPU.
Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
lib/librte_eal/common/include/arch/x86/rte_cpuflags.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h b/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
index dd56553..89c0d9d 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
@@ -131,6 +131,7 @@ enum rte_cpu_flag_t {
RTE_CPUFLAG_ERMS, /**< ERMS */
RTE_CPUFLAG_INVPCID, /**< INVPCID */
RTE_CPUFLAG_RTM, /**< Transactional memory */
+ RTE_CPUFLAG_AVX512F, /**< AVX512F */
/* (EAX 80000001h) ECX features */
RTE_CPUFLAG_LAHF_SAHF, /**< LAHF_SAHF */
@@ -238,6 +239,7 @@ static const struct feature_entry cpu_feature_table[] = {
FEAT_DEF(ERMS, 0x00000007, 0, RTE_REG_EBX, 8)
FEAT_DEF(INVPCID, 0x00000007, 0, RTE_REG_EBX, 10)
FEAT_DEF(RTM, 0x00000007, 0, RTE_REG_EBX, 11)
+ FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX, 0)
FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX, 4)
--
2.5.0
^ permalink raw reply [flat|nested] 23+ messages in thread
* [dpdk-dev] [PATCH 2/4] mk: Predefine AVX512 macro for compiler
2016-01-14 6:13 [dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms Zhihong Wang
2016-01-14 6:13 ` [dpdk-dev] [PATCH 1/4] lib/librte_eal: Identify AVX512 CPU flag Zhihong Wang
@ 2016-01-14 6:13 ` Zhihong Wang
2016-01-14 6:13 ` [dpdk-dev] [PATCH 3/4] lib/librte_eal: Optimize memcpy for AVX512 platforms Zhihong Wang
` (3 subsequent siblings)
5 siblings, 0 replies; 23+ messages in thread
From: Zhihong Wang @ 2016-01-14 6:13 UTC (permalink / raw)
To: dev
Predefine AVX512 macro if AVX512 is enabled by compiler.
Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
mk/rte.cpuflags.mk | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/mk/rte.cpuflags.mk b/mk/rte.cpuflags.mk
index 28f203b..19a3e7e 100644
--- a/mk/rte.cpuflags.mk
+++ b/mk/rte.cpuflags.mk
@@ -89,6 +89,10 @@ ifneq ($(filter $(AUTO_CPUFLAGS),__AVX2__),)
CPUFLAGS += AVX2
endif
+ifneq ($(filter $(AUTO_CPUFLAGS),__AVX512F__),)
+CPUFLAGS += AVX512F
+endif
+
# IBM Power CPU flags
ifneq ($(filter $(AUTO_CPUFLAGS),__PPC64__),)
CPUFLAGS += PPC64
--
2.5.0
^ permalink raw reply [flat|nested] 23+ messages in thread
* [dpdk-dev] [PATCH 3/4] lib/librte_eal: Optimize memcpy for AVX512 platforms
2016-01-14 6:13 [dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms Zhihong Wang
2016-01-14 6:13 ` [dpdk-dev] [PATCH 1/4] lib/librte_eal: Identify AVX512 CPU flag Zhihong Wang
2016-01-14 6:13 ` [dpdk-dev] [PATCH 2/4] mk: Predefine AVX512 macro for compiler Zhihong Wang
@ 2016-01-14 6:13 ` Zhihong Wang
2016-01-14 6:13 ` [dpdk-dev] [PATCH 4/4] app/test: Adjust alignment unit for memcpy perf test Zhihong Wang
` (2 subsequent siblings)
5 siblings, 0 replies; 23+ messages in thread
From: Zhihong Wang @ 2016-01-14 6:13 UTC (permalink / raw)
To: dev
Implement AVX512 memcpy and choose the right implementation based on
predefined macros, to make full utilization of hardware resources and
deliver high performance.
In current DPDK, memcpy holds a large proportion of execution time in
libs like Vhost, especially for large packets, and this patch can bring
considerable benefits for AVX512 platforms.
The implementation is based on the current DPDK memcpy framework, some
background introduction can be found in these threads:
http://dpdk.org/ml/archives/dev/2014-November/008158.html
http://dpdk.org/ml/archives/dev/2015-January/011800.html
Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
.../common/include/arch/x86/rte_memcpy.h | 247 ++++++++++++++++++++-
1 file changed, 243 insertions(+), 4 deletions(-)
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index 6a57426..fee954a 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -37,7 +37,7 @@
/**
* @file
*
- * Functions for SSE/AVX/AVX2 implementation of memcpy().
+ * Functions for SSE/AVX/AVX2/AVX512 implementation of memcpy().
*/
#include <stdio.h>
@@ -67,7 +67,246 @@ extern "C" {
static inline void *
rte_memcpy(void *dst, const void *src, size_t n) __attribute__((always_inline));
-#ifdef RTE_MACHINE_CPUFLAG_AVX2
+#ifdef RTE_MACHINE_CPUFLAG_AVX512F
+
+/**
+ * AVX512 implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+ __m128i xmm0;
+
+ xmm0 = _mm_loadu_si128((const __m128i *)src);
+ _mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+ __m256i ymm0;
+
+ ymm0 = _mm256_loadu_si256((const __m256i *)src);
+ _mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+ __m512i zmm0;
+
+ zmm0 = _mm512_loadu_si512((const void *)src);
+ _mm512_storeu_si512((void *)dst, zmm0);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+ rte_mov64(dst + 0 * 64, src + 0 * 64);
+ rte_mov64(dst + 1 * 64, src + 1 * 64);
+}
+
+/**
+ * Copy 256 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov256(uint8_t *dst, const uint8_t *src)
+{
+ rte_mov64(dst + 0 * 64, src + 0 * 64);
+ rte_mov64(dst + 1 * 64, src + 1 * 64);
+ rte_mov64(dst + 2 * 64, src + 2 * 64);
+ rte_mov64(dst + 3 * 64, src + 3 * 64);
+}
+
+/**
+ * Copy 128-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+ __m512i zmm0, zmm1;
+
+ while (n >= 128) {
+ zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
+ n -= 128;
+ zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
+ src = src + 128;
+ _mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
+ _mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
+ dst = dst + 128;
+ }
+}
+
+/**
+ * Copy 512-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+ __m512i zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, zmm7;
+
+ while (n >= 512) {
+ zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
+ n -= 512;
+ zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
+ zmm2 = _mm512_loadu_si512((const void *)(src + 2 * 64));
+ zmm3 = _mm512_loadu_si512((const void *)(src + 3 * 64));
+ zmm4 = _mm512_loadu_si512((const void *)(src + 4 * 64));
+ zmm5 = _mm512_loadu_si512((const void *)(src + 5 * 64));
+ zmm6 = _mm512_loadu_si512((const void *)(src + 6 * 64));
+ zmm7 = _mm512_loadu_si512((const void *)(src + 7 * 64));
+ src = src + 512;
+ _mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
+ _mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
+ _mm512_storeu_si512((void *)(dst + 2 * 64), zmm2);
+ _mm512_storeu_si512((void *)(dst + 3 * 64), zmm3);
+ _mm512_storeu_si512((void *)(dst + 4 * 64), zmm4);
+ _mm512_storeu_si512((void *)(dst + 5 * 64), zmm5);
+ _mm512_storeu_si512((void *)(dst + 6 * 64), zmm6);
+ _mm512_storeu_si512((void *)(dst + 7 * 64), zmm7);
+ dst = dst + 512;
+ }
+}
+
+static inline void *
+rte_memcpy(void *dst, const void *src, size_t n)
+{
+ uintptr_t dstu = (uintptr_t)dst;
+ uintptr_t srcu = (uintptr_t)src;
+ void *ret = dst;
+ size_t dstofss;
+ size_t bits;
+
+ /**
+ * Copy less than 16 bytes
+ */
+ if (n < 16) {
+ if (n & 0x01) {
+ *(uint8_t *)dstu = *(const uint8_t *)srcu;
+ srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+ dstu = (uintptr_t)((uint8_t *)dstu + 1);
+ }
+ if (n & 0x02) {
+ *(uint16_t *)dstu = *(const uint16_t *)srcu;
+ srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+ dstu = (uintptr_t)((uint16_t *)dstu + 1);
+ }
+ if (n & 0x04) {
+ *(uint32_t *)dstu = *(const uint32_t *)srcu;
+ srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+ dstu = (uintptr_t)((uint32_t *)dstu + 1);
+ }
+ if (n & 0x08)
+ *(uint64_t *)dstu = *(const uint64_t *)srcu;
+ return ret;
+ }
+
+ /**
+ * Fast way when copy size doesn't exceed 512 bytes
+ */
+ if (n <= 32) {
+ rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+ rte_mov16((uint8_t *)dst - 16 + n,
+ (const uint8_t *)src - 16 + n);
+ return ret;
+ }
+ if (n <= 64) {
+ rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+ rte_mov32((uint8_t *)dst - 32 + n,
+ (const uint8_t *)src - 32 + n);
+ return ret;
+ }
+ if (n <= 512) {
+ if (n >= 256) {
+ n -= 256;
+ rte_mov256((uint8_t *)dst, (const uint8_t *)src);
+ src = (const uint8_t *)src + 256;
+ dst = (uint8_t *)dst + 256;
+ }
+ if (n >= 128) {
+ n -= 128;
+ rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+ src = (const uint8_t *)src + 128;
+ dst = (uint8_t *)dst + 128;
+ }
+COPY_BLOCK_128_BACK63:
+ if (n > 64) {
+ rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+ rte_mov64((uint8_t *)dst - 64 + n,
+ (const uint8_t *)src - 64 + n);
+ return ret;
+ }
+ if (n > 0)
+ rte_mov64((uint8_t *)dst - 64 + n,
+ (const uint8_t *)src - 64 + n);
+ return ret;
+ }
+
+ /**
+ * Make store aligned when copy size exceeds 512 bytes
+ */
+ dstofss = ((uintptr_t)dst & 0x3F);
+ if (dstofss > 0) {
+ dstofss = 64 - dstofss;
+ n -= dstofss;
+ rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+ src = (const uint8_t *)src + dstofss;
+ dst = (uint8_t *)dst + dstofss;
+ }
+
+ /**
+ * Copy 512-byte blocks.
+ * Use copy block function for better instruction order control,
+ * which is important when load is unaligned.
+ */
+ rte_mov512blocks((uint8_t *)dst, (const uint8_t *)src, n);
+ bits = n;
+ n = n & 511;
+ bits -= n;
+ src = (const uint8_t *)src + bits;
+ dst = (uint8_t *)dst + bits;
+
+ /**
+ * Copy 128-byte blocks.
+ * Use copy block function for better instruction order control,
+ * which is important when load is unaligned.
+ */
+ if (n >= 128) {
+ rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
+ bits = n;
+ n = n & 127;
+ bits -= n;
+ src = (const uint8_t *)src + bits;
+ dst = (uint8_t *)dst + bits;
+ }
+
+ /**
+ * Copy whatever left
+ */
+ goto COPY_BLOCK_128_BACK63;
+}
+
+#elif RTE_MACHINE_CPUFLAG_AVX2
/**
* AVX2 implementation below
@@ -311,7 +550,7 @@ COPY_BLOCK_64_BACK31:
goto COPY_BLOCK_64_BACK31;
}
-#else /* RTE_MACHINE_CPUFLAG_AVX2 */
+#else /* RTE_MACHINE_CPUFLAG */
/**
* SSE & AVX implementation below
@@ -630,7 +869,7 @@ COPY_BLOCK_64_BACK15:
goto COPY_BLOCK_64_BACK15;
}
-#endif /* RTE_MACHINE_CPUFLAG_AVX2 */
+#endif /* RTE_MACHINE_CPUFLAG */
#ifdef __cplusplus
}
--
2.5.0
^ permalink raw reply [flat|nested] 23+ messages in thread
* [dpdk-dev] [PATCH 4/4] app/test: Adjust alignment unit for memcpy perf test
2016-01-14 6:13 [dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms Zhihong Wang
` (2 preceding siblings ...)
2016-01-14 6:13 ` [dpdk-dev] [PATCH 3/4] lib/librte_eal: Optimize memcpy for AVX512 platforms Zhihong Wang
@ 2016-01-14 6:13 ` Zhihong Wang
2016-01-14 16:48 ` [dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms Stephen Hemminger
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 0/5] " Zhihong Wang
5 siblings, 0 replies; 23+ messages in thread
From: Zhihong Wang @ 2016-01-14 6:13 UTC (permalink / raw)
To: dev
Decide alignment unit for memcpy perf test based on predefined macros.
Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
app/test/test_memcpy_perf.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/app/test/test_memcpy_perf.c b/app/test/test_memcpy_perf.c
index 754828e..73babec 100644
--- a/app/test/test_memcpy_perf.c
+++ b/app/test/test_memcpy_perf.c
@@ -79,7 +79,13 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
#define TEST_BATCH_SIZE 100
/* Data is aligned on this many bytes (power of 2) */
+#ifdef RTE_MACHINE_CPUFLAG_AVX512F
+#define ALIGNMENT_UNIT 64
+#elif RTE_MACHINE_CPUFLAG_AVX2
#define ALIGNMENT_UNIT 32
+#else /* RTE_MACHINE_CPUFLAG */
+#define ALIGNMENT_UNIT 16
+#endif /* RTE_MACHINE_CPUFLAG */
/*
* Pointers used in performance tests. The two large buffers are for uncached
--
2.5.0
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms
2016-01-14 6:13 [dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms Zhihong Wang
` (3 preceding siblings ...)
2016-01-14 6:13 ` [dpdk-dev] [PATCH 4/4] app/test: Adjust alignment unit for memcpy perf test Zhihong Wang
@ 2016-01-14 16:48 ` Stephen Hemminger
2016-01-15 6:39 ` Wang, Zhihong
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 0/5] " Zhihong Wang
5 siblings, 1 reply; 23+ messages in thread
From: Stephen Hemminger @ 2016-01-14 16:48 UTC (permalink / raw)
To: Zhihong Wang; +Cc: dev
On Thu, 14 Jan 2016 01:13:18 -0500
Zhihong Wang <zhihong.wang@intel.com> wrote:
> This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
> utilization of hardware resources and deliver high performance.
>
> In current DPDK, memcpy holds a large proportion of execution time in
> libs like Vhost, especially for large packets, and this patch can bring
> considerable benefits.
>
> The implementation is based on the current DPDK memcpy framework, some
> background introduction can be found in these threads:
> http://dpdk.org/ml/archives/dev/2014-November/008158.html
> http://dpdk.org/ml/archives/dev/2015-January/011800.html
>
> Code changes are:
>
> 1. Read CPUID to check if AVX512 is supported by CPU
>
> 2. Predefine AVX512 macro if AVX512 is enabled by compiler
>
> 3. Implement AVX512 memcpy and choose the right implementation based on
> predefined macros
>
> 4. Decide alignment unit for memcpy perf test based on predefined macros
>
> Zhihong Wang (4):
> lib/librte_eal: Identify AVX512 CPU flag
> mk: Predefine AVX512 macro for compiler
> lib/librte_eal: Optimize memcpy for AVX512 platforms
> app/test: Adjust alignment unit for memcpy perf test
>
> app/test/test_memcpy_perf.c | 6 +
> .../common/include/arch/x86/rte_cpuflags.h | 2 +
> .../common/include/arch/x86/rte_memcpy.h | 247 ++++++++++++++++++++-
> mk/rte.cpuflags.mk | 4 +
> 4 files changed, 255 insertions(+), 4 deletions(-)
>
This really looks like code that could benefit from Gcc
function multiversioning. The current cpuflags model is useless/flawed
in real product deployment
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms
2016-01-14 16:48 ` [dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms Stephen Hemminger
@ 2016-01-15 6:39 ` Wang, Zhihong
2016-01-15 22:03 ` Vincent JARDIN
0 siblings, 1 reply; 23+ messages in thread
From: Wang, Zhihong @ 2016-01-15 6:39 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: dev
> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Friday, January 15, 2016 12:49 AM
> To: Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org; Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> Richardson, Bruce <bruce.richardson@intel.com>; Xie, Huawei
> <huawei.xie@intel.com>
> Subject: Re: [PATCH 0/4] Optimize memcpy for AVX512 platforms
>
> On Thu, 14 Jan 2016 01:13:18 -0500
> Zhihong Wang <zhihong.wang@intel.com> wrote:
>
> > This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
> > utilization of hardware resources and deliver high performance.
> >
> > In current DPDK, memcpy holds a large proportion of execution time in
> > libs like Vhost, especially for large packets, and this patch can bring
> > considerable benefits.
> >
> > The implementation is based on the current DPDK memcpy framework, some
> > background introduction can be found in these threads:
> > http://dpdk.org/ml/archives/dev/2014-November/008158.html
> > http://dpdk.org/ml/archives/dev/2015-January/011800.html
> >
> > Code changes are:
> >
> > 1. Read CPUID to check if AVX512 is supported by CPU
> >
> > 2. Predefine AVX512 macro if AVX512 is enabled by compiler
> >
> > 3. Implement AVX512 memcpy and choose the right implementation based
> on
> > predefined macros
> >
> > 4. Decide alignment unit for memcpy perf test based on predefined macros
> >
> > Zhihong Wang (4):
> > lib/librte_eal: Identify AVX512 CPU flag
> > mk: Predefine AVX512 macro for compiler
> > lib/librte_eal: Optimize memcpy for AVX512 platforms
> > app/test: Adjust alignment unit for memcpy perf test
> >
> > app/test/test_memcpy_perf.c | 6 +
> > .../common/include/arch/x86/rte_cpuflags.h | 2 +
> > .../common/include/arch/x86/rte_memcpy.h | 247
> ++++++++++++++++++++-
> > mk/rte.cpuflags.mk | 4 +
> > 4 files changed, 255 insertions(+), 4 deletions(-)
> >
>
> This really looks like code that could benefit from Gcc
> function multiversioning. The current cpuflags model is useless/flawed
> in real product deployment
I've tried gcc function multi versioning, with a simple add() function
which returns a + b, and a loop calling it for millions of times. Turned
out this mechanism adds 17% extra time to execute, overall it's a lot
of extra overhead.
Quote the gcc wiki: "GCC takes care of doing the dispatching to call
the right version at runtime". So it loses inlining and adds extra
dispatching overhead.
Also this mechanism works only for C++, right?
I think using predefined macros at compile time is more efficient and
suits DPDK more.
Could you please give an example when the current CPU flags model
stop working? So I can fix it.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms
2016-01-15 6:39 ` Wang, Zhihong
@ 2016-01-15 22:03 ` Vincent JARDIN
0 siblings, 0 replies; 23+ messages in thread
From: Vincent JARDIN @ 2016-01-15 22:03 UTC (permalink / raw)
To: Wang, Zhihong; +Cc: dev
Le 14 janv. 2016 22:39, "Wang, Zhihong" <zhihong.wang@intel.com> a écrit :
>
>
>
> > -----Original Message-----
> > From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> > Sent: Friday, January 15, 2016 12:49 AM
> > To: Wang, Zhihong <zhihong.wang@intel.com>
> > Cc: dev@dpdk.org; Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> > Richardson, Bruce <bruce.richardson@intel.com>; Xie, Huawei
> > <huawei.xie@intel.com>
> > Subject: Re: [PATCH 0/4] Optimize memcpy for AVX512 platforms
> >
> > On Thu, 14 Jan 2016 01:13:18 -0500
> > Zhihong Wang <zhihong.wang@intel.com> wrote:
> >
> > > This patch set optimizes DPDK memcpy for AVX512 platforms, to make
full
> > > utilization of hardware resources and deliver high performance.
> > >
> > > In current DPDK, memcpy holds a large proportion of execution time in
> > > libs like Vhost, especially for large packets, and this patch can
bring
> > > considerable benefits.
> > >
> > > The implementation is based on the current DPDK memcpy framework, some
> > > background introduction can be found in these threads:
> > > http://dpdk.org/ml/archives/dev/2014-November/008158.html
> > > http://dpdk.org/ml/archives/dev/2015-January/011800.html
> > >
> > > Code changes are:
> > >
> > > 1. Read CPUID to check if AVX512 is supported by CPU
> > >
> > > 2. Predefine AVX512 macro if AVX512 is enabled by compiler
> > >
> > > 3. Implement AVX512 memcpy and choose the right implementation based
> > on
> > > predefined macros
> > >
> > > 4. Decide alignment unit for memcpy perf test based on predefined
macros
> > >
> > > Zhihong Wang (4):
> > > lib/librte_eal: Identify AVX512 CPU flag
> > > mk: Predefine AVX512 macro for compiler
> > > lib/librte_eal: Optimize memcpy for AVX512 platforms
> > > app/test: Adjust alignment unit for memcpy perf test
> > >
> > > app/test/test_memcpy_perf.c | 6 +
> > > .../common/include/arch/x86/rte_cpuflags.h | 2 +
> > > .../common/include/arch/x86/rte_memcpy.h | 247
> > ++++++++++++++++++++-
> > > mk/rte.cpuflags.mk | 4 +
> > > 4 files changed, 255 insertions(+), 4 deletions(-)
> > >
> >
> > This really looks like code that could benefit from Gcc
> > function multiversioning. The current cpuflags model is useless/flawed
> > in real product deployment
>
>
> I've tried gcc function multi versioning, with a simple add() function
> which returns a + b, and a loop calling it for millions of times. Turned
> out this mechanism adds 17% extra time to execute, overall it's a lot
> of extra overhead.
>
> Quote the gcc wiki: "GCC takes care of doing the dispatching to call
> the right version at runtime". So it loses inlining and adds extra
> dispatching overhead.
>
> Also this mechanism works only for C++, right?
>
> I think using predefined macros at compile time is more efficient and
> suits DPDK more.
>
I agree with you: performance first.
So having a mix of runtime and compile time would work. For those who are
ok with some performance drops, they can go with runtime.
> Could you please give an example when the current CPU flags model
> stop working? So I can fix it.
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* [dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
2016-01-14 6:13 [dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms Zhihong Wang
` (4 preceding siblings ...)
2016-01-14 16:48 ` [dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms Stephen Hemminger
@ 2016-01-18 3:05 ` Zhihong Wang
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 1/5] lib/librte_eal: Identify AVX512 CPU flag Zhihong Wang
` (8 more replies)
5 siblings, 9 replies; 23+ messages in thread
From: Zhihong Wang @ 2016-01-18 3:05 UTC (permalink / raw)
To: dev
This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
utilization of hardware resources and deliver high performance.
In current DPDK, memcpy holds a large proportion of execution time in
libs like Vhost, especially for large packets, and this patch can bring
considerable benefits.
The implementation is based on the current DPDK memcpy framework, some
background introduction can be found in these threads:
http://dpdk.org/ml/archives/dev/2014-November/008158.html
http://dpdk.org/ml/archives/dev/2015-January/011800.html
Code changes are:
1. Read CPUID to check if AVX512 is supported by CPU
2. Predefine AVX512 macro if AVX512 is enabled by compiler
3. Implement AVX512 memcpy and choose the right implementation based on
predefined macros
4. Decide alignment unit for memcpy perf test based on predefined macros
--------------
Changes in v2:
1. Tune performance for prior platforms
Zhihong Wang (5):
lib/librte_eal: Identify AVX512 CPU flag
mk: Predefine AVX512 macro for compiler
lib/librte_eal: Optimize memcpy for AVX512 platforms
app/test: Adjust alignment unit for memcpy perf test
lib/librte_eal: Tune memcpy for prior platforms
app/test/test_memcpy_perf.c | 6 +
.../common/include/arch/x86/rte_cpuflags.h | 2 +
.../common/include/arch/x86/rte_memcpy.h | 269 ++++++++++++++++++++-
mk/rte.cpuflags.mk | 4 +
4 files changed, 268 insertions(+), 13 deletions(-)
--
2.5.0
^ permalink raw reply [flat|nested] 23+ messages in thread
* [dpdk-dev] [PATCH v2 1/5] lib/librte_eal: Identify AVX512 CPU flag
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 0/5] " Zhihong Wang
@ 2016-01-18 3:05 ` Zhihong Wang
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 2/5] mk: Predefine AVX512 macro for compiler Zhihong Wang
` (7 subsequent siblings)
8 siblings, 0 replies; 23+ messages in thread
From: Zhihong Wang @ 2016-01-18 3:05 UTC (permalink / raw)
To: dev
Read CPUID to check if AVX512 is supported by CPU.
Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
lib/librte_eal/common/include/arch/x86/rte_cpuflags.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h b/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
index dd56553..89c0d9d 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_cpuflags.h
@@ -131,6 +131,7 @@ enum rte_cpu_flag_t {
RTE_CPUFLAG_ERMS, /**< ERMS */
RTE_CPUFLAG_INVPCID, /**< INVPCID */
RTE_CPUFLAG_RTM, /**< Transactional memory */
+ RTE_CPUFLAG_AVX512F, /**< AVX512F */
/* (EAX 80000001h) ECX features */
RTE_CPUFLAG_LAHF_SAHF, /**< LAHF_SAHF */
@@ -238,6 +239,7 @@ static const struct feature_entry cpu_feature_table[] = {
FEAT_DEF(ERMS, 0x00000007, 0, RTE_REG_EBX, 8)
FEAT_DEF(INVPCID, 0x00000007, 0, RTE_REG_EBX, 10)
FEAT_DEF(RTM, 0x00000007, 0, RTE_REG_EBX, 11)
+ FEAT_DEF(AVX512F, 0x00000007, 0, RTE_REG_EBX, 16)
FEAT_DEF(LAHF_SAHF, 0x80000001, 0, RTE_REG_ECX, 0)
FEAT_DEF(LZCNT, 0x80000001, 0, RTE_REG_ECX, 4)
--
2.5.0
^ permalink raw reply [flat|nested] 23+ messages in thread
* [dpdk-dev] [PATCH v2 2/5] mk: Predefine AVX512 macro for compiler
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 0/5] " Zhihong Wang
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 1/5] lib/librte_eal: Identify AVX512 CPU flag Zhihong Wang
@ 2016-01-18 3:05 ` Zhihong Wang
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 3/5] lib/librte_eal: Optimize memcpy for AVX512 platforms Zhihong Wang
` (6 subsequent siblings)
8 siblings, 0 replies; 23+ messages in thread
From: Zhihong Wang @ 2016-01-18 3:05 UTC (permalink / raw)
To: dev
Predefine AVX512 macro if AVX512 is enabled by compiler.
Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
mk/rte.cpuflags.mk | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/mk/rte.cpuflags.mk b/mk/rte.cpuflags.mk
index 28f203b..19a3e7e 100644
--- a/mk/rte.cpuflags.mk
+++ b/mk/rte.cpuflags.mk
@@ -89,6 +89,10 @@ ifneq ($(filter $(AUTO_CPUFLAGS),__AVX2__),)
CPUFLAGS += AVX2
endif
+ifneq ($(filter $(AUTO_CPUFLAGS),__AVX512F__),)
+CPUFLAGS += AVX512F
+endif
+
# IBM Power CPU flags
ifneq ($(filter $(AUTO_CPUFLAGS),__PPC64__),)
CPUFLAGS += PPC64
--
2.5.0
^ permalink raw reply [flat|nested] 23+ messages in thread
* [dpdk-dev] [PATCH v2 3/5] lib/librte_eal: Optimize memcpy for AVX512 platforms
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 0/5] " Zhihong Wang
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 1/5] lib/librte_eal: Identify AVX512 CPU flag Zhihong Wang
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 2/5] mk: Predefine AVX512 macro for compiler Zhihong Wang
@ 2016-01-18 3:05 ` Zhihong Wang
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 4/5] app/test: Adjust alignment unit for memcpy perf test Zhihong Wang
` (5 subsequent siblings)
8 siblings, 0 replies; 23+ messages in thread
From: Zhihong Wang @ 2016-01-18 3:05 UTC (permalink / raw)
To: dev
Implement AVX512 memcpy and choose the right implementation based on
predefined macros, to make full utilization of hardware resources and
deliver high performance.
In current DPDK, memcpy holds a large proportion of execution time in
libs like Vhost, especially for large packets, and this patch can bring
considerable benefits for AVX512 platforms.
The implementation is based on the current DPDK memcpy framework, some
background introduction can be found in these threads:
http://dpdk.org/ml/archives/dev/2014-November/008158.html
http://dpdk.org/ml/archives/dev/2015-January/011800.html
Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
.../common/include/arch/x86/rte_memcpy.h | 247 ++++++++++++++++++++-
1 file changed, 243 insertions(+), 4 deletions(-)
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index 6a57426..fee954a 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -37,7 +37,7 @@
/**
* @file
*
- * Functions for SSE/AVX/AVX2 implementation of memcpy().
+ * Functions for SSE/AVX/AVX2/AVX512 implementation of memcpy().
*/
#include <stdio.h>
@@ -67,7 +67,246 @@ extern "C" {
static inline void *
rte_memcpy(void *dst, const void *src, size_t n) __attribute__((always_inline));
-#ifdef RTE_MACHINE_CPUFLAG_AVX2
+#ifdef RTE_MACHINE_CPUFLAG_AVX512F
+
+/**
+ * AVX512 implementation below
+ */
+
+/**
+ * Copy 16 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov16(uint8_t *dst, const uint8_t *src)
+{
+ __m128i xmm0;
+
+ xmm0 = _mm_loadu_si128((const __m128i *)src);
+ _mm_storeu_si128((__m128i *)dst, xmm0);
+}
+
+/**
+ * Copy 32 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov32(uint8_t *dst, const uint8_t *src)
+{
+ __m256i ymm0;
+
+ ymm0 = _mm256_loadu_si256((const __m256i *)src);
+ _mm256_storeu_si256((__m256i *)dst, ymm0);
+}
+
+/**
+ * Copy 64 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov64(uint8_t *dst, const uint8_t *src)
+{
+ __m512i zmm0;
+
+ zmm0 = _mm512_loadu_si512((const void *)src);
+ _mm512_storeu_si512((void *)dst, zmm0);
+}
+
+/**
+ * Copy 128 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128(uint8_t *dst, const uint8_t *src)
+{
+ rte_mov64(dst + 0 * 64, src + 0 * 64);
+ rte_mov64(dst + 1 * 64, src + 1 * 64);
+}
+
+/**
+ * Copy 256 bytes from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov256(uint8_t *dst, const uint8_t *src)
+{
+ rte_mov64(dst + 0 * 64, src + 0 * 64);
+ rte_mov64(dst + 1 * 64, src + 1 * 64);
+ rte_mov64(dst + 2 * 64, src + 2 * 64);
+ rte_mov64(dst + 3 * 64, src + 3 * 64);
+}
+
+/**
+ * Copy 128-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov128blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+ __m512i zmm0, zmm1;
+
+ while (n >= 128) {
+ zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
+ n -= 128;
+ zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
+ src = src + 128;
+ _mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
+ _mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
+ dst = dst + 128;
+ }
+}
+
+/**
+ * Copy 512-byte blocks from one location to another,
+ * locations should not overlap.
+ */
+static inline void
+rte_mov512blocks(uint8_t *dst, const uint8_t *src, size_t n)
+{
+ __m512i zmm0, zmm1, zmm2, zmm3, zmm4, zmm5, zmm6, zmm7;
+
+ while (n >= 512) {
+ zmm0 = _mm512_loadu_si512((const void *)(src + 0 * 64));
+ n -= 512;
+ zmm1 = _mm512_loadu_si512((const void *)(src + 1 * 64));
+ zmm2 = _mm512_loadu_si512((const void *)(src + 2 * 64));
+ zmm3 = _mm512_loadu_si512((const void *)(src + 3 * 64));
+ zmm4 = _mm512_loadu_si512((const void *)(src + 4 * 64));
+ zmm5 = _mm512_loadu_si512((const void *)(src + 5 * 64));
+ zmm6 = _mm512_loadu_si512((const void *)(src + 6 * 64));
+ zmm7 = _mm512_loadu_si512((const void *)(src + 7 * 64));
+ src = src + 512;
+ _mm512_storeu_si512((void *)(dst + 0 * 64), zmm0);
+ _mm512_storeu_si512((void *)(dst + 1 * 64), zmm1);
+ _mm512_storeu_si512((void *)(dst + 2 * 64), zmm2);
+ _mm512_storeu_si512((void *)(dst + 3 * 64), zmm3);
+ _mm512_storeu_si512((void *)(dst + 4 * 64), zmm4);
+ _mm512_storeu_si512((void *)(dst + 5 * 64), zmm5);
+ _mm512_storeu_si512((void *)(dst + 6 * 64), zmm6);
+ _mm512_storeu_si512((void *)(dst + 7 * 64), zmm7);
+ dst = dst + 512;
+ }
+}
+
+static inline void *
+rte_memcpy(void *dst, const void *src, size_t n)
+{
+ uintptr_t dstu = (uintptr_t)dst;
+ uintptr_t srcu = (uintptr_t)src;
+ void *ret = dst;
+ size_t dstofss;
+ size_t bits;
+
+ /**
+ * Copy less than 16 bytes
+ */
+ if (n < 16) {
+ if (n & 0x01) {
+ *(uint8_t *)dstu = *(const uint8_t *)srcu;
+ srcu = (uintptr_t)((const uint8_t *)srcu + 1);
+ dstu = (uintptr_t)((uint8_t *)dstu + 1);
+ }
+ if (n & 0x02) {
+ *(uint16_t *)dstu = *(const uint16_t *)srcu;
+ srcu = (uintptr_t)((const uint16_t *)srcu + 1);
+ dstu = (uintptr_t)((uint16_t *)dstu + 1);
+ }
+ if (n & 0x04) {
+ *(uint32_t *)dstu = *(const uint32_t *)srcu;
+ srcu = (uintptr_t)((const uint32_t *)srcu + 1);
+ dstu = (uintptr_t)((uint32_t *)dstu + 1);
+ }
+ if (n & 0x08)
+ *(uint64_t *)dstu = *(const uint64_t *)srcu;
+ return ret;
+ }
+
+ /**
+ * Fast way when copy size doesn't exceed 512 bytes
+ */
+ if (n <= 32) {
+ rte_mov16((uint8_t *)dst, (const uint8_t *)src);
+ rte_mov16((uint8_t *)dst - 16 + n,
+ (const uint8_t *)src - 16 + n);
+ return ret;
+ }
+ if (n <= 64) {
+ rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+ rte_mov32((uint8_t *)dst - 32 + n,
+ (const uint8_t *)src - 32 + n);
+ return ret;
+ }
+ if (n <= 512) {
+ if (n >= 256) {
+ n -= 256;
+ rte_mov256((uint8_t *)dst, (const uint8_t *)src);
+ src = (const uint8_t *)src + 256;
+ dst = (uint8_t *)dst + 256;
+ }
+ if (n >= 128) {
+ n -= 128;
+ rte_mov128((uint8_t *)dst, (const uint8_t *)src);
+ src = (const uint8_t *)src + 128;
+ dst = (uint8_t *)dst + 128;
+ }
+COPY_BLOCK_128_BACK63:
+ if (n > 64) {
+ rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+ rte_mov64((uint8_t *)dst - 64 + n,
+ (const uint8_t *)src - 64 + n);
+ return ret;
+ }
+ if (n > 0)
+ rte_mov64((uint8_t *)dst - 64 + n,
+ (const uint8_t *)src - 64 + n);
+ return ret;
+ }
+
+ /**
+ * Make store aligned when copy size exceeds 512 bytes
+ */
+ dstofss = ((uintptr_t)dst & 0x3F);
+ if (dstofss > 0) {
+ dstofss = 64 - dstofss;
+ n -= dstofss;
+ rte_mov64((uint8_t *)dst, (const uint8_t *)src);
+ src = (const uint8_t *)src + dstofss;
+ dst = (uint8_t *)dst + dstofss;
+ }
+
+ /**
+ * Copy 512-byte blocks.
+ * Use copy block function for better instruction order control,
+ * which is important when load is unaligned.
+ */
+ rte_mov512blocks((uint8_t *)dst, (const uint8_t *)src, n);
+ bits = n;
+ n = n & 511;
+ bits -= n;
+ src = (const uint8_t *)src + bits;
+ dst = (uint8_t *)dst + bits;
+
+ /**
+ * Copy 128-byte blocks.
+ * Use copy block function for better instruction order control,
+ * which is important when load is unaligned.
+ */
+ if (n >= 128) {
+ rte_mov128blocks((uint8_t *)dst, (const uint8_t *)src, n);
+ bits = n;
+ n = n & 127;
+ bits -= n;
+ src = (const uint8_t *)src + bits;
+ dst = (uint8_t *)dst + bits;
+ }
+
+ /**
+ * Copy whatever left
+ */
+ goto COPY_BLOCK_128_BACK63;
+}
+
+#elif RTE_MACHINE_CPUFLAG_AVX2
/**
* AVX2 implementation below
@@ -311,7 +550,7 @@ COPY_BLOCK_64_BACK31:
goto COPY_BLOCK_64_BACK31;
}
-#else /* RTE_MACHINE_CPUFLAG_AVX2 */
+#else /* RTE_MACHINE_CPUFLAG */
/**
* SSE & AVX implementation below
@@ -630,7 +869,7 @@ COPY_BLOCK_64_BACK15:
goto COPY_BLOCK_64_BACK15;
}
-#endif /* RTE_MACHINE_CPUFLAG_AVX2 */
+#endif /* RTE_MACHINE_CPUFLAG */
#ifdef __cplusplus
}
--
2.5.0
^ permalink raw reply [flat|nested] 23+ messages in thread
* [dpdk-dev] [PATCH v2 4/5] app/test: Adjust alignment unit for memcpy perf test
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 0/5] " Zhihong Wang
` (2 preceding siblings ...)
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 3/5] lib/librte_eal: Optimize memcpy for AVX512 platforms Zhihong Wang
@ 2016-01-18 3:05 ` Zhihong Wang
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 5/5] lib/librte_eal: Tune memcpy for prior platforms Zhihong Wang
` (4 subsequent siblings)
8 siblings, 0 replies; 23+ messages in thread
From: Zhihong Wang @ 2016-01-18 3:05 UTC (permalink / raw)
To: dev
Decide alignment unit for memcpy perf test based on predefined macros.
Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
app/test/test_memcpy_perf.c | 6 ++++++
1 file changed, 6 insertions(+)
diff --git a/app/test/test_memcpy_perf.c b/app/test/test_memcpy_perf.c
index 754828e..73babec 100644
--- a/app/test/test_memcpy_perf.c
+++ b/app/test/test_memcpy_perf.c
@@ -79,7 +79,13 @@ static size_t buf_sizes[TEST_VALUE_RANGE];
#define TEST_BATCH_SIZE 100
/* Data is aligned on this many bytes (power of 2) */
+#ifdef RTE_MACHINE_CPUFLAG_AVX512F
+#define ALIGNMENT_UNIT 64
+#elif RTE_MACHINE_CPUFLAG_AVX2
#define ALIGNMENT_UNIT 32
+#else /* RTE_MACHINE_CPUFLAG */
+#define ALIGNMENT_UNIT 16
+#endif /* RTE_MACHINE_CPUFLAG */
/*
* Pointers used in performance tests. The two large buffers are for uncached
--
2.5.0
^ permalink raw reply [flat|nested] 23+ messages in thread
* [dpdk-dev] [PATCH v2 5/5] lib/librte_eal: Tune memcpy for prior platforms
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 0/5] " Zhihong Wang
` (3 preceding siblings ...)
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 4/5] app/test: Adjust alignment unit for memcpy perf test Zhihong Wang
@ 2016-01-18 3:05 ` Zhihong Wang
2016-01-18 20:06 ` [dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms Stephen Hemminger
` (3 subsequent siblings)
8 siblings, 0 replies; 23+ messages in thread
From: Zhihong Wang @ 2016-01-18 3:05 UTC (permalink / raw)
To: dev
For prior platforms, add condition for unalignment handling, to keep this
operation from interrupting the batch copy loop for aligned cases.
Signed-off-by: Zhihong Wang <zhihong.wang@intel.com>
---
.../common/include/arch/x86/rte_memcpy.h | 22 +++++++++++++---------
1 file changed, 13 insertions(+), 9 deletions(-)
diff --git a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
index fee954a..d965957 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_memcpy.h
@@ -513,10 +513,12 @@ COPY_BLOCK_64_BACK31:
* Make store aligned when copy size exceeds 512 bytes
*/
dstofss = 32 - ((uintptr_t)dst & 0x1F);
- n -= dstofss;
- rte_mov32((uint8_t *)dst, (const uint8_t *)src);
- src = (const uint8_t *)src + dstofss;
- dst = (uint8_t *)dst + dstofss;
+ if (dstofss > 0) {
+ n -= dstofss;
+ rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+ src = (const uint8_t *)src + dstofss;
+ dst = (uint8_t *)dst + dstofss;
+ }
/**
* Copy 256-byte blocks.
@@ -833,11 +835,13 @@ COPY_BLOCK_64_BACK15:
* backwards access.
*/
dstofss = 16 - ((uintptr_t)dst & 0x0F) + 16;
- n -= dstofss;
- rte_mov32((uint8_t *)dst, (const uint8_t *)src);
- src = (const uint8_t *)src + dstofss;
- dst = (uint8_t *)dst + dstofss;
- srcofs = ((uintptr_t)src & 0x0F);
+ if (dstofss > 0) {
+ n -= dstofss;
+ rte_mov32((uint8_t *)dst, (const uint8_t *)src);
+ src = (const uint8_t *)src + dstofss;
+ dst = (uint8_t *)dst + dstofss;
+ srcofs = ((uintptr_t)src & 0x0F);
+ }
/**
* For aligned copy
--
2.5.0
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 0/5] " Zhihong Wang
` (4 preceding siblings ...)
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 5/5] lib/librte_eal: Tune memcpy for prior platforms Zhihong Wang
@ 2016-01-18 20:06 ` Stephen Hemminger
2016-01-19 2:37 ` Wang, Zhihong
2016-01-27 15:23 ` Thomas Monjalon
` (2 subsequent siblings)
8 siblings, 1 reply; 23+ messages in thread
From: Stephen Hemminger @ 2016-01-18 20:06 UTC (permalink / raw)
To: Zhihong Wang; +Cc: dev
On Sun, 17 Jan 2016 22:05:09 -0500
Zhihong Wang <zhihong.wang@intel.com> wrote:
> This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
> utilization of hardware resources and deliver high performance.
>
> In current DPDK, memcpy holds a large proportion of execution time in
> libs like Vhost, especially for large packets, and this patch can bring
> considerable benefits.
>
> The implementation is based on the current DPDK memcpy framework, some
> background introduction can be found in these threads:
> http://dpdk.org/ml/archives/dev/2014-November/008158.html
> http://dpdk.org/ml/archives/dev/2015-January/011800.html
>
> Code changes are:
>
> 1. Read CPUID to check if AVX512 is supported by CPU
>
> 2. Predefine AVX512 macro if AVX512 is enabled by compiler
>
> 3. Implement AVX512 memcpy and choose the right implementation based on
> predefined macros
>
> 4. Decide alignment unit for memcpy perf test based on predefined macros
Cool, I like it. How much impact does this have on VHOST?
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
2016-01-18 20:06 ` [dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms Stephen Hemminger
@ 2016-01-19 2:37 ` Wang, Zhihong
0 siblings, 0 replies; 23+ messages in thread
From: Wang, Zhihong @ 2016-01-19 2:37 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: dev
> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Tuesday, January 19, 2016 4:06 AM
> To: Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org; Ananyev, Konstantin <konstantin.ananyev@intel.com>;
> Richardson, Bruce <bruce.richardson@intel.com>; Xie, Huawei
> <huawei.xie@intel.com>
> Subject: Re: [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
>
> On Sun, 17 Jan 2016 22:05:09 -0500
> Zhihong Wang <zhihong.wang@intel.com> wrote:
>
> > This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
> > utilization of hardware resources and deliver high performance.
> >
> > In current DPDK, memcpy holds a large proportion of execution time in
> > libs like Vhost, especially for large packets, and this patch can bring
> > considerable benefits.
> >
> > The implementation is based on the current DPDK memcpy framework, some
> > background introduction can be found in these threads:
> > http://dpdk.org/ml/archives/dev/2014-November/008158.html
> > http://dpdk.org/ml/archives/dev/2015-January/011800.html
> >
> > Code changes are:
> >
> > 1. Read CPUID to check if AVX512 is supported by CPU
> >
> > 2. Predefine AVX512 macro if AVX512 is enabled by compiler
> >
> > 3. Implement AVX512 memcpy and choose the right implementation based
> on
> > predefined macros
> >
> > 4. Decide alignment unit for memcpy perf test based on predefined macros
>
> Cool, I like it. How much impact does this have on VHOST?
The impact is significant especially for enqueue (Detailed numbers might not
be appropriate here due to policy :-), only how I test it), because VHOST actually
spends a lot of time doing memcpy. Simply measure 1024B RX/TX time cost and
compare it with 64B's and you'll get a sense of it, although not precise.
My test cases include NIC2VM2NIC and VM2VM scenarios, which are the main
use cases currently, and use both throughput and RX/TX cycles for evaluation.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 0/5] " Zhihong Wang
` (5 preceding siblings ...)
2016-01-18 20:06 ` [dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms Stephen Hemminger
@ 2016-01-27 15:23 ` Thomas Monjalon
2016-01-28 6:09 ` Wang, Zhihong
2016-01-27 15:30 ` Thomas Monjalon
2017-08-30 9:37 ` linhaifeng
8 siblings, 1 reply; 23+ messages in thread
From: Thomas Monjalon @ 2016-01-27 15:23 UTC (permalink / raw)
To: Zhihong Wang; +Cc: dev
2016-01-17 22:05, Zhihong Wang:
> This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
> utilization of hardware resources and deliver high performance.
On a related note, your expertise would be very valuable to review
these patches please:
(memcpy) http://dpdk.org/dev/patchwork/patch/4396/
(memcmp) http://dpdk.org/dev/patchwork/patch/4788/
Thanks
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 0/5] " Zhihong Wang
` (6 preceding siblings ...)
2016-01-27 15:23 ` Thomas Monjalon
@ 2016-01-27 15:30 ` Thomas Monjalon
2016-01-27 18:48 ` Ananyev, Konstantin
2017-08-30 9:37 ` linhaifeng
8 siblings, 1 reply; 23+ messages in thread
From: Thomas Monjalon @ 2016-01-27 15:30 UTC (permalink / raw)
To: bruce.richardson, konstantin.ananyev; +Cc: dev
> Zhihong Wang (5):
> lib/librte_eal: Identify AVX512 CPU flag
> mk: Predefine AVX512 macro for compiler
> lib/librte_eal: Optimize memcpy for AVX512 platforms
> app/test: Adjust alignment unit for memcpy perf test
> lib/librte_eal: Tune memcpy for prior platforms
>
> app/test/test_memcpy_perf.c | 6 +
> .../common/include/arch/x86/rte_cpuflags.h | 2 +
> .../common/include/arch/x86/rte_memcpy.h | 269 ++++++++++++++++++++-
> mk/rte.cpuflags.mk | 4 +
> 4 files changed, 268 insertions(+), 13 deletions(-)
The maintainers of arch/x86 are Bruce and Konstantin.
I guess there is no comment and we can apply this cool series?
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
2016-01-27 15:30 ` Thomas Monjalon
@ 2016-01-27 18:48 ` Ananyev, Konstantin
2016-01-27 20:18 ` Thomas Monjalon
0 siblings, 1 reply; 23+ messages in thread
From: Ananyev, Konstantin @ 2016-01-27 18:48 UTC (permalink / raw)
To: Thomas Monjalon, Richardson, Bruce; +Cc: dev
Hi Thomas,
> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Wednesday, January 27, 2016 3:31 PM
> To: Richardson, Bruce; Ananyev, Konstantin
> Cc: dev@dpdk.org; Wang, Zhihong
> Subject: Re: [dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
>
> > Zhihong Wang (5):
> > lib/librte_eal: Identify AVX512 CPU flag
> > mk: Predefine AVX512 macro for compiler
> > lib/librte_eal: Optimize memcpy for AVX512 platforms
> > app/test: Adjust alignment unit for memcpy perf test
> > lib/librte_eal: Tune memcpy for prior platforms
> >
> > app/test/test_memcpy_perf.c | 6 +
> > .../common/include/arch/x86/rte_cpuflags.h | 2 +
> > .../common/include/arch/x86/rte_memcpy.h | 269 ++++++++++++++++++++-
> > mk/rte.cpuflags.mk | 4 +
> > 4 files changed, 268 insertions(+), 13 deletions(-)
>
> The maintainers of arch/x86 are Bruce and Konstantin.
> I guess there is no comment and we can apply this cool series?
Yes, looks ok to me.
Konstantin
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
2016-01-27 18:48 ` Ananyev, Konstantin
@ 2016-01-27 20:18 ` Thomas Monjalon
0 siblings, 0 replies; 23+ messages in thread
From: Thomas Monjalon @ 2016-01-27 20:18 UTC (permalink / raw)
To: Wang, Zhihong; +Cc: dev
2016-01-27 18:48, Ananyev, Konstantin:
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> >
> > > Zhihong Wang (5):
> > > lib/librte_eal: Identify AVX512 CPU flag
> > > mk: Predefine AVX512 macro for compiler
> > > lib/librte_eal: Optimize memcpy for AVX512 platforms
> > > app/test: Adjust alignment unit for memcpy perf test
> > > lib/librte_eal: Tune memcpy for prior platforms
> > >
> > > app/test/test_memcpy_perf.c | 6 +
> > > .../common/include/arch/x86/rte_cpuflags.h | 2 +
> > > .../common/include/arch/x86/rte_memcpy.h | 269 ++++++++++++++++++++-
> > > mk/rte.cpuflags.mk | 4 +
> > > 4 files changed, 268 insertions(+), 13 deletions(-)
> >
> > The maintainers of arch/x86 are Bruce and Konstantin.
> > I guess there is no comment and we can apply this cool series?
>
> Yes, looks ok to me.
Applied, thanks
Some benchmark feedbacks would be welcome.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
2016-01-27 15:23 ` Thomas Monjalon
@ 2016-01-28 6:09 ` Wang, Zhihong
0 siblings, 0 replies; 23+ messages in thread
From: Wang, Zhihong @ 2016-01-28 6:09 UTC (permalink / raw)
To: Thomas Monjalon; +Cc: dev
> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com]
> Sent: Wednesday, January 27, 2016 11:24 PM
> To: Wang, Zhihong <zhihong.wang@intel.com>
> Cc: dev@dpdk.org; Ravi Kerur <rkerur@gmail.com>
> Subject: Re: [dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
>
> 2016-01-17 22:05, Zhihong Wang:
> > This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
> > utilization of hardware resources and deliver high performance.
>
> On a related note, your expertise would be very valuable to review
> these patches please:
> (memcpy) http://dpdk.org/dev/patchwork/patch/4396/
> (memcmp) http://dpdk.org/dev/patchwork/patch/4788/
Will do, thanks.
>
> Thanks
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 0/5] " Zhihong Wang
` (7 preceding siblings ...)
2016-01-27 15:30 ` Thomas Monjalon
@ 2017-08-30 9:37 ` linhaifeng
2017-09-18 5:10 ` Wang, Zhihong
8 siblings, 1 reply; 23+ messages in thread
From: linhaifeng @ 2017-08-30 9:37 UTC (permalink / raw)
To: Zhihong Wang, dev
在 2016/1/18 11:05, Zhihong Wang 写道:
> This patch set optimizes DPDK memcpy for AVX512 platforms, to make full
> utilization of hardware resources and deliver high performance.
>
> In current DPDK, memcpy holds a large proportion of execution time in
> libs like Vhost, especially for large packets, and this patch can bring
> considerable benefits.
>
> The implementation is based on the current DPDK memcpy framework, some
> background introduction can be found in these threads:
> http://dpdk.org/ml/archives/dev/2014-November/008158.html
> http://dpdk.org/ml/archives/dev/2015-January/011800.html
>
> Code changes are:
>
> 1. Read CPUID to check if AVX512 is supported by CPU
>
> 2. Predefine AVX512 macro if AVX512 is enabled by compiler
>
> 3. Implement AVX512 memcpy and choose the right implementation based on
> predefined macros
>
> 4. Decide alignment unit for memcpy perf test based on predefined macros
>
> --------------
> Changes in v2:
>
> 1. Tune performance for prior platforms
>
> Zhihong Wang (5):
> lib/librte_eal: Identify AVX512 CPU flag
> mk: Predefine AVX512 macro for compiler
> lib/librte_eal: Optimize memcpy for AVX512 platforms
> app/test: Adjust alignment unit for memcpy perf test
> lib/librte_eal: Tune memcpy for prior platforms
>
> app/test/test_memcpy_perf.c | 6 +
> .../common/include/arch/x86/rte_cpuflags.h | 2 +
> .../common/include/arch/x86/rte_memcpy.h | 269 ++++++++++++++++++++-
> mk/rte.cpuflags.mk | 4 +
> 4 files changed, 268 insertions(+), 13 deletions(-)
>
Hi Zhihong Wang
I test avx512 rte_memcpy found the performanc for ovs dpdk is lower than avx2 rte_memcpy.
The vm loop test for ovs dpdk results:
avx512 is *15*Gbps
perf data:
0.52 │ vmovdq (%r8,%r10,1),%zmm0
95.33 │ sub $0x40,%r9
0.45 │ add $0x40,%r8
0.60 │ vmovdq %zmm0,-0x40(%r8)
1.84 │ cmp $0x3f,%r9
│ ↓ ja f20
│ lea -0x40(%rsi),%r8
0.15 │ or $0xffffffffffffffc0,%rsi
0.21 │ and $0xffffffffffffffc0,%r8
0.00 │ lea 0x40(%rsi,%r8,1),%rsi
0.00 │ vmovdq (%rcx,%rsi,1),%zmm0
0.22 │ vmovdq %zmm0,(%rdx,%rsi,1)
0.67 │ ↓ jmpq c78
│ mov -0x128(%rbp),%rdi
│ rex.R
│ .byte 0x89
│ popfq
avx2 is *18.8*Gbps
perf data:
0.96 │ add %r9,%r13
66.04 │ vmovdq (%rdx),%ymm0
1.20 │ sub $0x40,%rdi
1.53 │ add $0x40,%rdx
10.83 │ vmovdq %ymm0,-0x40(%rdx,%r15,1)
8.64 │ vmovdq -0x20(%rdx),%ymm0
7.58 │ vmovdq %ymm0,-0x40(%rdx,%r13,1)
dpdk version: v17.05
ovs version: 2.8.90
qemu version: QEMU emulator version 2.9.94 (v2.10.0-rc4-dirty)
gcc version: gcc (GCC) 4.9.2 20150212 (Red Hat 4.9.2-6)
kernal version: 3.10.0
compile dpdk:
CONFIG_RTE_ENABLE_AVX512=y
export DPDK_DIR=$PWD
export DPDK_TARGET=x86_64-native-linuxapp-gcc
export DPDK_BUILD=$DPDK_DIR/$DPDK_TARGET
make install T=$DPDK_TARGET DESTDIR=install
compile ovs:
sh boot.sh
./configure CFLAGS="-g -O2" --with-dpdk=$DPDK_BUILD --prefix=/usr --localstatedir=/var --sysconfdir=/etc
make -j
make install
The test for dpdk test_memcpy_perf:
avx2:
** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **
======= ============== ============== ============== ==============
Size Cache to cache Cache to mem Mem to cache Mem to mem
(bytes) (ticks) (ticks) (ticks) (ticks)
------- -------------- -------------- -------------- --------------
========================== 32B aligned ============================
64 6 - 10 27 - 52 30 - 39 56 - 97
512 24 - 44 251 - 271 145 - 217 396 - 447
1024 35 - 78 394 - 433 252 - 319 609 - 670
------- -------------- -------------- -------------- --------------
C 64 3 - 9 28 - 31 29 - 40 55 - 66
C 512 25 - 55 253 - 268 139 - 268 397 - 410
C 1024 32 - 83 394 - 416 250 - 396 612 - 687
=========================== Unaligned =============================
64 8 - 9 85 - 71 45 - 45 125 - 121
512 33 - 49 282 - 305 153 - 252 420 - 478
1024 42 - 83 409 - 491 259 - 389 640 - 748
------- -------------- -------------- -------------- --------------
C 64 4 - 9 42 - 46 39 - 46 76 - 90
C 512 33 - 55 280 - 272 153 - 281 421 - 415
C 1024 41 - 83 407 - 427 258 - 405 578 - 701
======= ============== ============== ============== ==============
avx512:
** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **
======= ============== ============== ============== ==============
Size Cache to cache Cache to mem Mem to cache Mem to mem
(bytes) (ticks) (ticks) (ticks) (ticks)
------- -------------- -------------- -------------- --------------
========================== 64B aligned ============================
64 6 - 9 18 - 33 24 - 38 40 - 65
512 18 - 44 178 - 262 138 - 218 309 - 429
1024 27 - 79 338 - 430 250 - 322 560 - 674
------- -------------- -------------- -------------- --------------
C 64 3 - 9 18 - 20 23 - 41 39 - 50
C 512 15 - 54 205 - 270 134 - 268 304 - 409
C 1024 24 - 83 371 - 414 242 - 400 550 - 692
=========================== Unaligned =============================
64 8 - 9 87 - 74 45 - 48 125 - 118
512 23 - 49 298 - 311 150 - 250 437 - 482
1024 36 - 83 427 - 505 259 - 406 633 - 754
------- -------------- -------------- -------------- --------------
C 64 4 - 9 42 - 46 39 - 46 76 - 94
C 512 23 - 55 246 - 277 152 - 290 349 - 426
C 1024 38 - 83 398 - 431 258 - 416 634 - 725
======= ============== ============== ============== ==============
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms
2017-08-30 9:37 ` linhaifeng
@ 2017-09-18 5:10 ` Wang, Zhihong
0 siblings, 0 replies; 23+ messages in thread
From: Wang, Zhihong @ 2017-09-18 5:10 UTC (permalink / raw)
To: linhaifeng, dev
> Hi Zhihong Wang
>
> I test avx512 rte_memcpy found the performanc for ovs dpdk is lower than
> avx2 rte_memcpy.
Hi Haifeng,
AVX512 memcpy is marked as experimental and disabled by default, its
benefit varies from case to case. So enable it only when the case
(SW + HW setup with expected data pattern) is verified.
BTW, it's not recommended to use micro benchmarks like test_memcpy_perf
for memcpy performance report as they aren't likely able to reflect
performance of real world applications, please find more details at
https://software.intel.com/en-us/articles/performance-optimization-of-memcpy-in-dpdk
Thanks
Zhihong
>
> The vm loop test for ovs dpdk results:
> avx512 is *15*Gbps
> perf data:
> 0.52 │ vmovdq (%r8,%r10,1),%zmm0
> 95.33 │ sub $0x40,%r9
> 0.45 │ add $0x40,%r8
> 0.60 │ vmovdq %zmm0,-0x40(%r8)
> 1.84 │ cmp $0x3f,%r9
> │ ↓ ja f20
> │ lea -0x40(%rsi),%r8
> 0.15 │ or $0xffffffffffffffc0,%rsi
> 0.21 │ and $0xffffffffffffffc0,%r8
> 0.00 │ lea 0x40(%rsi,%r8,1),%rsi
> 0.00 │ vmovdq (%rcx,%rsi,1),%zmm0
> 0.22 │ vmovdq %zmm0,(%rdx,%rsi,1)
> 0.67 │ ↓ jmpq c78
> │ mov -0x128(%rbp),%rdi
> │ rex.R
> │ .byte 0x89
> │ popfq
>
> avx2 is *18.8*Gbps
> perf data:
> 0.96 │ add %r9,%r13
> 66.04 │ vmovdq (%rdx),%ymm0
> 1.20 │ sub $0x40,%rdi
> 1.53 │ add $0x40,%rdx
> 10.83 │ vmovdq %ymm0,-0x40(%rdx,%r15,1)
> 8.64 │ vmovdq -0x20(%rdx),%ymm0
> 7.58 │ vmovdq %ymm0,-0x40(%rdx,%r13,1)
>
>
> dpdk version: v17.05
> ovs version: 2.8.90
> qemu version: QEMU emulator version 2.9.94 (v2.10.0-rc4-dirty)
>
> gcc version: gcc (GCC) 4.9.2 20150212 (Red Hat 4.9.2-6)
> kernal version: 3.10.0
>
>
> compile dpdk:
> CONFIG_RTE_ENABLE_AVX512=y
> export DPDK_DIR=$PWD
> export DPDK_TARGET=x86_64-native-linuxapp-gcc
> export DPDK_BUILD=$DPDK_DIR/$DPDK_TARGET
> make install T=$DPDK_TARGET DESTDIR=install
>
> compile ovs:
> sh boot.sh
> ./configure CFLAGS="-g -O2" --with-dpdk=$DPDK_BUILD --prefix=/usr --
> localstatedir=/var --sysconfdir=/etc
> make -j
> make install
>
> The test for dpdk test_memcpy_perf:
> avx2:
> ** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **
> ======= ============== ============== ==============
> ==============
> Size Cache to cache Cache to mem Mem to cache Mem to mem
> (bytes) (ticks) (ticks) (ticks) (ticks)
> ------- -------------- -------------- -------------- --------------
> ========================== 32B aligned
> ============================
> 64 6 - 10 27 - 52 30 - 39 56 - 97
> 512 24 - 44 251 - 271 145 - 217 396 - 447
> 1024 35 - 78 394 - 433 252 - 319 609 - 670
> ------- -------------- -------------- -------------- --------------
> C 64 3 - 9 28 - 31 29 - 40 55 - 66
> C 512 25 - 55 253 - 268 139 - 268 397 - 410
> C 1024 32 - 83 394 - 416 250 - 396 612 - 687
> =========================== Unaligned
> =============================
> 64 8 - 9 85 - 71 45 - 45 125 - 121
> 512 33 - 49 282 - 305 153 - 252 420 - 478
> 1024 42 - 83 409 - 491 259 - 389 640 - 748
> ------- -------------- -------------- -------------- --------------
> C 64 4 - 9 42 - 46 39 - 46 76 - 90
> C 512 33 - 55 280 - 272 153 - 281 421 - 415
> C 1024 41 - 83 407 - 427 258 - 405 578 - 701
> ======= ============== ============== ==============
> ==============
>
> avx512:
> ** rte_memcpy() - memcpy perf. tests (C = compile-time constant) **
> ======= ============== ============== ==============
> ==============
> Size Cache to cache Cache to mem Mem to cache Mem to mem
> (bytes) (ticks) (ticks) (ticks) (ticks)
> ------- -------------- -------------- -------------- --------------
> ========================== 64B aligned
> ============================
> 64 6 - 9 18 - 33 24 - 38 40 - 65
> 512 18 - 44 178 - 262 138 - 218 309 - 429
> 1024 27 - 79 338 - 430 250 - 322 560 - 674
> ------- -------------- -------------- -------------- --------------
> C 64 3 - 9 18 - 20 23 - 41 39 - 50
> C 512 15 - 54 205 - 270 134 - 268 304 - 409
> C 1024 24 - 83 371 - 414 242 - 400 550 - 692
> =========================== Unaligned
> =============================
> 64 8 - 9 87 - 74 45 - 48 125 - 118
> 512 23 - 49 298 - 311 150 - 250 437 - 482
> 1024 36 - 83 427 - 505 259 - 406 633 - 754
> ------- -------------- -------------- -------------- --------------
> C 64 4 - 9 42 - 46 39 - 46 76 - 94
> C 512 23 - 55 246 - 277 152 - 290 349 - 426
> C 1024 38 - 83 398 - 431 258 - 416 634 - 725
> ======= ============== ============== ==============
> ==============
>
>
>
>
>
>
>
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2017-09-18 5:11 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-01-14 6:13 [dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms Zhihong Wang
2016-01-14 6:13 ` [dpdk-dev] [PATCH 1/4] lib/librte_eal: Identify AVX512 CPU flag Zhihong Wang
2016-01-14 6:13 ` [dpdk-dev] [PATCH 2/4] mk: Predefine AVX512 macro for compiler Zhihong Wang
2016-01-14 6:13 ` [dpdk-dev] [PATCH 3/4] lib/librte_eal: Optimize memcpy for AVX512 platforms Zhihong Wang
2016-01-14 6:13 ` [dpdk-dev] [PATCH 4/4] app/test: Adjust alignment unit for memcpy perf test Zhihong Wang
2016-01-14 16:48 ` [dpdk-dev] [PATCH 0/4] Optimize memcpy for AVX512 platforms Stephen Hemminger
2016-01-15 6:39 ` Wang, Zhihong
2016-01-15 22:03 ` Vincent JARDIN
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 0/5] " Zhihong Wang
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 1/5] lib/librte_eal: Identify AVX512 CPU flag Zhihong Wang
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 2/5] mk: Predefine AVX512 macro for compiler Zhihong Wang
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 3/5] lib/librte_eal: Optimize memcpy for AVX512 platforms Zhihong Wang
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 4/5] app/test: Adjust alignment unit for memcpy perf test Zhihong Wang
2016-01-18 3:05 ` [dpdk-dev] [PATCH v2 5/5] lib/librte_eal: Tune memcpy for prior platforms Zhihong Wang
2016-01-18 20:06 ` [dpdk-dev] [PATCH v2 0/5] Optimize memcpy for AVX512 platforms Stephen Hemminger
2016-01-19 2:37 ` Wang, Zhihong
2016-01-27 15:23 ` Thomas Monjalon
2016-01-28 6:09 ` Wang, Zhihong
2016-01-27 15:30 ` Thomas Monjalon
2016-01-27 18:48 ` Ananyev, Konstantin
2016-01-27 20:18 ` Thomas Monjalon
2017-08-30 9:37 ` linhaifeng
2017-09-18 5:10 ` Wang, Zhihong
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).