From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
To: Herbert Guan <herbert.guan@arm.com>
Cc: dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH v4] arch/arm: optimization for memcpy on AArch64
Date: Wed, 3 Jan 2018 19:05:15 +0530 [thread overview]
Message-ID: <20180103133513.GA30368@jerin> (raw)
In-Reply-To: <1513834427-12635-1-git-send-email-herbert.guan@arm.com>
-----Original Message-----
> Date: Thu, 21 Dec 2017 13:33:47 +0800
> From: Herbert Guan <herbert.guan@arm.com>
> To: dev@dpdk.org, jerin.jacob@caviumnetworks.com
> CC: Herbert Guan <herbert.guan@arm.com>
> Subject: [PATCH v4] arch/arm: optimization for memcpy on AArch64
> X-Mailer: git-send-email 1.8.3.1
>
> This patch provides an option to do rte_memcpy() using 'restrict'
> qualifier, which can induce GCC to do optimizations by using more
> efficient instructions, providing some performance gain over memcpy()
> on some AArch64 platforms/enviroments.
>
> The memory copy performance differs between different AArch64
> platforms. And a more recent glibc (e.g. 2.23 or later)
> can provide a better memcpy() performance compared to old glibc
> versions. It's always suggested to use a more recent glibc if
> possible, from which the entire system can get benefit. If for some
> reason an old glibc has to be used, this patch is provided for an
> alternative.
>
> This implementation can improve memory copy on some AArch64
> platforms, when an old glibc (e.g. 2.19, 2.17...) is being used.
> It is disabled by default and needs "RTE_ARCH_ARM64_MEMCPY"
> defined to activate. It's not always proving better performance
> than memcpy() so users need to run DPDK unit test
> "memcpy_perf_autotest" and customize parameters in "customization
> section" in rte_memcpy_64.h for best performance.
>
> Compiler version will also impact the rte_memcpy() performance.
> It's observed on some platforms and with the same code, GCC 7.2.0
> compiled binary can provide better performance than GCC 4.8.5. It's
> suggested to use GCC 5.4.0 or later.
>
> Signed-off-by: Herbert Guan <herbert.guan@arm.com>
Looks good. Find inline request for some minor changes.
Feel free to add my Acked-by with those changes.
> ---
> config/common_armv8a_linuxapp | 6 +
> .../common/include/arch/arm/rte_memcpy_64.h | 287 +++++++++++++++++++++
> 2 files changed, 293 insertions(+)
>
> diff --git a/config/common_armv8a_linuxapp b/config/common_armv8a_linuxapp
> index 6732d1e..8f0cbed 100644
> --- a/config/common_armv8a_linuxapp
> +++ b/config/common_armv8a_linuxapp
> @@ -44,6 +44,12 @@ CONFIG_RTE_FORCE_INTRINSICS=y
> # to address minimum DMA alignment across all arm64 implementations.
> CONFIG_RTE_CACHE_LINE_SIZE=128
>
> +# Accelarate rte_memcpy. Be sure to run unit test to determine the
> +# best threshold in code. Refer to notes in source file
> +# (lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h) for more
> +# info.
> +CONFIG_RTE_ARCH_ARM64_MEMCPY=n
> +
> CONFIG_RTE_LIBRTE_FM10K_PMD=n
> CONFIG_RTE_LIBRTE_SFC_EFX_PMD=n
> CONFIG_RTE_LIBRTE_AVP_PMD=n
> diff --git a/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h b/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h
> index b80d8ba..b269f34 100644
> --- a/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h
> +++ b/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h
> @@ -42,6 +42,291 @@
>
> #include "generic/rte_memcpy.h"
>
> +#ifdef RTE_ARCH_ARM64_MEMCPY
> +#include <rte_common.h>
> +#include <rte_branch_prediction.h>
> +
> +/*
> + * The memory copy performance differs on different AArch64 micro-architectures.
> + * And the most recent glibc (e.g. 2.23 or later) can provide a better memcpy()
> + * performance compared to old glibc versions. It's always suggested to use a
> + * more recent glibc if possible, from which the entire system can get benefit.
> + *
> + * This implementation improves memory copy on some aarch64 micro-architectures,
> + * when an old glibc (e.g. 2.19, 2.17...) is being used. It is disabled by
> + * default and needs "RTE_ARCH_ARM64_MEMCPY" defined to activate. It's not
> + * always providing better performance than memcpy() so users need to run unit
> + * test "memcpy_perf_autotest" and customize parameters in customization section
> + * below for best performance.
> + *
> + * Compiler version will also impact the rte_memcpy() performance. It's observed
> + * on some platforms and with the same code, GCC 7.2.0 compiled binaries can
> + * provide better performance than GCC 4.8.5 compiled binaries.
> + */
> +
> +/**************************************
> + * Beginning of customization section
> + **************************************/
> +#define RTE_ARM64_MEMCPY_ALIGN_MASK 0x0F
> +#ifndef RTE_ARCH_ARM64_MEMCPY_STRICT_ALIGN
> +/* Only src unalignment will be treaed as unaligned copy */
> +#define IS_UNALIGNED_COPY(dst, src) \
Better to to change to RTE_ARM64_MEMCPY_IS_UNALIGNED_COPY, as it is
defined in public DPDK header file.
> + ((uintptr_t)(dst) & RTE_ARM64_MEMCPY_ALIGN_MASK)
> +#else
> +/* Both dst and src unalignment will be treated as unaligned copy */
> +#define IS_UNALIGNED_COPY(dst, src) \
> + (((uintptr_t)(dst) | (uintptr_t)(src)) & RTE_ARM64_MEMCPY_ALIGN_MASK)
Same as above
> +#endif
> +
> +
> +/*
> + * If copy size is larger than threshold, memcpy() will be used.
> + * Run "memcpy_perf_autotest" to determine the proper threshold.
> + */
> +#define RTE_ARM64_MEMCPY_ALIGNED_THRESHOLD ((size_t)(0xffffffff))
> +#define RTE_ARM64_MEMCPY_UNALIGNED_THRESHOLD ((size_t)(0xffffffff))
> +
> +/*
> + * The logic of USE_RTE_MEMCPY() can also be modified to best fit platform.
> + */
> +#define USE_RTE_MEMCPY(dst, src, n) \
> +((!IS_UNALIGNED_COPY(dst, src) && n <= RTE_ARM64_MEMCPY_ALIGNED_THRESHOLD) \
> +|| (IS_UNALIGNED_COPY(dst, src) && n <= RTE_ARM64_MEMCPY_UNALIGNED_THRESHOLD))
> +
> +
> +/**************************************
> + * End of customization section
> + **************************************/
> +#if defined(RTE_TOOLCHAIN_GCC) && !defined(RTE_AARCH64_SKIP_GCC_VERSION_CHECK)
To maintain consistency
s/RTE_AARCH64_SKIP_GCC_VERSION_CHECK/RTE_ARM64_MEMCPY_SKIP_GCC_VERSION_CHECK
> +#if (GCC_VERSION < 50400)
> +#warning "The GCC version is quite old, which may result in sub-optimal \
> +performance of the compiled code. It is suggested that at least GCC 5.4.0 \
> +be used."
> +#endif
> +#endif
> +
> +static __rte_always_inline void rte_mov16(uint8_t *dst, const uint8_t *src)
static __rte_always_inline
void rte_mov16(uint8_t *dst, const uint8_t *src)
> +{
> + __uint128_t *dst128 = (__uint128_t *)dst;
> + const __uint128_t *src128 = (const __uint128_t *)src;
> + *dst128 = *src128;
> +}
> +
> +static __rte_always_inline void rte_mov32(uint8_t *dst, const uint8_t *src)
See above
> +{
> + __uint128_t *dst128 = (__uint128_t *)dst;
> + const __uint128_t *src128 = (const __uint128_t *)src;
> + const __uint128_t x0 = src128[0], x1 = src128[1];
> + dst128[0] = x0;
> + dst128[1] = x1;
> +}
> +
> +static __rte_always_inline void rte_mov48(uint8_t *dst, const uint8_t *src)
> +{
See above
> + __uint128_t *dst128 = (__uint128_t *)dst;
> + const __uint128_t *src128 = (const __uint128_t *)src;
> + const __uint128_t x0 = src128[0], x1 = src128[1], x2 = src128[2];
> + dst128[0] = x0;
> + dst128[1] = x1;
> + dst128[2] = x2;
> +}
> +
> +static __rte_always_inline void rte_mov64(uint8_t *dst, const uint8_t *src)
> +{
See above
> + __uint128_t *dst128 = (__uint128_t *)dst;
> + const __uint128_t *src128 = (const __uint128_t *)src;
> + const __uint128_t
> + x0 = src128[0], x1 = src128[1], x2 = src128[2], x3 = src128[3];
> + dst128[0] = x0;
> + dst128[1] = x1;
> + dst128[2] = x2;
> + dst128[3] = x3;
> +}
> +
> +static __rte_always_inline void rte_mov128(uint8_t *dst, const uint8_t *src)
> +{
See above
> + __uint128_t *dst128 = (__uint128_t *)dst;
> + const __uint128_t *src128 = (const __uint128_t *)src;
> + /* Keep below declaration & copy sequence for optimized instructions */
> + const __uint128_t
> + x0 = src128[0], x1 = src128[1], x2 = src128[2], x3 = src128[3];
> + dst128[0] = x0;
> + __uint128_t x4 = src128[4];
> + dst128[1] = x1;
> + __uint128_t x5 = src128[5];
> + dst128[2] = x2;
> + __uint128_t x6 = src128[6];
> + dst128[3] = x3;
> + __uint128_t x7 = src128[7];
> + dst128[4] = x4;
> + dst128[5] = x5;
> + dst128[6] = x6;
> + dst128[7] = x7;
> +}
> +
> +static __rte_always_inline void rte_mov256(uint8_t *dst, const uint8_t *src)
> +{
See above
next prev parent reply other threads:[~2018-01-03 13:35 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-11-27 7:49 [dpdk-dev] [PATCH] " Herbert Guan
2017-11-29 12:31 ` Jerin Jacob
2017-12-03 12:37 ` Herbert Guan
2017-12-15 4:06 ` Jerin Jacob
2017-12-18 2:51 ` Herbert Guan
2017-12-18 4:17 ` Jerin Jacob
2017-12-02 7:33 ` Pavan Nikhilesh Bhagavatula
2017-12-03 12:38 ` Herbert Guan
2017-12-03 14:20 ` Pavan Nikhilesh Bhagavatula
2017-12-04 7:14 ` Herbert Guan
2017-12-05 6:02 ` [dpdk-dev] [PATCH v2] " Herbert Guan
2017-12-18 2:54 ` [dpdk-dev] [PATCH v3] " Herbert Guan
2017-12-18 7:43 ` Jerin Jacob
2017-12-19 5:33 ` Herbert Guan
2017-12-19 7:24 ` Jerin Jacob
2017-12-21 5:33 ` [dpdk-dev] [PATCH v4] " Herbert Guan
2018-01-03 13:35 ` Jerin Jacob [this message]
2018-01-04 10:23 ` Herbert Guan
2018-01-04 10:20 ` [dpdk-dev] [PATCH v5] " Herbert Guan
2018-01-12 17:03 ` Thomas Monjalon
2018-01-15 10:57 ` Herbert Guan
2018-01-15 11:37 ` Thomas Monjalon
2018-01-18 23:54 ` Thomas Monjalon
2018-01-19 6:16 ` [dpdk-dev] 答复: " Herbert Guan
2018-01-19 6:10 ` [dpdk-dev] [PATCH v6] arch/arm: optimization for memcpy on ARM64 Herbert Guan
2018-01-20 16:21 ` Thomas Monjalon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20180103133513.GA30368@jerin \
--to=jerin.jacob@caviumnetworks.com \
--cc=dev@dpdk.org \
--cc=herbert.guan@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).