From: Herbert Guan <Herbert.Guan@arm.com>
To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Cc: "dev@dpdk.org" <dev@dpdk.org>, nd <nd@arm.com>
Subject: Re: [dpdk-dev] [PATCH v3] arch/arm: optimization for memcpy on AArch64
Date: Tue, 19 Dec 2017 05:33:19 +0000 [thread overview]
Message-ID: <HE1PR08MB28096F9BAA9CFB42E397D18C860F0@HE1PR08MB2809.eurprd08.prod.outlook.com> (raw)
In-Reply-To: <20171218074349.GA16659@jerin>
Jerin,
Thanks for review and comments. Please find my feedbacks below inline.
> -----Original Message-----
> From: Jerin Jacob [mailto:jerin.jacob@caviumnetworks.com]
> Sent: Monday, December 18, 2017 15:44
> To: Herbert Guan <Herbert.Guan@arm.com>
> Cc: dev@dpdk.org
> Subject: Re: [PATCH v3] arch/arm: optimization for memcpy on AArch64
>
> -----Original Message-----
> > Date: Mon, 18 Dec 2017 10:54:24 +0800
> > From: Herbert Guan <herbert.guan@arm.com>
> > To: dev@dpdk.org, jerin.jacob@caviumnetworks.com
> > CC: Herbert Guan <herbert.guan@arm.com>
> > Subject: [PATCH v3] arch/arm: optimization for memcpy on AArch64
> > X-Mailer: git-send-email 1.8.3.1
> >
> > Signed-off-by: Herbert Guan <herbert.guan@arm.com>
> > ---
> > config/common_armv8a_linuxapp | 6 +
> > .../common/include/arch/arm/rte_memcpy_64.h | 292
> +++++++++++++++++++++
> > 2 files changed, 298 insertions(+)
> >
> > diff --git a/config/common_armv8a_linuxapp
> > b/config/common_armv8a_linuxapp index 6732d1e..8f0cbed 100644
> > --- a/config/common_armv8a_linuxapp
> > +++ b/config/common_armv8a_linuxapp
> > @@ -44,6 +44,12 @@ CONFIG_RTE_FORCE_INTRINSICS=y # to address
> minimum
> > DMA alignment across all arm64 implementations.
> > CONFIG_RTE_CACHE_LINE_SIZE=128
> >
> > +# Accelarate rte_memcpy. Be sure to run unit test to determine the
>
> Additional space before "Be". Rather than just mentioning the unit test,
> mention the absolute test case name(memcpy_perf_autotest)
>
> > +# best threshold in code. Refer to notes in source file
>
> Additional space before "Refer"
Fixed in new version.
>
> > +# (lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h) for more
> #
> > +info.
> > +CONFIG_RTE_ARCH_ARM64_MEMCPY=n
> > +
> > CONFIG_RTE_LIBRTE_FM10K_PMD=n
> > CONFIG_RTE_LIBRTE_SFC_EFX_PMD=n
> > CONFIG_RTE_LIBRTE_AVP_PMD=n
> > diff --git a/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h
> > b/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h
> > index b80d8ba..1ea275d 100644
> > --- a/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h
> > +++ b/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h
> > @@ -42,6 +42,296 @@
> >
> > #include "generic/rte_memcpy.h"
> >
> > +#ifdef RTE_ARCH_ARM64_MEMCPY
>
> See the comment below at "(GCC_VERSION < 50400)" check
>
> > +#include <rte_common.h>
> > +#include <rte_branch_prediction.h>
> > +
> > +/*
> > + * The memory copy performance differs on different AArch64 micro-
> architectures.
> > + * And the most recent glibc (e.g. 2.23 or later) can provide a
> > +better memcpy()
> > + * performance compared to old glibc versions. It's always suggested
> > +to use a
> > + * more recent glibc if possible, from which the entire system can get
> benefit.
> > + *
> > + * This implementation improves memory copy on some aarch64
> > +micro-architectures,
> > + * when an old glibc (e.g. 2.19, 2.17...) is being used. It is
> > +disabled by
> > + * default and needs "RTE_ARCH_ARM64_MEMCPY" defined to activate.
> > +It's not
> > + * always providing better performance than memcpy() so users need to
> > +run unit
> > + * test "memcpy_perf_autotest" and customize parameters in
> > +customization section
> > + * below for best performance.
> > + *
> > + * Compiler version will also impact the rte_memcpy() performance.
> > +It's observed
> > + * on some platforms and with the same code, GCC 7.2.0 compiled
> > +binaries can
> > + * provide better performance than GCC 4.8.5 compiled binaries.
> > + */
> > +
> > +/**************************************
> > + * Beginning of customization section
> > +**************************************/
> > +#define ALIGNMENT_MASK 0x0F
>
> This symbol will be included in public rte_memcpy.h version for arm64 DPDK
> build.
> Please use RTE_ prefix to avoid multi
> definition.(RTE_ARCH_ARM64_ALIGN_MASK ? or any shorter name)
>
Changed to RTE_AARCH64_ALIGN_MASK in new version.
> > +#ifndef RTE_ARCH_ARM64_MEMCPY_STRICT_ALIGN
> > +/* Only src unalignment will be treaed as unaligned copy */ #define
> > +IS_UNALIGNED_COPY(dst, src) ((uintptr_t)(dst) & ALIGNMENT_MASK)
> #else
> > +/* Both dst and src unalignment will be treated as unaligned copy */
> > +#define IS_UNALIGNED_COPY(dst, src) \
> > + (((uintptr_t)(dst) | (uintptr_t)(src)) & ALIGNMENT_MASK)
> #endif
> > +
> > +
> > +/*
> > + * If copy size is larger than threshold, memcpy() will be used.
> > + * Run "memcpy_perf_autotest" to determine the proper threshold.
> > + */
> > +#define ALIGNED_THRESHOLD ((size_t)(0xffffffff))
> > +#define UNALIGNED_THRESHOLD ((size_t)(0xffffffff))
>
> Same as above comment.
Added RTE_AARCH64_ prefix in new version.
>
> > +
> > +/**************************************
> > + * End of customization section
> > + **************************************/
> > +#ifdef RTE_TOOLCHAIN_GCC
> > +#if (GCC_VERSION < 50400)
> > +#warning "The GCC version is quite old, which may result in sub-optimal \
> > +performance of the compiled code. It is suggested that at least GCC 5.4.0 \
> > +be used."
>
> Even though it is warning, based on where this file get included it will
> generate error(see below)
> How about, selecting optimized memcpy when RTE_ARCH_ARM64_MEMCPY
> && if (GCC_VERSION >= 50400) ?
>
Fully understand that. While I'm not tending to make it 'silent'. GCC 4.x is just
quite old and may not provide best optimized code -- not only for DPDK app.
We can provide another option RTE_AARCH64_SKIP_GCC_VERSION_CHECK to allow
skipping the GCC version check. How do you think?
> CC eal_common_options.o
> In file included from
> /home/jerin/dpdk.org/build/include/rte_memcpy.h:37:0,from
> /home/jerin/dpdk.org/lib/librte_eal/common/eal_common_options.c:53:
> /home/jerin/dpdk.org/build/include/rte_memcpy_64.h:93:2: error:
> #warning
> ^^^^^^^^
> "The GCC version is quite old, which may result in sub-optimal
> performance of the compiled code. It is suggested that at least GCC
> 5.4.0 be used." [-Werror=cpp]
> ^^^^^^^^^^^^^^
> #warning "The GCC version is quite old, which may result in sub-optimal
> \
> ^
>
>
> > +#endif
> > +#endif
> > +
> > +
> > +#if RTE_CACHE_LINE_SIZE >= 128
>
> We can remove this conditional compilation check. ie. It can get compiled for
> both cases,
> But it will be used only when RTE_CACHE_LINE_SIZE >= 128
>
OK, it'll be removed in the new version.
> > +static __rte_always_inline void
> > +rte_memcpy_ge16_lt128
> > +(uint8_t *restrict dst, const uint8_t *restrict src, size_t n)
> > +{
> > + if (n < 64) {
> > + if (n == 16) {
> > + rte_mov16(dst, src);
> > + } else if (n <= 32) {
> > + rte_mov16(dst, src);
> > + rte_mov16(dst - 16 + n, src - 16 + n);
> > + } else if (n <= 48) {
> > + rte_mov32(dst, src);
> > + rte_mov16(dst - 16 + n, src - 16 + n);
> > + } else {
> > + rte_mov48(dst, src);
> > + rte_mov16(dst - 16 + n, src - 16 + n);
> > + }
> > + } else {
> > + rte_mov64((uint8_t *)dst, (const uint8_t *)src);
> > + if (n > 48 + 64)
> > + rte_mov64(dst - 64 + n, src - 64 + n);
> > + else if (n > 32 + 64)
> > + rte_mov48(dst - 48 + n, src - 48 + n);
> > + else if (n > 16 + 64)
> > + rte_mov32(dst - 32 + n, src - 32 + n);
> > + else if (n > 64)
> > + rte_mov16(dst - 16 + n, src - 16 + n);
> > + }
> > +}
> > +
> > +
> > +#else
>
> Same as above comment.
>
> > +static __rte_always_inline void
> > +rte_memcpy_ge16_lt64
> > +(uint8_t *restrict dst, const uint8_t *restrict src, size_t n)
> > +{
> > + if (n == 16) {
> > + rte_mov16(dst, src);
> > + } else if (n <= 32) {
> > + rte_mov16(dst, src);
> > + rte_mov16(dst - 16 + n, src - 16 + n);
> > + } else if (n <= 48) {
> > + rte_mov32(dst, src);
> > + rte_mov16(dst - 16 + n, src - 16 + n);
> > + } else {
> > + rte_mov48(dst, src);
> > + rte_mov16(dst - 16 + n, src - 16 + n);
> > + }
> > +}
> > +
> > +
> > +static __rte_always_inline void *
> > +rte_memcpy(void *restrict dst, const void *restrict src, size_t n)
> > +{
> > + if (n < 16) {
> > + rte_memcpy_lt16((uint8_t *)dst, (const uint8_t *)src, n);
> > + return dst;
> > + }
> > +#if RTE_CACHE_LINE_SIZE >= 128
> > + if (n < 128) {
> > + rte_memcpy_ge16_lt128((uint8_t *)dst, (const uint8_t *)src,
> n);
> > + return dst;
> > + }
> > +#else
> > + if (n < 64) {
> > + rte_memcpy_ge16_lt64((uint8_t *)dst, (const uint8_t *)src,
> n);
> > + return dst;
> > + }
> > +#endif
> > + __builtin_prefetch(src, 0, 0);
> > + __builtin_prefetch(dst, 1, 0);
> > + if (likely(
> > + (!IS_UNALIGNED_COPY(dst, src) && n <=
> ALIGNED_THRESHOLD)
> > + || (IS_UNALIGNED_COPY(dst, src) && n <=
> UNALIGNED_THRESHOLD)
> > + )) {
> > +#if RTE_CACHE_LINE_SIZE >= 128
> > + rte_memcpy_ge128((uint8_t *)dst, (const uint8_t *)src, n);
> > +#else
> > + rte_memcpy_ge64((uint8_t *)dst, (const uint8_t *)src, n);
> > +#endif
>
> Can we remove this #ifdef clutter(We have two of them in a same function)?
>
> I suggest to remove this clutter by having the separate routine. ie.
> 1)
> #if RTE_CACHE_LINE_SIZE >= 128
> rte_memcpy(void *restrict dst, const void *restrict src, size_t n)
> {
> }
> #else
> rte_memcpy(void *restrict dst, const void *restrict src, size_t n)
> {
> }
> #endif
>
> 2) Have separate inline function to resolve following logic and used it
> in both variants.
>
> if (likely(
> (!IS_UNALIGNED_COPY(dst, src) && n <=
> ALIGNED_THRESHOLD)
> || (IS_UNALIGNED_COPY(dst, src) && n <=
> UNALIGNED_THRESHOLD)
> )) {
>
Implemented as suggested in new version.
> With above changes:
> Acked-by: Jerin Jacob <jerin.jacob@caviumnetworks.com>
Thanks,
Herbert Guan
next prev parent reply other threads:[~2017-12-19 5:33 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-11-27 7:49 [dpdk-dev] [PATCH] " Herbert Guan
2017-11-29 12:31 ` Jerin Jacob
2017-12-03 12:37 ` Herbert Guan
2017-12-15 4:06 ` Jerin Jacob
2017-12-18 2:51 ` Herbert Guan
2017-12-18 4:17 ` Jerin Jacob
2017-12-02 7:33 ` Pavan Nikhilesh Bhagavatula
2017-12-03 12:38 ` Herbert Guan
2017-12-03 14:20 ` Pavan Nikhilesh Bhagavatula
2017-12-04 7:14 ` Herbert Guan
2017-12-05 6:02 ` [dpdk-dev] [PATCH v2] " Herbert Guan
2017-12-18 2:54 ` [dpdk-dev] [PATCH v3] " Herbert Guan
2017-12-18 7:43 ` Jerin Jacob
2017-12-19 5:33 ` Herbert Guan [this message]
2017-12-19 7:24 ` Jerin Jacob
2017-12-21 5:33 ` [dpdk-dev] [PATCH v4] " Herbert Guan
2018-01-03 13:35 ` Jerin Jacob
2018-01-04 10:23 ` Herbert Guan
2018-01-04 10:20 ` [dpdk-dev] [PATCH v5] " Herbert Guan
2018-01-12 17:03 ` Thomas Monjalon
2018-01-15 10:57 ` Herbert Guan
2018-01-15 11:37 ` Thomas Monjalon
2018-01-18 23:54 ` Thomas Monjalon
2018-01-19 6:16 ` [dpdk-dev] 答复: " Herbert Guan
2018-01-19 6:10 ` [dpdk-dev] [PATCH v6] arch/arm: optimization for memcpy on ARM64 Herbert Guan
2018-01-20 16:21 ` Thomas Monjalon
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=HE1PR08MB28096F9BAA9CFB42E397D18C860F0@HE1PR08MB2809.eurprd08.prod.outlook.com \
--to=herbert.guan@arm.com \
--cc=dev@dpdk.org \
--cc=jerin.jacob@caviumnetworks.com \
--cc=nd@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).