From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from EUR02-HE1-obe.outbound.protection.outlook.com (mail-eopbgr10083.outbound.protection.outlook.com [40.107.1.83]) by dpdk.org (Postfix) with ESMTP id C985E1DBF for ; Tue, 19 Dec 2017 06:33:21 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector1-arm-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=XNCDFPkmG/HckmJoM4gu0hKCif3HdTHBw59RMrEqze4=; b=bTGCQ5biQdiZ86VRJ+Tb87Wa1glyGqOtGLA9/BG9dSbRPwlwm45+wlGp7ummGqJLZRjXI7+WKHSNsJIErNrniyADT9HijTbnGLG2lTE1XhhS0jLSgjZ20+0DvxBf0x3V1HNiN5EIlBenZJSBIcJ4A1KZxSbUR3jcl9jJEhCORTE= Received: from HE1PR08MB2809.eurprd08.prod.outlook.com (10.170.246.148) by HE1PR08MB2809.eurprd08.prod.outlook.com (10.170.246.148) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P256) id 15.20.323.15; Tue, 19 Dec 2017 05:33:19 +0000 Received: from HE1PR08MB2809.eurprd08.prod.outlook.com ([fe80::90dc:2dac:4dcd:a0f9]) by HE1PR08MB2809.eurprd08.prod.outlook.com ([fe80::90dc:2dac:4dcd:a0f9%13]) with mapi id 15.20.0323.018; Tue, 19 Dec 2017 05:33:19 +0000 From: Herbert Guan To: Jerin Jacob CC: "dev@dpdk.org" , nd Thread-Topic: [PATCH v3] arch/arm: optimization for memcpy on AArch64 Thread-Index: AQHTd6uqTOU4dSKnskexy3KAHuFinaNIuI2AgAE3zpA= Date: Tue, 19 Dec 2017 05:33:19 +0000 Message-ID: References: <1511768985-21639-1-git-send-email-herbert.guan@arm.com> <1513565664-19509-1-git-send-email-herbert.guan@arm.com> <20171218074349.GA16659@jerin> In-Reply-To: <20171218074349.GA16659@jerin> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=Herbert.Guan@arm.com; x-originating-ip: [113.29.88.7] x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1; HE1PR08MB2809; 6:wAL+5DIefoMjXMvqGr2wLF+M+1ToOwYwwVTAauzxhr9zI/9ENDk34RaSQyKxifxT1SvUMNzzI+l7qRGUr8o/AmHukC8IEgsyAskyXqprqUvyN1ly4Xyg41tbaI4MvKhQlARuAmiyTB0JMYf60rIm8vG9IMVNYF/obLsx8h4GsGW6mrv7rOo5AjwRTqCu71iM6TTbHwBHeY9nX2OAnOpxP748d4RKsIUBH6v5e+ygEnFqld8CZw54ORWJURUM6HXjCuar+mFN7iuqzvGkW0dGgYmHyEIkjSiAa3nIKSm0r4RWVUy1nGOiit1fzitnnBVLPFb53+EQNrJ5kgG6kiKxJjJkdLgkfJIMA/yZPM+F0Hw=; 5:f67RnPjUpPnDrKoNZhDeT4Bfe1Duiyu1mfUVQR5HSCMiHn7d5orAQ9x+6FlIhKIK+nR5kb/o8T/gtspRyVYrldu4UrtbzexzxxvkxdUMczzo8GOiGLaagnzGom7BFcPyb56Nr/h/Mo3T1LbOX3zeHnGehAB/87hbu1PKMy4gwxY=; 24:dnDVjFpH4A4EWsyH4aUcDmLMf5XrpJduB+snZUZDMUh26274sBnaVDH5fi1Q1rFMLp8Dp1dVbGvs8gbcCuhZ7b9AxOpQOaF+Ggptor0A47k=; 7:bKEV3xX+2r65OpD2jFhjriOAmTP1wgSzsPp646lVSJpdulG4pizecikXfdXoJ5Ns0WEpQ07sDnXnt3o+TdeZTC1vDXgNkKgZAyeEMSa5Z6jUG8PggGskyTI5XbikeMGizj6MiktrT71XQkmGmkDZCPVuqMsaJbwTjw9pJketve3svqiWDCjaxdD+9OiqqBfhwlJ3kj2WoSQNfKKcUaGrdgkJlfSHrRsLNK2ZjDoOn9xG3n/8yE4JnSy7bFYEKX0+ x-ms-exchange-antispam-srfa-diagnostics: SSOS; x-ms-office365-filtering-correlation-id: e7c57973-d0b4-4bc8-9b48-08d546a20358 x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: UriScan:; BCL:0; PCL:0; RULEID:(5600026)(4604075)(4534020)(4602075)(7168020)(4627115)(201703031133081)(201702281549075)(48565401081)(2017052603307); SRVR:HE1PR08MB2809; x-ms-traffictypediagnostic: HE1PR08MB2809: nodisclaimer: True x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(180628864354917); x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(6040450)(2401047)(5005006)(8121501046)(3002001)(3231023)(93006095)(93001095)(10201501046)(6055026)(6041248)(20161123562025)(20161123555025)(20161123564025)(20161123558100)(20161123560025)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(6072148)(201708071742011); SRVR:HE1PR08MB2809; BCL:0; PCL:0; RULEID:(100000803101)(100110400095); SRVR:HE1PR08MB2809; x-forefront-prvs: 052670E5A4 x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(39860400002)(376002)(396003)(366004)(346002)(13464003)(189003)(199004)(33656002)(81166006)(6916009)(2950100002)(8936002)(7736002)(97736004)(6436002)(6246003)(106356001)(105586002)(5660300001)(305945005)(86362001)(3660700001)(229853002)(9686003)(55016002)(53936002)(68736007)(55236004)(74316002)(3280700002)(4326008)(25786009)(575784001)(54906003)(316002)(3846002)(6116002)(2900100001)(66066001)(99286004)(72206003)(8676002)(478600001)(14454004)(81156014)(2906002)(5250100002)(6506007)(59450400001)(53546011)(7696005)(76176011)(102836003); DIR:OUT; SFP:1101; SCL:1; SRVR:HE1PR08MB2809; H:HE1PR08MB2809.eurprd08.prod.outlook.com; FPR:; SPF:None; PTR:InfoNoRecords; MX:1; A:1; LANG:en; received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-Network-Message-Id: e7c57973-d0b4-4bc8-9b48-08d546a20358 X-MS-Exchange-CrossTenant-originalarrivaltime: 19 Dec 2017 05:33:19.5859 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-Transport-CrossTenantHeadersStamped: HE1PR08MB2809 Subject: Re: [dpdk-dev] [PATCH v3] arch/arm: optimization for memcpy on AArch64 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 19 Dec 2017 05:33:22 -0000 Jerin, Thanks for review and comments. Please find my feedbacks below inline. > -----Original Message----- > From: Jerin Jacob [mailto:jerin.jacob@caviumnetworks.com] > Sent: Monday, December 18, 2017 15:44 > To: Herbert Guan > Cc: dev@dpdk.org > Subject: Re: [PATCH v3] arch/arm: optimization for memcpy on AArch64 >=20 > -----Original Message----- > > Date: Mon, 18 Dec 2017 10:54:24 +0800 > > From: Herbert Guan > > To: dev@dpdk.org, jerin.jacob@caviumnetworks.com > > CC: Herbert Guan > > Subject: [PATCH v3] arch/arm: optimization for memcpy on AArch64 > > X-Mailer: git-send-email 1.8.3.1 > > > > Signed-off-by: Herbert Guan > > --- > > config/common_armv8a_linuxapp | 6 + > > .../common/include/arch/arm/rte_memcpy_64.h | 292 > +++++++++++++++++++++ > > 2 files changed, 298 insertions(+) > > > > diff --git a/config/common_armv8a_linuxapp > > b/config/common_armv8a_linuxapp index 6732d1e..8f0cbed 100644 > > --- a/config/common_armv8a_linuxapp > > +++ b/config/common_armv8a_linuxapp > > @@ -44,6 +44,12 @@ CONFIG_RTE_FORCE_INTRINSICS=3Dy # to address > minimum > > DMA alignment across all arm64 implementations. > > CONFIG_RTE_CACHE_LINE_SIZE=3D128 > > > > +# Accelarate rte_memcpy. Be sure to run unit test to determine the >=20 > Additional space before "Be". Rather than just mentioning the unit test, > mention the absolute test case name(memcpy_perf_autotest) >=20 > > +# best threshold in code. Refer to notes in source file >=20 > Additional space before "Refer" Fixed in new version. >=20 > > +# (lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h) for more > # > > +info. > > +CONFIG_RTE_ARCH_ARM64_MEMCPY=3Dn > > + > > CONFIG_RTE_LIBRTE_FM10K_PMD=3Dn > > CONFIG_RTE_LIBRTE_SFC_EFX_PMD=3Dn > > CONFIG_RTE_LIBRTE_AVP_PMD=3Dn > > diff --git a/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h > > b/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h > > index b80d8ba..1ea275d 100644 > > --- a/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h > > +++ b/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h > > @@ -42,6 +42,296 @@ > > > > #include "generic/rte_memcpy.h" > > > > +#ifdef RTE_ARCH_ARM64_MEMCPY >=20 > See the comment below at "(GCC_VERSION < 50400)" check >=20 > > +#include > > +#include > > + > > +/* > > + * The memory copy performance differs on different AArch64 micro- > architectures. > > + * And the most recent glibc (e.g. 2.23 or later) can provide a > > +better memcpy() > > + * performance compared to old glibc versions. It's always suggested > > +to use a > > + * more recent glibc if possible, from which the entire system can get > benefit. > > + * > > + * This implementation improves memory copy on some aarch64 > > +micro-architectures, > > + * when an old glibc (e.g. 2.19, 2.17...) is being used. It is > > +disabled by > > + * default and needs "RTE_ARCH_ARM64_MEMCPY" defined to activate. > > +It's not > > + * always providing better performance than memcpy() so users need to > > +run unit > > + * test "memcpy_perf_autotest" and customize parameters in > > +customization section > > + * below for best performance. > > + * > > + * Compiler version will also impact the rte_memcpy() performance. > > +It's observed > > + * on some platforms and with the same code, GCC 7.2.0 compiled > > +binaries can > > + * provide better performance than GCC 4.8.5 compiled binaries. > > + */ > > + > > +/************************************** > > + * Beginning of customization section > > +**************************************/ > > +#define ALIGNMENT_MASK 0x0F >=20 > This symbol will be included in public rte_memcpy.h version for arm64 DPD= K > build. > Please use RTE_ prefix to avoid multi > definition.(RTE_ARCH_ARM64_ALIGN_MASK ? or any shorter name) >=20 Changed to RTE_AARCH64_ALIGN_MASK in new version. > > +#ifndef RTE_ARCH_ARM64_MEMCPY_STRICT_ALIGN > > +/* Only src unalignment will be treaed as unaligned copy */ #define > > +IS_UNALIGNED_COPY(dst, src) ((uintptr_t)(dst) & ALIGNMENT_MASK) > #else > > +/* Both dst and src unalignment will be treated as unaligned copy */ > > +#define IS_UNALIGNED_COPY(dst, src) \ > > + (((uintptr_t)(dst) | (uintptr_t)(src)) & ALIGNMENT_MASK) > #endif > > + > > + > > +/* > > + * If copy size is larger than threshold, memcpy() will be used. > > + * Run "memcpy_perf_autotest" to determine the proper threshold. > > + */ > > +#define ALIGNED_THRESHOLD ((size_t)(0xffffffff)) > > +#define UNALIGNED_THRESHOLD ((size_t)(0xffffffff)) >=20 > Same as above comment. Added RTE_AARCH64_ prefix in new version. >=20 > > + > > +/************************************** > > + * End of customization section > > + **************************************/ > > +#ifdef RTE_TOOLCHAIN_GCC > > +#if (GCC_VERSION < 50400) > > +#warning "The GCC version is quite old, which may result in sub-optima= l \ > > +performance of the compiled code. It is suggested that at least GCC 5.= 4.0 \ > > +be used." >=20 > Even though it is warning, based on where this file get included it will > generate error(see below) > How about, selecting optimized memcpy when RTE_ARCH_ARM64_MEMCPY > && if (GCC_VERSION >=3D 50400) ? >=20 Fully understand that. While I'm not tending to make it 'silent'. GCC 4.x= is just quite old and may not provide best optimized code -- not only for DPDK app.= =20 We can provide another option RTE_AARCH64_SKIP_GCC_VERSION_CHECK to allow skipping the GCC version check. How do you think? > CC eal_common_options.o > In file included from > /home/jerin/dpdk.org/build/include/rte_memcpy.h:37:0,from > /home/jerin/dpdk.org/lib/librte_eal/common/eal_common_options.c:53: > /home/jerin/dpdk.org/build/include/rte_memcpy_64.h:93:2: error: > #warning > ^^^^^^^^ > "The GCC version is quite old, which may result in sub-optimal > performance of the compiled code. It is suggested that at least GCC > 5.4.0 be used." [-Werror=3Dcpp] > ^^^^^^^^^^^^^^ > #warning "The GCC version is quite old, which may result in sub-optimal > \ > ^ >=20 >=20 > > +#endif > > +#endif > > + > > + > > +#if RTE_CACHE_LINE_SIZE >=3D 128 >=20 > We can remove this conditional compilation check. ie. It can get compiled= for > both cases, > But it will be used only when RTE_CACHE_LINE_SIZE >=3D 128 >=20 OK, it'll be removed in the new version. > > +static __rte_always_inline void > > +rte_memcpy_ge16_lt128 > > +(uint8_t *restrict dst, const uint8_t *restrict src, size_t n) > > +{ > > + if (n < 64) { > > + if (n =3D=3D 16) { > > + rte_mov16(dst, src); > > + } else if (n <=3D 32) { > > + rte_mov16(dst, src); > > + rte_mov16(dst - 16 + n, src - 16 + n); > > + } else if (n <=3D 48) { > > + rte_mov32(dst, src); > > + rte_mov16(dst - 16 + n, src - 16 + n); > > + } else { > > + rte_mov48(dst, src); > > + rte_mov16(dst - 16 + n, src - 16 + n); > > + } > > + } else { > > + rte_mov64((uint8_t *)dst, (const uint8_t *)src); > > + if (n > 48 + 64) > > + rte_mov64(dst - 64 + n, src - 64 + n); > > + else if (n > 32 + 64) > > + rte_mov48(dst - 48 + n, src - 48 + n); > > + else if (n > 16 + 64) > > + rte_mov32(dst - 32 + n, src - 32 + n); > > + else if (n > 64) > > + rte_mov16(dst - 16 + n, src - 16 + n); > > + } > > +} > > + > > + > > +#else >=20 > Same as above comment. >=20 > > +static __rte_always_inline void > > +rte_memcpy_ge16_lt64 > > +(uint8_t *restrict dst, const uint8_t *restrict src, size_t n) > > +{ > > + if (n =3D=3D 16) { > > + rte_mov16(dst, src); > > + } else if (n <=3D 32) { > > + rte_mov16(dst, src); > > + rte_mov16(dst - 16 + n, src - 16 + n); > > + } else if (n <=3D 48) { > > + rte_mov32(dst, src); > > + rte_mov16(dst - 16 + n, src - 16 + n); > > + } else { > > + rte_mov48(dst, src); > > + rte_mov16(dst - 16 + n, src - 16 + n); > > + } > > +} > > + > > + > > +static __rte_always_inline void * > > +rte_memcpy(void *restrict dst, const void *restrict src, size_t n) > > +{ > > + if (n < 16) { > > + rte_memcpy_lt16((uint8_t *)dst, (const uint8_t *)src, n); > > + return dst; > > + } > > +#if RTE_CACHE_LINE_SIZE >=3D 128 > > + if (n < 128) { > > + rte_memcpy_ge16_lt128((uint8_t *)dst, (const uint8_t *)src, > n); > > + return dst; > > + } > > +#else > > + if (n < 64) { > > + rte_memcpy_ge16_lt64((uint8_t *)dst, (const uint8_t *)src, > n); > > + return dst; > > + } > > +#endif > > + __builtin_prefetch(src, 0, 0); > > + __builtin_prefetch(dst, 1, 0); > > + if (likely( > > + (!IS_UNALIGNED_COPY(dst, src) && n <=3D > ALIGNED_THRESHOLD) > > + || (IS_UNALIGNED_COPY(dst, src) && n <=3D > UNALIGNED_THRESHOLD) > > + )) { > > +#if RTE_CACHE_LINE_SIZE >=3D 128 > > + rte_memcpy_ge128((uint8_t *)dst, (const uint8_t *)src, n); > > +#else > > + rte_memcpy_ge64((uint8_t *)dst, (const uint8_t *)src, n); > > +#endif >=20 > Can we remove this #ifdef clutter(We have two of them in a same function)= ? >=20 > I suggest to remove this clutter by having the separate routine. ie. > 1) > #if RTE_CACHE_LINE_SIZE >=3D 128 > rte_memcpy(void *restrict dst, const void *restrict src, size_t n) > { > } > #else > rte_memcpy(void *restrict dst, const void *restrict src, size_t n) > { > } > #endif >=20 > 2) Have separate inline function to resolve following logic and used it > in both variants. >=20 > if (likely( > (!IS_UNALIGNED_COPY(dst, src) && n <=3D > ALIGNED_THRESHOLD) > || (IS_UNALIGNED_COPY(dst, src) && n <=3D > UNALIGNED_THRESHOLD) > )) { >=20 Implemented as suggested in new version. > With above changes: > Acked-by: Jerin Jacob Thanks, Herbert Guan