From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from EUR02-VE1-obe.outbound.protection.outlook.com (mail-eopbgr20053.outbound.protection.outlook.com [40.107.2.53]) by dpdk.org (Postfix) with ESMTP id B25B67D30 for ; Thu, 4 Jan 2018 11:23:51 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector1-arm-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=Vm/QLcuOQPfJfMuJbq+OGcie5K/O05MYGsDDkocFSYA=; b=AOv+9xrkYvH68nhELNoO1MAbFi/hVEdlHzLLzvQOz9xaOFhUy1Oxf9/OG/I1IBt73DBu4euTCK8XawA7VgaXnDEMlAVQz36+L8HlrkKAjV3fP/YhU9qL9u/7gFbs1GLR/bUlPc1bUV+huntLrdPM67NyNVuTvpEUgwkE4dnVEWk= Received: from HE1PR08MB2809.eurprd08.prod.outlook.com (10.170.246.148) by HE1PR08MB2810.eurprd08.prod.outlook.com (10.170.246.149) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P256) id 15.20.366.8; Thu, 4 Jan 2018 10:23:50 +0000 Received: from HE1PR08MB2809.eurprd08.prod.outlook.com ([fe80::90dc:2dac:4dcd:a0f9]) by HE1PR08MB2809.eurprd08.prod.outlook.com ([fe80::90dc:2dac:4dcd:a0f9%13]) with mapi id 15.20.0366.009; Thu, 4 Jan 2018 10:23:50 +0000 From: Herbert Guan To: Jerin Jacob CC: "dev@dpdk.org" , nd Thread-Topic: [PATCH v4] arch/arm: optimization for memcpy on AArch64 Thread-Index: AQHTeh1Lpq4Ei89PqkyqBtvwto8TG6NiOyWAgAFY0xA= Date: Thu, 4 Jan 2018 10:23:49 +0000 Message-ID: References: <1513565664-19509-1-git-send-email-herbert.guan@arm.com> <1513834427-12635-1-git-send-email-herbert.guan@arm.com> <20180103133513.GA30368@jerin> In-Reply-To: <20180103133513.GA30368@jerin> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [113.29.88.7] x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1; HE1PR08MB2810; 7:wJKlzswHdpr6WA/1fp6VJTCO03Kq0h9X5y480URh95Aiw1w4ekkmUzcgSArv2eyh6wzqFfS+QNdMpzGBUarRvuJaN0pZsLSZQA1qs+PWnmOALh9csyD50z3eOWOLl+anuMMSFztSHIX26lrnpR3h7fW2AEdL6tbpyUz1QNXvYTJWz2X6/Y+6sc124C5/8IQLne2npS570UfMY12d5eTt8F56Wtao6wrA10Y6z/IA756b6C2sBR+pZHs3ANP9qpMD x-ms-exchange-antispam-srfa-diagnostics: SSOS; x-ms-office365-filtering-ht: Tenant x-ms-office365-filtering-correlation-id: e811e21c-0f05-45f7-03fd-08d5535d3f3b x-microsoft-antispam: UriScan:; BCL:0; PCL:0; RULEID:(4534020)(4602075)(7168020)(4627115)(201703031133081)(201702281549075)(48565401081)(5600026)(4604075)(3008032)(2017052603307)(7153060); SRVR:HE1PR08MB2810; x-ms-traffictypediagnostic: HE1PR08MB2810: nodisclaimer: True x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(180628864354917); x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(6040470)(2401047)(8121501046)(5005006)(10201501046)(93006095)(93001095)(3231023)(944501075)(3002001)(6055026)(6041268)(20161123562045)(20161123564045)(201703131423095)(201702281528075)(20161123555045)(201703061421075)(201703061406153)(20161123558120)(20161123560045)(6072148)(201708071742011); SRVR:HE1PR08MB2810; BCL:0; PCL:0; RULEID:(100000803101)(100110400095); SRVR:HE1PR08MB2810; x-forefront-prvs: 054231DC40 x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(366004)(396003)(39380400002)(39860400002)(346002)(376002)(199004)(189003)(13464003)(478600001)(2906002)(9686003)(14454004)(7736002)(3846002)(74316002)(55016002)(68736007)(6116002)(305945005)(6436002)(72206003)(33656002)(81166006)(575784001)(8936002)(86362001)(66066001)(8676002)(229853002)(3660700001)(81156014)(5250100002)(2900100001)(6916009)(316002)(3280700002)(106356001)(105586002)(53936002)(54906003)(2950100002)(7696005)(4326008)(76176011)(25786009)(99286004)(59450400001)(53546011)(55236004)(102836004)(6506007)(97736004)(5660300001)(6246003); DIR:OUT; SFP:1101; SCL:1; SRVR:HE1PR08MB2810; H:HE1PR08MB2809.eurprd08.prod.outlook.com; FPR:; SPF:None; PTR:InfoNoRecords; A:1; MX:1; LANG:en; received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) authentication-results: spf=none (sender IP is ) smtp.mailfrom=Herbert.Guan@arm.com; x-microsoft-antispam-message-info: gYh1O1T7FwZTlotZ10dJ4mWVnubLxCCikhjgFENdjB68PktlRBU3Ebx/eJs9JwawML8PD6Wwtu8UFQnn4fmyNg== spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-Network-Message-Id: e811e21c-0f05-45f7-03fd-08d5535d3f3b X-MS-Exchange-CrossTenant-originalarrivaltime: 04 Jan 2018 10:23:49.9439 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-Transport-CrossTenantHeadersStamped: HE1PR08MB2810 Subject: Re: [dpdk-dev] [PATCH v4] arch/arm: optimization for memcpy on AArch64 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Thu, 04 Jan 2018 10:23:52 -0000 Thanks for review and comments, Jerin. A new version has been sent out for= review with your comments applied and Acked-by added. Best regards, Herbert > -----Original Message----- > From: Jerin Jacob [mailto:jerin.jacob@caviumnetworks.com] > Sent: Wednesday, January 3, 2018 21:35 > To: Herbert Guan > Cc: dev@dpdk.org > Subject: Re: [PATCH v4] arch/arm: optimization for memcpy on AArch64 >=20 > -----Original Message----- > > Date: Thu, 21 Dec 2017 13:33:47 +0800 > > From: Herbert Guan > > To: dev@dpdk.org, jerin.jacob@caviumnetworks.com > > CC: Herbert Guan > > Subject: [PATCH v4] arch/arm: optimization for memcpy on AArch64 > > X-Mailer: git-send-email 1.8.3.1 > > > > This patch provides an option to do rte_memcpy() using 'restrict' > > qualifier, which can induce GCC to do optimizations by using more > > efficient instructions, providing some performance gain over memcpy() > > on some AArch64 platforms/enviroments. > > > > The memory copy performance differs between different AArch64 > > platforms. And a more recent glibc (e.g. 2.23 or later) > > can provide a better memcpy() performance compared to old glibc > > versions. It's always suggested to use a more recent glibc if > > possible, from which the entire system can get benefit. If for some > > reason an old glibc has to be used, this patch is provided for an > > alternative. > > > > This implementation can improve memory copy on some AArch64 > > platforms, when an old glibc (e.g. 2.19, 2.17...) is being used. > > It is disabled by default and needs "RTE_ARCH_ARM64_MEMCPY" > > defined to activate. It's not always proving better performance > > than memcpy() so users need to run DPDK unit test > > "memcpy_perf_autotest" and customize parameters in "customization > > section" in rte_memcpy_64.h for best performance. > > > > Compiler version will also impact the rte_memcpy() performance. > > It's observed on some platforms and with the same code, GCC 7.2.0 > > compiled binary can provide better performance than GCC 4.8.5. It's > > suggested to use GCC 5.4.0 or later. > > > > Signed-off-by: Herbert Guan >=20 > Looks good. Find inline request for some minor changes. > Feel free to add my Acked-by with those changes. >=20 >=20 > > --- > > config/common_armv8a_linuxapp | 6 + > > .../common/include/arch/arm/rte_memcpy_64.h | 287 > +++++++++++++++++++++ > > 2 files changed, 293 insertions(+) > > > > diff --git a/config/common_armv8a_linuxapp > b/config/common_armv8a_linuxapp > > index 6732d1e..8f0cbed 100644 > > --- a/config/common_armv8a_linuxapp > > +++ b/config/common_armv8a_linuxapp > > @@ -44,6 +44,12 @@ CONFIG_RTE_FORCE_INTRINSICS=3Dy > > # to address minimum DMA alignment across all arm64 implementations. > > CONFIG_RTE_CACHE_LINE_SIZE=3D128 > > > > +# Accelarate rte_memcpy. Be sure to run unit test to determine the > > +# best threshold in code. Refer to notes in source file > > +# (lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h) for more > > +# info. > > +CONFIG_RTE_ARCH_ARM64_MEMCPY=3Dn > > + > > CONFIG_RTE_LIBRTE_FM10K_PMD=3Dn > > CONFIG_RTE_LIBRTE_SFC_EFX_PMD=3Dn > > CONFIG_RTE_LIBRTE_AVP_PMD=3Dn > > diff --git a/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h > b/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h > > index b80d8ba..b269f34 100644 > > --- a/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h > > +++ b/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h > > @@ -42,6 +42,291 @@ > > > > #include "generic/rte_memcpy.h" > > > > +#ifdef RTE_ARCH_ARM64_MEMCPY > > +#include > > +#include > > + > > +/* > > + * The memory copy performance differs on different AArch64 micro- > architectures. > > + * And the most recent glibc (e.g. 2.23 or later) can provide a better > memcpy() > > + * performance compared to old glibc versions. It's always suggested t= o > use a > > + * more recent glibc if possible, from which the entire system can get > benefit. > > + * > > + * This implementation improves memory copy on some aarch64 micro- > architectures, > > + * when an old glibc (e.g. 2.19, 2.17...) is being used. It is disable= d by > > + * default and needs "RTE_ARCH_ARM64_MEMCPY" defined to activate. > It's not > > + * always providing better performance than memcpy() so users need to > run unit > > + * test "memcpy_perf_autotest" and customize parameters in > customization section > > + * below for best performance. > > + * > > + * Compiler version will also impact the rte_memcpy() performance. It'= s > observed > > + * on some platforms and with the same code, GCC 7.2.0 compiled > binaries can > > + * provide better performance than GCC 4.8.5 compiled binaries. > > + */ > > + > > +/************************************** > > + * Beginning of customization section > > + **************************************/ > > +#define RTE_ARM64_MEMCPY_ALIGN_MASK 0x0F > > +#ifndef RTE_ARCH_ARM64_MEMCPY_STRICT_ALIGN > > +/* Only src unalignment will be treaed as unaligned copy */ > > +#define IS_UNALIGNED_COPY(dst, src) \ >=20 > Better to to change to RTE_ARM64_MEMCPY_IS_UNALIGNED_COPY, as it is > defined in public DPDK header file. >=20 >=20 > > + ((uintptr_t)(dst) & RTE_ARM64_MEMCPY_ALIGN_MASK) > > +#else > > +/* Both dst and src unalignment will be treated as unaligned copy */ > > +#define IS_UNALIGNED_COPY(dst, src) \ > > + (((uintptr_t)(dst) | (uintptr_t)(src)) & > RTE_ARM64_MEMCPY_ALIGN_MASK) >=20 > Same as above >=20 > > +#endif > > + > > + > > +/* > > + * If copy size is larger than threshold, memcpy() will be used. > > + * Run "memcpy_perf_autotest" to determine the proper threshold. > > + */ > > +#define RTE_ARM64_MEMCPY_ALIGNED_THRESHOLD > ((size_t)(0xffffffff)) > > +#define RTE_ARM64_MEMCPY_UNALIGNED_THRESHOLD > ((size_t)(0xffffffff)) > > + > > +/* > > + * The logic of USE_RTE_MEMCPY() can also be modified to best fit > platform. > > + */ > > +#define USE_RTE_MEMCPY(dst, src, n) \ > > +((!IS_UNALIGNED_COPY(dst, src) && n <=3D > RTE_ARM64_MEMCPY_ALIGNED_THRESHOLD) \ > > +|| (IS_UNALIGNED_COPY(dst, src) && n <=3D > RTE_ARM64_MEMCPY_UNALIGNED_THRESHOLD)) > > + > > + > > +/************************************** > > + * End of customization section > > + **************************************/ > > +#if defined(RTE_TOOLCHAIN_GCC) > && !defined(RTE_AARCH64_SKIP_GCC_VERSION_CHECK) >=20 > To maintain consistency > s/RTE_AARCH64_SKIP_GCC_VERSION_CHECK/RTE_ARM64_MEMCPY_SKIP_ > GCC_VERSION_CHECK >=20 > > +#if (GCC_VERSION < 50400) > > +#warning "The GCC version is quite old, which may result in sub-optima= l \ > > +performance of the compiled code. It is suggested that at least GCC 5.= 4.0 \ > > +be used." > > +#endif > > +#endif > > + > > +static __rte_always_inline void rte_mov16(uint8_t *dst, const uint8_t > *src) >=20 > static __rte_always_inline > void rte_mov16(uint8_t *dst, const uint8_t *src) >=20 > > +{ > > + __uint128_t *dst128 =3D (__uint128_t *)dst; > > + const __uint128_t *src128 =3D (const __uint128_t *)src; > > + *dst128 =3D *src128; > > +} > > + > > +static __rte_always_inline void rte_mov32(uint8_t *dst, const uint8_t > *src) >=20 > See above >=20 > > +{ > > + __uint128_t *dst128 =3D (__uint128_t *)dst; > > + const __uint128_t *src128 =3D (const __uint128_t *)src; > > + const __uint128_t x0 =3D src128[0], x1 =3D src128[1]; > > + dst128[0] =3D x0; > > + dst128[1] =3D x1; > > +} > > + > > +static __rte_always_inline void rte_mov48(uint8_t *dst, const uint8_t > *src) > > +{ >=20 > See above >=20 > > + __uint128_t *dst128 =3D (__uint128_t *)dst; > > + const __uint128_t *src128 =3D (const __uint128_t *)src; > > + const __uint128_t x0 =3D src128[0], x1 =3D src128[1], x2 =3D src128[2= ]; > > + dst128[0] =3D x0; > > + dst128[1] =3D x1; > > + dst128[2] =3D x2; > > +} > > + > > +static __rte_always_inline void rte_mov64(uint8_t *dst, const uint8_t > *src) > > +{ >=20 > See above >=20 > > + __uint128_t *dst128 =3D (__uint128_t *)dst; > > + const __uint128_t *src128 =3D (const __uint128_t *)src; > > + const __uint128_t > > + x0 =3D src128[0], x1 =3D src128[1], x2 =3D src128[2], x3 =3D src128[= 3]; > > + dst128[0] =3D x0; > > + dst128[1] =3D x1; > > + dst128[2] =3D x2; > > + dst128[3] =3D x3; > > +} > > + > > +static __rte_always_inline void rte_mov128(uint8_t *dst, const uint8_t > *src) > > +{ >=20 > See above >=20 > > + __uint128_t *dst128 =3D (__uint128_t *)dst; > > + const __uint128_t *src128 =3D (const __uint128_t *)src; > > + /* Keep below declaration & copy sequence for optimized > instructions */ > > + const __uint128_t > > + x0 =3D src128[0], x1 =3D src128[1], x2 =3D src128[2], x3 =3D src128[= 3]; > > + dst128[0] =3D x0; > > + __uint128_t x4 =3D src128[4]; > > + dst128[1] =3D x1; > > + __uint128_t x5 =3D src128[5]; > > + dst128[2] =3D x2; > > + __uint128_t x6 =3D src128[6]; > > + dst128[3] =3D x3; > > + __uint128_t x7 =3D src128[7]; > > + dst128[4] =3D x4; > > + dst128[5] =3D x5; > > + dst128[6] =3D x6; > > + dst128[7] =3D x7; > > +} > > + > > +static __rte_always_inline void rte_mov256(uint8_t *dst, const uint8_t > *src) > > +{ >=20 > See above