From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from NAM03-DM3-obe.outbound.protection.outlook.com (mail-dm3nam03on0079.outbound.protection.outlook.com [104.47.41.79]) by dpdk.org (Postfix) with ESMTP id 30F2B14E8 for ; Sat, 2 Dec 2017 08:33:30 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=CAVIUMNETWORKS.onmicrosoft.com; s=selector1-cavium-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=V1Dik7HhQdwS8ZFu2U78O9laEfW97uFlartPqA9tSZU=; b=ADjqwtdrv8nw4YwAyU9S4OUC8WfZyDr8WubOIcC0dxDjopZk6S4NxgsQU4heZpjqkUyGMQ7WYXyf771z1P3YEA3fNprDxzpJBFs7UpGLuT+6cektF5v+nI4RoYeziRjaZnr9u5PxaQwBXPYi08Vw4upuJntQsqflKjl7GMsQPWM= Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=Pavan.Bhagavatula@cavium.com; Received: from Pavan-LT (111.93.218.67) by DM5PR07MB3468.namprd07.prod.outlook.com (10.164.153.23) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P256) id 15.20.282.5; Sat, 2 Dec 2017 07:33:27 +0000 Date: Sat, 2 Dec 2017 13:03:02 +0530 From: Pavan Nikhilesh Bhagavatula To: Herbert Guan , jianbo.liu@arm.com Cc: dev@dpdk.org Message-ID: <20171202073300.yozet72nnvlwrkgj@Pavan-LT> References: <1511768985-21639-1-git-send-email-herbert.guan@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <1511768985-21639-1-git-send-email-herbert.guan@arm.com> User-Agent: NeoMutt/20170609 (1.8.3) X-Originating-IP: [111.93.218.67] X-ClientProxiedBy: HK2PR02CA0159.apcprd02.prod.outlook.com (10.171.30.19) To DM5PR07MB3468.namprd07.prod.outlook.com (10.164.153.23) X-MS-PublicTrafficType: Email X-MS-Office365-Filtering-Correlation-Id: 925885f0-736d-41fe-d8f0-08d53956fb39 X-Microsoft-Antispam: UriScan:; BCL:0; PCL:0; RULEID:(5600026)(4604075)(4534020)(4602075)(4627115)(201703031133081)(201702281549075)(2017052603286); SRVR:DM5PR07MB3468; X-Microsoft-Exchange-Diagnostics: 1; DM5PR07MB3468; 3:uI5GNGRTuO2e5MMUL/LYMQwpHFnu9oHYLcXf0NZBsO/zA5PTjurJ2OeBqUMJ13P1leaqCZTw83dTuOGsv6N/A6Fx4ttstYBeNdiKlf9zjmsf+8Nrcpjnt05ovoKXEzPC7ZPO/V5khfzkGZjAdt1pfGN+4uoOOIlF3RF+opZIae9wEoq8blDOaCIIdClAy4ieGBTguvsb+0k4OQA8+JSK2UISJc+ptmIR5FVc7mfYOgwOluZ6vZHXJ111Owb1IKvt; 25:SJd3oqT0X0BAW6FKcIWB8kqJ2Z7S3qbK7oTVqcf5DecXa6x9bvS1tXSGJmXMTYEHZ7AKSohOMntdKf4s6x1P1ECgTaS6P5qyBl+6YgUJjQOD8MZzeiSeeqlANlXSNFF8jzOtzy/kvJ4phfDMKTJA0F9kQ2ts0r1s+06i3uUU7xfpkYjSNyE4aV7hEBmwzk1y1ZbwyMSQ03UhVBCxNI8FLPDsfZUHJYhelM1XBizZALFwpsWoRA3+fI0/rsoXDPsWhrvgBULoLCYB5x0GX/dMaqmw3OI1D6yqTXiTuQWPze7atnKpw7XHfwp6A/D4ngPYxJJQGLWqcMiKrZIsUT/GRQ==; 31:Doo/0SbQRQl9aUd3Ihe+bwX60Wi5nD0QlTT+C5Dv9lARCkZ51huXgnL4nTyv/T/pGkdK5XOQBM5CLUdiNmJQkHxedjJ73d4Hse82AG5e3cFcVJf+xvpwtwH4ag8vm/DxtoB0URtHNuGTLk2iJ69lZjt69qEdK449uLM9LG7ElNqqVfPEGres6ooxkBePHxYqnw6xxrgCdRrFyn9r+4hojW5UI1MRql6JXouxxFBLOkk= X-MS-TrafficTypeDiagnostic: DM5PR07MB3468: X-Microsoft-Exchange-Diagnostics: 1; DM5PR07MB3468; 20:ueHc5yKVM+X61gxjLuZ8/E3aghnJZsO80lZrnUM8gvuvDUVrsXm7mLDIQmc3PIDXjtg3BDJzBh5pMkCkJ0A0NpqDsdy6MP/m5+mg6BmYY+4WQ4Aqx/ptNmrVTsC9+RVS9IUaYTzITGHDVSEULF/aAiwB9P65ZoLA/1t+cctw19HtFKIsCSpY6D5uFyrw5O7YB23zfyhOSWT+hqIAcUvXJNul92YPVKTNNCt7KBp+APhaepO986x3kWBDRbimuW7YSfK1uib9qFRKvX7MFf5nS4wtuzD4wvmv1tVg6H/cWkoXVOi+fFtkVYvuwhgeCbzFh1yAR+JQRE6gCSzXuQ6vq5nVOZ6E0UdgmnB5aSCn7A7IktBhI/ZJkrdBivDyB2jmJlHJ382gB8Oxrg54A4sjdbhEcpw0LnBbUKqhJe+jNr5Ps7xCtElQkTmvtG2Ztcmoycz3K+f3JnW+esAgsg5fgCUxc7wg8hkymSxJ0Yp7CF8k/4wrHOg3zqNh6mFdovRTOo8Y2fjhob51TWifUNfmgMHlTz/Xl4ycBaNZ1HUt/ZjNqybN+q/MkivICg7HUXoLaTD47sZC1p8/X6H7iYnhs6DiXTISjs4qsCFxMmO5HUw=; 4:Ltye6i+vxRabr5Mf6kXbWANFuJGIgFfR8W+L9sPUdDgctdZOkJdOGubCtfAcFwMjbTNlPS50MfMURk83ETINLsp3DQb6yZWKH8fJlc7PhCSfc2aoQho5d8VUBpdL2fbRmfRQVdnOlSHv/Lo1AX7yfZ1DnSKOJwAY9WujmVEoIZvac2sRdeB1vdQpauxf/ds5wb7AcDzscy/Hawd/5lucp9CSooQES6lBRrf/JeXVJrnt0+4m6KoXIrHr2uSEeeDzBDB4sw7w3HdiwVpXX8A80JrZCIB0eeJTwAxI6nllCEO5VaiTR7ncvlMNhhCnjRZG X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:(180628864354917); X-Exchange-Antispam-Report-CFA-Test: BCL:0; PCL:0; RULEID:(6040450)(2401047)(8121501046)(5005006)(3231022)(10201501046)(3002001)(93006095)(6041248)(20161123555025)(20161123558100)(20161123564025)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(20161123562025)(20161123560025)(6072148)(201708071742011); SRVR:DM5PR07MB3468; BCL:0; PCL:0; RULEID:(100000803101)(100110400095); SRVR:DM5PR07MB3468; X-Forefront-PRVS: 0509245D29 X-Forefront-Antispam-Report: SFV:NSPM; SFS:(10009020)(6009001)(366004)(376002)(346002)(24454002)(189002)(199003)(33646002)(33716001)(6246003)(53936002)(106356001)(105586002)(4326008)(9686003)(3846002)(6116002)(2870700001)(66066001)(47776003)(305945005)(81156014)(81166006)(229853002)(97736004)(101416001)(50466002)(8676002)(1076002)(72206003)(478600001)(2906002)(316002)(25786009)(55016002)(189998001)(58126008)(54356011)(52116002)(83506002)(6496006)(76176011)(52146003)(2486003)(23676004)(68736007)(5660300001)(2950100002)(42882006)(16526018)(7736002)(8936002)(575784001)(6666003)(5009440100003)(107986001); DIR:OUT; SFP:1101; SCL:1; SRVR:DM5PR07MB3468; H:Pavan-LT; FPR:; SPF:None; PTR:InfoNoRecords; MX:1; A:1; LANG:en; Received-SPF: None (protection.outlook.com: cavium.com does not designate permitted sender hosts) X-Microsoft-Exchange-Diagnostics: =?utf-8?B?MTtETTVQUjA3TUIzNDY4OzIzOmdBSWp0QmlaOEE2RWxsQkdCNWJrRTBTNXFm?= =?utf-8?B?Qm5aK1NwZ2JCMEJtQzMva0JVeUw0OE10cHFPZGtFNlRVVkNOKzBtWlRqVFh2?= =?utf-8?B?T3RDZmVaWkZoVWE1RDIwV0diQkFRelpWSVE1S0NuSzBNWFR0RSt4c1NldnRl?= =?utf-8?B?MTBsemdnbjRpeE1ENzliTlJGWk56TkFYTWlvTzdtUmZYSWxzREluNkpKUHN4?= =?utf-8?B?VlV6ZmFqZmxPanRZL3hGVjlyRE5Hb1h3MEpXUkxTUFRqdTBQQjVhK28yZkI5?= =?utf-8?B?aDlURjRxeWNqSFBpdUsxd2Q4eHpGM2R6UGNnVThxZGhnVWphUGoyK3VqNWtC?= =?utf-8?B?bVVjQWVEOTFjMWlqQkZpQUw3ZVlPTnB1NlAraEpGZUE3TlJNWXBtbEhnMWtv?= =?utf-8?B?VU5iTmZCcTF3Q2lFVGZmS0JlSTZMdjJwS0VGejNIMzE5THNZS0ZJcUQvVE1L?= =?utf-8?B?V0x0RnlmNmRBYi9vOVF0MjRPNUdOSVRoTFVuMWRMMGtzc3UxOFpjZm1ZalB2?= =?utf-8?B?NGtxZzRMQmtvdUIvTUhaS3A5VDRNZ25nUExOTlZsZU50UVNBYmZJdFc2UVRy?= =?utf-8?B?aUFzMHFBUU1FZXRiS3RSSlh6bFRQaWpQQlFZQ2NHekhUYk0wcE9mVGlnV1RU?= =?utf-8?B?a2hCRHVwajJVVFFoQUEvQWRPenZOTUlGNU9MK3ozeXNxWFpualJnWWR6aExm?= =?utf-8?B?cUV0aGNDaHl2WXJES20rdE1wbWEzODk5azcwcFFZOWU4OXhZUFZnK1NJSGdE?= =?utf-8?B?NzNrYkxjZCtNNDEvVEE1bmV2aEJaQ0tmaFVaZTUzL2VaOGpyQTRmNko0eXJr?= =?utf-8?B?T08vdzdyd0EwOXQ5MVZnV2RheFkzbmJZRlZwWmZJMW5nK05LU3A1QUxYcW5k?= =?utf-8?B?MkhGQms5Qjl5a00zMTh2eUsyMitGYjFxWVIwTnVFK1d1YVZUYmJaMHZocFZl?= =?utf-8?B?OWp5cUp4Q3dmbmFlS3JpV3ZOT3VER3A0SXpBNEhoVU1IbG1KeFh1UC9najJL?= =?utf-8?B?UUFBN3pUMXlkOThZR3piNXBmbHR3VlJxdURxUnpPazVnaytONk51VngyVHdp?= =?utf-8?B?b2FWOGZ0TGdZbmkwTFdvd3BIRi9jRlFML0VsaGI0T0JMajRkY1NuYWVvamIx?= =?utf-8?B?T2tCM2VHOGY3Vk5ERjhoUFozY3VWOGh0TStOZGZBSFcvbzhGSG9LWWR5UFlw?= =?utf-8?B?cGNGNzNpbXlLRDg5YU52aDduWmRHM3BhRjdmY2tLV1l1SnRvdXVGNXlURjZh?= =?utf-8?B?MFBRZlVqbVdHTkVtNHNGSFJ2aFdzbFdYL0xKVHN2L1I2cXdUZTgxNnZ3dkQ0?= =?utf-8?B?clFCcDhJcGI1N3M0c1QzOXU3RjFnRnBGVHlKcHI5dUp0RnBrMEsyRDJoTjRS?= =?utf-8?B?QmU0U1dnN0IvU0NqcGgzczlsQjFnOHQvNUpwZEJSVEszRzZ4ZWtmb2t1T0VF?= =?utf-8?B?a0doMmlZRUJtQ2EvYkYzcVJCUmNBd25IMjJqcGt0elcyQ2Jnbmloa0Q2Wm5r?= =?utf-8?B?UDdLVTNCT01LK3JtY0VMaDFFNUVmSGEwR2J3OXpnN0wyVGU5MHpJcnQwazlI?= =?utf-8?B?UWJjL0FqUzc3anU2bVNORXBzTlpFL3psdnVjQkhnbytUSExYa2F3NjZ1SjVp?= =?utf-8?B?bUlKREU5UjhjcG94dGhBdHAzV3YvYlQxNkxBWStQSUsrZ3pvK0ExYTZYTGFz?= =?utf-8?B?YVEwc1NiMzNsdHBWTlRRTjRhNkNDSmswUlFzYXdHR1M3VUxKMFVzaWlRV1d5?= =?utf-8?B?UzRJWXgzMlFSSGJKM3hNdz09?= X-Microsoft-Antispam-Message-Info: sNmUCjl4SPjFsbPONdP6RCQTgQERvMMLzIJHt13orEJCmDHuhrfLhZEFPGm2AHq45KGkivLmrLIpwNX6HAKEhA== X-Microsoft-Exchange-Diagnostics: 1; DM5PR07MB3468; 6:9qZoh278bvPzx+wSPORM8DWJePfJMY12Ue8pnJremv6NzX5ewZAdqMFtt0fyicIwVNLrceSJEV/9fQO0MQPcOYo9D1v+JR6mjmxbbR8vMMLHJY90eRcF/YlM9uZSEHa59o0baZX18shRQSW3Qfe2KXqFzbWqahQZ3Zt2AJWoryYgxmdn6u9sgfWuCAFLZIF6D6W7Ufy3xBobWNwISza1A0h8GuQ0ewi+dNKutsMBRldKoLgv6cPP0QOiGqfVsjBxPTEiQcdA4YNr6k2OpCY+XMVcwLzDNeqi4pUAKotLQ/k7kBbPTjXBTefDKNXYCPRDkLxCpDgsaxo0RyJEnp6LfT8BAMTV0WrpVphrakLIpr4=; 5:LocUAXP61s+EQmRMYN0Pf+IoJGAyNtRPRQoULjv5wqxTSMlxFE18l8sEXY837BHAjLJwuMhDxC+AOzwK1cbE1i14YKzzkwBH53XzYi2uckvFM2X+WVRm6msrgKvDnKLZ+bMeYTkMw5aRlIVGpK2dB3EC8jlVzlKeZh+fK145KBc=; 24:8F+udg/uiVnlFuB2B99ovMVn1Jy76sdPhHhytuYv4icM9Jc0fASdx8KPYY3doJxcFOBYgqixM6VPCJgbEaEjie8coj+DqBKd3BGdwjF72ac=; 7:SAnAwqekMQt8Z8EhqzEtoHGFk2fdDY6vwq3d/83jc7wYRrtzqi+OYkDf26igGB7wKUqfSFZc4/7HRgUkTtu8oMqSSvg7+hdCHBJgJ+eeePrtOYKf4Qea35JzAq21RHWRQw/4HWzHenHzVy1UhaOEvhbITGOKHvFllqp41rnsj8PsrFL/v1YrQEFmb9ckLUuIoJXp0zmda5f+KFHoEXtvSe9EAqUhi+6GN5ezLPtY6aklw++/E5mmndDML7mrPZZz SpamDiagnosticOutput: 1:99 SpamDiagnosticMetadata: NSPM X-OriginatorOrg: caviumnetworks.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 02 Dec 2017 07:33:27.1073 (UTC) X-MS-Exchange-CrossTenant-Network-Message-Id: 925885f0-736d-41fe-d8f0-08d53956fb39 X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-CrossTenant-Id: 711e4ccf-2e9b-4bcf-a551-4094005b6194 X-MS-Exchange-Transport-CrossTenantHeadersStamped: DM5PR07MB3468 Subject: Re: [dpdk-dev] [PATCH] arch/arm: optimization for memcpy on AArch64 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Sat, 02 Dec 2017 07:33:30 -0000 On Mon, Nov 27, 2017 at 03:49:45PM +0800, Herbert Guan wrote: > This patch provides an option to do rte_memcpy() using 'restrict' > qualifier, which can induce GCC to do optimizations by using more > efficient instructions, providing some performance gain over memcpy() > on some AArch64 platforms/enviroments. > > The memory copy performance differs between different AArch64 > platforms. And a more recent glibc (e.g. 2.23 or later) > can provide a better memcpy() performance compared to old glibc > versions. It's always suggested to use a more recent glibc if > possible, from which the entire system can get benefit. If for some > reason an old glibc has to be used, this patch is provided for an > alternative. > > This implementation can improve memory copy on some AArch64 > platforms, when an old glibc (e.g. 2.19, 2.17...) is being used. > It is disabled by default and needs "RTE_ARCH_ARM64_MEMCPY" > defined to activate. It's not always proving better performance > than memcpy() so users need to run DPDK unit test > "memcpy_perf_autotest" and customize parameters in "customization > section" in rte_memcpy_64.h for best performance. > > Compiler version will also impact the rte_memcpy() performance. > It's observed on some platforms and with the same code, GCC 7.2.0 > compiled binary can provide better performance than GCC 4.8.5. It's > suggested to use GCC 5.4.0 or later. > > Signed-off-by: Herbert Guan > --- > .../common/include/arch/arm/rte_memcpy_64.h | 193 +++++++++++++++++++++ > 1 file changed, 193 insertions(+) > > diff --git a/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h b/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h > index b80d8ba..1f42b3c 100644 > --- a/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h > +++ b/lib/librte_eal/common/include/arch/arm/rte_memcpy_64.h > @@ -42,6 +42,197 @@ > > #include "generic/rte_memcpy.h" > > +#ifdef RTE_ARCH_ARM64_MEMCPY There is an existing flag for arm32 to enable neon based memcpy RTE_ARCH_ARM_NEON_MEMCPY we could reuse that here as restrict does the same. > +#include > +#include > + > +/******************************************************************************* > + * The memory copy performance differs on different AArch64 micro-architectures. > + * And the most recent glibc (e.g. 2.23 or later) can provide a better memcpy() > + * performance compared to old glibc versions. It's always suggested to use a > + * more recent glibc if possible, from which the entire system can get benefit. > + * > + * This implementation improves memory copy on some aarch64 micro-architectures, > + * when an old glibc (e.g. 2.19, 2.17...) is being used. It is disabled by > + * default and needs "RTE_ARCH_ARM64_MEMCPY" defined to activate. It's not > + * always providing better performance than memcpy() so users need to run unit > + * test "memcpy_perf_autotest" and customize parameters in customization section > + * below for best performance. > + * > + * Compiler version will also impact the rte_memcpy() performance. It's observed > + * on some platforms and with the same code, GCC 7.2.0 compiled binaries can > + * provide better performance than GCC 4.8.5 compiled binaries. > + ******************************************************************************/ > + > +/************************************** > + * Beginning of customization section > + **************************************/ > +#define ALIGNMENT_MASK 0x0F > +#ifndef RTE_ARCH_ARM64_MEMCPY_STRICT_ALIGN > +// Only src unalignment will be treaed as unaligned copy > +#define IS_UNALIGNED_COPY(dst, src) ((uintptr_t)(dst) & ALIGNMENT_MASK) We can use existing `rte_is_aligned` function instead. > +#else > +// Both dst and src unalignment will be treated as unaligned copy > +#define IS_UNALIGNED_COPY(dst, src) \ > + (((uintptr_t)(dst) | (uintptr_t)(src)) & ALIGNMENT_MASK) > +#endif > + > + > +// If copy size is larger than threshold, memcpy() will be used. > +// Run "memcpy_perf_autotest" to determine the proper threshold. > +#define ALIGNED_THRESHOLD ((size_t)(0xffffffff)) > +#define UNALIGNED_THRESHOLD ((size_t)(0xffffffff)) > + > + > +/************************************** > + * End of customization section > + **************************************/ > +#ifdef RTE_TOOLCHAIN_GCC > +#if (GCC_VERSION < 50400) > +#warning "The GCC version is quite old, which may result in sub-optimal \ > +performance of the compiled code. It is suggested that at least GCC 5.4.0 \ > +be used." > +#endif > +#endif > + > +static inline void __attribute__ ((__always_inline__)) use __rte_always_inline instead. > +rte_mov16(uint8_t *restrict dst, const uint8_t *restrict src) > +{ > + __int128 * restrict dst128 = (__int128 * restrict)dst; > + const __int128 * restrict src128 = (const __int128 * restrict)src; > + *dst128 = *src128; > +} > + > +static inline void __attribute__ ((__always_inline__)) > +rte_mov64(uint8_t *restrict dst, const uint8_t *restrict src) > +{ > + __int128 * restrict dst128 = (__int128 * restrict)dst; ISO C does not support ‘__int128’ please use '__int128_t' or '__uint128_t'. > + const __int128 * restrict src128 = (const __int128 * restrict)src; > + dst128[0] = src128[0]; > + dst128[1] = src128[1]; > + dst128[2] = src128[2]; > + dst128[3] = src128[3]; > +} > + Would doing this still benifit if size is compile time constant? i.e. when __builtin_constant_p(n) is true. > + > +static inline void *__attribute__ ((__always_inline__)) > +rte_memcpy(void *restrict dst, const void *restrict src, size_t n) > +{ > + if (n < 16) { > + rte_memcpy_lt16((uint8_t *)dst, (const uint8_t *)src, n); > + return dst; > + } > + if (n < 64) { > + rte_memcpy_ge16_lt64((uint8_t *)dst, (const uint8_t *)src, n); > + return dst; > + } > + __builtin_prefetch(src, 0, 0); > + __builtin_prefetch(dst, 1, 0); > + if (likely( > + (!IS_UNALIGNED_COPY(dst, src) && n <= ALIGNED_THRESHOLD) > + || (IS_UNALIGNED_COPY(dst, src) && n <= UNALIGNED_THRESHOLD) > + )) { > + rte_memcpy_ge64((uint8_t *)dst, (const uint8_t *)src, n); > + return dst; > + } else > + return memcpy(dst, src, n); > +} > + > + > +#else > static inline void > rte_mov16(uint8_t *dst, const uint8_t *src) > { > @@ -80,6 +271,8 @@ > > #define rte_memcpy(d, s, n) memcpy((d), (s), (n)) > > +#endif > + > #ifdef __cplusplus > } > #endif > -- > 1.8.3.1 > Regards, Pavan.