From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from EUR01-HE1-obe.outbound.protection.outlook.com (mail-he1eur01on0088.outbound.protection.outlook.com [104.47.0.88]) by dpdk.org (Postfix) with ESMTP id 283B514E8 for ; Mon, 4 Dec 2017 08:14:11 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector1-arm-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=s+nMszD/bDQWirlOrxocfpboRAgdebiB2t1YNGMHfag=; b=eQwmYVXzLl0LdAOK4Rs1c6ftuXJq03DshaDb5TK8NB9T0KaREh/SClYh7lv0Vk8mAzPBLEurf8qhIwxMkq2X7Tc+5uYu8hAndG+FCBFD5VI1UtBN15FLsjcvDJ1UOeFg0Qd/AuTDAXkt4sryqxCDNeSSi/IU+MQDV1beZyiZZfE= Received: from HE1PR08MB2809.eurprd08.prod.outlook.com (10.170.246.148) by AM5PR0801MB1346.eurprd08.prod.outlook.com (10.167.217.12) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P256) id 15.20.282.5; Mon, 4 Dec 2017 07:14:09 +0000 Received: from HE1PR08MB2809.eurprd08.prod.outlook.com ([fe80::54fd:d63d:4cce:8f32]) by HE1PR08MB2809.eurprd08.prod.outlook.com ([fe80::54fd:d63d:4cce:8f32%13]) with mapi id 15.20.0282.010; Mon, 4 Dec 2017 07:14:09 +0000 From: Herbert Guan To: Pavan Nikhilesh Bhagavatula , Jianbo Liu CC: "dev@dpdk.org" Thread-Topic: [dpdk-dev] [PATCH] arch/arm: optimization for memcpy on AArch64 Thread-Index: AQHTZ1RSZ4k73Wd8i0Ow+NbtDaNy7KMvsOgAgAHWZ0CAAC3eAIAA9wyw Date: Mon, 4 Dec 2017 07:14:09 +0000 Message-ID: References: <1511768985-21639-1-git-send-email-herbert.guan@arm.com> <20171202073300.yozet72nnvlwrkgj@Pavan-LT> <20171203142049.m7moapt7msxrwacs@Pavan-LT> In-Reply-To: <20171203142049.m7moapt7msxrwacs@Pavan-LT> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=Herbert.Guan@arm.com; x-originating-ip: [113.29.88.7] x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1; AM5PR0801MB1346; 6:J4lrLaFmA3j7GWp4gZT2owjJzicoOQcLQnri7icqlJDyuvzOjNe+/sYz3PxJp6s5BFK8jJ6d9vG/YGGgLA+1qdzLhJqsJYbkmCukOeGi2LWFAZ8xTr5Fx9woCmluoXcT1ckLuqVbNjJ6akLnbKzLeGPfO3B7BQtS7P/pv6lHfRMIh+IdV7F13y6Koq3JrAFzH0g7NFa7sMs2HhTDtszdwJvhF4Qo3YljKpWxnFVF1/uFHBucPg2pL1vopZyG7uBzrjcL4NNnjAZReUKISEahSNulaZKKad7fhnSLQqsqrU+urM9oc+AJfk/TO2mUEo1ZjIXFqA0EGbiEGJX9ccDeJw4Ej9D9JcgSY6iYQ62lrg4=; 5:Z2Rke8MWPbRIoWqcGoiCDe/o7dHZ5DHLr+kLIwpX2IRIxbZbiU9UqRrk1Eu178BQPdFWGEMt/v6H2W8+QqnoTdsWXusK2vUtGXuRrji9MTZvmF13ckqihrtQC4JE2H0ZE22Qfhs5dtThU3CxladZRGncC505nX3pUFUf5rLT7Og=; 24:2Nia24qrCX/8WJR1XD3hQvpSu043FcNi6Nj70cEVtpWpmko8gPLqvxxgfcL3ORGsXNPZ7zXe0HcYoeDqES7ly3fdJele7BkThhvrk3KhW+Q=; 7:2U2mjIHmD8IX7DDKXGFZBI0iyWWGKjgdrJ7/Yd5uyvi/88hhI4fK+7xLp8sqe5bvOL3/74LlqqyGNwwFk5Wp7XEsIFAQ/HBdjOfwa0BhmbOu16PoAo9Ln2thwnV1Vk57Om6dB6jLhwmRDZO0JHCHRsmdQIECNXjkqvuuyQBzSSQvpV+sMpl0wmUIGhhFLaC2jh/RWShs31HedlPJ6eRC2y8Mgw+QDl2PKW5YuniZWzobVuUmlTlo+ROfRjh3+taR x-ms-exchange-antispam-srfa-diagnostics: SSOS;SSOR; x-ms-office365-filtering-correlation-id: 5bf5ab7f-6ac2-498d-2723-08d53ae69cf9 x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: UriScan:; BCL:0; PCL:0; RULEID:(5600026)(4604075)(4534020)(4602075)(7168020)(4627115)(201703031133081)(201702281549075)(48565401081)(2017052603286); SRVR:AM5PR0801MB1346; x-ms-traffictypediagnostic: AM5PR0801MB1346: x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(180628864354917); x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(6040450)(2401047)(8121501046)(5005006)(3231022)(10201501046)(93006095)(93001095)(3002001)(6055026)(6041248)(20161123555025)(20161123562025)(20161123558100)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(20161123564025)(20161123560025)(6072148)(201708071742011); SRVR:AM5PR0801MB1346; BCL:0; PCL:0; RULEID:(100000803101)(100110400095); SRVR:AM5PR0801MB1346; x-forefront-prvs: 051158ECBB x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(6009001)(376002)(346002)(366004)(39860400002)(189002)(13464003)(40434004)(199003)(24454002)(5660300001)(6246003)(316002)(4326008)(3660700001)(110136005)(3280700002)(53546010)(86362001)(575784001)(102836003)(3846002)(6116002)(229853002)(189998001)(54356011)(2950100002)(6506006)(76176011)(6436002)(7696005)(53936002)(55016002)(2906002)(6636002)(74316002)(93886005)(105586002)(7736002)(305945005)(9686003)(55236003)(66066001)(99286004)(2900100001)(478600001)(8936002)(33656002)(25786009)(81156014)(106356001)(101416001)(8676002)(97736004)(68736007)(81166006)(5250100002)(72206003)(5890100001)(14454004); DIR:OUT; SFP:1101; SCL:1; SRVR:AM5PR0801MB1346; H:HE1PR08MB2809.eurprd08.prod.outlook.com; FPR:; SPF:None; PTR:InfoNoRecords; MX:1; A:1; LANG:en; received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-Network-Message-Id: 5bf5ab7f-6ac2-498d-2723-08d53ae69cf9 X-MS-Exchange-CrossTenant-originalarrivaltime: 04 Dec 2017 07:14:09.2006 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM5PR0801MB1346 Subject: Re: [dpdk-dev] [PATCH] arch/arm: optimization for memcpy on AArch64 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 04 Dec 2017 07:14:11 -0000 > -----Original Message----- > From: Pavan Nikhilesh Bhagavatula > [mailto:pbhagavatula@caviumnetworks.com] > Sent: Sunday, December 3, 2017 22:21 > To: Herbert Guan ; Jianbo Liu > > Cc: dev@dpdk.org > Subject: Re: [dpdk-dev] [PATCH] arch/arm: optimization for memcpy on > AArch64 > > On Sun, Dec 03, 2017 at 12:38:35PM +0000, Herbert Guan wrote: > > Pavan, > > > > Thanks for review and comments. Please find my comments inline below. > > > > Best regards, > > Herbert > > > > > > There is an existing flag for arm32 to enable neon based memcpy > > > RTE_ARCH_ARM_NEON_MEMCPY we could reuse that here as restrict > does > > > the same. > > > > > This implementation is actually not using ARM NEON instructions so the > existing flag is not describing the option exactly. It'll be good if the= existing > flag is "RTE_ARCH_ARM_MEMCPY" but unfortunately it might be too late > now to get the flags aligned. > > > > Correct me if I'm wrong but doesn't restrict tell the compiler to do SIMD > optimization? > Anyway can we put RTE_ARCH_ARM64_MEMCPY into config/common_base > as CONFIG_RTE_ARCH_ARM64_MEMCPY=3Dn so that it would be easier to > enable/disable. > The result of using 'restrict' is to generate codes with ldp/stp instructio= ns. These instructions actually belong to the "data transfer instructions"= , though they are loading/storing a pair of registers. 'ld1/st1' are SIMD = (NEON) instructions. I can add CONFIG_RTE_ARCH_ARM64_MEMCPY=3Dn into common_armv8a_linuxapp in = the new version as you've suggested. > > > > +#include > > > > +#include > > > > + > > > > > > > > +/********************************************************* > > > *********** > > > > +*********** > > > > + * The memory copy performance differs on different AArch64 > > > > +micro- > > > architectures. > > > > + * And the most recent glibc (e.g. 2.23 or later) can provide a > > > > +better memcpy() > > > > + * performance compared to old glibc versions. It's always > > > > +suggested to use a > > > > + * more recent glibc if possible, from which the entire system > > > > +can get > > > benefit. > > > > + * > > > > + * This implementation improves memory copy on some aarch64 > > > > +micro-architectures, > > > > + * when an old glibc (e.g. 2.19, 2.17...) is being used. It is > > > > +disabled by > > > > + * default and needs "RTE_ARCH_ARM64_MEMCPY" defined to > activate. > > > > +It's not > > > > + * always providing better performance than memcpy() so users > > > > +need to run unit > > > > + * test "memcpy_perf_autotest" and customize parameters in > > > > +customization section > > > > + * below for best performance. > > > > + * > > > > + * Compiler version will also impact the rte_memcpy() performance. > > > > +It's observed > > > > + * on some platforms and with the same code, GCC 7.2.0 compiled > > > > +binaries can > > > > + * provide better performance than GCC 4.8.5 compiled binaries. > > > > + > > > > > > > > +********************************************************* > > > ************ > > > > +*********/ > > > > + > > > > +/************************************** > > > > + * Beginning of customization section > > > > +**************************************/ > > > > +#define ALIGNMENT_MASK 0x0F > > > > +#ifndef RTE_ARCH_ARM64_MEMCPY_STRICT_ALIGN > > > > +// Only src unalignment will be treaed as unaligned copy #define > > > > +IS_UNALIGNED_COPY(dst, src) ((uintptr_t)(dst) & ALIGNMENT_MASK) > > > > > > We can use existing `rte_is_aligned` function instead. > > > > The exising 'rte_is_aligned()' inline function is defined in a relative= ly > complex way, and there will be more instructions generated (using GCC > 7.2.0): > > > > 0000000000000000 : // using rte_is_aligned() > > 0:91003c01 addx1, x0, #0xf > > 4:927cec21 andx1, x1, #0xfffffffffffffff0 > > 8:eb01001f cmpx0, x1 > > c:1a9f07e0 csetw0, ne // ne =3D any > > 10:d65f03c0 ret > > 14:d503201f nop > > > > 0000000000000018 : // using above expression > > 18:12000c00 andw0, w0, #0xf > > 1c:d65f03c0 ret > > > > So to get better performance, it's better to use the simple logic. > > Agreed, I have noticed that too maybe we could change rte_is_aligned to b= e > simpler (Not in this patch). > > > > > Would doing this still benifit if size is compile time constant? > > > i.e. when > > > __builtin_constant_p(n) is true. > > > > > Yes, performance margin is observed if size is compile time constant on > some tested platforms. > > > > Sorry I didn't get you but which is better? If size is compile time const= ant is > using libc memcpy is better or going with restrict implementation better. > > If the former then we could do what 32bit rte_memcpy is using i.e. > > #define rte_memcpy(dst, src, n) \ > __extension__ ({ \ > (__builtin_constant_p(n)) ? \ > memcpy((dst), (src), (n)) : \ > rte_memcpy_func((dst), (src), (n)); }) > Per my test, it usually has the same direction. Means if the variable size= can get improved performance, then hopefully the compile time constant wil= l be improved as well, and vice versa. The percentage might be different. = So in this patch, the property of size parameter (variable or compile time= constant is not checked). > Regards, > Pavan. Thanks, Herbert IMPORTANT NOTICE: The contents of this email and any attachments are confid= ential and may also be privileged. If you are not the intended recipient, p= lease notify the sender immediately and do not disclose the contents to any= other person, use it for any purpose, or store or copy the information in = any medium. Thank you.