From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from EUR03-AM5-obe.outbound.protection.outlook.com (mail-eopbgr30045.outbound.protection.outlook.com [40.107.3.45]) by dpdk.org (Postfix) with ESMTP id C5A841B012 for ; Mon, 18 Dec 2017 03:51:22 +0100 (CET) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=armh.onmicrosoft.com; s=selector1-arm-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version; bh=MgnUB+r/DQ0rWqz1L9EwoiygnU4ZO/WF5deAlMvkmDE=; b=Qznq0/ck2AmPNNEX97UH1BwTA48QtbxDCtCHhqNb3s2QSYrNG2UEMWzVY8/HILwV2LqvB5/bnJijQ8XnygtL4tPrT7/GjM7crWZ5N0iZhUx2kI01nKjSxjhbuMDjpxux/QiQali2aJN/fBaunPsII0Sils8gU1eqBjjuMQNftDM= Received: from HE1PR08MB2809.eurprd08.prod.outlook.com (10.170.246.148) by AM5PR0801MB1347.eurprd08.prod.outlook.com (10.167.217.13) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_CBC_SHA384_P256) id 15.20.302.9; Mon, 18 Dec 2017 02:51:20 +0000 Received: from HE1PR08MB2809.eurprd08.prod.outlook.com ([fe80::54fd:d63d:4cce:8f32]) by HE1PR08MB2809.eurprd08.prod.outlook.com ([fe80::54fd:d63d:4cce:8f32%13]) with mapi id 15.20.0302.017; Mon, 18 Dec 2017 02:51:20 +0000 From: Herbert Guan To: Jerin Jacob CC: Jianbo Liu , "dev@dpdk.org" Thread-Topic: [PATCH] arch/arm: optimization for memcpy on AArch64 Thread-Index: AQHTZ1RSZ4k73Wd8i0Ow+NbtDaNy7KMrTW0AgAZCBFCAElYLAIAEnltQ Date: Mon, 18 Dec 2017 02:51:19 +0000 Message-ID: References: <1511768985-21639-1-git-send-email-herbert.guan@arm.com> <20171129123154.GA22644@jerin> <20171215040623.GB5874@jerin> In-Reply-To: <20171215040623.GB5874@jerin> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=Herbert.Guan@arm.com; x-originating-ip: [113.29.88.7] x-ms-publictraffictype: Email x-microsoft-exchange-diagnostics: 1; AM5PR0801MB1347; 6:TZW2E+VD/3jNcirzdjN2/4nvdMN5e0iVWHSu5/GUX7Lqkdtiio22qjko2I/AUEpll39YkS7BByXVpbbdYaVk8sVc/GysWAe2pDElPJ8hKyGZ24VHuNJtqhHOzM5gv9DhZxOW2G4KJlGHKQgJk75m2kWnW0fyZCuhUY7f0kb8cYxTBvSGQgA8jNF0TpWFgKaiBlwZORErTqf0ryR0n9BQ7Eur3XmQJWvy4Fg+eMuMsRq00X63fcaZe8lySPDJQ7jNxLvm/zclU+j2fyzsS8HtDzRN0kRUnzrfnvaQHkn38bJu7E1oaCdJX+ltdLPBbpsJpTvOztL7XMMzSzyBKK+o5TfNUjsKZPkyi0HKtC9BSL0=; 5:dTQiyAwX7LbKVu7nPMAmyF8RE6aZNeVO1AbirYVKfjQQCd7NC1yVayLbra8oivZpHSWNXgU6lHOlcX/laX/vLG8Fk4SWUCd++T7CFGl7gPZTBIm87/3k0AsDQuQySzARCfTCtY5vIecbmvlEOB+oEWYQl6wysxVI/MLe6SlRML4=; 24:8xoQ/dncTeAp5qeX5C5edRE7S/GaQWsLeEjfF2YMoTzjNRh7IrPuxvtNcM6dUX/oZBG2oxn8CvLaUEX3Nsb0CD94aPHsEeUA+F9z2FRsK+M=; 7:KabVtHmP4KxvnDDnmjifgbQhUVRDWvD3dCzP2ulT15umyk6JesBsDRminQADFLK7SNC6g2IZXD4G4e3w2pfsOQsbtWAWd7MXsj1zhgMpdYyiBYFdJEbcvHNh2O60267Isc01ba4vvWV0xRAXvmeSWjvppQ1HU9dtNBWqPfoDwaJxAibhKCVUFA9BhnKn/fwRSmH0EVBtVgyvcocLkkGCna/nLxMjjLP7Q1gMRq+Q/YETnFsWdK9iwRObzR8wkHgc x-ms-exchange-antispam-srfa-diagnostics: SSOS;SSOR; x-ms-office365-filtering-correlation-id: 0437605d-61db-47be-1298-08d545c23792 x-ms-office365-filtering-ht: Tenant x-microsoft-antispam: UriScan:; BCL:0; PCL:0; RULEID:(5600026)(4604075)(4534020)(4602075)(7168020)(4627115)(201703031133081)(201702281549075)(48565401081)(2017052603307); SRVR:AM5PR0801MB1347; x-ms-traffictypediagnostic: AM5PR0801MB1347: x-microsoft-antispam-prvs: x-exchange-antispam-report-test: UriScan:(180628864354917); x-exchange-antispam-report-cfa-test: BCL:0; PCL:0; RULEID:(6040450)(2401047)(5005006)(8121501046)(3002001)(10201501046)(93006095)(93001095)(3231023)(6055026)(6041248)(20161123562025)(20161123564025)(20161123555025)(20161123558100)(201703131423075)(201702281528075)(201703061421075)(201703061406153)(20161123560025)(6072148)(201708071742011); SRVR:AM5PR0801MB1347; BCL:0; PCL:0; RULEID:(100000803101)(100110400095); SRVR:AM5PR0801MB1347; x-forefront-prvs: 0525BB0ADF x-forefront-antispam-report: SFV:NSPM; SFS:(10009020)(346002)(376002)(366004)(396003)(39850400004)(13464003)(189003)(199004)(40434004)(43544003)(33656002)(53936002)(59450400001)(55236004)(229853002)(53546011)(2950100002)(6916009)(66066001)(76176011)(2906002)(3660700001)(3280700002)(5660300001)(97736004)(6506007)(7696005)(9686003)(6436002)(6246003)(4326008)(2900100001)(305945005)(81166006)(81156014)(8936002)(86362001)(575784001)(8676002)(55016002)(5890100001)(5250100002)(14454004)(478600001)(72206003)(99286004)(3846002)(102836003)(74316002)(25786009)(105586002)(6116002)(106356001)(68736007)(7736002)(93886005)(316002)(54906003); DIR:OUT; SFP:1101; SCL:1; SRVR:AM5PR0801MB1347; H:HE1PR08MB2809.eurprd08.prod.outlook.com; FPR:; SPF:None; PTR:InfoNoRecords; MX:1; A:1; LANG:en; received-spf: None (protection.outlook.com: arm.com does not designate permitted sender hosts) spamdiagnosticoutput: 1:99 spamdiagnosticmetadata: NSPM Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-OriginatorOrg: arm.com X-MS-Exchange-CrossTenant-Network-Message-Id: 0437605d-61db-47be-1298-08d545c23792 X-MS-Exchange-CrossTenant-originalarrivaltime: 18 Dec 2017 02:51:19.9522 (UTC) X-MS-Exchange-CrossTenant-fromentityheader: Hosted X-MS-Exchange-CrossTenant-id: f34e5979-57d9-4aaa-ad4d-b122a662184d X-MS-Exchange-Transport-CrossTenantHeadersStamped: AM5PR0801MB1347 Subject: Re: [dpdk-dev] [PATCH] arch/arm: optimization for memcpy on AArch64 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 18 Dec 2017 02:51:23 -0000 Hi Jerin, > -----Original Message----- > From: Jerin Jacob [mailto:jerin.jacob@caviumnetworks.com] > Sent: Friday, December 15, 2017 12:06 > To: Herbert Guan > Cc: Jianbo Liu ; dev@dpdk.org > Subject: Re: [PATCH] arch/arm: optimization for memcpy on AArch64 > > -----Original Message----- > > Date: Sun, 3 Dec 2017 12:37:30 +0000 > > From: Herbert Guan > > To: Jerin Jacob > > CC: Jianbo Liu , "dev@dpdk.org" > > Subject: RE: [PATCH] arch/arm: optimization for memcpy on AArch64 > > > > Jerin, > > Hi Herbert, > > > > > Thanks a lot for your review and comments. Please find my comments > below inline. > > > > Best regards, > > Herbert > > > > > -----Original Message----- > > > From: Jerin Jacob [mailto:jerin.jacob@caviumnetworks.com] > > > Sent: Wednesday, November 29, 2017 20:32 > > > To: Herbert Guan > > > Cc: Jianbo Liu ; dev@dpdk.org > > > Subject: Re: [PATCH] arch/arm: optimization for memcpy on AArch64 > > > > > > -----Original Message----- > > > > Date: Mon, 27 Nov 2017 15:49:45 +0800 > > > > From: Herbert Guan > > > > To: jerin.jacob@caviumnetworks.com, jianbo.liu@arm.com, > dev@dpdk.org > > > > CC: Herbert Guan > > > > Subject: [PATCH] arch/arm: optimization for memcpy on AArch64 > > > > X-Mailer: git-send-email 1.8.3.1 > > > > + > > > > +/************************************** > > > > + * Beginning of customization section > > > > +**************************************/ > > > > +#define ALIGNMENT_MASK 0x0F > > > > +#ifndef RTE_ARCH_ARM64_MEMCPY_STRICT_ALIGN > > > > +// Only src unalignment will be treaed as unaligned copy > > > > > > C++ style comments. It may generate check patch errors. > > > > I'll change it to use C style comment in the version 2. > > > > > > > > > +#define IS_UNALIGNED_COPY(dst, src) ((uintptr_t)(dst) & > > > > +ALIGNMENT_MASK) #else // Both dst and src unalignment will be > treated > > > > +as unaligned copy #define IS_UNALIGNED_COPY(dst, src) \ > > > > +(((uintptr_t)(dst) | (uintptr_t)(src)) & ALIGNMENT_MASK) > > > #endif > > > > + > > > > + > > > > +// If copy size is larger than threshold, memcpy() will be used. > > > > +// Run "memcpy_perf_autotest" to determine the proper threshold. > > > > +#define ALIGNED_THRESHOLD ((size_t)(0xffffffff)) > > > > +#define UNALIGNED_THRESHOLD ((size_t)(0xffffffff)) > > > > > > Do you see any case where this threshold is useful. > > > > Yes, on some platforms, and/or with some glibc version, the glibc memc= py > has better performance in larger size (e.g., >512, >4096...). So develop= ers > should run unit test to find the best threshold. The default value of 0x= ffffffff > should be modified with evaluated values. > > OK > > > > > > > > > > + > > > > +static inline void *__attribute__ ((__always_inline__))i > > use __rte_always_inline Applied in V3 patch. > > > > > +rte_memcpy(void *restrict dst, const void *restrict src, size_t n) > > > > +{ > > > > +if (n < 16) { > > > > +rte_memcpy_lt16((uint8_t *)dst, (const uint8_t *)src, n); > > > > +return dst; > > > > +} > > > > +if (n < 64) { > > > > +rte_memcpy_ge16_lt64((uint8_t *)dst, (const uint8_t *)src, > > > n); > > > > +return dst; > > > > +} > > > > > > Unfortunately we have 128B cache arm64 implementation too. Could you > > > please take care that based on RTE_CACHE_LINE_SIZE > > > > > > > Here the value of '64' is not the cache line size. But for the reason = that > prefetch itself will cost some cycles, it's not worthwhile to do prefetch= for > small size (e.g. < 64 bytes) copy. Per my test, prefetching for small si= ze copy > will actually lower the performance. > > But > I think, '64' is a function of cache size. ie. Any reason why we haven't = used > rte_memcpy_ge16_lt128()/rte_memcpy_ge128 pair instead of > rte_memcpy_ge16_lt64//rte_memcpy_ge64 pair? > I think, if you can add one more conditional compilation to choose betwee= n > rte_memcpy_ge16_lt128()/rte_memcpy_ge128 vs > rte_memcpy_ge16_lt64//rte_memcpy_ge64, > will address the all arm64 variants supported in current DPDK. > The logic for 128B cache is implemented as you've suggested, and has been a= dded in V3 patch. > > > > In the other hand, I can only find one 128B cache line aarch64 machine = here. > And it do exist some specific optimization for this machine. Not sure if= it'll be > beneficial for other 128B cache machines or not. I prefer not to put it = in this > patch but in a later standalone specific patch for 128B cache machines. > > > > > > +__builtin_prefetch(src, 0, 0); // rte_prefetch_non_temporal(src); > > > > +__builtin_prefetch(dst, 1, 0); // * unchanged * > > # Why only once __builtin_prefetch used? Why not invoke in > rte_memcpy_ge64 loop > # Does it make sense to prefetch src + 64/128 * n Prefetch is only necessary once at the beginning. CPU will do auto increme= ntal prefetch when the continuous memory access starts. It's not necessary= to do prefetch in the loop. In fact doing it in loop will actually break = CPU's HW prefetch and degrade the performance. IMPORTANT NOTICE: The contents of this email and any attachments are confid= ential and may also be privileged. If you are not the intended recipient, p= lease notify the sender immediately and do not disclose the contents to any= other person, use it for any purpose, or store or copy the information in = any medium. Thank you.