From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by dpdk.space (Postfix) with ESMTP id 886C3A0471 for ; Fri, 21 Jun 2019 21:27:02 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 0004E1D558; Fri, 21 Jun 2019 21:27:00 +0200 (CEST) Received: from mx1.redhat.com (mx1.redhat.com [209.132.183.28]) by dpdk.org (Postfix) with ESMTP id 7FD5F1D556 for ; Fri, 21 Jun 2019 21:26:58 +0200 (CEST) Received: from smtp.corp.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id B02FAC05E76E; Fri, 21 Jun 2019 19:26:57 +0000 (UTC) Received: from dhcp-25.97.bos.redhat.com (unknown [10.18.25.84]) by smtp.corp.redhat.com (Postfix) with ESMTPS id 01EE360BFB; Fri, 21 Jun 2019 19:26:56 +0000 (UTC) From: Aaron Conole To: Pavan Nikhilesh Bhagavatula Cc: Jerin Jacob Kollanukkaran , "dev\@dpdk.org" , Nithin Kumar Dabilpuram , Vamsi Krishna Attunuru , Olivier Matz References: <20190601014905.45531-1-jerinj@marvell.com> <20190617155537.36144-1-jerinj@marvell.com> <20190617155537.36144-26-jerinj@marvell.com> Date: Fri, 21 Jun 2019 15:26:56 -0400 In-Reply-To: (Pavan Nikhilesh Bhagavatula's message of "Tue, 18 Jun 2019 07:39:23 +0000") Message-ID: User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain X-Scanned-By: MIMEDefang 2.79 on 10.5.11.12 X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Fri, 21 Jun 2019 19:26:57 +0000 (UTC) Subject: Re: [dpdk-dev] [EXT] Re: [PATCH v3 25/27] mempool/octeontx2: add optimized dequeue operation for arm64 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" Pavan Nikhilesh Bhagavatula writes: > Hi Aaron, > >>-----Original Message----- >>From: Aaron Conole >>Sent: Tuesday, June 18, 2019 2:55 AM >>To: Jerin Jacob Kollanukkaran >>Cc: dev@dpdk.org; Nithin Kumar Dabilpuram >>; Vamsi Krishna Attunuru >>; Pavan Nikhilesh Bhagavatula >>; Olivier Matz >>Subject: [EXT] Re: [dpdk-dev] [PATCH v3 25/27] mempool/octeontx2: >>add optimized dequeue operation for arm64 >> >>> From: Pavan Nikhilesh >>> >>> This patch adds an optimized arm64 instruction based routine to >>leverage >>> CPU pipeline characteristics of octeontx2. The theme is to fill the >>> pipeline with CASP operations as much HW can do so that HW can do >>alloc() >>> HW ops in full throttle. >>> >>> Cc: Olivier Matz >>> Cc: Aaron Conole >>> >>> Signed-off-by: Pavan Nikhilesh >>> Signed-off-by: Jerin Jacob >>> Signed-off-by: Vamsi Attunuru >>> --- >>> drivers/mempool/octeontx2/otx2_mempool_ops.c | 291 >>+++++++++++++++++++ >>> 1 file changed, 291 insertions(+) >>> >>> diff --git a/drivers/mempool/octeontx2/otx2_mempool_ops.c >>b/drivers/mempool/octeontx2/otx2_mempool_ops.c >>> index c59bd73c0..e6737abda 100644 >>> --- a/drivers/mempool/octeontx2/otx2_mempool_ops.c >>> +++ b/drivers/mempool/octeontx2/otx2_mempool_ops.c >>> @@ -37,6 +37,293 @@ npa_lf_aura_op_alloc_one(const int64_t >>wdata, int64_t * const addr, >>> return -ENOENT; >>> } >>> >>> +#if defined(RTE_ARCH_ARM64) >>> +static __rte_noinline int >>> +npa_lf_aura_op_search_alloc(const int64_t wdata, int64_t * const >>addr, >>> + void **obj_table, unsigned int n) >>> +{ >>> + uint8_t i; >>> + >>> + for (i = 0; i < n; i++) { >>> + if (obj_table[i] != NULL) >>> + continue; >>> + if (npa_lf_aura_op_alloc_one(wdata, addr, obj_table, >>i)) >>> + return -ENOENT; >>> + } >>> + >>> + return 0; >>> +} >>> + >>> +static __attribute__((optimize("-O3"))) __rte_noinline int __hot >> >>Sorry if I missed this before. >> >>Is there a good reason to hard-code this optimization, rather than let >>the build system provide it? > > Some versions of compiler don't have support for __int128_t for CASP inline-asm. > i.e. if the optimization level is reduced to -O0 the CASP restrictions aren't followed and > compiler might end up violation the CASP rules example: > > /tmp/ccSPMGzq.s:1648: Error: reg pair must start from even reg at > operand 1 - `casp x21,x22,x0,x1,[x19]' > /tmp/ccSPMGzq.s:1706: Error: reg pair must start from even reg at > operand 1 - `casp x13,x14,x0,x1,[x11]' > /tmp/ccSPMGzq.s:1745: Error: reg pair must start from even reg at > operand 1 - `casp x9,x10,x0,x1,[x7]' > /tmp/ccSPMGzq.s:1775: Error: reg pair must start from even reg at > operand 1 - `casp x7,x8,x0,x1,[x5]'* > > Forcing to -O3 with __rte_noinline in place fixes it as the alignment fits in. It makes sense to document this - it isn't apparent that it is needed. It would be good to put a comment just before that explains it, preferably with the compilers that aren't behaving. This would help in the future to determine when it would be safe to drop the flag. > Regards, > Pavan. > >> >>> +npa_lf_aura_op_alloc_bulk(const int64_t wdata, int64_t * const >>addr, >>> + unsigned int n, void **obj_table) >>> +{ >>> + const __uint128_t wdata128 = ((__uint128_t)wdata << 64) | >>wdata; >>> + uint64x2_t failed = vdupq_n_u64(~0); >>> + >>> + switch (n) { >>> + case 32: >>> + { >>> + __uint128_t t0, t1, t2, t3, t4, t5, t6, t7, t8, t9; >>> + __uint128_t t10, t11; >>> + >>> + asm volatile ( >>> + ".cpu generic+lse\n" >>> + "casp %[t0], %H[t0], %[wdata], %H[wdata], [%[loc]]\n"