From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by dpdk.space (Postfix) with ESMTP id B07D6A0471 for ; Sat, 22 Jun 2019 15:28:31 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 36C431C5E8; Sat, 22 Jun 2019 15:26:36 +0200 (CEST) Received: from mx0b-0016f401.pphosted.com (mx0b-0016f401.pphosted.com [67.231.156.173]) by dpdk.org (Postfix) with ESMTP id 51EED1CDF2 for ; Sat, 22 Jun 2019 15:25:46 +0200 (CEST) Received: from pps.filterd (m0045851.ppops.net [127.0.0.1]) by mx0b-0016f401.pphosted.com (8.16.0.27/8.16.0.27) with SMTP id x5MDP8MX030689; Sat, 22 Jun 2019 06:25:45 -0700 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=marvell.com; h=from : to : cc : subject : date : message-id : in-reply-to : references : mime-version : content-transfer-encoding : content-type; s=pfpt0818; bh=XqFpin56xps77P9VtGvcgYelzwX2AJplSuI1+a0QnUE=; b=oRB8TiTQr7PydXv2D9sfG4ueM9+sPnxY4ieVSa5psja17puosljK7YQ8mY2e+TKRWTwk U5hIMhQh1qFrI8M1mmgjBVhsAhXzqcy+PE6t4UyiZojfW6+kYn/VZwxZuANkgQQkg056 7iLoZulrh91LWl4aPoyXvWoKoh+dwq6lOtJqwTUhVIUCScZAO88AYDnCcRjLtpGccHca vMhf3uqZLtpfbp/AgC5h2iPCophAHFDvtbFX6CDxwSrtb+GTYy7pwlfjhPaqTD5BWzKW lDuoP+jgcy3qjIHO5GfIqG3SPc2ZZAMdFkguF0HfYLphPtjS0fFKvwljiZEBkn24u+Tg +g== Received: from sc-exch04.marvell.com ([199.233.58.184]) by mx0b-0016f401.pphosted.com with ESMTP id 2t9kuj8662-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES256-SHA384 bits=256 verify=NOT); Sat, 22 Jun 2019 06:25:45 -0700 Received: from SC-EXCH01.marvell.com (10.93.176.81) by SC-EXCH04.marvell.com (10.93.176.84) with Microsoft SMTP Server (TLS) id 15.0.1367.3; Sat, 22 Jun 2019 06:25:43 -0700 Received: from maili.marvell.com (10.93.176.43) by SC-EXCH01.marvell.com (10.93.176.81) with Microsoft SMTP Server id 15.0.1367.3 via Frontend Transport; Sat, 22 Jun 2019 06:25:43 -0700 Received: from jerin-lab.marvell.com (jerin-lab.marvell.com [10.28.34.14]) by maili.marvell.com (Postfix) with ESMTP id 9B2543F703F; Sat, 22 Jun 2019 06:25:41 -0700 (PDT) From: To: Jerin Jacob , Nithin Dabilpuram , Vamsi Attunuru CC: , Pavan Nikhilesh , Olivier Matz , Aaron Conole Date: Sat, 22 Jun 2019 18:54:15 +0530 Message-ID: <20190622132417.32694-26-jerinj@marvell.com> X-Mailer: git-send-email 2.21.0 In-Reply-To: <20190622132417.32694-1-jerinj@marvell.com> References: <20190617155537.36144-1-jerinj@marvell.com> <20190622132417.32694-1-jerinj@marvell.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Content-Type: text/plain X-Proofpoint-Virus-Version: vendor=fsecure engine=2.50.10434:, , definitions=2019-06-22_09:, , signatures=0 Subject: [dpdk-dev] [PATCH v4 25/27] mempool/octeontx2: add optimized dequeue operation for arm64 X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" From: Pavan Nikhilesh This patch adds an optimized arm64 instruction based routine to leverage CPU pipeline characteristics of octeontx2. The theme is to fill the pipeline with CASP operations as much HW can do so that HW can do alloc() HW ops in full throttle. Cc: Olivier Matz Cc: Aaron Conole Signed-off-by: Pavan Nikhilesh Signed-off-by: Jerin Jacob Signed-off-by: Vamsi Attunuru --- drivers/mempool/octeontx2/otx2_mempool_ops.c | 301 +++++++++++++++++++ 1 file changed, 301 insertions(+) diff --git a/drivers/mempool/octeontx2/otx2_mempool_ops.c b/drivers/mempool/octeontx2/otx2_mempool_ops.c index c59bd73c0..25170015a 100644 --- a/drivers/mempool/octeontx2/otx2_mempool_ops.c +++ b/drivers/mempool/octeontx2/otx2_mempool_ops.c @@ -37,6 +37,303 @@ npa_lf_aura_op_alloc_one(const int64_t wdata, int64_t * const addr, return -ENOENT; } +#if defined(RTE_ARCH_ARM64) +static __rte_noinline int +npa_lf_aura_op_search_alloc(const int64_t wdata, int64_t * const addr, + void **obj_table, unsigned int n) +{ + uint8_t i; + + for (i = 0; i < n; i++) { + if (obj_table[i] != NULL) + continue; + if (npa_lf_aura_op_alloc_one(wdata, addr, obj_table, i)) + return -ENOENT; + } + + return 0; +} + +/* + * Some versions of the compiler don't have support for __int128_t for + * CASP inline-asm. i.e. if the optimization level is reduced to -O0 the + * CASP restrictions aren't followed and the compiler might end up violation the + * CASP rules. Fix it by explicitly providing ((optimize("-O3"))). + * + * Example: + * ccSPMGzq.s:1648: Error: reg pair must start from even reg at + * operand 1 - `casp x21,x22,x0,x1,[x19]' + */ +static __attribute__((optimize("-O3"))) __rte_noinline int __hot +npa_lf_aura_op_alloc_bulk(const int64_t wdata, int64_t * const addr, + unsigned int n, void **obj_table) +{ + const __uint128_t wdata128 = ((__uint128_t)wdata << 64) | wdata; + uint64x2_t failed = vdupq_n_u64(~0); + + switch (n) { + case 32: + { + __uint128_t t0, t1, t2, t3, t4, t5, t6, t7, t8, t9; + __uint128_t t10, t11; + + asm volatile ( + ".cpu generic+lse\n" + "casp %[t0], %H[t0], %[wdata], %H[wdata], [%[loc]]\n" + "casp %[t1], %H[t1], %[wdata], %H[wdata], [%[loc]]\n" + "casp %[t2], %H[t2], %[wdata], %H[wdata], [%[loc]]\n" + "casp %[t3], %H[t3], %[wdata], %H[wdata], [%[loc]]\n" + "casp %[t4], %H[t4], %[wdata], %H[wdata], [%[loc]]\n" + "casp %[t5], %H[t5], %[wdata], %H[wdata], [%[loc]]\n" + "casp %[t6], %H[t6], %[wdata], %H[wdata], [%[loc]]\n" + "casp %[t7], %H[t7], %[wdata], %H[wdata], [%[loc]]\n" + "casp %[t8], %H[t8], %[wdata], %H[wdata], [%[loc]]\n" + "casp %[t9], %H[t9], %[wdata], %H[wdata], [%[loc]]\n" + "casp %[t10], %H[t10], %[wdata], %H[wdata], [%[loc]]\n" + "casp %[t11], %H[t11], %[wdata], %H[wdata], [%[loc]]\n" + "fmov d16, %[t0]\n" + "fmov v16.D[1], %H[t0]\n" + "casp %[t0], %H[t0], %[wdata], %H[wdata], [%[loc]]\n" + "fmov d17, %[t1]\n" + "fmov v17.D[1], %H[t1]\n" + "casp %[t1], %H[t1], %[wdata], %H[wdata], [%[loc]]\n" + "fmov d18, %[t2]\n" + "fmov v18.D[1], %H[t2]\n" + "casp %[t2], %H[t2], %[wdata], %H[wdata], [%[loc]]\n" + "fmov d19, %[t3]\n" + "fmov v19.D[1], %H[t3]\n" + "casp %[t3], %H[t3], %[wdata], %H[wdata], [%[loc]]\n" + "and %[failed].16B, %[failed].16B, v16.16B\n" + "and %[failed].16B, %[failed].16B, v17.16B\n" + "and %[failed].16B, %[failed].16B, v18.16B\n" + "and %[failed].16B, %[failed].16B, v19.16B\n" + "fmov d20, %[t4]\n" + "fmov v20.D[1], %H[t4]\n" + "fmov d21, %[t5]\n" + "fmov v21.D[1], %H[t5]\n" + "fmov d22, %[t6]\n" + "fmov v22.D[1], %H[t6]\n" + "fmov d23, %[t7]\n" + "fmov v23.D[1], %H[t7]\n" + "and %[failed].16B, %[failed].16B, v20.16B\n" + "and %[failed].16B, %[failed].16B, v21.16B\n" + "and %[failed].16B, %[failed].16B, v22.16B\n" + "and %[failed].16B, %[failed].16B, v23.16B\n" + "st1 { v16.2d, v17.2d, v18.2d, v19.2d}, [%[dst]], 64\n" + "st1 { v20.2d, v21.2d, v22.2d, v23.2d}, [%[dst]], 64\n" + "fmov d16, %[t8]\n" + "fmov v16.D[1], %H[t8]\n" + "fmov d17, %[t9]\n" + "fmov v17.D[1], %H[t9]\n" + "fmov d18, %[t10]\n" + "fmov v18.D[1], %H[t10]\n" + "fmov d19, %[t11]\n" + "fmov v19.D[1], %H[t11]\n" + "and %[failed].16B, %[failed].16B, v16.16B\n" + "and %[failed].16B, %[failed].16B, v17.16B\n" + "and %[failed].16B, %[failed].16B, v18.16B\n" + "and %[failed].16B, %[failed].16B, v19.16B\n" + "fmov d20, %[t0]\n" + "fmov v20.D[1], %H[t0]\n" + "fmov d21, %[t1]\n" + "fmov v21.D[1], %H[t1]\n" + "fmov d22, %[t2]\n" + "fmov v22.D[1], %H[t2]\n" + "fmov d23, %[t3]\n" + "fmov v23.D[1], %H[t3]\n" + "and %[failed].16B, %[failed].16B, v20.16B\n" + "and %[failed].16B, %[failed].16B, v21.16B\n" + "and %[failed].16B, %[failed].16B, v22.16B\n" + "and %[failed].16B, %[failed].16B, v23.16B\n" + "st1 { v16.2d, v17.2d, v18.2d, v19.2d}, [%[dst]], 64\n" + "st1 { v20.2d, v21.2d, v22.2d, v23.2d}, [%[dst]], 64\n" + : "+Q" (*addr), [failed] "=&w" (failed), + [t0] "=&r" (t0), [t1] "=&r" (t1), [t2] "=&r" (t2), + [t3] "=&r" (t3), [t4] "=&r" (t4), [t5] "=&r" (t5), + [t6] "=&r" (t6), [t7] "=&r" (t7), [t8] "=&r" (t8), + [t9] "=&r" (t9), [t10] "=&r" (t10), [t11] "=&r" (t11) + : [wdata] "r" (wdata128), [dst] "r" (obj_table), + [loc] "r" (addr) + : "memory", "v16", "v17", "v18", + "v19", "v20", "v21", "v22", "v23" + ); + break; + } + case 16: + { + __uint128_t t0, t1, t2, t3, t4, t5, t6, t7; + + asm volatile ( + ".cpu generic+lse\n" + "casp %[t0], %H[t0], %[wdata], %H[wdata], [%[loc]]\n" + "casp %[t1], %H[t1], %[wdata], %H[wdata], [%[loc]]\n" + "casp %[t2], %H[t2], %[wdata], %H[wdata], [%[loc]]\n" + "casp %[t3], %H[t3], %[wdata], %H[wdata], [%[loc]]\n" + "casp %[t4], %H[t4], %[wdata], %H[wdata], [%[loc]]\n" + "casp %[t5], %H[t5], %[wdata], %H[wdata], [%[loc]]\n" + "casp %[t6], %H[t6], %[wdata], %H[wdata], [%[loc]]\n" + "casp %[t7], %H[t7], %[wdata], %H[wdata], [%[loc]]\n" + "fmov d16, %[t0]\n" + "fmov v16.D[1], %H[t0]\n" + "fmov d17, %[t1]\n" + "fmov v17.D[1], %H[t1]\n" + "fmov d18, %[t2]\n" + "fmov v18.D[1], %H[t2]\n" + "fmov d19, %[t3]\n" + "fmov v19.D[1], %H[t3]\n" + "and %[failed].16B, %[failed].16B, v16.16B\n" + "and %[failed].16B, %[failed].16B, v17.16B\n" + "and %[failed].16B, %[failed].16B, v18.16B\n" + "and %[failed].16B, %[failed].16B, v19.16B\n" + "fmov d20, %[t4]\n" + "fmov v20.D[1], %H[t4]\n" + "fmov d21, %[t5]\n" + "fmov v21.D[1], %H[t5]\n" + "fmov d22, %[t6]\n" + "fmov v22.D[1], %H[t6]\n" + "fmov d23, %[t7]\n" + "fmov v23.D[1], %H[t7]\n" + "and %[failed].16B, %[failed].16B, v20.16B\n" + "and %[failed].16B, %[failed].16B, v21.16B\n" + "and %[failed].16B, %[failed].16B, v22.16B\n" + "and %[failed].16B, %[failed].16B, v23.16B\n" + "st1 { v16.2d, v17.2d, v18.2d, v19.2d}, [%[dst]], 64\n" + "st1 { v20.2d, v21.2d, v22.2d, v23.2d}, [%[dst]], 64\n" + : "+Q" (*addr), [failed] "=&w" (failed), + [t0] "=&r" (t0), [t1] "=&r" (t1), [t2] "=&r" (t2), + [t3] "=&r" (t3), [t4] "=&r" (t4), [t5] "=&r" (t5), + [t6] "=&r" (t6), [t7] "=&r" (t7) + : [wdata] "r" (wdata128), [dst] "r" (obj_table), + [loc] "r" (addr) + : "memory", "v16", "v17", "v18", "v19", + "v20", "v21", "v22", "v23" + ); + break; + } + case 8: + { + __uint128_t t0, t1, t2, t3; + + asm volatile ( + ".cpu generic+lse\n" + "casp %[t0], %H[t0], %[wdata], %H[wdata], [%[loc]]\n" + "casp %[t1], %H[t1], %[wdata], %H[wdata], [%[loc]]\n" + "casp %[t2], %H[t2], %[wdata], %H[wdata], [%[loc]]\n" + "casp %[t3], %H[t3], %[wdata], %H[wdata], [%[loc]]\n" + "fmov d16, %[t0]\n" + "fmov v16.D[1], %H[t0]\n" + "fmov d17, %[t1]\n" + "fmov v17.D[1], %H[t1]\n" + "fmov d18, %[t2]\n" + "fmov v18.D[1], %H[t2]\n" + "fmov d19, %[t3]\n" + "fmov v19.D[1], %H[t3]\n" + "and %[failed].16B, %[failed].16B, v16.16B\n" + "and %[failed].16B, %[failed].16B, v17.16B\n" + "and %[failed].16B, %[failed].16B, v18.16B\n" + "and %[failed].16B, %[failed].16B, v19.16B\n" + "st1 { v16.2d, v17.2d, v18.2d, v19.2d}, [%[dst]], 64\n" + : "+Q" (*addr), [failed] "=&w" (failed), + [t0] "=&r" (t0), [t1] "=&r" (t1), [t2] "=&r" (t2), + [t3] "=&r" (t3) + : [wdata] "r" (wdata128), [dst] "r" (obj_table), + [loc] "r" (addr) + : "memory", "v16", "v17", "v18", "v19" + ); + break; + } + case 4: + { + __uint128_t t0, t1; + + asm volatile ( + ".cpu generic+lse\n" + "casp %[t0], %H[t0], %[wdata], %H[wdata], [%[loc]]\n" + "casp %[t1], %H[t1], %[wdata], %H[wdata], [%[loc]]\n" + "fmov d16, %[t0]\n" + "fmov v16.D[1], %H[t0]\n" + "fmov d17, %[t1]\n" + "fmov v17.D[1], %H[t1]\n" + "and %[failed].16B, %[failed].16B, v16.16B\n" + "and %[failed].16B, %[failed].16B, v17.16B\n" + "st1 { v16.2d, v17.2d}, [%[dst]], 32\n" + : "+Q" (*addr), [failed] "=&w" (failed), + [t0] "=&r" (t0), [t1] "=&r" (t1) + : [wdata] "r" (wdata128), [dst] "r" (obj_table), + [loc] "r" (addr) + : "memory", "v16", "v17" + ); + break; + } + case 2: + { + __uint128_t t0; + + asm volatile ( + ".cpu generic+lse\n" + "casp %[t0], %H[t0], %[wdata], %H[wdata], [%[loc]]\n" + "fmov d16, %[t0]\n" + "fmov v16.D[1], %H[t0]\n" + "and %[failed].16B, %[failed].16B, v16.16B\n" + "st1 { v16.2d}, [%[dst]], 16\n" + : "+Q" (*addr), [failed] "=&w" (failed), + [t0] "=&r" (t0) + : [wdata] "r" (wdata128), [dst] "r" (obj_table), + [loc] "r" (addr) + : "memory", "v16" + ); + break; + } + case 1: + return npa_lf_aura_op_alloc_one(wdata, addr, obj_table, 0); + } + + if (unlikely(!(vgetq_lane_u64(failed, 0) & vgetq_lane_u64(failed, 1)))) + return npa_lf_aura_op_search_alloc(wdata, addr, (void **) + ((char *)obj_table - (sizeof(uint64_t) * n)), n); + + return 0; +} + +static __rte_noinline void +otx2_npa_clear_alloc(struct rte_mempool *mp, void **obj_table, unsigned int n) +{ + unsigned int i; + + for (i = 0; i < n; i++) { + if (obj_table[i] != NULL) { + otx2_npa_enq(mp, &obj_table[i], 1); + obj_table[i] = NULL; + } + } +} + +static inline int __hot +otx2_npa_deq_arm64(struct rte_mempool *mp, void **obj_table, unsigned int n) +{ + const int64_t wdata = npa_lf_aura_handle_to_aura(mp->pool_id); + void **obj_table_bak = obj_table; + const unsigned int nfree = n; + unsigned int parts; + + int64_t * const addr = (int64_t * const) + (npa_lf_aura_handle_to_base(mp->pool_id) + + NPA_LF_AURA_OP_ALLOCX(0)); + while (n) { + parts = n > 31 ? 32 : rte_align32prevpow2(n); + n -= parts; + if (unlikely(npa_lf_aura_op_alloc_bulk(wdata, addr, + parts, obj_table))) { + otx2_npa_clear_alloc(mp, obj_table_bak, nfree - n); + return -ENOENT; + } + obj_table += parts; + } + + return 0; +} +#endif + static inline int __hot otx2_npa_deq(struct rte_mempool *mp, void **obj_table, unsigned int n) { @@ -463,7 +760,11 @@ static struct rte_mempool_ops otx2_npa_ops = { .get_count = otx2_npa_get_count, .calc_mem_size = otx2_npa_calc_mem_size, .populate = otx2_npa_populate, +#if defined(RTE_ARCH_ARM64) + .dequeue = otx2_npa_deq_arm64, +#else .dequeue = otx2_npa_deq, +#endif }; MEMPOOL_REGISTER_OPS(otx2_npa_ops); -- 2.21.0