From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from na01-bl2-obe.outbound.protection.outlook.com (mail-bl2on0087.outbound.protection.outlook.com [65.55.169.87]) by dpdk.org (Postfix) with ESMTP id 1D28B8E56 for ; Wed, 2 Dec 2015 15:34:45 +0100 (CET) Authentication-Results: spf=none (sender IP is ) smtp.mailfrom=Jerin.Jacob@caviumnetworks.com; Received: from localhost.localdomain (122.167.201.210) by BLUPR0701MB1714.namprd07.prod.outlook.com (10.163.85.140) with Microsoft SMTP Server (TLS) id 15.1.331.20; Wed, 2 Dec 2015 14:34:40 +0000 Date: Wed, 2 Dec 2015 20:04:20 +0530 From: Jerin Jacob To: Jianbo Liu Message-ID: <20151202143415.GA11757@localhost.localdomain> References: <1448995276-9599-1-git-send-email-jianbo.liu@linaro.org> <1448995276-9599-4-git-send-email-jianbo.liu@linaro.org> <20151201164139.GA12144@localhost.localdomain> <20151202080259.GA32494@localhost.localdomain> <20151202103903.GA4940@localhost.localdomain> MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.23 (2014-03-12) X-Originating-IP: [122.167.201.210] X-ClientProxiedBy: MAXPR01CA0001.INDPRD01.PROD.OUTLOOK.COM (25.164.147.8) To BLUPR0701MB1714.namprd07.prod.outlook.com (25.163.85.140) X-Microsoft-Exchange-Diagnostics: 1; BLUPR0701MB1714; 2:Ddf2TZQ9b9PRy1Zns4NyP58mxFZJ+Hd8fdakvGBCBUjH7eH0rdn/w29b82tIXnJPFUoQ2NFQx7heULQX1U+CDXvKZQxgzON17kDAvCirv1z1EUQgAERQemO0x7UM67s3pUhnwqjM+yBX+w9Fijt+mQ==; 3:IsmZLFmGAvlDorHiqeP9wlf41vyf3sJgtXMb9QcpFpC6qBk1hVuNBzhL2oJnvw+rMsiuXoiNe3WbSFWBbiOSWOIUQ77k9nwqS6qRjHNctEuSNHMpiD9TQj0NVyEJdCAp; 25:4UCwLxSleBJbau3qOzRoT4Qj7LQkZ5Fg6Oj+x7M9OyhvLXl515iuymdneoTHAEKDfpcFq8HySjE9lEIU6Wb6J1IQffc2Rj3TccBaF8L+UJ8hKC4vaxAdYuhSrawc9MLd/3nblk56y25YPLE8+Zq7DmXiYnMrIOHOGUSXoN8gPFJj5Sq9xZJ7y/Xykt5DHX+Azr2TcnaDUoAi5/j7beG2vseupcCjIMLDN8vw+9XjDWVOMNNZkCGCRmUq/BOGG1LsiE0c5gPRaPMZ6H/HF8SiQQ== X-Microsoft-Antispam: UriScan:;BCL:0;PCL:0;RULEID:;SRVR:BLUPR0701MB1714; X-Microsoft-Exchange-Diagnostics: 1; BLUPR0701MB1714; 20:51bud0FcOHpycQUiJC7KVHy+Of+SL0+5DuoYnSFS0HibTiKLFca3vM253jOo0rmn+6UKX4G9RzXUnFghg2AWBaLW8OpdQktsLRhI4i2ZiQUdplCEAgWIm6vbzxY5vkqTWe7RdnkdwyP3tlTSiuHSg7PxgVu45X8+/RNPCNa5QSuyFJ1FUYKoua95WZviKvEwsggaJ9azbEdtUg+aZ0NZNf12zj+b/6OqNb2401MFaErCCSZ43Yvsqe2ivE9VffUEw0/HtDc+b7g6T7AaNGc9DVw/9aZcM2iz1YSycDNWSAfz2EX5b/ls6RI5dSjqi2YpFm3Ls1RYG42toKxuXpMP9s2OhJDN2yZTSxNcvnA/DamNA5iXNF5DV/sPRFxMaq/mUVW/0uoyq+JIuEuUBIj8f0jNIZmPvAMYsdSTv0Hs6V/78s66fj1nYNK9lKj+6G7XvAXv5YpZUmovPoeX8YoOTDKxiLlnnaQz2bFFonL2FVmZUar7DUnmi7ENX9/L7FMLP0qi7a8L36RwCPs+IXLzol/GB2uDp1TD9z187HpXkaphhEHxRdnV3mpe3cksrUYV9AGzFNw182LWC+RYDFkY424lrc/luSjqO9qNCJI8zS0= X-Microsoft-Antispam-PRVS: X-Exchange-Antispam-Report-Test: UriScan:(236414709691187); X-Exchange-Antispam-Report-CFA-Test: BCL:0; PCL:0; RULEID:(601004)(2401047)(520078)(8121501046)(5005006)(3002001)(10201501046); SRVR:BLUPR0701MB1714; BCL:0; PCL:0; RULEID:; SRVR:BLUPR0701MB1714; X-Microsoft-Exchange-Diagnostics: 1; BLUPR0701MB1714; 4:o76RKKzMWGO/n2MQg4UDRBmsfSP54XULBaveBmTkkxGmVOsMmHh09325DloKcQjt6L7mbKPA7UOLCHYeREWRdMvyCSnLbsE/Yj+cYqZHDPgmHgIRsXoex8Ohh7CywYe69uT4VDPZMfai20vtZBHHSx7wLKg8kQrFHf7Y0SO4rJXW9LCk9Kzre/h0qTOBxuFDK9spPeLsROCKrtHlbRoZ40ji06/QzrBjkwnwj2CqOSVWNIna5z61V8xA5g/xOwQP9vr62ggCZqbpF7YcoDn70xHcoTy7dI17ZXSi0kyunPhTgtTFvL3PlRNKwc+ZB+FruUGe9YJ6zpOFrOPqc/7vyt9xCtAZYeToGBse4ZhFcBUYOr+AWzSP2FMl+ut1PCQNzR85J2FGCsDBr8UUGdKFTirFDxfwVQk011rQtA9Wptkk5bDWCWbItKQ6uSPjZaKW X-Forefront-PRVS: 077884B8B5 X-Forefront-Antispam-Report: SFV:NSPM; SFS:(10009020)(6069001)(6009001)(199003)(189002)(24454002)(86362001)(50986999)(77096005)(122386002)(33656002)(61506002)(189998001)(2950100001)(3846002)(6116002)(23726003)(87976001)(83506001)(19580395003)(1076002)(40100003)(1096002)(66066001)(5004730100002)(54356999)(47776003)(76176999)(5008740100001)(4001350100001)(5001960100002)(42186005)(586003)(110136002)(81156007)(92566002)(93886004)(97756001)(101416001)(15975445007)(46406003)(105586002)(19580405001)(50466002)(106356001)(97736004)(7099028)(357404004); DIR:OUT; SFP:1101; SCL:1; SRVR:BLUPR0701MB1714; H:localhost.localdomain; FPR:; SPF:None; PTR:InfoNoRecords; A:1; MX:1; LANG:en; Received-SPF: None (protection.outlook.com: caviumnetworks.com does not designate permitted sender hosts) X-Microsoft-Exchange-Diagnostics: =?us-ascii?Q?1; BLUPR0701MB1714; 23:A0zsa9xApfOTxZRsyv49O4VZsxHYrL3VNPD0yJG?= =?us-ascii?Q?O9ixnBXYXvVmVXqqX6lLegnnVBhdoLgVXeK58OxdQDKl56VabLW/jVPlJTBC?= =?us-ascii?Q?cLQdFS8mgYojivekpqd/vNyeDSd/kadV1Eg3x6QZUGgD3i824H1v3vxmf8Eq?= =?us-ascii?Q?STJXlaaLFzR5fahuYppgGJT+rpvQruk3XimgeDtAu+Gu3PgHe94Dof0kATQ2?= =?us-ascii?Q?S1EHlgEXwoup2tTfEtSiwXHpBLqeQwjUBhKlYFhgWmDd5YSFnwu53YqSs7N1?= =?us-ascii?Q?foPFvsZXZx1JBptcUe/G6kgEuAGHDluma88nCdw3+I/3UY7Ms/OTOLVlPlhI?= =?us-ascii?Q?SBUDNo5Bmt9SQJhRklrhEmtFn0hMOUSCrlSpuMpLN6tHHQhX47fYAS9UflSk?= =?us-ascii?Q?4unHs1aUX3YmldDcwovKwb9BHO10ZB+3dvUAaLQuTkxYzDhcusNmstg0+Ly0?= =?us-ascii?Q?7n3qgNBttf8x1xS/NrZcX7YhLoJSgFFsZWHr98A3cLMKpUlAjb8P1TaYe8E+?= =?us-ascii?Q?EzeSyJjYn7qp2zDkTaFOpggmYI1b24f6KU3Nlicrp8GIIbsE783K54BLH78C?= =?us-ascii?Q?sMDC/MM9aEtPM1piEmqnrChC+ebRAC7vGErmJMkcOy8zgSDEEKcGLtw2RuYA?= =?us-ascii?Q?1giraF9ZzfQ0QvYMXLJiYbpKsGZCEF3CT4vEwiNPP4f1KJaAsKussXV5tpPN?= =?us-ascii?Q?VGoQGpZsDldobZYypGu6jjbLlWk2eA26MaEdkdobTQDjgvUxYwBqRcoj5qxt?= =?us-ascii?Q?XftNpOLWNNRAEbptcMOz/RMLCtXcBEouA5IKJQdQZrdkF0GnPLJ2u7gkU28d?= =?us-ascii?Q?HiKzisI4nAM16XFZg8AxDnJNViyz/Dtpo1K+btiZKwg6eaJeu6EPP8HiiQUl?= =?us-ascii?Q?/3xw82g390FesT6AVYy4d3zMLsc3Z9IzG2dc6EVk5IR/0TaDldlAviUCDJ9Y?= =?us-ascii?Q?J9+Zrj3iwyYei2WOsM8DcPIfZ1SNz862qAnalMm/GRJ9UJQLfhGAIk2KbP9u?= =?us-ascii?Q?07fbbPJvUA301OvBhtyeJaElouOKu9VPZT7NHj/IptklCmIboYJX0C0+cIcN?= =?us-ascii?Q?zxYW0pOyB4gCA4i5DUZyu414VP8U8Q22cVj7fXTHeqG+iZ+2apxSXntcuQ3W?= =?us-ascii?Q?e2w2leO1bhwniJhsxw+GtUMVi0P2JHDJPIibBUl8FmaSxby+lb3OSbkM4ix8?= =?us-ascii?Q?fyREoSCsziJ8wSOHwFt+WnmtWrh5EPf04Apd6?= X-Microsoft-Exchange-Diagnostics: 1; BLUPR0701MB1714; 5:LgR4ejaUQmiRVVEJcULxnsThjxGCZnZLe+YKg1g3DLbos4sDMcrK//82bBiPT03iQBwf7wRqdeT1GYqqkLZfrDMBWKvBFrTCFU96IhWov709TLK6hOVa2BlEVttej3Q4lptaW6MLm73vH5fTEfRebA==; 24:VFY6OQFgLufB+1Pv4tDi2Oh1ANSv/18fJy6/rmKJEelTrTxtxPk6S5UInE/wFIxVxL7/jqaOht/eP7uH9aqDPnb/1furz9tJDbkje8oTIFE= SpamDiagnosticOutput: 1:23 SpamDiagnosticMetadata: NSPM X-OriginatorOrg: caviumnetworks.com X-MS-Exchange-CrossTenant-OriginalArrivalTime: 02 Dec 2015 14:34:40.5279 (UTC) X-MS-Exchange-CrossTenant-FromEntityHeader: Hosted X-MS-Exchange-Transport-CrossTenantHeadersStamped: BLUPR0701MB1714 Cc: dev@dpdk.org Subject: Re: [dpdk-dev] [PATCH 3/4] eal/arm: Enable lpm/table/pipeline libs X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Wed, 02 Dec 2015 14:34:45 -0000 On Wed, Dec 02, 2015 at 09:13:51PM +0800, Jianbo Liu wrote: > On 2 December 2015 at 18:39, Jerin Jacob wrote: > > On Wed, Dec 02, 2015 at 05:49:41PM +0800, Jianbo Liu wrote: > >> On 2 December 2015 at 16:03, Jerin Jacob wrote: > >> > On Wed, Dec 02, 2015 at 02:54:52PM +0800, Jianbo Liu wrote: > >> >> On 2 December 2015 at 00:41, Jerin Jacob wrote: > >> >> > On Tue, Dec 01, 2015 at 01:41:15PM -0500, Jianbo Liu wrote: > >> >> >> Adds ARM NEON support for lpm. > >> >> >> And enables table/pipeline libraries which depend on lpm. > >> >> > > >> >> > I already sent the patch on the same yesterday. > >> >> > We can converge the patches after the discussion. > >> >> > Please check "[dpdk-dev] [PATCH 0/3] add lpm support for NEON" on ml > >> >> > > >> >> Yes, I have read your patch. But there are many differences, so I sent > >> >> mine for your reviewing :) > >> >> > >> >> > > >> >> >> > >> >> >> Signed-off-by: Jianbo Liu > >> >> >> --- > >> >> >> config/defconfig_arm-armv7a-linuxapp-gcc | 3 - > >> >> >> config/defconfig_arm64-armv8a-linuxapp-gcc | 3 - > >> >> >> lib/librte_eal/common/include/arch/arm/rte_vect.h | 28 ++++++++++ > >> >> >> lib/librte_lpm/rte_lpm.h | 68 ++++++++++++++++------- > >> >> >> 4 files changed, 77 insertions(+), 25 deletions(-) > >> >> >> > >> >> >> diff --git a/config/defconfig_arm-armv7a-linuxapp-gcc b/config/defconfig_arm-armv7a-linuxapp-gcc > >> >> >> index cbebd64..efffa1f 100644 > >> >> >> --- a/config/defconfig_arm-armv7a-linuxapp-gcc > >> >> >> +++ b/config/defconfig_arm-armv7a-linuxapp-gcc > >> >> >> @@ -53,9 +53,6 @@ CONFIG_RTE_LIBRTE_KNI=n > >> >> >> CONFIG_RTE_EAL_IGB_UIO=n > >> >> >> > >> >> >> # fails to compile on ARM > >> >> >> -CONFIG_RTE_LIBRTE_LPM=n > >> >> >> -CONFIG_RTE_LIBRTE_TABLE=n > >> >> >> -CONFIG_RTE_LIBRTE_PIPELINE=n > >> >> >> CONFIG_RTE_SCHED_VECTOR=n > >> >> >> > >> >> >> # cannot use those on ARM > >> >> >> diff --git a/config/defconfig_arm64-armv8a-linuxapp-gcc b/config/defconfig_arm64-armv8a-linuxapp-gcc > >> >> >> index 504f3ed..57f7941 100644 > >> >> >> --- a/config/defconfig_arm64-armv8a-linuxapp-gcc > >> >> >> +++ b/config/defconfig_arm64-armv8a-linuxapp-gcc > >> >> >> @@ -51,7 +51,4 @@ CONFIG_RTE_LIBRTE_IVSHMEM=n > >> >> >> CONFIG_RTE_LIBRTE_FM10K_PMD=n > >> >> >> CONFIG_RTE_LIBRTE_I40E_PMD=n > >> >> >> > >> >> >> -CONFIG_RTE_LIBRTE_LPM=n > >> >> >> -CONFIG_RTE_LIBRTE_TABLE=n > >> >> >> -CONFIG_RTE_LIBRTE_PIPELINE=n > >> >> >> CONFIG_RTE_SCHED_VECTOR=n > >> >> >> diff --git a/lib/librte_eal/common/include/arch/arm/rte_vect.h b/lib/librte_eal/common/include/arch/arm/rte_vect.h > >> >> >> index a33c054..7437711 100644 > >> >> >> --- a/lib/librte_eal/common/include/arch/arm/rte_vect.h > >> >> >> +++ b/lib/librte_eal/common/include/arch/arm/rte_vect.h > >> >> >> @@ -41,6 +41,8 @@ extern "C" { > >> >> >> > >> >> >> typedef int32x4_t xmm_t; > >> >> >> > >> >> >> +typedef int32x4_t __m128i; > >> >> >> + > >> >> >> #define XMM_SIZE (sizeof(xmm_t)) > >> >> >> #define XMM_MASK (XMM_SIZE - 1) > >> >> >> > >> >> >> @@ -53,6 +55,32 @@ typedef union rte_xmm { > >> >> >> double pd[XMM_SIZE / sizeof(double)]; > >> >> >> } __attribute__((aligned(16))) rte_xmm_t; > >> >> >> > >> >> >> +static __inline __m128i > >> >> >> +_mm_set_epi32(int i3, int i2, int i1, int i0) > >> >> >> +{ > >> >> >> + int32_t r[4] = {i0, i1, i2, i3}; > >> >> >> + > >> >> >> + return vld1q_s32(r); > >> >> >> +} > >> >> >> + > >> >> >> +static __inline __m128i > >> >> >> +_mm_loadu_si128(__m128i *p) > >> >> >> +{ > >> >> >> + return vld1q_s32((int32_t *)p); > >> >> >> +} > >> >> >> + > >> >> >> +static __inline __m128i > >> >> >> +_mm_set1_epi32(int i) > >> >> >> +{ > >> >> >> + return vdupq_n_s32(i); > >> >> >> +} > >> >> >> + > >> >> >> +static __inline __m128i > >> >> >> +_mm_and_si128(__m128i a, __m128i b) > >> >> >> +{ > >> >> >> + return vandq_s32(a, b); > >> >> >> +} > >> >> >> + > >> > > >> > IMO, it's not always good to emulate GCC defined intrinsics of > >> > other architecture. What if a legacy DPDK application has such mappings > >> > then BOOM, multiple definition, which one is correct? which one > >> > to comment it out? Integration pain starts for DPDK library consumer:-( > >> > > >> They can include rte_vect.h in build/include directly, which is linked correctly > >> to the one for that ARCH, so there is no need to worry about. > > > > I think you missed the point,I was trying to say that > > legacy DPDK application and third party stacks uses SSE2NEON kind of > > libraries > > for quick integration, for example, something like this > > https://github.com/jratcliff63367/sse2neon/blob/master/SSE2NEON.h > > > > AND they include "rte_lpm.h"(it internally includes rte_vect.h) > > that lead to multiple definition and its not good. > > > But you will have similar issue since "typedef int32x4_t __m128i" > appears in both your patch and this header file. I just tested it, it won't break, back to back "typedef int32x4_t __m128i" is fine(unlike inline function). my intention to keep __m128i "as is" because changing the __m128i to rte_??? something would break the ABI. > > >> > >> > >> >> > > >> >> > IMO, it makes sense to not emulate the SSE intrinsics with NEON > >> >> > Let's create the rte_vect_* as required. look at the existing patch. > >> >> > > >> >> I thought of creating a layer of SIMD over all the platforms before. > >> >> But can't you see it make things complicated, considering there are > >> >> only few simple intrinsic to implement? > >> > > >> > Not true, There were, a lot of SSE intrinsics needs be to emulated for ACL NEON > >> > implementation if I were to take this approach and emulation comes with > >> > the cost. > >> > > >> No, I will not re-implement all the intrinsic like that . > >> I only do with the simple intrinsic, such as load/store, as you said below. > > > > but you forced to add _mm_and_si128 also to the list and emulated > > _mm_and_si128 intrinsic. Am just saying no emulation. > > > I means simple intrinsic, not load/store only. > Depends on how you define emulation. Actually, these simple intrisinic > could be only one NEON instruction, and will not bring cost. > > > > >> > >> > So my take is, > >> > lets the each architecture implementation for specific SIMD version of DPDK > >> > API in the library should have the freedom to implement the API in > >> > NATIVE. > >> > > >> > And let's create only rte_vect_* abstraction only for using > >> > that API/library. Which boils down to have very minimal rte_vect_* > >> > abstraction to load, store, set not beyond that. > >> > > >> > This makes clear "contract" between DPDK library and the applications. > >> > and make easy for remaning new architecture porting effort in DPDK. > >> > > >> Agree. > >> But I reuse existing intrinsic names, and you recreate new ones. > >> And I try to do as few changes as possible, and try to avoid any > >> mistaken which may cause code un-compiled. > > > > Its trival to verify. Just compile it > > > >> I think it's design level question, we need to hear what others talk about it. > >> > >> > Imagine how your proposed function will look like if new architecture > >> > wants to implement "optimized" version of rte_lpm_lookupx4 > >> > > >> There is no optimization for this (simple) rte_lpm_lookupx4, otherwise > >> you have done that in your patch. > >> If there is for other new platform, defintely they should do like > >> yours, as you did for NEON ACL. > >> > >> > > >> >> If do so, we also need to explain to others how to use these interfaces. > >> >> Besides, this patch did the smallest changes to the original code, and > >> >> more likely to be accepted by others. > >> > > >> > other patch makes no changes to IA version of rte_lpm_lookupx4.I thought > >> > that make reviewer easy to review the changes in architecture > >> > perspective. > >> > > >> As I know, they don't enable LPM for PPC, and ARM is the first one to > >> touch this issue. > >> > >> >> > >> >> > > >> >> >> #ifdef RTE_ARCH_ARM > >> >> >> /* NEON intrinsic vqtbl1q_u8() is not supported in ARMv7-A(AArch32) */ > >> >> >> static __inline uint8x16_t > >> >> >> diff --git a/lib/librte_lpm/rte_lpm.h b/lib/librte_lpm/rte_lpm.h > >> >> >> index c299ce2..c76c07d 100644 > >> >> >> --- a/lib/librte_lpm/rte_lpm.h > >> >> >> +++ b/lib/librte_lpm/rte_lpm.h > >> >> >> @@ -361,6 +361,47 @@ rte_lpm_lookup_bulk_func(const struct rte_lpm *lpm, const uint32_t * ips, > >> >> >> /* Mask four results. */ > >> >> >> #define RTE_LPM_MASKX4_RES UINT64_C(0x00ff00ff00ff00ff) > >> >> >> > >> >> >> +#if defined(RTE_ARCH_ARM) || defined(RTE_ARCH_ARM64) > >> >> > > >> >> > Separate out arm implementation to the different header file. > >> >> > Too many ifdef looks odd in the header file and difficult to manage. > >> >> > > >> >> But there are many ifdefs already. > >> >> And It seems unreasonable to add a new file only for one small function. > >> >> > >> > > >> > small or big, its matter of each architecture to have > >> > the freedom for the optimized version for the implementation. > >> > > >> > What if other architecture demands to write this function in assembly > >> > or restructure it for performance improvement? > >> > > >> If there is such demands, should do like that. > >> But I don't see any restructure in your patch, and you still follow > >> the logic as x86, is it worth adding a new file? > > > > SIMD Logic on getting 4 indexes for tbl24[] is different. > > > > /* get 4 indexes for tbl24[]. */ > > i24 = _mm_srli_epi32(ip, CHAR_BIT); > > > > /* extract values from tbl24[] */ > > idx = _mm_cvtsi128_si64(i24); > > i24 = _mm_srli_si128(i24, sizeof(uint64_t)); > > > > tbl[0] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx]; > > tbl[1] = *(const uint16_t *)&lpm->tbl24[idx >> 32]; > > > > idx = _mm_cvtsi128_si64(i24); > > > > tbl[2] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx]; > > tbl[3] = *(const uint16_t *)&lpm->tbl24[idx >> 32]; > > > > VS > > > > /* extract values from tbl24[] */ > > idx = vgetq_lane_u64((uint64x2_t)i24, 0); > > > > tbl[0] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx]; > > tbl[1] = *(const uint16_t *)&lpm->tbl24[idx >> 32]; > > > > idx = vgetq_lane_u64((uint64x2_t)i24, 1); > > > > tbl[2] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx]; > > tbl[3] = *(const uint16_t *)&lpm->tbl24[idx >> 32]; > > > It's only the optimazation of part of code in that function. I did the > similar in my patch. > But, looking from the whole, this function is not restructured, and > the logic is the same as x86. > > >> > >> > > >> >> > > >> >> >> +static inline void > >> >> >> +rte_lpm_tbl24_val4(const struct rte_lpm *lpm, int32x4_t ip, uint16_t tbl[4]) > >> >> >> +{ > >> >> >> + uint32x4_t i24; > >> >> >> + uint32_t idx[4]; > >> >> >> + > >> >> >> + /* get 4 indexes for tbl24[]. */ > >> >> >> + i24 = vshrq_n_u32(vreinterpretq_u32_s32(ip), CHAR_BIT); > >> >> >> + vst1q_u32(idx, i24); > >> >> >> + > >> >> >> + /* extract values from tbl24[] */ > >> >> >> + tbl[0] = *(const uint16_t *)&lpm->tbl24[idx[0]]; > >> >> >> + tbl[1] = *(const uint16_t *)&lpm->tbl24[idx[1]]; > >> >> >> + tbl[2] = *(const uint16_t *)&lpm->tbl24[idx[2]]; > >> >> >> + tbl[3] = *(const uint16_t *)&lpm->tbl24[idx[3]]; > >> >> >> +} > >> >> > > >> >> > Nice. There is an improvement in this portion code wrt my patch. This is > >> >> > a candidate for convergence. > >> >> > > >> >> > > >> >> >> +#else > >> >> >> +static inline void > >> >> >> +rte_lpm_tbl24_val4(const struct rte_lpm *lpm, __m128i ip, uint16_t tbl[4]) > >> >> >> +{ > >> >> >> + __m128i i24; > >> >> >> + uint64_t idx; > >> >> >> + > >> >> >> + /* get 4 indexes for tbl24[]. */ > >> >> >> + i24 = _mm_srli_epi32(ip, CHAR_BIT); > >> >> >> + > >> >> >> + /* extract values from tbl24[] */ > >> >> >> + idx = _mm_cvtsi128_si64(i24); > >> >> >> + i24 = _mm_srli_si128(i24, sizeof(uint64_t)); > >> >> >> + > >> >> >> + tbl[0] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx]; > >> >> >> + tbl[1] = *(const uint16_t *)&lpm->tbl24[idx >> 32]; > >> >> >> + > >> >> >> + idx = _mm_cvtsi128_si64(i24); > >> >> >> + > >> >> >> + tbl[2] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx]; > >> >> >> + tbl[3] = *(const uint16_t *)&lpm->tbl24[idx >> 32]; > >> >> >> +} > >> >> >> +#endif > >> >> >> + > >> >> >> /** > >> >> >> * Lookup four IP addresses in an LPM table. > >> >> >> * > >> >> >> @@ -381,17 +422,19 @@ rte_lpm_lookup_bulk_func(const struct rte_lpm *lpm, const uint32_t * ips, > >> >> >> * if lookup would fail. > >> >> >> */ > >> >> >> static inline void > >> >> >> +#if defined(RTE_ARCH_ARM) || defined(RTE_ARCH_ARM64) > >> >> >> +rte_lpm_lookupx4(const struct rte_lpm *lpm, int32x4_t ip, uint16_t hop[4], > >> >> >> + uint16_t defv) > >> >> > > >> >> > This would call for change in the change the ABI, > >> >> > IMO, __m128i can be used to represent 128bit vector to avoid ABI chang > >> >> > > >> >> This redefine rte_lpm_lookupx4 is unncessary, I will remove it, so no > >> >> ABI change. > >> >> And there only one ifdef for ARM platforms left. > >> >> > >> >> > > >> >> >> +#else > >> >> > separate out arm implementation to the different header file. Too many > >> >> > ifdef looks odd in the header file. > >> >> > > >> >> > Could you rebase your patch based on existing patch and send the > >> >> > improvement portion as separate patch or I can send update patch with > >> >> > your improvements and with your signoff. > >> >> > > >> >> > > >> >> >> rte_lpm_lookupx4(const struct rte_lpm *lpm, __m128i ip, uint16_t hop[4], > >> >> >> uint16_t defv) > >> >> >> +#endif > >> >> >> { > >> >> >> - __m128i i24; > >> >> >> rte_xmm_t i8; > >> >> >> uint16_t tbl[4]; > >> >> >> - uint64_t idx, pt; > >> >> >> - > >> >> >> - const __m128i mask8 = > >> >> >> - _mm_set_epi32(UINT8_MAX, UINT8_MAX, UINT8_MAX, UINT8_MAX); > >> >> >> + uint64_t pt; > >> >> >> > >> >> >> + const __m128i mask8 = _mm_set1_epi32(UINT8_MAX); > >> >> >> /* > >> >> >> * RTE_LPM_VALID_EXT_ENTRY_BITMASK for 4 LPM entries > >> >> >> * as one 64-bit value (0x0300030003000300). > >> >> >> @@ -412,20 +455,7 @@ rte_lpm_lookupx4(const struct rte_lpm *lpm, __m128i ip, uint16_t hop[4], > >> >> >> (uint64_t)RTE_LPM_LOOKUP_SUCCESS << 32 | > >> >> >> (uint64_t)RTE_LPM_LOOKUP_SUCCESS << 48); > >> >> >> > >> >> >> - /* get 4 indexes for tbl24[]. */ > >> >> >> - i24 = _mm_srli_epi32(ip, CHAR_BIT); > >> >> >> - > >> >> >> - /* extract values from tbl24[] */ > >> >> >> - idx = _mm_cvtsi128_si64(i24); > >> >> >> - i24 = _mm_srli_si128(i24, sizeof(uint64_t)); > >> >> >> - > >> >> >> - tbl[0] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx]; > >> >> >> - tbl[1] = *(const uint16_t *)&lpm->tbl24[idx >> 32]; > >> >> >> - > >> >> >> - idx = _mm_cvtsi128_si64(i24); > >> >> >> - > >> >> >> - tbl[2] = *(const uint16_t *)&lpm->tbl24[(uint32_t)idx]; > >> >> >> - tbl[3] = *(const uint16_t *)&lpm->tbl24[idx >> 32]; > >> >> >> + rte_lpm_tbl24_val4(lpm, ip, tbl); > >> >> >> > >> >> >> /* get 4 indexes for tbl8[]. */ > >> >> >> i8.x = _mm_and_si128(ip, mask8); > >> >> >> -- > >> >> >> 1.8.3.1 > >> >> >>