From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124]) by inbox.dpdk.org (Postfix) with ESMTP id 09F7C43E44; Thu, 11 Apr 2024 15:33:34 +0200 (CEST) Received: from mails.dpdk.org (localhost [127.0.0.1]) by mails.dpdk.org (Postfix) with ESMTP id 87EAF402A8; Thu, 11 Apr 2024 15:33:33 +0200 (CEST) Received: from foss.arm.com (foss.arm.com [217.140.110.172]) by mails.dpdk.org (Postfix) with ESMTP id D55BE4029C for ; Thu, 11 Apr 2024 15:33:31 +0200 (CEST) Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 932F911FB; Thu, 11 Apr 2024 06:34:00 -0700 (PDT) Received: from [10.1.29.184] (e125442.arm.com [10.1.29.184]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 688DE3F64C; Thu, 11 Apr 2024 06:33:29 -0700 (PDT) Message-ID: Date: Thu, 11 Apr 2024 14:32:53 +0100 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [EXTERNAL] [PATCH v7 2/4] hash: optimize compare signature for NEON Content-Language: en-US To: Pavan Nikhilesh Bhagavatula , Yoan Picchi , Yipeng Wang , Sameh Gobriel , Bruce Richardson , Vladimir Medvedkin Cc: "dev@dpdk.org" , "nd@arm.com" , Ruifeng Wang , Nathan Brown , Jerin Jacob References: <20231020165159.1649282-1-yoan.picchi@arm.com> <20240312154215.802374-1-yoan.picchi@arm.com> <20240312154215.802374-3-yoan.picchi@arm.com> From: Yoan Picchi In-Reply-To: Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org On 3/20/24 07:37, Pavan Nikhilesh Bhagavatula wrote: >> Upon a successful comparison, NEON sets all the bits in the lane to 1 >> We can skip shifting by simply masking with specific masks. >> >> Signed-off-by: Yoan Picchi >> Reviewed-by: Ruifeng Wang >> Reviewed-by: Nathan Brown >> --- >> lib/hash/arch/arm/compare_signatures.h | 24 +++++++++++------------- >> 1 file changed, 11 insertions(+), 13 deletions(-) >> >> diff --git a/lib/hash/arch/arm/compare_signatures.h >> b/lib/hash/arch/arm/compare_signatures.h >> index 1af6ba8190..b5a457f936 100644 >> --- a/lib/hash/arch/arm/compare_signatures.h >> +++ b/lib/hash/arch/arm/compare_signatures.h >> @@ -30,23 +30,21 @@ compare_signatures_dense(uint16_t >> *hitmask_buffer, >> switch (sig_cmp_fn) { >> #if RTE_HASH_BUCKET_ENTRIES <= 8 >> case RTE_HASH_COMPARE_NEON: { >> - uint16x8_t vmat, vsig, x; >> - int16x8_t shift = {0, 1, 2, 3, 4, 5, 6, 7}; >> - uint16_t low, high; >> + uint16x8_t vmat, hit1, hit2; >> + const uint16x8_t mask = {0x1, 0x2, 0x4, 0x8, 0x10, 0x20, >> 0x40, 0x80}; >> + const uint16x8_t vsig = vld1q_dup_u16((uint16_t const >> *)&sig); >> >> - vsig = vld1q_dup_u16((uint16_t const *)&sig); >> /* Compare all signatures in the primary bucket */ >> - vmat = vceqq_u16(vsig, >> - vld1q_u16((uint16_t const *)prim_bucket_sigs)); >> - x = vshlq_u16(vandq_u16(vmat, vdupq_n_u16(0x0001)), >> shift); >> - low = (uint16_t)(vaddvq_u16(x)); >> + vmat = vceqq_u16(vsig, vld1q_u16(prim_bucket_sigs)); >> + hit1 = vandq_u16(vmat, mask); >> + >> /* Compare all signatures in the secondary bucket */ >> - vmat = vceqq_u16(vsig, >> - vld1q_u16((uint16_t const *)sec_bucket_sigs)); >> - x = vshlq_u16(vandq_u16(vmat, vdupq_n_u16(0x0001)), >> shift); >> - high = (uint16_t)(vaddvq_u16(x)); >> - *hitmask_buffer = low | high << RTE_HASH_BUCKET_ENTRIES; >> + vmat = vceqq_u16(vsig, vld1q_u16(sec_bucket_sigs)); >> + hit2 = vandq_u16(vmat, mask); >> >> + hit2 = vshlq_n_u16(hit2, RTE_HASH_BUCKET_ENTRIES); >> + hit2 = vorrq_u16(hit1, hit2); >> + *hitmask_buffer = vaddvq_u16(hit2); > > Since vaddv is expensive could you convert it to vshrn? > > https://community.arm.com/arm-community-blogs/b/infrastructure-solutions-blog/posts/porting-x86-vector-bitmask-optimizations-to-arm-neon > > https://github.com/DPDK/dpdk/blob/main/examples/l3fwd/l3fwd_neon.h#L226 Thank you for those links, it was a good read. Unfortunatly I don't think it is a good use case here. A decent part of the speedup I get is by using a dense hitmask: ie every bit count with no padding. Using the vshrn would have 4 bits of padding, and stripping them would be more expensive than using a regular reduce. > >> } >> break; >> #endif >> -- >> 2.25.1 >