From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga02.intel.com (mga02.intel.com [134.134.136.20]) by dpdk.org (Postfix) with ESMTP id 8D6CE2931 for ; Fri, 2 Sep 2016 19:05:25 +0200 (CEST) Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by orsmga101.jf.intel.com with ESMTP; 02 Sep 2016 10:05:12 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.30,271,1470726000"; d="scan'208";a="4065998" Received: from irsmsx152.ger.corp.intel.com ([163.33.192.66]) by fmsmga006.fm.intel.com with ESMTP; 02 Sep 2016 10:05:11 -0700 Received: from irsmsx108.ger.corp.intel.com ([169.254.11.71]) by IRSMSX152.ger.corp.intel.com ([169.254.6.43]) with mapi id 14.03.0248.002; Fri, 2 Sep 2016 18:05:10 +0100 From: "De Lara Guarch, Pablo" To: Thomas Monjalon , "Marohn, Byron" CC: "dev@dpdk.org" , "Richardson, Bruce" , "Edupuganti, Saikrishna" , "jianbo.liu@linaro.org" , "chaozhu@linux.vnet.ibm.com" , "jerin.jacob@caviumnetworks.com" Thread-Topic: [dpdk-dev] [PATCH 2/3] hash: add vectorized comparison Thread-Index: AQHR/+GY1ff8IKclr0apnZInmikF8aBccaKAgAoEygA= Date: Fri, 2 Sep 2016 17:05:10 +0000 Message-ID: References: <1472247287-167011-1-git-send-email-pablo.de.lara.guarch@intel.com> <1472247287-167011-3-git-send-email-pablo.de.lara.guarch@intel.com> <5721729.LXq7JRZ983@xps13> In-Reply-To: <5721729.LXq7JRZ983@xps13> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiZDZkMTYyMzItMDc4ZC00YjI2LThkYWMtN2VjZmIwZWNlYzZkIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE1LjkuNi42IiwiVHJ1c3RlZExhYmVsSGFzaCI6IlJNZk9ma21CMk1VQVwvZ0Fta3hYZVE0aEorODFHXC91cmZtMmdWYjczOFlhYz0ifQ== x-ctpclassification: CTP_IC x-originating-ip: [163.33.239.182] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH 2/3] hash: add vectorized comparison X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 02 Sep 2016 17:05:26 -0000 > -----Original Message----- > From: Thomas Monjalon [mailto:thomas.monjalon@6wind.com] > Sent: Saturday, August 27, 2016 1:58 AM > To: De Lara Guarch, Pablo; Marohn, Byron > Cc: dev@dpdk.org; Richardson, Bruce; Edupuganti, Saikrishna; > jianbo.liu@linaro.org; chaozhu@linux.vnet.ibm.com; > jerin.jacob@caviumnetworks.com > Subject: Re: [dpdk-dev] [PATCH 2/3] hash: add vectorized comparison >=20 > 2016-08-26 22:34, Pablo de Lara: > > From: Byron Marohn > > > > In lookup bulk function, the signatures of all entries > > are compared against the signature of the key that is being looked up. > > Now that all the signatures are together, they can be compared > > with vector instructions (SSE, AVX2), achieving higher lookup performan= ce. > > > > Also, entries per bucket are increased to 8 when using processors > > with AVX2, as 256 bits can be compared at once, which is the size of > > 8x32-bit signatures. >=20 > Please, would it be possible to use the generic SIMD intrinsics? > We could define generic types compatible with Altivec and NEON: > __attribute__ ((vector_size (n))) > as described in https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html >=20 I tried to convert these into generic code with gcc builtins, but I couldn't find a way to translate the __mm_movemask instrinsic into a = generic builtin (which is very necessary for performance reasons). Therefore, I think it is not possible to do this without penalizing perform= ance. Sure, we could try to translate the other intrinsics, but it would mean tha= t we still need to use #ifdefs and we would have a mix of code with x86 instrinsics and gcc bu= iltins, so it is better to leave it this way. > > +/* 8 entries per bucket */ > > +#if defined(__AVX2__) >=20 > Please prefer > #ifdef RTE_MACHINE_CPUFLAG_AVX2 > Ideally the vector support could be checked at runtime: > if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_AVX2)) > It would allow packaging one binary using the best optimization available= . >=20 Good idea. Will submit a v2 with this change. It took me a bit of time to f= igure out a way to do this without paying a big performance penalty. > > + *prim_hash_matches |=3D > _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32( > > + _mm256_load_si256((__m256i const *)prim_bkt- > >sig_current), > > + _mm256_set1_epi32(prim_hash))); > > + *sec_hash_matches |=3D > _mm256_movemask_ps((__m256)_mm256_cmpeq_epi32( > > + _mm256_load_si256((__m256i const *)sec_bkt- > >sig_current), > > + _mm256_set1_epi32(sec_hash))); > > +/* 4 entries per bucket */ > > +#elif defined(__SSE2__) > > + *prim_hash_matches |=3D > _mm_movemask_ps((__m128)_mm_cmpeq_epi16( > > + _mm_load_si128((__m128i const *)prim_bkt- > >sig_current), > > + _mm_set1_epi32(prim_hash))); > > + *sec_hash_matches |=3D > _mm_movemask_ps((__m128)_mm_cmpeq_epi16( > > + _mm_load_si128((__m128i const *)sec_bkt- > >sig_current), > > + _mm_set1_epi32(sec_hash))); >=20 > In order to allow such switch based on register size, we could have an > abstraction in EAL supporting 128/256/512 width for x86/ARM/POWER. > I think aliasing RTE_MACHINE_CPUFLAG_ and RTE_CPUFLAG_ may be > enough.