From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by dpdk.org (Postfix) with ESMTP id D2E6E592F for ; Fri, 30 Sep 2016 21:53:16 +0200 (CEST) Received: from orsmga001.jf.intel.com ([10.7.209.18]) by fmsmga105.fm.intel.com with ESMTP; 30 Sep 2016 12:53:15 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.31,274,1473145200"; d="scan'208";a="1038786913" Received: from orsmsx104.amr.corp.intel.com ([10.22.225.131]) by orsmga001.jf.intel.com with ESMTP; 30 Sep 2016 12:53:15 -0700 Received: from orsmsx113.amr.corp.intel.com ([169.254.9.161]) by ORSMSX104.amr.corp.intel.com ([169.254.4.228]) with mapi id 14.03.0248.002; Fri, 30 Sep 2016 12:53:15 -0700 From: "Gobriel, Sameh" To: "De Lara Guarch, Pablo" , "dev@dpdk.org" CC: "Richardson, Bruce" Thread-Topic: [dpdk-dev] [PATCH v4 0/4] Cuckoo hash enhancements Thread-Index: AQHSGu2hSeXbYz26q0GPj60j4uzuRaCScnfQ Date: Fri, 30 Sep 2016 19:53:14 +0000 Message-ID: References: <1473190397-120741-1-git-send-email-pablo.de.lara.guarch@intel.com> <1475221136-213246-1-git-send-email-pablo.de.lara.guarch@intel.com> In-Reply-To: <1475221136-213246-1-git-send-email-pablo.de.lara.guarch@intel.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.22.254.139] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH v4 0/4] Cuckoo hash enhancements X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Fri, 30 Sep 2016 19:53:17 -0000 > -----Original Message----- > From: dev [mailto:dev-bounces@dpdk.org] On Behalf Of De Lara Guarch, Pabl= o > Sent: Friday, September 30, 2016 12:39 AM > To: dev@dpdk.org > Cc: Richardson, Bruce ; De Lara Guarch, Pablo > > Subject: [dpdk-dev] [PATCH v4 0/4] Cuckoo hash enhancements >=20 > This patchset improves lookup performance on the current hash library by > changing the existing lookup bulk pipeline, with an improved pipeline, ba= sed on > a loop-and-jump model, instead of the current 4-stage 2-entry pipeline. > Also, x86 vectorized intrinsics are used to improve performance when > comparing signatures. >=20 > First patch reorganizes the order of the hash structure. > The structure takes more than one 64-byte cache line, but not all the fie= lds are > used in the lookup operation (the most common operation). > Therefore, all these fields have been moved to the first part of the stru= cture, so > they all fit in one cache line, improving slightly the performance in som= e > scenarios. >=20 > Second patch modifies the order of the bucket structure. > Currently, the buckets store all the signatures together (current and > alternative). > In order to be able to perform a vectorized signature comparison, all cur= rent > signatures have to be together, so the order of the bucket has been chang= ed, > having separated all the current signatures from the alternative signatur= es. >=20 > Third patch introduces x86 vectorized intrinsics. > When performing a lookup bulk operation, all current signatures in a buck= et > are compared against the signature of the key being looked up. > Now that they all are together, a vectorized comparison can be performed, > which takes less instructions to be carried out. > In case of having a machine with AVX2, number of entries per bucket are > increased from 4 to 8, as AVX2 allows comparing two 256-bit values, with > 8x32-bit integers, which are the 8 signatures on the bucket. >=20 > Fourth (and last) patch modifies the current pipeline of the lookup bulk > function. > The new pipeline is based on a loop-and-jump model. The two key > improvements are: >=20 > - Better prefetching: in this case, first 4 keys to be looked up are pref= etched, > and after that, the rest of the keys are prefetched at the time the cal= culation > of the signatures are being performed. This gives more time for the CPU= to > prefetch the data requesting before actually need it, which result in l= ess > cache misses and therefore, higher throughput. >=20 > - Lower performance penalty when using fallback: the lookup bulk algorith= m > assumes that most times there will not be a collision in a bucket, but = it might > happen that two or more signatures are equal, which means that more tha= n > one > key comparison might be necessary. In that case, only the key of the fi= rst hit is > prefetched, > like in the current implementation. The difference now is that if this > comparison > results in a miss, the information of the other keys to be compared has= been > stored, > unlike the current implementation, which needs to perform an entire sim= ple > lookup again. >=20 > Changes in v4: > - Reordered hash structure, so alt signature is at the start > of the next cache line, and explain in the commit message > why it has been moved > - Reordered hash structure, so name field is on top of the structure, > leaving all the fields used in lookup in the next cache line > (instead of the first cache line) >=20 > Changes in v3: > - Corrected the cover letter (wrong number of patches) >=20 > Changes in v2: > - Increased entries per bucket from 4 to 8 for all cases, > so it is not architecture dependent any longer. > - Replaced compile-time signature comparison function election > with run-time election, so best optimization available > will be used from a single binary. > - Reordered the hash structure, so all the fields used by lookup > are in the same cache line (first). >=20 > Byron Marohn (3): > hash: reorganize bucket structure > hash: add vectorized comparison > hash: modify lookup bulk pipeline >=20 > Pablo de Lara (1): > hash: reorder hash structure >=20 > lib/librte_hash/rte_cuckoo_hash.c | 455 ++++++++++++++--------------= ------ > lib/librte_hash/rte_cuckoo_hash.h | 56 +++-- > lib/librte_hash/rte_cuckoo_hash_x86.h | 20 +- > 3 files changed, 228 insertions(+), 303 deletions(-) >=20 > -- > 2.7.4 Series-acked-by: Sameh Gobriel