From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by dpdk.org (Postfix) with ESMTP id 11C652A07 for ; Mon, 3 Oct 2016 11:59:10 +0200 (CEST) Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga103.jf.intel.com with ESMTP; 03 Oct 2016 02:59:09 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.31,289,1473145200"; d="scan'208";a="1059720658" Received: from bricha3-mobl3.ger.corp.intel.com ([10.237.221.44]) by orsmga002.jf.intel.com with SMTP; 03 Oct 2016 02:59:10 -0700 Received: by (sSMTP sendmail emulation); Mon, 03 Oct 2016 10:59:07 +0025 Date: Mon, 3 Oct 2016 10:59:07 +0100 From: Bruce Richardson To: Pablo de Lara Cc: dev@dpdk.org Message-ID: <20161003095906.GA83136@bricha3-MOBL3> References: <1473190397-120741-1-git-send-email-pablo.de.lara.guarch@intel.com> <1475221136-213246-1-git-send-email-pablo.de.lara.guarch@intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1475221136-213246-1-git-send-email-pablo.de.lara.guarch@intel.com> Organization: Intel Research and =?iso-8859-1?Q?De=ACvel?= =?iso-8859-1?Q?opment?= Ireland Ltd. User-Agent: Mutt/1.5.23 (2014-03-12) Subject: Re: [dpdk-dev] [PATCH v4 0/4] Cuckoo hash enhancements X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Mon, 03 Oct 2016 09:59:11 -0000 On Fri, Sep 30, 2016 at 08:38:52AM +0100, Pablo de Lara wrote: > This patchset improves lookup performance on the current hash library > by changing the existing lookup bulk pipeline, with an improved pipeline, > based on a loop-and-jump model, instead of the current 4-stage 2-entry pipeline. > Also, x86 vectorized intrinsics are used to improve performance when comparing signatures. > > First patch reorganizes the order of the hash structure. > The structure takes more than one 64-byte cache line, but not all > the fields are used in the lookup operation (the most common operation). > Therefore, all these fields have been moved to the first part of the structure, > so they all fit in one cache line, improving slightly the performance in some > scenarios. > > Second patch modifies the order of the bucket structure. > Currently, the buckets store all the signatures together (current and alternative). > In order to be able to perform a vectorized signature comparison, > all current signatures have to be together, so the order of the bucket has been changed, > having separated all the current signatures from the alternative signatures. > > Third patch introduces x86 vectorized intrinsics. > When performing a lookup bulk operation, all current signatures in a bucket > are compared against the signature of the key being looked up. > Now that they all are together, a vectorized comparison can be performed, > which takes less instructions to be carried out. > In case of having a machine with AVX2, number of entries per bucket are > increased from 4 to 8, as AVX2 allows comparing two 256-bit values, with 8x32-bit integers, > which are the 8 signatures on the bucket. > > Fourth (and last) patch modifies the current pipeline of the lookup bulk function. > The new pipeline is based on a loop-and-jump model. The two key improvements are: > > - Better prefetching: in this case, first 4 keys to be looked up are prefetched, > and after that, the rest of the keys are prefetched at the time the calculation > of the signatures are being performed. This gives more time for the CPU to > prefetch the data requesting before actually need it, which result in less > cache misses and therefore, higher throughput. > > - Lower performance penalty when using fallback: the lookup bulk algorithm > assumes that most times there will not be a collision in a bucket, but it might > happen that two or more signatures are equal, which means that more than one > key comparison might be necessary. In that case, only the key of the first hit is prefetched, > like in the current implementation. The difference now is that if this comparison > results in a miss, the information of the other keys to be compared has been stored, > unlike the current implementation, which needs to perform an entire simple lookup again. > > Changes in v4: > - Reordered hash structure, so alt signature is at the start > of the next cache line, and explain in the commit message > why it has been moved > - Reordered hash structure, so name field is on top of the structure, > leaving all the fields used in lookup in the next cache line > (instead of the first cache line) > > Changes in v3: > - Corrected the cover letter (wrong number of patches) > > Changes in v2: > - Increased entries per bucket from 4 to 8 for all cases, > so it is not architecture dependent any longer. > - Replaced compile-time signature comparison function election > with run-time election, so best optimization available > will be used from a single binary. > - Reordered the hash structure, so all the fields used by lookup > are in the same cache line (first). > > Byron Marohn (3): > hash: reorganize bucket structure > hash: add vectorized comparison > hash: modify lookup bulk pipeline > Hi, Firstly, checkpatches is reporting some style errors in these patches. Secondly, when I run the "hash_multiwriter_autotest" I get what I assume to be an error after applying this patchset. Before this set is applied, running that test shows the cycles per insert with/without lock elision. Now, though I'm getting an error about a key being dropped or failing to insert in the lock elision case, e.g. Core #2 inserting 1572864: 0 - 1,572,864 key 1497087 is lost 1 key lost I've run the test a number of times, and there is a single key lost each time. Please check on this, is it expected or is it a problem? Thanks, /Bruce