From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mga06.intel.com (mga06.intel.com [134.134.136.31]) by dpdk.org (Postfix) with ESMTP id 4476A2BA4 for ; Tue, 4 Oct 2016 08:50:40 +0200 (CEST) Received: from orsmga005.jf.intel.com ([10.7.209.41]) by orsmga104.jf.intel.com with ESMTP; 03 Oct 2016 23:50:39 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.31,442,1473145200"; d="scan'208";a="16017827" Received: from irsmsx105.ger.corp.intel.com ([163.33.3.28]) by orsmga005.jf.intel.com with ESMTP; 03 Oct 2016 23:50:38 -0700 Received: from irsmsx156.ger.corp.intel.com (10.108.20.68) by irsmsx105.ger.corp.intel.com (163.33.3.28) with Microsoft SMTP Server (TLS) id 14.3.248.2; Tue, 4 Oct 2016 07:50:37 +0100 Received: from irsmsx108.ger.corp.intel.com ([169.254.11.164]) by IRSMSX156.ger.corp.intel.com ([169.254.3.80]) with mapi id 14.03.0248.002; Tue, 4 Oct 2016 07:50:37 +0100 From: "De Lara Guarch, Pablo" To: "Richardson, Bruce" CC: "dev@dpdk.org" Thread-Topic: [PATCH v4 0/4] Cuckoo hash enhancements Thread-Index: AQHSGu2LxawgzNz350iHEhSr9kMv36CWcvCAgAFrgGA= Date: Tue, 4 Oct 2016 06:50:36 +0000 Message-ID: References: <1473190397-120741-1-git-send-email-pablo.de.lara.guarch@intel.com> <1475221136-213246-1-git-send-email-pablo.de.lara.guarch@intel.com> <20161003095906.GA83136@bricha3-MOBL3> In-Reply-To: <20161003095906.GA83136@bricha3-MOBL3> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiZjE0MzcwYjAtMjI5ZS00YmMyLThkNzMtMTFmYWZiNTY3MzU1IiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX0lDIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE1LjkuNi42IiwiVHJ1c3RlZExhYmVsSGFzaCI6IlJGY1htcEtBN0dkXC84RFBYbUx3Y1VHM2l3aWh1cXRONVZwUVYyM3FtcUVNPSJ9 x-ctpclassification: CTP_IC x-originating-ip: [163.33.239.180] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [PATCH v4 0/4] Cuckoo hash enhancements X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: patches and discussions about DPDK List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-List-Received-Date: Tue, 04 Oct 2016 06:50:40 -0000 Hi Bruce, > -----Original Message----- > From: Richardson, Bruce > Sent: Monday, October 03, 2016 2:59 AM > To: De Lara Guarch, Pablo > Cc: dev@dpdk.org > Subject: Re: [PATCH v4 0/4] Cuckoo hash enhancements >=20 > On Fri, Sep 30, 2016 at 08:38:52AM +0100, Pablo de Lara wrote: > > This patchset improves lookup performance on the current hash library > > by changing the existing lookup bulk pipeline, with an improved pipelin= e, > > based on a loop-and-jump model, instead of the current 4-stage 2-entry > pipeline. > > Also, x86 vectorized intrinsics are used to improve performance when > comparing signatures. > > > > First patch reorganizes the order of the hash structure. > > The structure takes more than one 64-byte cache line, but not all > > the fields are used in the lookup operation (the most common operation)= . > > Therefore, all these fields have been moved to the first part of the st= ructure, > > so they all fit in one cache line, improving slightly the performance i= n some > > scenarios. > > > > Second patch modifies the order of the bucket structure. > > Currently, the buckets store all the signatures together (current and > alternative). > > In order to be able to perform a vectorized signature comparison, > > all current signatures have to be together, so the order of the bucket = has > been changed, > > having separated all the current signatures from the alternative signat= ures. > > > > Third patch introduces x86 vectorized intrinsics. > > When performing a lookup bulk operation, all current signatures in a bu= cket > > are compared against the signature of the key being looked up. > > Now that they all are together, a vectorized comparison can be performe= d, > > which takes less instructions to be carried out. > > In case of having a machine with AVX2, number of entries per bucket are > > increased from 4 to 8, as AVX2 allows comparing two 256-bit values, wit= h > 8x32-bit integers, > > which are the 8 signatures on the bucket. > > > > Fourth (and last) patch modifies the current pipeline of the lookup bul= k > function. > > The new pipeline is based on a loop-and-jump model. The two key > improvements are: > > > > - Better prefetching: in this case, first 4 keys to be looked up are pr= efetched, > > and after that, the rest of the keys are prefetched at the time the > calculation > > of the signatures are being performed. This gives more time for the C= PU to > > prefetch the data requesting before actually need it, which result in= less > > cache misses and therefore, higher throughput. > > > > - Lower performance penalty when using fallback: the lookup bulk > algorithm > > assumes that most times there will not be a collision in a bucket, bu= t it > might > > happen that two or more signatures are equal, which means that more > than one > > key comparison might be necessary. In that case, only the key of the = first > hit is prefetched, > > like in the current implementation. The difference now is that if thi= s > comparison > > results in a miss, the information of the other keys to be compared h= as > been stored, > > unlike the current implementation, which needs to perform an entire > simple lookup again. > > > > Changes in v4: > > - Reordered hash structure, so alt signature is at the start > > of the next cache line, and explain in the commit message > > why it has been moved > > - Reordered hash structure, so name field is on top of the structure, > > leaving all the fields used in lookup in the next cache line > > (instead of the first cache line) > > > > Changes in v3: > > - Corrected the cover letter (wrong number of patches) > > > > Changes in v2: > > - Increased entries per bucket from 4 to 8 for all cases, > > so it is not architecture dependent any longer. > > - Replaced compile-time signature comparison function election > > with run-time election, so best optimization available > > will be used from a single binary. > > - Reordered the hash structure, so all the fields used by lookup > > are in the same cache line (first). > > > > Byron Marohn (3): > > hash: reorganize bucket structure > > hash: add vectorized comparison > > hash: modify lookup bulk pipeline > > >=20 > Hi, >=20 > Firstly, checkpatches is reporting some style errors in these patches. >=20 > Secondly, when I run the "hash_multiwriter_autotest" I get what I assume = to > be > an error after applying this patchset. Before this set is applied, runnin= g > that test shows the cycles per insert with/without lock elision. Now, tho= ugh > I'm getting an error about a key being dropped or failing to insert in th= e lock > elision case, e.g. >=20 > Core #2 inserting 1572864: 0 - 1,572,864 > key 1497087 is lost > 1 key lost >=20 > I've run the test a number of times, and there is a single key lost each = time. > Please check on this, is it expected or is it a problem? I am seeing that error even without the patchset. I am still investigating = it, but using "git bisect" looks like the problem is in commit 5fc74c2e146d ("hash: check if slot is empty with key index"). Thanks, Pablo >=20 > Thanks, > /Bruce