From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from dpdk.org (dpdk.org [92.243.14.124]) by inbox.dpdk.org (Postfix) with ESMTP id 9E4FFA2EDB for ; Thu, 5 Sep 2019 21:25:58 +0200 (CEST) Received: from [92.243.14.124] (localhost [127.0.0.1]) by dpdk.org (Postfix) with ESMTP id 676461F079; Thu, 5 Sep 2019 21:25:57 +0200 (CEST) Received: from mga09.intel.com (mga09.intel.com [134.134.136.24]) by dpdk.org (Postfix) with ESMTP id 7CDB61F05E for ; Thu, 5 Sep 2019 21:25:55 +0200 (CEST) X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from fmsmga008.fm.intel.com ([10.253.24.58]) by orsmga102.jf.intel.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 05 Sep 2019 12:25:54 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.64,471,1559545200"; d="scan'208";a="182922008" Received: from orsmsx101.amr.corp.intel.com ([10.22.225.128]) by fmsmga008.fm.intel.com with ESMTP; 05 Sep 2019 12:25:54 -0700 Received: from orsmsx155.amr.corp.intel.com (10.22.240.21) by ORSMSX101.amr.corp.intel.com (10.22.225.128) with Microsoft SMTP Server (TLS) id 14.3.439.0; Thu, 5 Sep 2019 12:25:53 -0700 Received: from orsmsx104.amr.corp.intel.com ([169.254.4.123]) by ORSMSX155.amr.corp.intel.com ([169.254.7.26]) with mapi id 14.03.0439.000; Thu, 5 Sep 2019 12:25:53 -0700 From: "Wang, Yipeng1" To: Bing Zhao , "Gobriel, Sameh" , "Richardson, Bruce" , "De Lara Guarch, Pablo" CC: "dev@dpdk.org" Thread-Topic: [RFC] hash: introduce resizable hash list Thread-Index: AQHVXW0jmhgiHeYRcUSNTv/mAtpL5Kcdgncg Date: Thu, 5 Sep 2019 19:25:52 +0000 Message-ID: References: <1566975109-318949-1-git-send-email-bingz@mellanox.com> In-Reply-To: <1566975109-318949-1-git-send-email-bingz@mellanox.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: dlp-product: dlpe-windows dlp-version: 11.2.0.6 dlp-reaction: no-action x-ctpclassification: CTP_NT x-titus-metadata-40: eyJDYXRlZ29yeUxhYmVscyI6IiIsIk1ldGFkYXRhIjp7Im5zIjoiaHR0cDpcL1wvd3d3LnRpdHVzLmNvbVwvbnNcL0ludGVsMyIsImlkIjoiZDA5YmU1ZjAtZDlkMy00NzVhLTk5MzctYjhjMmZkYTk5ODFmIiwicHJvcHMiOlt7Im4iOiJDVFBDbGFzc2lmaWNhdGlvbiIsInZhbHMiOlt7InZhbHVlIjoiQ1RQX05UIn1dfV19LCJTdWJqZWN0TGFiZWxzIjpbXSwiVE1DVmVyc2lvbiI6IjE3LjEwLjE4MDQuNDkiLCJUcnVzdGVkTGFiZWxIYXNoIjoiZ2l6NVIyclRReVl1YWMwZzFqdjV6YkhiRmszdFVSMUVXc3VBYmZvR1hPREJrc2JtM0NxOGc4OEZoSEJiQVM5MiJ9 x-originating-ip: [10.22.254.140] Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Subject: Re: [dpdk-dev] [RFC] hash: introduce resizable hash list X-BeenThere: dev@dpdk.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: DPDK patches and discussions List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: dev-bounces@dpdk.org Sender: "dev" >-----Original Message----- >From: Bing Zhao [mailto:bingz@mellanox.com] >Sent: Tuesday, August 27, 2019 11:52 PM >To: Wang, Yipeng1 ; Gobriel, Sameh ; Richardson, Bruce >; De Lara Guarch, Pablo >Cc: dev@dpdk.org; bingz@mellanox.com >Subject: [RFC] hash: introduce resizable hash list > >In the current hash library, there are already two hash tables and >several hash calculation functions. > >FBK hash table is very lightweight, but the key length is limited to >4 bytes and the size of the table is fixed during startup. > >Cuckoo hash table is a quite great idea and nice implementation in >the library, it is efficient and flexible. After some study of the >code and information from internet, I find that the table extension >is some pain point (correct me if anything wrong). It means that we >need to allocate the tables in advance by defining a maximum size. >So there is a chance that we waste some unused memory. Right now >there is an extendable buckets table, but it seems that the number >is also fixed and the same as the primary table. >Take the flows information for example, we may support as many as >several millions of flows. In some case, only several hundreds of >flows will be created but in the other case, millions of flows are >needed. So this means that we need to create the hash table to >support millions of elements in advance during the startup time. If >one key of flow information will consume 64B and 1M flows, so it >will occupy more than one hundred MB memory (the table fields >included). What is worse if there is only 2M + several elements, it >needs to create a table of 4M (or more: depends on the hash >collision rate) and it is some wasting of the memory. > >In order to handle this, an resizable hash list is introduced. >The table could start with a small number of the elements and be >allocated dynamically during the runtime. In the meanwhile, an >on-demand list splitting mechanism is used in order not to make a >significant performance degradation. Then there is no need to >re-hash and relocate all the existing elements in the table when the >table is extended. > >The brief design is as below: >1. The table is consisted of LIST header array and the LIST entries. > In each entry, a pre-calculated hash signature is stored and is > used to decide which header should it be linked to, by using > "AND" with the mask to select the LSBs of the signature. >2. The header array size could start with a very small number, and a > specified max number of each list is used to check if a table > extension is needed. >3. When the max number on any of list is reached, the header array > size will be doubled. Then each entries linked to this list will > be split into two lists with the new mask (one more bit 1 in > the mask, e.g. 0xfff -> 0x1fff). And a global shift number and > local shift number of each list is used for the further checking. >4. When some other list is being accessed, a comparison for the shift > numbers is used to check if the splitting of this list is needed. > If so, then there will be two conditions: > a. If the local shift number is only 1 less than global or > non-zero, then this list is the one that needs to be split. > b. If more than 1, then it means that the table is extended more > than once. And If the local shift is zero, a mechanism is used > to find the last unsplit list. > And then the list will be split into 2/4/8... lists depends on > the gap. All the entries then will linked to the proper header. >So, each time when the hlist APIs are called, only one list will be >traversed but not all the lists. And since there is parameter to set >a max number of entries in a list. The traversal time is predictable >and these will not cause a significant performance degradation. > >BR. Bing > > >Bing Zhao (1): > rte_hash: introduce hash list into hash lib > > lib/librte_hash/Makefile | 2 + > lib/librte_hash/rte_hash_version.map | 15 + > lib/librte_hash/rte_hlist.c | 687 ++++++++++++++++++++++++++++++= +++++ > lib/librte_hash/rte_hlist.h | 563 ++++++++++++++++++++++++++++ > 4 files changed, 1267 insertions(+) > create mode 100644 lib/librte_hash/rte_hlist.c > create mode 100644 lib/librte_hash/rte_hlist.h > >-- >1.8.3.1 [Wang, Yipeng]=20 Hi, Bing, Table resizing is very important and I think it is a critical feature to ha= ve in DPDK's hash library. Thanks a lot for putting effort on this and I hope we = could work out a solution based on your patch. =20 My major concern of the current implementation is as following: =20 Although we still have the fbk hash table implementation, I think the goal = here is to have the users to move to the main hash table implementation (the cuc= koo algorithm), since it is the better one in most use cases. There have been m= any optimizations done on the current code base since the cuckoo hash was introduced, we do not want to reinvent the wheel such as the multi-threadin= g support, non-TSO machine support, SIMD optimizations, etc. Also, comparing = to linked list, the current bucketized table should provide better lookup performance. Last but not least, we probably don't want fragmented hash tab= le APIs. =20 I would suggest to see if we could add resizing feature to the current hash= table implementation first. For example, we can use the current bucketized design to do the resizing ra= ther than the new linked list-based data structure. We do need some modifications such as having a local bucket_shift variable each bucket, but I think it is feasible. We could also turn off the cuckoo path/extendable linked list of the current implementation when resizing is = needed,=20 so that it would be simpler for the implementation. In such way, we keep the current API and also reuse many of the existing co= de. Also, I wonder if you have a specific use case that benefit from the gradua= lly break-down approach? Since we could always stop the world and resize the table at once.