DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [RFC] hash: introduce resizable hash list
@ 2019-08-28  6:51 Bing Zhao
  2019-08-28  6:51 ` [dpdk-dev] [RFC] rte_hash: introduce hash list into hash lib Bing Zhao
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Bing Zhao @ 2019-08-28  6:51 UTC (permalink / raw)
  To: yipeng1.wang, sameh.gobriel, bruce.richardson, pablo.de.lara.guarch
  Cc: dev, bingz

In the current hash library, there are already two hash tables and
several hash calculation functions.

FBK hash table is very lightweight, but the key length is limited to
4 bytes and the size of the table is fixed during startup.

Cuckoo hash table is a quite great idea and nice implementation in
the library, it is efficient and flexible. After some study of the
code and information from internet, I find that the table extension
is some pain point (correct me if anything wrong). It means that we
need to allocate the tables in advance by defining a maximum size.
So there is a chance that we waste some unused memory. Right now
there is an extendable buckets table, but it seems that the number
is also fixed and the same as the primary table.
Take the flows information for example, we may support as many as
several millions of flows. In some case, only several hundreds of
flows will be created but in the other case, millions of flows are
needed. So this means that we need to create the hash table to
support millions of elements in advance during the startup time. If
one key of flow information will consume 64B and 1M flows, so it
will occupy more than one hundred MB memory (the table fields
included). What is worse if there is only 2M + several elements, it
needs to create a table of 4M (or more: depends on the hash
collision rate) and it is some wasting of the memory.

In order to handle this, an resizable hash list is introduced.
The table could start with a small number of the elements and be
allocated dynamically during the runtime. In the meanwhile, an
on-demand list splitting mechanism is used in order not to make a
significant performance degradation. Then there is no need to
re-hash and relocate all the existing elements in the table when the
table is extended.

The brief design is as below:
1. The table is consisted of LIST header array and the LIST entries.
   In each entry, a pre-calculated hash signature is stored and is
   used to decide which header should it be linked to, by using
   "AND" with the mask to select the LSBs of the signature.
2. The header array size could start with a very small number, and a
   specified max number of each list is used to check if a table
   extension is needed.
3. When the max number on any of list is reached, the header array
   size will be doubled. Then each entries linked to this list will
   be split into two lists with the new mask (one more bit 1 in
   the mask, e.g. 0xfff -> 0x1fff). And a global shift number and
   local shift number of each list is used for the further checking.
4. When some other list is being accessed, a comparison for the shift
   numbers is used to check if the splitting of this list is needed.
   If so, then there will be two conditions:
   a. If the local shift number is only 1 less than global or
      non-zero, then this list is the one that needs to be split.
   b. If more than 1, then it means that the table is extended more
      than once. And If the local shift is zero, a mechanism is used
      to find the last unsplit list.
   And then the list will be split into 2/4/8... lists depends on
   the gap. All the entries then will linked to the proper header.
So, each time when the hlist APIs are called, only one list will be
traversed but not all the lists. And since there is parameter to set
a max number of entries in a list. The traversal time is predictable
and these will not cause a significant performance degradation.

BR. Bing


Bing Zhao (1):
  rte_hash: introduce hash list into hash lib

 lib/librte_hash/Makefile             |   2 +
 lib/librte_hash/rte_hash_version.map |  15 +
 lib/librte_hash/rte_hlist.c          | 687 +++++++++++++++++++++++++++++++++++
 lib/librte_hash/rte_hlist.h          | 563 ++++++++++++++++++++++++++++
 4 files changed, 1267 insertions(+)
 create mode 100644 lib/librte_hash/rte_hlist.c
 create mode 100644 lib/librte_hash/rte_hlist.h

-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* [dpdk-dev] [RFC] rte_hash: introduce hash list into hash lib
  2019-08-28  6:51 [dpdk-dev] [RFC] hash: introduce resizable hash list Bing Zhao
@ 2019-08-28  6:51 ` Bing Zhao
  2019-08-28 11:50   ` Stephen Hemminger
                     ` (2 more replies)
  2019-09-05 19:25 ` [dpdk-dev] [RFC] hash: introduce resizable hash list Wang, Yipeng1
  2023-06-12 16:44 ` Stephen Hemminger
  2 siblings, 3 replies; 10+ messages in thread
From: Bing Zhao @ 2019-08-28  6:51 UTC (permalink / raw)
  To: yipeng1.wang, sameh.gobriel, bruce.richardson, pablo.de.lara.guarch
  Cc: dev, bingz

An extendible hash list library supporting add, delete, lookup action.
Extending when the elements number of any list reach the pre-set max
value, and splitting of the list is done passively when this list is
being accessed(add, del, lookup). Right now, it is not a multi-thread
safe implementation and no table size shrinking is supported.

Signed-off-by: Bing Zhao <bingz@mellanox.com>
---
 lib/librte_hash/Makefile             |   2 +
 lib/librte_hash/rte_hash_version.map |  15 +
 lib/librte_hash/rte_hlist.c          | 687 +++++++++++++++++++++++++++++++++++
 lib/librte_hash/rte_hlist.h          | 563 ++++++++++++++++++++++++++++
 4 files changed, 1267 insertions(+)
 create mode 100644 lib/librte_hash/rte_hlist.c
 create mode 100644 lib/librte_hash/rte_hlist.h

diff --git a/lib/librte_hash/Makefile b/lib/librte_hash/Makefile
index c8c435d..6de49dd 100644
--- a/lib/librte_hash/Makefile
+++ b/lib/librte_hash/Makefile
@@ -17,6 +17,7 @@ LIBABIVER := 2
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_HASH) := rte_cuckoo_hash.c
 SRCS-$(CONFIG_RTE_LIBRTE_HASH) += rte_fbk_hash.c
+SRCS-$(CONFIG_RTE_LIBRTE_HASH) += rte_hlist.c
 
 # install this header file
 SYMLINK-$(CONFIG_RTE_LIBRTE_HASH)-include := rte_hash.h
@@ -29,5 +30,6 @@ endif
 SYMLINK-$(CONFIG_RTE_LIBRTE_HASH)-include += rte_jhash.h
 SYMLINK-$(CONFIG_RTE_LIBRTE_HASH)-include += rte_thash.h
 SYMLINK-$(CONFIG_RTE_LIBRTE_HASH)-include += rte_fbk_hash.h
+SYMLINK-$(CONFIG_RTE_LIBRTE_HASH)-include += rte_hlist.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_hash/rte_hash_version.map b/lib/librte_hash/rte_hash_version.map
index 734ae28..9325f4f 100644
--- a/lib/librte_hash/rte_hash_version.map
+++ b/lib/librte_hash/rte_hash_version.map
@@ -59,4 +59,19 @@ EXPERIMENTAL {
 
 	rte_hash_free_key_with_position;
 
+	rte_hlist_add_key;
+	rte_hlist_add_key_data;
+	rte_hlist_add_key_data_len;
+	rte_hlist_add_key_with_hash_data;
+	rte_hlist_add_key_with_hash_data_len;
+	rte_hlist_clear_all_entries_with_cb;
+	rte_hlist_create;
+	rte_hlist_del_entry_fast_return_data;
+	rte_hlist_del_key;
+	rte_hlist_del_key_return_data;
+	rte_hlist_free;
+	rte_hlist_hash;
+	rte_hlist_lookup;
+	rte_hlist_lookup_with_hash;
+
 };
diff --git a/lib/librte_hash/rte_hlist.c b/lib/librte_hash/rte_hlist.c
new file mode 100644
index 0000000..d2e96fe
--- /dev/null
+++ b/lib/librte_hash/rte_hlist.c
@@ -0,0 +1,687 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2019 Mellanox Technologies, Ltd
+ */
+
+#include <string.h>
+
+#include <rte_common.h>
+#include <rte_eal_memconfig.h>
+#include <rte_malloc.h>
+#include <rte_memcpy.h>
+#include <rte_errno.h>
+#include <rte_log.h>
+
+#include "rte_hlist.h"
+
+TAILQ_HEAD(rte_hlist_list, rte_tailq_entry);
+
+static struct rte_tailq_elem rte_hlist_tailq = {
+	.name = "RTE_HLIST",
+};
+EAL_REGISTER_TAILQ(rte_hlist_tailq)
+
+hash_sig_t
+rte_hlist_hash(const struct rte_hlist_table *h, const void *key)
+{
+	/* calc hash result by key */
+	return h->hash_func(key, h->key_len, h->init_val);
+}
+
+static inline int
+__rte_hlist_extend_list_table(struct rte_hlist_table *h)
+{
+	struct rte_hlist_head_entry *new;
+	uint32_t dsize = (h->bucket_mask + 1) << 1;
+
+	new = (struct rte_hlist_head_entry *)rte_realloc(h->t,
+				sizeof(struct rte_hlist_head_entry) * dsize, 0);
+	if (new == NULL) {
+		rte_errno = ENOMEM;
+		return -rte_errno;
+	}
+	memset(((char *)new)+sizeof(struct rte_hlist_head_entry)*(dsize>>1),
+		0, sizeof(struct rte_hlist_head_entry)*(dsize>>1));
+
+	/* new head that first element prev points to must be adjusted */
+	h->t = new;
+	h->entries *= 2;
+	h->bucket_shift += 1;
+	h->bucket_mask = dsize - 1;
+
+	return 0;
+}
+
+static inline uint32_t
+__rte_hlist_find_pre_unsplited_list(
+	const struct rte_hlist_table *h, uint32_t idx)
+{
+	struct rte_hlist_head_entry *n_he;
+	uint32_t gap = (h->bucket_shift > h->t[idx].bucket_shift) ?
+		(h->bucket_shift - h->t[idx].bucket_shift) : 0;
+	uint32_t i;
+	uint32_t p_idx = idx;
+
+	/* If the shift number is not zero, just return the one from input. */
+	for (i = 0; i <= gap; i++) {
+		p_idx = idx &
+			HLIST_CALC_PREVIOUS_BUCKET_MASK(h->bucket_mask, i);
+		n_he = &h->t[p_idx];
+		if (n_he->bucket_shift != 0)
+			break;
+	}
+
+	return p_idx;
+}
+
+static inline int
+__rte_hlist_split_one_list(const struct rte_hlist_table *h,
+	uint32_t idx)
+{
+	struct rte_hlist_node_entry *pos;
+	struct rte_hlist_node_entry *tp;
+	struct rte_hlist_head_entry *he = &h->t[idx];
+	struct rte_hlist_head_entry *n_he;
+	uint32_t new_idx;
+	uint32_t sh;
+	uint32_t sh_gap_mul;
+	uint32_t i = 0;
+
+	if (h->bucket_shift <= he->bucket_shift) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+
+	/* adjust the '(elm)->field.le_prev' of first element to the new head */
+	LIST_MOVE_TO_NEW_HEAD(&he->head, &he->head, next);
+
+	/* TAILQ is easy for lists combining when shrinking the table, but a
+	 * little more memory will be costed for each head. No need to
+	 * re-calculate the hash value.
+	 */
+	LIST_FOREACH_SAFE(pos, &he->head, next, tp) {
+		new_idx = pos->d.sig & h->bucket_mask;
+		if (new_idx != idx) {
+			LIST_REMOVE(pos, next);
+			he->entries_in_bucket--;
+			n_he = &h->t[new_idx];
+			LIST_INSERT_HEAD(&n_he->head, pos, next);
+			n_he->entries_in_bucket++;
+		}
+	}
+
+	/* old_mask to speed up:
+	 * [0, old_mask+1, 2(old_mask+1) ... <mask] | idx
+	 */
+	do { i++; } while (h->bucket_mask >> i);
+	sh = i - (h->bucket_shift - he->bucket_shift);
+	sh_gap_mul = 1 << (h->bucket_shift - he->bucket_shift);
+	for (i = 0; i < sh_gap_mul; i++) {
+		new_idx = (i << sh) | idx;
+		h->t[new_idx].bucket_shift = h->bucket_shift;
+	}
+
+	return 0;
+}
+
+static inline int
+__rte_hlist_find_node_entry(struct rte_hlist_table *h,
+	const void *key, hash_sig_t sig, struct rte_hlist_node_entry **p)
+{
+	uint32_t idx;
+	uint32_t n_idx;
+	struct rte_hlist_head_entry *he;
+	struct rte_hlist_head_entry *n_he;
+	struct rte_hlist_node_entry *pos;
+	uint32_t matched = FALSE;
+	int ret;
+
+	if ((h == NULL) || (key == NULL) || (p == NULL)) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+
+	idx = sig & h->bucket_mask;
+	he = &h->t[idx];
+
+	/* When the entries linked to the list reach upper limit, we try to
+	 * extend the length of the head array. Then we just split this list.
+	 * Others will be checked and splitted when being accessed.
+	 * Shift number also needs to be checked in case of extra extending.
+	 */
+	if ((he->entries_in_bucket > h->entries_per_bucket) &&
+		(he->bucket_shift == h->bucket_shift)) {
+		/* If the list operation failed, it returns nothing even if
+		 * the key will match one of the existing entries.
+		 */
+		ret = __rte_hlist_extend_list_table(h);
+		if (ret < 0) {
+			RTE_LOG(ERR, HASH,
+				"Failed to extend the list table %d\n", ret);
+			goto exit;
+		}
+	}
+
+	if (he->bucket_shift < h->bucket_shift) {
+		uint32_t p_idx = __rte_hlist_find_pre_unsplited_list(h, idx);
+
+		/* No matter how many times extending is done, one splitting
+		 * is enough. If shift is 0, then the 'oldest' list that is
+		 * not splitted should be splitted(no matter if any entry in).
+		 * If not zero, just try to split this list and try to move
+		 * entries into the new one.
+		 */
+		ret = __rte_hlist_split_one_list(h, p_idx);
+		if (ret < 0) {
+			RTE_LOG(ERR, HASH,
+				"Failed to split the bucket %d\n", ret);
+			goto exit;
+		}
+	}
+
+	/* Need to re-do since the mask & pointer may change if extended */
+	n_idx = sig & h->bucket_mask;
+	n_he = &h->t[n_idx];
+
+	/* Passive way: just split the list that being accessed.
+	 * After splitting, only one list needs to be traversed.
+	 */
+	if (n_he->entries_in_bucket > 0) {
+		LIST_FOREACH(pos, &n_he->head, next) {
+			if (pos->d.sig == sig)
+				/* Key comparison needs be optimized later */
+				if (!memcmp(pos->key, key, h->key_len)) {
+					matched = TRUE;
+					break;
+				}
+		}
+	}
+
+	if (matched == TRUE) {
+		*p = pos;
+		/* The head index will always be the calculated one
+		 * because of the splitting of the list.
+		 */
+		ret = n_idx;
+	} else {
+		*p = NULL;
+		ret = -ENOENT;
+	}
+
+exit:
+	return ret;
+}
+
+static inline struct rte_hlist_data_element *
+__rte_hlist_add_key_with_hash_data(struct rte_hlist_table *h,
+	const void *key, hash_sig_t sig, void *data, uint16_t len, uint8_t cus)
+{
+	int idx;
+	struct rte_hlist_head_entry *he;
+	struct rte_hlist_node_entry *pos;
+
+	idx = __rte_hlist_find_node_entry(h, key, sig, &pos);
+	if (idx >= 0) {
+		pos->d.flags &= (~HLIST_DATA_NEW_ENTRY);
+		return &pos->d;
+	} else if ((idx < 0) && (idx != -ENOENT))
+		return NULL;
+
+	/* All the fields will be written */
+	pos = rte_malloc(NULL,
+		sizeof(struct rte_hlist_node_entry) + h->key_len, 0);
+	if (pos == NULL) {
+		RTE_LOG(ERR, HASH, "Failed to allocate new list node\n");
+		return NULL;
+	}
+
+	pos->d.sig = sig;
+	/* should be optimized if the key length is small */
+	rte_memcpy(pos->key, key, h->key_len);
+	/* User should be responsible for the data no matter how many bytes
+	 * the length is.
+	 */
+	if (cus == TRUE) {
+		pos->d.extra_data = data;
+		pos->d.flags = HLIST_DATA_CUSTOM_EXTRA_DATA;
+		h->custom_entries++;
+	} else {
+		if ((data != NULL) && (len != 0)) {
+			pos->d.flags = HLIST_DATA_INLINE;
+			switch (len) {
+			case 1:
+				pos->d.v8 = *(uint8_t *)data;
+				break;
+			case 2:
+				pos->d.v16 = *(uint16_t *)data;
+				break;
+			case 4:
+				pos->d.v32 = *(uint32_t *)data;
+				break;
+			case 8:
+				pos->d.v64 = *(uint64_t *)data;
+				break;
+			default:
+				pos->d.extra_data = rte_malloc(NULL, len, 0);
+				if (pos->d.extra_data == NULL) {
+					rte_free(pos);
+					rte_errno = ENOMEM;
+					return NULL;
+				}
+				rte_memcpy(pos->d.extra_data, data, len);
+				pos->d.flags = HLIST_DATA_ALLOC_WITH_SIZE(len);
+				break;
+			}
+		} else {
+			pos->d.extra_data = data;
+			pos->d.flags = HLIST_DATA_NOT_EXIST;
+		}
+	}
+	pos->d.flags |= HLIST_DATA_NEW_ENTRY;
+	idx = sig & h->bucket_mask;
+	he = &h->t[idx];
+	LIST_INSERT_HEAD(&he->head, pos, next);
+	he->entries_in_bucket++;
+
+	return &pos->d;
+}
+
+struct rte_hlist_data_element *
+rte_hlist_add_key(struct rte_hlist_table *h, const void *key)
+{
+	RETURN_IF_TRUE_ERRNO(((h == NULL) || (key == NULL)), NULL, -EINVAL);
+
+	return __rte_hlist_add_key_with_hash_data(h, key,
+			rte_hlist_hash(h, key), NULL, 0, FALSE);
+}
+
+struct rte_hlist_data_element *
+rte_hlist_add_key_data(struct rte_hlist_table *h,
+	const void *key, void *data)
+{
+	RETURN_IF_TRUE_ERRNO(((h == NULL) || (key == NULL)), NULL, -EINVAL);
+
+	return __rte_hlist_add_key_with_hash_data(h, key,
+			rte_hlist_hash(h, key), data, 0, TRUE);
+}
+
+struct rte_hlist_data_element *
+rte_hlist_add_key_data_len(struct rte_hlist_table *h,
+	const void *key, void *data, uint16_t len, uint8_t cus)
+{
+	RETURN_IF_TRUE_ERRNO(((h == NULL) || (key == NULL)), NULL, -EINVAL);
+
+	return __rte_hlist_add_key_with_hash_data(h, key,
+			rte_hlist_hash(h, key), data, len, cus);
+}
+
+struct rte_hlist_data_element *
+rte_hlist_add_key_with_hash_data(struct rte_hlist_table *h,
+	const void *key, hash_sig_t sig, void *data)
+{
+	RETURN_IF_TRUE_ERRNO(((h == NULL) || (key == NULL)), NULL, -EINVAL);
+
+	return __rte_hlist_add_key_with_hash_data(h, key, sig, data, 0, TRUE);
+}
+
+struct rte_hlist_data_element *
+rte_hlist_add_key_with_hash_data_len(
+	struct rte_hlist_table *h, const void *key, hash_sig_t sig,
+	void *data, uint16_t len, uint8_t cus)
+{
+	RETURN_IF_TRUE_ERRNO(((h == NULL) || (key == NULL)), NULL, -EINVAL);
+
+	return __rte_hlist_add_key_with_hash_data(h, key, sig, data, len, cus);
+}
+
+static inline int
+__rte_hlist_del_key_with_hash_return_data(
+	struct rte_hlist_table *h, const void *key,
+	hash_sig_t sig, void **data)
+{
+	struct rte_hlist_node_entry *pos;
+	int idx = __rte_hlist_find_node_entry(h, key, sig, &pos);
+	if (idx < 0)
+		return idx;
+
+	*data = NULL;
+	if (HLIST_DATA_CUSTOM_EXTRA_DATA & pos->d.flags) {
+		*data = pos->d.extra_data;
+		h->custom_entries--;
+	} else if (HLIST_DATA_IS_ALLOCATED(&pos->d)) {
+		rte_free(pos->d.extra_data);
+	}
+
+	LIST_REMOVE(pos, next);
+	rte_free(pos);
+	h->t[idx].entries_in_bucket--;
+
+	return 0;
+}
+
+static inline int
+__rte_hlist_del_key_with_hash(struct rte_hlist_table *h,
+	const void *key, hash_sig_t sig)
+{
+	struct rte_hlist_node_entry *pos;
+	int idx;
+
+	/* Currently there will be some 'false positive' to extend the lists
+	 * head and split this list. e.g. only one more entry in the list to
+	 * be moved to another list after extending and if it is the one to be
+	 * removed, then there will be no entry in the 'brother' list after
+	 * deletion. But the length of the lists head array is extended after
+	 * searching. It is not a bug but not graceful enough right now.
+	 * (Other hash table may also face some issues with this scenario)
+	 */
+	idx = __rte_hlist_find_node_entry(h, key, sig, &pos);
+	if (idx < 0)
+		return idx;
+
+	if (HLIST_DATA_CUSTOM_EXTRA_DATA & pos->d.flags) {
+		if (h->free_func == NULL) {
+			rte_errno = EBUSY;
+			return -rte_errno;
+		}
+		h->free_func(pos->d.extra_data);
+		h->custom_entries--;
+	} else if (HLIST_DATA_IS_ALLOCATED(&pos->d)) {
+		rte_free(pos->d.extra_data);
+	}
+
+	LIST_REMOVE(pos, next);
+	rte_free(pos);
+	h->t[idx].entries_in_bucket--;
+
+	return 0;
+}
+
+static inline int
+__rte_hlist_del_entry_fast_return_data(struct rte_hlist_table *h,
+	struct rte_hlist_data_element *de, void **data)
+{
+	struct rte_hlist_node_entry *pos;
+	struct rte_hlist_head_entry *he;
+	uint32_t idx = de->sig & h->bucket_mask;
+	int ret;
+
+	pos = container_of(de, struct rte_hlist_node_entry, d);
+	he = &h->t[idx];
+
+	/* Splitting will ensure that the head and the first element pointers
+	 * are consistent, or else the deletion will cause memory corruption.
+	 */
+	if (he->bucket_shift < h->bucket_shift) {
+		uint32_t p_idx = __rte_hlist_find_pre_unsplited_list(h, idx);
+
+		ret = __rte_hlist_split_one_list(h, p_idx);
+		if (ret < 0) {
+			RTE_LOG(ERR, HASH,
+				"Failed to split the bucket %d\n", ret);
+			return ret;
+		}
+	}
+
+	if (HLIST_DATA_CUSTOM_EXTRA_DATA & pos->d.flags) {
+		*data = pos->d.extra_data;
+		h->custom_entries--;
+	} else if (HLIST_DATA_IS_ALLOCATED(&pos->d)) {
+		rte_free(pos->d.extra_data);
+	}
+	LIST_REMOVE(pos, next);
+	rte_free(pos);
+	he->entries_in_bucket--;
+
+	return 0;
+}
+
+int
+rte_hlist_del_key(struct rte_hlist_table *h, const void *key)
+{
+	RETURN_IF_TRUE(((h == NULL) || (key == NULL)), -EINVAL);
+
+	return __rte_hlist_del_key_with_hash(h, key, rte_hlist_hash(h, key));
+}
+
+int
+rte_hlist_del_key_return_data(struct rte_hlist_table *h,
+	const void *key, void **data)
+{
+	RETURN_IF_TRUE(((h == NULL) || (key == NULL) || (data == NULL)),
+			-EINVAL);
+
+	return __rte_hlist_del_key_with_hash_return_data(h, key,
+			rte_hlist_hash(h, key), data);
+}
+
+int
+rte_hlist_del_entry_fast_return_data(struct rte_hlist_table *h,
+	struct rte_hlist_data_element *de, void **data)
+{
+	RETURN_IF_TRUE(((h == NULL) || (key == NULL) || (data == NULL)),
+			-EINVAL);
+
+	return __rte_hlist_del_entry_fast_return_data(h, de, data);
+}
+
+struct rte_hlist_data_element *
+rte_hlist_lookup_with_hash(struct rte_hlist_table *h,
+	const void *key, hash_sig_t sig)
+{
+	struct rte_hlist_node_entry *p;
+	int idx;
+
+	idx = __rte_hlist_find_node_entry(h, key, sig, &p);
+	if (idx < 0)
+		return NULL;
+
+	return &p->d;
+}
+
+struct rte_hlist_data_element *
+rte_hlist_lookup(struct rte_hlist_table *h, const void *key)
+{
+	RETURN_IF_TRUE_ERRNO(((h == NULL) || (key == NULL)), NULL, -EINVAL);
+
+	return rte_hlist_lookup_with_hash(h, key, rte_hlist_hash(h, key));
+}
+
+static inline int
+__rte_hlist_clear_all_entries(struct rte_hlist_table *h, uint8_t force)
+{
+	uint32_t size;
+	uint32_t i;
+	struct rte_hlist_head_entry *he;
+	struct rte_hlist_node_entry *pos;
+	struct rte_hlist_node_entry *tp;
+
+	if (h == NULL) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+
+	if ((force == FALSE) && (h->custom_entries) && (h->free_func == NULL)) {
+		rte_errno = EBUSY;
+		return -rte_errno;
+	}
+
+	size = h->bucket_mask + 1;
+	for (i = 0; i < size; i++) {
+		he = &h->t[i];
+		if (he->entries_in_bucket > 0) {
+			LIST_MOVE_TO_NEW_HEAD(&he->head, &he->head, next);
+			LIST_FOREACH_SAFE(pos, &he->head, next, tp) {
+				if (HLIST_DATA_IS_ALLOCATED(&pos->d)) {
+					rte_free(pos->d.extra_data);
+				} else if (HLIST_DATA_CUSTOM_EXTRA_DATA &
+							pos->d.flags) {
+					h->custom_entries--;
+					if (h->free_func != NULL)
+						h->free_func(
+							pos->d.extra_data);
+				}
+				LIST_REMOVE(pos, next);
+				rte_free(pos);
+				he->entries_in_bucket--;
+			}
+		}
+	}
+
+	return 0;
+}
+
+int
+rte_hlist_clear_all_entries_with_cb(struct rte_hlist_table *h,
+	rte_hlist_free_fn fn)
+{
+	rte_hlist_free_fn saved_fn;
+	int ret;
+
+	RETURN_IF_TRUE((h == NULL), -EINVAL);
+	saved_fn = h->free_func;
+	h->free_func = fn;
+	ret = __rte_hlist_clear_all_entries(h, TRUE);
+	h->free_func = saved_fn;
+
+	return ret;
+}
+
+struct rte_hlist_table *
+rte_hlist_create(const struct rte_hlist_params *params)
+{
+	struct rte_hlist_table *ht = NULL;
+	struct rte_tailq_entry *te;
+	char hash_name[RTE_HLIST_NAMESIZE];
+	struct rte_hlist_list *hlist_list;
+	uint32_t table_size;
+
+	hlist_list = RTE_TAILQ_CAST(rte_hlist_tailq.head,
+				       rte_hlist_list);
+
+	/* Error checking of parameters. */
+	if ((!rte_is_power_of_2(params->entries)) ||
+			(!rte_is_power_of_2(params->entries_per_bucket)) ||
+			(params->entries == 0) ||
+			(params->entries_per_bucket == 0) ||
+			(params->entries_per_bucket > params->entries) ||
+			(params->entries > RTE_HLIST_ENTRIES_MAX) ||
+			(params->entries_per_bucket >
+				RTE_HLIST_ENTRIES_PER_BUCKET_MAX)) {
+		rte_errno = EINVAL;
+		return NULL;
+	}
+
+	snprintf(hash_name, sizeof(hash_name), "HLIST_%s", params->name);
+
+	rte_rwlock_write_lock(RTE_EAL_TAILQ_RWLOCK);
+
+	/* guarantee there's no existing */
+	TAILQ_FOREACH(te, hlist_list, next) {
+		ht = (struct rte_hlist_table *)te->data;
+		if (strncmp(params->name, ht->name, RTE_HLIST_NAMESIZE) == 0)
+			break;
+	}
+	ht = NULL;
+	if (te != NULL) {
+		rte_errno = EEXIST;
+		goto exit;
+	}
+
+	te = rte_zmalloc("HLIST_TAILQ_ENTRY",
+			sizeof(struct rte_tailq_entry), 0);
+	if (te == NULL) {
+		RTE_LOG(ERR, HASH,
+			"Failed to allocate tailq entry for hlist\n");
+		goto exit;
+	}
+
+	/* Allocate memory for table. */
+	ht = (struct rte_hlist_table *)rte_zmalloc_socket(hash_name,
+			sizeof(struct rte_hlist_table), 0, params->socket_id);
+	if (ht == NULL) {
+		RTE_LOG(ERR, HASH, "Failed to allocate hlist table\n");
+		rte_free(te);
+		goto exit;
+	}
+
+	table_size = params->entries / params->entries_per_bucket;
+	ht->t = rte_zmalloc_socket(NULL,
+			sizeof(struct rte_hlist_head_entry) * table_size,
+			0, params->socket_id);
+	if (ht->t == NULL) {
+		RTE_LOG(ERR, HASH, "Failed to allocate hlist head table\n");
+		rte_free(te);
+		rte_free(ht);
+		goto exit;
+	}
+
+	/* Because HLIST_HEAD_INIT is to initialize the value to zero, skip it
+	 * here to accelerate the initialization stage.
+	 */
+
+	/* Set up hash table context. */
+	snprintf(ht->name, sizeof(ht->name), "%s", params->name);
+	ht->entries = params->entries;
+	ht->entries_per_bucket = params->entries_per_bucket;
+	/* since table size is a power of 2 */
+	ht->bucket_mask = table_size - 1;
+	ht->key_len = params->key_len;
+
+	if (params->hash_func != NULL) {
+		ht->hash_func = params->hash_func;
+		ht->init_val = params->init_val;
+	} else {
+		ht->hash_func = RTE_HLIST_FUNC_DEFAULT;
+		ht->init_val = RTE_HLIST_INIT_VAL_DEFAULT;
+	}
+
+	/* not mandatory for the free function */
+	ht->free_func = params->free_func;
+
+	te->data = (void *)ht;
+
+	TAILQ_INSERT_TAIL(hlist_list, te, next);
+
+exit:
+	rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+	return ht;
+}
+
+int rte_hlist_free(struct rte_hlist_table *h)
+{
+	struct rte_tailq_entry *te;
+	struct rte_hlist_list *hlist_list;
+
+	if (h == NULL) {
+		rte_errno = EINVAL;
+		return -rte_errno;
+	}
+
+	hlist_list = RTE_TAILQ_CAST(rte_hlist_tailq.head, rte_hlist_list);
+
+	rte_rwlock_write_lock(RTE_EAL_TAILQ_RWLOCK);
+
+	/* find out tailq entry */
+	TAILQ_FOREACH(te, hlist_list, next) {
+		if (te->data == (void *)h)
+			break;
+	}
+
+	if (te == NULL) {
+		rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+		rte_errno = EEXIST;
+		return -rte_errno;
+	}
+
+	TAILQ_REMOVE(hlist_list, te, next);
+
+	rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
+
+	(void)__rte_hlist_clear_all_entries(h, TRUE);
+	rte_free(h);
+	rte_free(te);
+
+	return 0;
+}
+
diff --git a/lib/librte_hash/rte_hlist.h b/lib/librte_hash/rte_hlist.h
new file mode 100644
index 0000000..8fd1b6a
--- /dev/null
+++ b/lib/librte_hash/rte_hlist.h
@@ -0,0 +1,563 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright 2019 Mellanox Technologies, Ltd
+ */
+
+#ifndef _RTE_HLIST_H_
+#define _RTE_HLIST_H_
+
+/**
+ * @file
+ *
+ * RTE Hash List
+ */
+
+#include <stdint.h>
+#include <errno.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <rte_config.h>
+#ifndef RTE_HLIST_FUNC_DEFAULT
+#if defined(RTE_ARCH_X86) || defined(RTE_MACHINE_CPUFLAG_CRC32)
+#include <rte_hash_crc.h>
+/** Default hlist hash function if none is specified. */
+#define RTE_HLIST_FUNC_DEFAULT		rte_hash_crc
+#else
+#include <rte_jhash.h>
+#define RTE_HLIST_FUNC_DEFAULT		rte_jhash
+#endif
+#endif
+
+/* To enable the deletion when iterating the list */
+#ifndef LIST_FOREACH_SAFE
+#define LIST_FOREACH_SAFE(var, head, field, tvar)			\
+	for ((var) = ((head)->lh_first);				\
+		(var) && ((tvar) = ((var)->field.le_next), 1);		\
+		(var) = (tvar))
+#endif
+
+/* To move the whole list from one head to another */
+#define LIST_MOVE_TO_NEW_HEAD(new, old, field) do {			\
+	(new)->lh_first = (old)->lh_first;				\
+	if (((new)->lh_first) != NULL)					\
+		(new)->lh_first->field.le_prev = &(new)->lh_first;	\
+} while (/*CONSTCOND*/0)
+
+#ifndef FALSE
+#define FALSE 0
+#endif
+#ifndef TRUE
+#define TRUE 1
+#endif
+
+#ifndef RTE_HLIST_INIT_VAL_DEFAULT
+/** Initialising value used when calculating hash. */
+#define RTE_HLIST_INIT_VAL_DEFAULT	0xFFFFFFFF
+#endif
+
+/* Macro to enable/disable run-time checking of function parameters */
+#if defined(RTE_LIBRTE_HASH_DEBUG)
+#define RETURN_IF_TRUE(cond, retval) do {				\
+	if (cond)							\
+		return retval;						\
+} while (0)
+
+#define RETURN_IF_TRUE_ERRNO(cond, retval, err) do {			\
+		if (cond) {						\
+			rte_errno = err;				\
+			return retval;					\
+		}							\
+	} while (0)
+#else
+#define RETURN_IF_TRUE(cond, retval)
+#define RETURN_IF_TRUE_ERRNO(cond, retval, err)
+#endif
+
+/** Maximum size of string for naming the hlist table. */
+#define RTE_HLIST_NAMESIZE			32
+
+/** The maximum number of entries in the hlist table that is supported. */
+#define RTE_HLIST_ENTRIES_MAX			(1 << 24)
+
+/** The maximum number of entries in each list that is supported. */
+#define RTE_HLIST_ENTRIES_PER_BUCKET_MAX	128
+
+/** Calculate the previous mask before splitting */
+#define HLIST_CALC_PREVIOUS_BUCKET_MASK(v, s)	((((v)+1) >> (s)) - 1)
+
+/** Flag bits for the data stored */
+#define HLIST_DATA_NEW_ENTRY			(0x80000000)
+#define HLIST_DATA_CUSTOM_EXTRA_DATA		(0x40000000)
+#define HLIST_DATA_INLINE			(0x20000000)
+#define HLIST_DATA_ALLOC_WITH_SIZE(s)		\
+					(0x10000000 | ((0x0fffffff) & (s)))
+#define HLIST_DATA_IS_ALLOCATED(d)		(!!((d)->flags & 0x10000000))
+#define HLIST_DATA_NOT_EXIST			(0x00000000)
+
+/** Signature of key that is stored internally. */
+typedef uint32_t hash_sig_t;
+
+/** Type of function that can be used for calculating the hash value. */
+typedef uint32_t (*rte_hlist_calc_fn)(const void *key, uint32_t key_len,
+					uint32_t init_val);
+
+/** Type of function that is used to free the data from customer. */
+typedef void (*rte_hlist_free_fn)(void *p);
+
+/** Parameters used when creating hash list table. */
+struct rte_hlist_params {
+	const char *name;		/**< Name of the hash table. */
+	uint32_t entries;		/**< Total number of entries. */
+	uint32_t entries_per_bucket;	/**< Number of entries in a bucket. */
+	/**< Length of the key to be calcuated. */
+	uint32_t key_len;
+	int socket_id;			/**< Socket to allocate memory on. */
+	rte_hlist_calc_fn hash_func;	/**< The hash function to calcuate. */
+	/**< The function to free the custom data. */
+	rte_hlist_free_fn free_func;
+	uint32_t init_val;		/**< For initializing hash function. */
+};
+
+/** Data element structure for future use */
+struct rte_hlist_data_element {
+	/**< Union for data, pointer or inline */
+	union {
+		void *extra_data;
+		uint64_t v64;
+		uint32_t v32;
+		uint16_t v16;
+		uint8_t v8;
+	};
+	uint32_t sig;			/**< Calcuated hash value. */
+	uint32_t flags;			/**< Flag bits of the data store. */
+};
+
+/** Node element structure on the LIST of the link */
+struct rte_hlist_node_entry {
+	LIST_ENTRY(rte_hlist_node_entry) next;	/**< Next element pointer. */
+	/**< Data element inside this noed. */
+	struct rte_hlist_data_element d;
+	char key[];				/**< Copied and stored key. */
+};
+
+/** Head of all the nodes with the same hash value */
+struct rte_hlist_head_entry {
+	LIST_HEAD(, rte_hlist_node_entry) head;	/**< Head for each hash list. */
+	/**< Current items in the list. */
+	uint16_t entries_in_bucket;
+	/**< Shift number for extension */
+	uint16_t bucket_shift;
+};
+
+/** The hlist table structure. */
+struct rte_hlist_table {
+	char name[RTE_HLIST_NAMESIZE];	/**< Name of the hash. */
+	uint32_t entries;		/**< Total number of entries. */
+	uint32_t entries_per_bucket;	/**< Number of entries in a list. */
+	/**< Number of entries with data from customer. */
+	uint32_t custom_entries;
+	uint16_t key_len;		/**< Length of the key. */
+	/**< Shift number of the whole table. */
+	uint16_t bucket_shift;
+	/**< To find which list the key is in. */
+	uint32_t bucket_mask;
+	rte_hlist_calc_fn hash_func;	/**< The hash function to calcuate. */
+	/**< The function to free the custom data. */
+	rte_hlist_free_fn free_func;
+	uint32_t init_val;		/**< For initializing hash function. */
+	/**< Reserved for fast shrinking of the table. */
+	char *map;
+	/**< A flat and extendible table of all lists. */
+	struct rte_hlist_head_entry *t;
+};
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Calc a hash value by key.
+ * This operation is not multi-thread safe.
+ *
+ * @param h
+ *   Hlist table to look in.
+ * @param key
+ *   Key to calc.
+ * @return
+ *   - hash value
+ */
+__rte_experimental
+hash_sig_t
+rte_hlist_hash(const struct rte_hlist_table *h, const void *key);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add a key to an existing hash list without data, or return the existing
+ * element associated with the key. This operation is not multi-thread safe
+ * and should only be called from one thread.
+ *
+ * @param h
+ *   Hlist table to add the key to.
+ * @param key
+ *   Key to add to the hlist table.
+ * @return
+ *   - NULL pointer if failed to add the key, check the variable rte_errno for
+ *     more error information.
+ *   - A data element pointer with useful information for future use.
+ */
+__rte_experimental
+struct rte_hlist_data_element *
+rte_hlist_add_key(struct rte_hlist_table *h, const void *key);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add a key to an existing hash list with customer data, or return the existing
+ * element associated with the key(data will be ignored). This operation is not
+ * multi-thread safe and should only be called from one thread.
+ *
+ * @param h
+ *   Hlist table to add the key to.
+ * @param key
+ *   Key to add to the hlist table.
+ * @param data
+ *   Customer data pointer, and it is the caller's responsibility not to release
+ *   it if the data area will be accessed later.
+ * @return
+ *   - NULL pointer if failed to add the key, check the variable rte_errno for
+ *     more error information.
+ *   - A data element pointer with useful information for future use.
+ */
+__rte_experimental
+struct rte_hlist_data_element *
+rte_hlist_add_key_data(struct rte_hlist_table *h, const void *key, void *data);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add a key to an existing hash list with customer data, or return the existing
+ * element associated with the key(data will be ignored). This operation is not
+ * multi-thread safe and should only be called from one thread.
+ *
+ * @param h
+ *   Hlist table to add the key to.
+ * @param key
+ *   Key to add to the hlist table.
+ * @param data
+ *   Customer data pointer, and it is the caller's responsibility not to release
+ *   it if the data area will be accessed later.
+ * @param len
+ *   The length of the customer's data.
+ * @param cus
+ *   The allocation attribute of the data. If yes, it is the customer's
+ *   responsibility to hold/release the data, or else, a copy of the data will
+ *   be held and the data pointer is free to release or re-use.
+ * @return
+ *   - NULL pointer if failed to add the key, check the variable rte_errno for
+ *     more error information.
+ *   - A data element pointer with useful information for future use.
+ */
+__rte_experimental
+struct rte_hlist_data_element *
+rte_hlist_add_key_data_len(struct rte_hlist_table *h, const void *key,
+				void *data, uint16_t len, uint8_t cus);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add a key to an existing hash list with pre-calculated hash signature and
+ * customer data, or return the existing element associated with the key(data
+ * will be ignored). This operation is not multi-thread safe and should only
+ * be called from one thread.
+ *
+ * @param h
+ *   Hlist table to add the key to.
+ * @param key
+ *   Key to add to the hlist table.
+ * @param sig
+ *   Precomputed hash value for 'key'.
+ * @param data
+ *   Customer data pointer, and it is the caller's responsibility not to release
+ *   it if the data area will be accessed later.
+ * @return
+ *   - NULL pointer if failed to add the key, check the variable rte_errno for
+ *     more error information.
+ *   - A data element pointer with useful information for future use.
+ */
+__rte_experimental
+struct rte_hlist_data_element *
+rte_hlist_add_key_with_hash_data(struct rte_hlist_table *h, const void *key,
+				hash_sig_t sig, void *data);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Add a key to an existing hash list with pre-calculated hash signature and
+ * customer data, or return the existing element associated with the key(data
+ * will be ignored). This operation is not multi-thread safe and should only
+ * be called from one thread.
+ *
+ * @param h
+ *   Hlist table to add the key to.
+ * @param key
+ *   Key to add to the hlist table.
+ * @param sig
+ *   Precomputed hash value for 'key'.
+ * @param data
+ *   Customer data pointer.
+ * @param len
+ *   The length of the customer's data.
+ * @param cus
+ *   The allocation attribute of the data. If yes, it is the customer's
+ *   responsibility to hold/release the data, or else, a copy of the data will
+ *   be held and the data pointer is free to release or re-use.
+ * @return
+ *   - NULL pointer if failed to add the key, check the variable rte_errno for
+ *     more error information.
+ *   - A data element pointer with useful information for future use.
+ */
+__rte_experimental
+struct rte_hlist_data_element *
+rte_hlist_add_key_with_hash_data_len(struct rte_hlist_table *h, const void *key,
+			hash_sig_t sig, void *data, uint16_t len, uint8_t cus);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Remove a key from an existing hlist table. This operation is not multi-thread
+ * safe and should only be called from one thread.
+ *
+ * @param h
+ *   Hlist table to remove the key from.
+ * @param key
+ *   Key to remove from the hlist table.
+ * @return
+ *   - -EINVAL if the parameters are invalid.
+ *   - -ENOMEM if failed to extend the lists head.
+ *   - -ENOENT if the key is not found.
+ *   - -EBUSY if there is custom data but without custom free fuction.
+ *   - zero for success.
+ */
+__rte_experimental
+int
+rte_hlist_del_key(struct rte_hlist_table *h, const void *key);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Remove a key from an existing hlist table, and return the data pointer to the
+ * customer if any but without trying to free it. This operation is not
+ * multi-thread safe and should only be called from one thread.
+ *
+ * @param h
+ *   Hlist table to remove the key from.
+ * @param key
+ *   Key to remove from the hlist table.
+ * @param data
+ *   Output containing a pointer to the data.
+ * @return
+ *   - -EINVAL if the parameters are invalid.
+ *   - -ENOMEM if failed to extend the lists head.
+ *   - -ENOENT if the key is not found.
+ *   - zero for success.
+ */
+__rte_experimental
+int
+rte_hlist_del_key_return_data(struct rte_hlist_table *h, const void *key,
+				void **data);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Remove an entry from an existing hlist table, and return the data pointer to
+ * the customer if any but without trying to free it. User should ensure the
+ * integrity of the entry. This operation is not multi-thread safe and should
+ * only be called from one thread.
+ *
+ * @param h
+ *   Hlist table to remove the key from.
+ * @param de
+ *   Data element to be removed from the hlist table.
+ * @param data
+ *   Output containing a pointer to the data.
+ * @return
+ *   - -EINVAL if the parameters are invalid.
+ *   - zero for success.
+ */
+__rte_experimental
+int
+rte_hlist_del_entry_fast_return_data(struct rte_hlist_table *h,
+			struct rte_hlist_data_element *de, void **data);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Find a key in the hlist table.
+ * This operation is not multi-thread safe.
+ *
+ * @param h
+ *   Hlist table to look in.
+ * @param key
+ *   Key to find.
+ * @return
+ *   - NULL if failed to search the key in the hlist. Check the rte_errno for
+ *     more error information.
+ *     -EINVAL if the parameters are invalid.
+ *     -ENOMEM if failed to extend the lists head.
+ *     -ENOENT if the key is not found.
+ *   - A data element pointer with useful information for future use.
+ */
+__rte_experimental
+struct rte_hlist_data_element *
+rte_hlist_lookup(struct rte_hlist_table *h, const void *key);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Find a key in the hlist table.
+ * This operation is not multi-thread safe.
+ *
+ * @param h
+ *   Hlist table to look in.
+ * @param key
+ *   Key to find.
+ * @param sig
+ *   Precomputed hash value for 'key'.
+ * @return
+ *   - NULL if failed to search the key in the hlist. Check the rte_errno for
+ *     more error information.
+ *     -EINVAL if the parameters are invalid.
+ *     -ENOMEM if failed to extend the lists head.
+ *     -ENOENT if the key is not found.
+ *   - A data element pointer with useful information for future use.
+ */
+__rte_experimental
+struct rte_hlist_data_element *
+rte_hlist_lookup_with_hash(struct rte_hlist_table *h, const void *key,
+				hash_sig_t sig);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Check if the data entry is a new one.
+ * This operation is not multi-thread safe.
+ *
+ * @param d
+ *   Pointer to the data entry.
+ * @return
+ *   No is 0 and Yes is 1.
+ */
+__rte_experimental
+static inline uint32_t
+rte_hlist_entry_is_new(struct rte_hlist_data_element *d)
+{
+	return !!(d->flags & HLIST_DATA_NEW_ENTRY);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Append customer data to the data element.
+ * This operation is not multi-thread safe.
+ *
+ * @param h
+ *   Hlist table to look in.
+ * @param d
+ *   Pointer to the data entry.
+ * @param data
+ *   Data address to be append.
+ * @return
+ *   - -EINVAL if the parameters are invalid.
+ *   - -EEXIST if there is alreay some data.
+ *   - 0 if succeed.
+ */
+__rte_experimental
+static inline int
+rte_hlist_entry_append_custom_data(struct rte_hlist_table *h,
+		struct rte_hlist_data_element *d, void *data)
+{
+	if ((h == NULL) || (d == NULL) || (data == NULL))
+		return -EINVAL;
+
+	if ((d->flags & (~HLIST_DATA_NEW_ENTRY)) != HLIST_DATA_NOT_EXIST)
+		return -EEXIST;
+
+	d->extra_data = data;
+	d->flags |= HLIST_DATA_CUSTOM_EXTRA_DATA;
+	h->custom_entries++;
+	return 0;
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Create a new hlist table. This operation is not multi-thread safe.
+ *
+ * @param params
+ *	 Parameters used in creation of hlist table.
+ *
+ * @return
+ *	 Pointer to hlist table structure that is used in future hlist table
+ *	 operations, or NULL on error with error code set in rte_errno.
+ */
+__rte_experimental
+struct rte_hlist_table *
+rte_hlist_create(const struct rte_hlist_params *params);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Free all memory used by a hlist table. Note, this will force to try to
+ * release all the customer's data before it cleans the memories of the table.
+ *
+ * @param h
+ *   Hlist table to deallocate.
+ * @return
+ *   - 0 if succeed.
+ *   - -EINVAL if the parameters are invalid.
+ *   - -EEXIST if the hlist doesn't exist.
+ */
+__rte_experimental
+int rte_hlist_free(struct rte_hlist_table *h);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Free all entries in the hash list table, and the customer's data could be
+ * handled by the callback function (optional).
+ *
+ * @param h
+ *   Hlist table to deallocate.
+ * @param fn
+ *   Callback function for the customer data handling.
+ * @return
+ *   - 0 if succeed.
+ *   - -EINVAL if the parameters are invalid.
+ *   - -EBUSY if the hlist doesn't exist.
+ */
+__rte_experimental
+int rte_hlist_clear_all_entries_with_cb(struct rte_hlist_table *h,
+					rte_hlist_free_fn fn);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_HLIST_H_ */
-- 
1.8.3.1


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] rte_hash: introduce hash list into hash lib
  2019-08-28  6:51 ` [dpdk-dev] [RFC] rte_hash: introduce hash list into hash lib Bing Zhao
@ 2019-08-28 11:50   ` Stephen Hemminger
  2019-08-28 11:51   ` Stephen Hemminger
  2019-08-28 11:53   ` Stephen Hemminger
  2 siblings, 0 replies; 10+ messages in thread
From: Stephen Hemminger @ 2019-08-28 11:50 UTC (permalink / raw)
  To: Bing Zhao
  Cc: yipeng1.wang, sameh.gobriel, bruce.richardson, pablo.de.lara.guarch, dev

On Wed, 28 Aug 2019 14:51:49 +0800
Bing Zhao <bingz@mellanox.com> wrote:

> +
> +/* To enable the deletion when iterating the list */
> +#ifndef LIST_FOREACH_SAFE
> +#define LIST_FOREACH_SAFE(var, head, field, tvar)			\
> +	for ((var) = ((head)->lh_first);				\
> +		(var) && ((tvar) = ((var)->field.le_next), 1);		\
> +		(var) = (tvar))
> +#endif
> +
> +/* To move the whole list from one head to another */
> +#define LIST_MOVE_TO_NEW_HEAD(new, old, field) do {			\
> +	(new)->lh_first = (old)->lh_first;				\
> +	if (((new)->lh_first) != NULL)					\
> +		(new)->lh_first->field.le_prev = &(new)->lh_first;	\
> +} while (/*CONSTCOND*/0)
> +

Why not use BSD style lists, rather than reinventing it here.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] rte_hash: introduce hash list into hash lib
  2019-08-28  6:51 ` [dpdk-dev] [RFC] rte_hash: introduce hash list into hash lib Bing Zhao
  2019-08-28 11:50   ` Stephen Hemminger
@ 2019-08-28 11:51   ` Stephen Hemminger
  2019-08-28 11:53   ` Stephen Hemminger
  2 siblings, 0 replies; 10+ messages in thread
From: Stephen Hemminger @ 2019-08-28 11:51 UTC (permalink / raw)
  To: Bing Zhao
  Cc: yipeng1.wang, sameh.gobriel, bruce.richardson, pablo.de.lara.guarch, dev

On Wed, 28 Aug 2019 14:51:49 +0800
Bing Zhao <bingz@mellanox.com> wrote:

> +#ifndef FALSE
> +#define FALSE 0
> +#endif
> +#ifndef TRUE
> +#define TRUE 1
> +#endif

Please use stdbool rather than reinventing boolean type.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] rte_hash: introduce hash list into hash lib
  2019-08-28  6:51 ` [dpdk-dev] [RFC] rte_hash: introduce hash list into hash lib Bing Zhao
  2019-08-28 11:50   ` Stephen Hemminger
  2019-08-28 11:51   ` Stephen Hemminger
@ 2019-08-28 11:53   ` Stephen Hemminger
  2 siblings, 0 replies; 10+ messages in thread
From: Stephen Hemminger @ 2019-08-28 11:53 UTC (permalink / raw)
  To: Bing Zhao
  Cc: yipeng1.wang, sameh.gobriel, bruce.richardson, pablo.de.lara.guarch, dev

On Wed, 28 Aug 2019 14:51:49 +0800
Bing Zhao <bingz@mellanox.com> wrote:

> +
> +/** Node element structure on the LIST of the link */
> +struct rte_hlist_node_entry {
> +	LIST_ENTRY(rte_hlist_node_entry) next;	/**< Next element pointer. */
> +	/**< Data element inside this noed. */
> +	struct rte_hlist_data_element d;
> +	char key[];				/**< Copied and stored key. */
> +};
> +
> +/** Head of all the nodes with the same hash value */
> +struct rte_hlist_head_entry {
> +	LIST_HEAD(, rte_hlist_node_entry) head;	/**< Head for each hash list. */
> +	/**< Current items in the list. */
> +	uint16_t entries_in_bucket;
> +	/**< Shift number for extension */
> +	uint16_t bucket_shift;
> +};
> +
> +/** The hlist table structure. */
> +struct rte_hlist_table {
> +	char name[RTE_HLIST_NAMESIZE];	/**< Name of the hash. */
> +	uint32_t entries;		/**< Total number of entries. */
> +	uint32_t entries_per_bucket;	/**< Number of entries in a list. */
> +	/**< Number of entries with data from customer. */
> +	uint32_t custom_entries;
> +	uint16_t key_len;		/**< Length of the key. */
> +	/**< Shift number of the whole table. */
> +	uint16_t bucket_shift;
> +	/**< To find which list the key is in. */
> +	uint32_t bucket_mask;
> +	rte_hlist_calc_fn hash_func;	/**< The hash function to calcuate. */
> +	/**< The function to free the custom data. */
> +	rte_hlist_free_fn free_func;
> +	uint32_t init_val;		/**< For initializing hash function. */
> +	/**< Reserved for fast shrinking of the table. */
> +	char *map;

You probably should use void * for that.

> +	/**< A flat and extendible table of all lists. */
> +	struct rte_hlist_head_entry *t;
> +};
> +

Since API/ABI considerations are important.
You will save yourself a lot of pain if these structures can be made
private and only part of rte_hlist.c.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] hash: introduce resizable hash list
  2019-08-28  6:51 [dpdk-dev] [RFC] hash: introduce resizable hash list Bing Zhao
  2019-08-28  6:51 ` [dpdk-dev] [RFC] rte_hash: introduce hash list into hash lib Bing Zhao
@ 2019-09-05 19:25 ` Wang, Yipeng1
  2019-09-12  5:41   ` Bing Zhao
  2023-06-12 16:44 ` Stephen Hemminger
  2 siblings, 1 reply; 10+ messages in thread
From: Wang, Yipeng1 @ 2019-09-05 19:25 UTC (permalink / raw)
  To: Bing Zhao, Gobriel, Sameh, Richardson, Bruce, De Lara Guarch, Pablo; +Cc: dev

>-----Original Message-----
>From: Bing Zhao [mailto:bingz@mellanox.com]
>Sent: Tuesday, August 27, 2019 11:52 PM
>To: Wang, Yipeng1 <yipeng1.wang@intel.com>; Gobriel, Sameh <sameh.gobriel@intel.com>; Richardson, Bruce
><bruce.richardson@intel.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>
>Cc: dev@dpdk.org; bingz@mellanox.com
>Subject: [RFC] hash: introduce resizable hash list
>
>In the current hash library, there are already two hash tables and
>several hash calculation functions.
>
>FBK hash table is very lightweight, but the key length is limited to
>4 bytes and the size of the table is fixed during startup.
>
>Cuckoo hash table is a quite great idea and nice implementation in
>the library, it is efficient and flexible. After some study of the
>code and information from internet, I find that the table extension
>is some pain point (correct me if anything wrong). It means that we
>need to allocate the tables in advance by defining a maximum size.
>So there is a chance that we waste some unused memory. Right now
>there is an extendable buckets table, but it seems that the number
>is also fixed and the same as the primary table.
>Take the flows information for example, we may support as many as
>several millions of flows. In some case, only several hundreds of
>flows will be created but in the other case, millions of flows are
>needed. So this means that we need to create the hash table to
>support millions of elements in advance during the startup time. If
>one key of flow information will consume 64B and 1M flows, so it
>will occupy more than one hundred MB memory (the table fields
>included). What is worse if there is only 2M + several elements, it
>needs to create a table of 4M (or more: depends on the hash
>collision rate) and it is some wasting of the memory.
>
>In order to handle this, an resizable hash list is introduced.
>The table could start with a small number of the elements and be
>allocated dynamically during the runtime. In the meanwhile, an
>on-demand list splitting mechanism is used in order not to make a
>significant performance degradation. Then there is no need to
>re-hash and relocate all the existing elements in the table when the
>table is extended.
>
>The brief design is as below:
>1. The table is consisted of LIST header array and the LIST entries.
>   In each entry, a pre-calculated hash signature is stored and is
>   used to decide which header should it be linked to, by using
>   "AND" with the mask to select the LSBs of the signature.
>2. The header array size could start with a very small number, and a
>   specified max number of each list is used to check if a table
>   extension is needed.
>3. When the max number on any of list is reached, the header array
>   size will be doubled. Then each entries linked to this list will
>   be split into two lists with the new mask (one more bit 1 in
>   the mask, e.g. 0xfff -> 0x1fff). And a global shift number and
>   local shift number of each list is used for the further checking.
>4. When some other list is being accessed, a comparison for the shift
>   numbers is used to check if the splitting of this list is needed.
>   If so, then there will be two conditions:
>   a. If the local shift number is only 1 less than global or
>      non-zero, then this list is the one that needs to be split.
>   b. If more than 1, then it means that the table is extended more
>      than once. And If the local shift is zero, a mechanism is used
>      to find the last unsplit list.
>   And then the list will be split into 2/4/8... lists depends on
>   the gap. All the entries then will linked to the proper header.
>So, each time when the hlist APIs are called, only one list will be
>traversed but not all the lists. And since there is parameter to set
>a max number of entries in a list. The traversal time is predictable
>and these will not cause a significant performance degradation.
>
>BR. Bing
>
>
>Bing Zhao (1):
>  rte_hash: introduce hash list into hash lib
>
> lib/librte_hash/Makefile             |   2 +
> lib/librte_hash/rte_hash_version.map |  15 +
> lib/librte_hash/rte_hlist.c          | 687 +++++++++++++++++++++++++++++++++++
> lib/librte_hash/rte_hlist.h          | 563 ++++++++++++++++++++++++++++
> 4 files changed, 1267 insertions(+)
> create mode 100644 lib/librte_hash/rte_hlist.c
> create mode 100644 lib/librte_hash/rte_hlist.h
>
>--
>1.8.3.1

[Wang, Yipeng] 

Hi, Bing,

Table resizing is very important and I think it is a critical feature to have in
DPDK's hash library. Thanks a lot for putting effort on this and I hope we could
work out a solution based on your patch.
 
My major concern of the current implementation is as following:
 
Although we still have the fbk hash table implementation, I think the goal here
is to have the users to move to the main hash table implementation (the cuckoo
algorithm), since it is the better one in most use cases. There have been many
optimizations done on the current code base since the cuckoo hash was
introduced, we do not want to reinvent the wheel such as the multi-threading
support, non-TSO machine support, SIMD optimizations, etc. Also, comparing to
linked list, the current bucketized table should provide better lookup
performance. Last but not least, we probably don't want fragmented hash table APIs.
 
I would suggest to see if we could add resizing feature to the current hash table
implementation first.
For example, we can use the current bucketized design to do the resizing rather
than the new linked list-based data structure.
We do need some modifications such as having a local  bucket_shift variable
each bucket, but I think it is feasible. We could also turn off the cuckoo
path/extendable linked list of the current implementation when resizing is needed, 
so that it would be simpler for the implementation.
In such way, we keep the current API and also reuse many of the existing code.

Also, I wonder if you have a specific use case that benefit from the gradually
break-down approach? Since we could always stop the world and resize the
table at once.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] hash: introduce resizable hash list
  2019-09-05 19:25 ` [dpdk-dev] [RFC] hash: introduce resizable hash list Wang, Yipeng1
@ 2019-09-12  5:41   ` Bing Zhao
  2019-09-21  1:16     ` Wang, Yipeng1
  0 siblings, 1 reply; 10+ messages in thread
From: Bing Zhao @ 2019-09-12  5:41 UTC (permalink / raw)
  To: Wang, Yipeng1, Gobriel, Sameh, Richardson, Bruce, De Lara Guarch, Pablo
  Cc: dev

Hi Yipeng,

> -----Original Message-----
> From: Wang, Yipeng1 <yipeng1.wang@intel.com>
> Sent: Friday, September 6, 2019 3:26 AM
> To: Bing Zhao <bingz@mellanox.com>; Gobriel, Sameh
> <sameh.gobriel@intel.com>; Richardson, Bruce
> <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>
> Cc: dev@dpdk.org
> Subject: RE: [RFC] hash: introduce resizable hash list
> 
> >-----Original Message-----
> >From: Bing Zhao [mailto:bingz@mellanox.com]
> >Sent: Tuesday, August 27, 2019 11:52 PM
> >To: Wang, Yipeng1 <yipeng1.wang@intel.com>; Gobriel, Sameh
> ><sameh.gobriel@intel.com>; Richardson, Bruce
> ><bruce.richardson@intel.com>; De Lara Guarch, Pablo
> ><pablo.de.lara.guarch@intel.com>
> >Cc: dev@dpdk.org; bingz@mellanox.com
> >Subject: [RFC] hash: introduce resizable hash list
> >
> >In the current hash library, there are already two hash tables and
> >several hash calculation functions.
> >
> >FBK hash table is very lightweight, but the key length is limited to
> >4 bytes and the size of the table is fixed during startup.
> >
> >Cuckoo hash table is a quite great idea and nice implementation in
> the
> >library, it is efficient and flexible. After some study of the code and
> >information from internet, I find that the table extension is some pain
> >point (correct me if anything wrong). It means that we need to
> allocate
> >the tables in advance by defining a maximum size.
> >So there is a chance that we waste some unused memory. Right now
> there
> >is an extendable buckets table, but it seems that the number is also
> >fixed and the same as the primary table.
> >Take the flows information for example, we may support as many as
> >several millions of flows. In some case, only several hundreds of
> flows
> >will be created but in the other case, millions of flows are needed. So
> >this means that we need to create the hash table to support millions
> of
> >elements in advance during the startup time. If one key of flow
> >information will consume 64B and 1M flows, so it will occupy more
> than
> >one hundred MB memory (the table fields included). What is worse if
> >there is only 2M + several elements, it needs to create a table of 4M
> >(or more: depends on the hash collision rate) and it is some wasting
> of
> >the memory.
> >
> >In order to handle this, an resizable hash list is introduced.
> >The table could start with a small number of the elements and be
> >allocated dynamically during the runtime. In the meanwhile, an
> >on-demand list splitting mechanism is used in order not to make a
> >significant performance degradation. Then there is no need to re-
> hash
> >and relocate all the existing elements in the table when the table is
> >extended.
> >
> >The brief design is as below:
> >1. The table is consisted of LIST header array and the LIST entries.
> >   In each entry, a pre-calculated hash signature is stored and is
> >   used to decide which header should it be linked to, by using
> >   "AND" with the mask to select the LSBs of the signature.
> >2. The header array size could start with a very small number, and a
> >   specified max number of each list is used to check if a table
> >   extension is needed.
> >3. When the max number on any of list is reached, the header array
> >   size will be doubled. Then each entries linked to this list will
> >   be split into two lists with the new mask (one more bit 1 in
> >   the mask, e.g. 0xfff -> 0x1fff). And a global shift number and
> >   local shift number of each list is used for the further checking.
> >4. When some other list is being accessed, a comparison for the shift
> >   numbers is used to check if the splitting of this list is needed.
> >   If so, then there will be two conditions:
> >   a. If the local shift number is only 1 less than global or
> >      non-zero, then this list is the one that needs to be split.
> >   b. If more than 1, then it means that the table is extended more
> >      than once. And If the local shift is zero, a mechanism is used
> >      to find the last unsplit list.
> >   And then the list will be split into 2/4/8... lists depends on
> >   the gap. All the entries then will linked to the proper header.
> >So, each time when the hlist APIs are called, only one list will be
> >traversed but not all the lists. And since there is parameter to set a
> >max number of entries in a list. The traversal time is predictable and
> >these will not cause a significant performance degradation.
> >
> >BR. Bing
> >
> >
> >Bing Zhao (1):
> >  rte_hash: introduce hash list into hash lib
> >
> > lib/librte_hash/Makefile             |   2 +
> > lib/librte_hash/rte_hash_version.map |  15 +
> > lib/librte_hash/rte_hlist.c          | 687
> +++++++++++++++++++++++++++++++++++
> > lib/librte_hash/rte_hlist.h          | 563
> ++++++++++++++++++++++++++++
> > 4 files changed, 1267 insertions(+)
> > create mode 100644 lib/librte_hash/rte_hlist.c create mode 100644
> > lib/librte_hash/rte_hlist.h
> >
> >--
> >1.8.3.1
> 
> [Wang, Yipeng]
> 
> Hi, Bing,
> 
> Table resizing is very important and I think it is a critical feature to have
> in DPDK's hash library. Thanks a lot for putting effort on this and I hope
> we could work out a solution based on your patch.
> 

Thanks. To my understanding, yes it is very important. Especially, there is
some case that the maximum capacity is quite large but elements could
be just a few during one run time life cycle.

> My major concern of the current implementation is as following:
> 
> Although we still have the fbk hash table implementation, I think the
> goal here is to have the users to move to the main hash table
> implementation (the cuckoo algorithm), since it is the better one in
> most use cases. There have been many optimizations done on the
> current code base since the cuckoo hash was introduced, we do not
> want to reinvent the wheel such as the multi-threading support, non-
> TSO machine support, SIMD optimizations, etc. Also, comparing to
> linked list, the current bucketized table should provide better lookup
> performance. Last but not least, we probably don't want fragmented
> hash table APIs.
> 

Yes, I see and I agree. I have some basic study of the cuckoo and it is
quite a nice implementation from the paper to the engineering. And a
lot of optimizations both on X86_64 and aarch64 were done. And unique
solution will be much easier and less problematic for the users and
maintainers.

> I would suggest to see if we could add resizing feature to the current
> hash table implementation first.
> For example, we can use the current bucketized design to do the
> resizing rather than the new linked list-based data structure.
> We do need some modifications such as having a local  bucket_shift
> variable each bucket, but I think it is feasible. We could also turn off
> the cuckoo path/extendable linked list of the current implementation
> when resizing is needed, so that it would be simpler for the
> implementation.
> In such way, we keep the current API and also reuse many of the
> existing code.
> 

That will be quite great if the resize with cuckoo hash could be realized.
I am not quite familiar with the current status of the researching. Is there
any paper or ideas these years for the cuckoo hash algorithm already?

> Also, I wonder if you have a specific use case that benefit from the
> gradually break-down approach? Since we could always stop the world
> and resize the table at once.

Do you mean the on-demand (when being accessed) split? Theoretically,
no for the total time, or even more because of the calculation and extra
checking. And this is only to smoothen the rate when there is already a
lot of elements in the table, e.g. 1M. Traversing 1M elements and relocate
them into the right position will be very time consuming. Especially the
memory is dynamically allocated but not continuous, the cache may be
also a problem with the linked list. I tested with a simple case and the
rate seemed to be smooth (no big jitter) - there is also some other actions
in the case and the hash action occupy about 15% of the CPU.
And this is only for the list case. I think when using cuckoo hash, maybe
there is no need of such trick 😊

Thanks a lot

BR. Bing

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] hash: introduce resizable hash list
  2019-09-12  5:41   ` Bing Zhao
@ 2019-09-21  1:16     ` Wang, Yipeng1
  0 siblings, 0 replies; 10+ messages in thread
From: Wang, Yipeng1 @ 2019-09-21  1:16 UTC (permalink / raw)
  To: Bing Zhao, Gobriel, Sameh, Richardson, Bruce, De Lara Guarch, Pablo; +Cc: dev

Hi, Bing,  I am sorry for the late reply:

>> >1.8.3.1
>>
>> [Wang, Yipeng]
>>
>> Hi, Bing,
>>
>> Table resizing is very important and I think it is a critical feature to have
>> in DPDK's hash library. Thanks a lot for putting effort on this and I hope
>> we could work out a solution based on your patch.
>>
>
>Thanks. To my understanding, yes it is very important. Especially, there is
>some case that the maximum capacity is quite large but elements could
>be just a few during one run time life cycle.
>
>> My major concern of the current implementation is as following:
>>
>> Although we still have the fbk hash table implementation, I think the
>> goal here is to have the users to move to the main hash table
>> implementation (the cuckoo algorithm), since it is the better one in
>> most use cases. There have been many optimizations done on the
>> current code base since the cuckoo hash was introduced, we do not
>> want to reinvent the wheel such as the multi-threading support, non-
>> TSO machine support, SIMD optimizations, etc. Also, comparing to
>> linked list, the current bucketized table should provide better lookup
>> performance. Last but not least, we probably don't want fragmented
>> hash table APIs.
>>
>
>Yes, I see and I agree. I have some basic study of the cuckoo and it is
>quite a nice implementation from the paper to the engineering. And a
>lot of optimizations both on X86_64 and aarch64 were done. And unique
>solution will be much easier and less problematic for the users and
>maintainers.
>
>> I would suggest to see if we could add resizing feature to the current
>> hash table implementation first.
>> For example, we can use the current bucketized design to do the
>> resizing rather than the new linked list-based data structure.
>> We do need some modifications such as having a local  bucket_shift
>> variable each bucket, but I think it is feasible. We could also turn off
>> the cuckoo path/extendable linked list of the current implementation
>> when resizing is needed, so that it would be simpler for the
>> implementation.
>> In such way, we keep the current API and also reuse many of the
>> existing code.
>>
>
>That will be quite great if the resize with cuckoo hash could be realized.
>I am not quite familiar with the current status of the researching. Is there
>any paper or ideas these years for the cuckoo hash algorithm already?
>
[Wang, Yipeng] 
I am not aware of new papers focusing on hash table resizing, but I encountered
one resizing algorithm from this paper: "Elastic Sketch: Adaptive and Fast Network-wide
Measurements". If you look at section 3.4, the authors first duplicates the hash table
and gradually remove the duplicates. But I think the duplication would still be too costly.

I guess what I suggested is to disable cuckoo path while we resize. Just treat
it as a regular hash table, then the resizing would be similar to your linked list algorithm right?
You just redistribute the contents in the buckets rather than breaking down the linked list.

>> Also, I wonder if you have a specific use case that benefit from the
>> gradually break-down approach? Since we could always stop the world
>> and resize the table at once.
>
>Do you mean the on-demand (when being accessed) split? Theoretically,
>no for the total time, or even more because of the calculation and extra
>checking. And this is only to smoothen the rate when there is already a
>lot of elements in the table, e.g. 1M. Traversing 1M elements and relocate
>them into the right position will be very time consuming. Especially the
>memory is dynamically allocated but not continuous, the cache may be
>also a problem with the linked list. I tested with a simple case and the
>rate seemed to be smooth (no big jitter) - there is also some other actions
>in the case and the hash action occupy about 15% of the CPU.
>And this is only for the list case. I think when using cuckoo hash, maybe
>there is no need of such trick 😊
[Wang, Yipeng] 
I guess I was thinking that If you have a specific use case,
we could design the algorithm with the use case in mind.

Thank you !
Yipeng

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [dpdk-dev] [RFC] hash: introduce resizable hash list
  2019-08-28  6:51 [dpdk-dev] [RFC] hash: introduce resizable hash list Bing Zhao
  2019-08-28  6:51 ` [dpdk-dev] [RFC] rte_hash: introduce hash list into hash lib Bing Zhao
  2019-09-05 19:25 ` [dpdk-dev] [RFC] hash: introduce resizable hash list Wang, Yipeng1
@ 2023-06-12 16:44 ` Stephen Hemminger
  2023-06-13  6:34   ` Bing Zhao
  2 siblings, 1 reply; 10+ messages in thread
From: Stephen Hemminger @ 2023-06-12 16:44 UTC (permalink / raw)
  To: Bing Zhao
  Cc: yipeng1.wang, sameh.gobriel, bruce.richardson, pablo.de.lara.guarch, dev

On Wed, 28 Aug 2019 14:51:48 +0800
Bing Zhao <bingz@mellanox.com> wrote:

> In the current hash library, there are already two hash tables and
> several hash calculation functions.
> 
> FBK hash table is very lightweight, but the key length is limited to
> 4 bytes and the size of the table is fixed during startup.
> 
> Cuckoo hash table is a quite great idea and nice implementation in
> the library, it is efficient and flexible. After some study of the
> code and information from internet, I find that the table extension
> is some pain point (correct me if anything wrong). It means that we
> need to allocate the tables in advance by defining a maximum size.
> So there is a chance that we waste some unused memory. Right now
> there is an extendable buckets table, but it seems that the number
> is also fixed and the same as the primary table.
> Take the flows information for example, we may support as many as
> several millions of flows. In some case, only several hundreds of
> flows will be created but in the other case, millions of flows are
> needed. So this means that we need to create the hash table to
> support millions of elements in advance during the startup time. If
> one key of flow information will consume 64B and 1M flows, so it
> will occupy more than one hundred MB memory (the table fields
> included). What is worse if there is only 2M + several elements, it
> needs to create a table of 4M (or more: depends on the hash
> collision rate) and it is some wasting of the memory.
> 
> In order to handle this, an resizable hash list is introduced.
> The table could start with a small number of the elements and be
> allocated dynamically during the runtime. In the meanwhile, an
> on-demand list splitting mechanism is used in order not to make a
> significant performance degradation. Then there is no need to
> re-hash and relocate all the existing elements in the table when the
> table is extended.
> 
> The brief design is as below:
> 1. The table is consisted of LIST header array and the LIST entries.
>    In each entry, a pre-calculated hash signature is stored and is
>    used to decide which header should it be linked to, by using
>    "AND" with the mask to select the LSBs of the signature.
> 2. The header array size could start with a very small number, and a
>    specified max number of each list is used to check if a table
>    extension is needed.
> 3. When the max number on any of list is reached, the header array
>    size will be doubled. Then each entries linked to this list will
>    be split into two lists with the new mask (one more bit 1 in
>    the mask, e.g. 0xfff -> 0x1fff). And a global shift number and
>    local shift number of each list is used for the further checking.
> 4. When some other list is being accessed, a comparison for the shift
>    numbers is used to check if the splitting of this list is needed.
>    If so, then there will be two conditions:
>    a. If the local shift number is only 1 less than global or
>       non-zero, then this list is the one that needs to be split.
>    b. If more than 1, then it means that the table is extended more
>       than once. And If the local shift is zero, a mechanism is used
>       to find the last unsplit list.
>    And then the list will be split into 2/4/8... lists depends on
>    the gap. All the entries then will linked to the proper header.
> So, each time when the hlist APIs are called, only one list will be
> traversed but not all the lists. And since there is parameter to set
> a max number of entries in a list. The traversal time is predictable
> and these will not cause a significant performance degradation.
> 
> BR. Bing
> 
> 
> Bing Zhao (1):
>   rte_hash: introduce hash list into hash lib
> 
>  lib/librte_hash/Makefile             |   2 +
>  lib/librte_hash/rte_hash_version.map |  15 +
>  lib/librte_hash/rte_hlist.c          | 687 +++++++++++++++++++++++++++++++++++
>  lib/librte_hash/rte_hlist.h          | 563 ++++++++++++++++++++++++++++
>  4 files changed, 1267 insertions(+)
>  create mode 100644 lib/librte_hash/rte_hlist.c
>  create mode 100644 lib/librte_hash/rte_hlist.h
> 

A resizeable hash table (with RCU?) would be good to have.
See old article about this https://lwn.net/Articles/573431/
But this patch got review feedback and no follow up.
Marking it as Changes Requested, maybe someone can resuscitate it later.


^ permalink raw reply	[flat|nested] 10+ messages in thread

* RE: [dpdk-dev] [RFC] hash: introduce resizable hash list
  2023-06-12 16:44 ` Stephen Hemminger
@ 2023-06-13  6:34   ` Bing Zhao
  0 siblings, 0 replies; 10+ messages in thread
From: Bing Zhao @ 2023-06-13  6:34 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: yipeng1.wang, sameh.gobriel, bruce.richardson,
	pablo.de.lara.guarch, dev, NBU-Contact-Thomas Monjalon (EXTERNAL)

Hi Stephen,

> -----Original Message-----
> From: Stephen Hemminger <stephen@networkplumber.org>
> Sent: Tuesday, June 13, 2023 12:44 AM
> To: Bing Zhao <bingz@mellanox.com>
> Cc: yipeng1.wang@intel.com; sameh.gobriel@intel.com;
> bruce.richardson@intel.com; pablo.de.lara.guarch@intel.com; dev@dpdk.org
> Subject: Re: [dpdk-dev] [RFC] hash: introduce resizable hash list
> 
> On Wed, 28 Aug 2019 14:51:48 +0800
> Bing Zhao <bingz@mellanox.com> wrote:
> 
> > In the current hash library, there are already two hash tables and
> > several hash calculation functions.
> >
> > FBK hash table is very lightweight, but the key length is limited to
> > 4 bytes and the size of the table is fixed during startup.
> >
> > Cuckoo hash table is a quite great idea and nice implementation in the
> > library, it is efficient and flexible. After some study of the code
> > and information from internet, I find that the table extension is some
> > pain point (correct me if anything wrong). It means that we need to
> > allocate the tables in advance by defining a maximum size.
> > So there is a chance that we waste some unused memory. Right now there
> > is an extendable buckets table, but it seems that the number is also
> > fixed and the same as the primary table.
> > Take the flows information for example, we may support as many as
> > several millions of flows. In some case, only several hundreds of
> > flows will be created but in the other case, millions of flows are
> > needed. So this means that we need to create the hash table to support
> > millions of elements in advance during the startup time. If one key of
> > flow information will consume 64B and 1M flows, so it will occupy more
> > than one hundred MB memory (the table fields included). What is worse
> > if there is only 2M + several elements, it needs to create a table of
> > 4M (or more: depends on the hash collision rate) and it is some
> > wasting of the memory.
> >
> > In order to handle this, an resizable hash list is introduced.
> > The table could start with a small number of the elements and be
> > allocated dynamically during the runtime. In the meanwhile, an
> > on-demand list splitting mechanism is used in order not to make a
> > significant performance degradation. Then there is no need to re-hash
> > and relocate all the existing elements in the table when the table is
> > extended.
> >
> > The brief design is as below:
> > 1. The table is consisted of LIST header array and the LIST entries.
> >    In each entry, a pre-calculated hash signature is stored and is
> >    used to decide which header should it be linked to, by using
> >    "AND" with the mask to select the LSBs of the signature.
> > 2. The header array size could start with a very small number, and a
> >    specified max number of each list is used to check if a table
> >    extension is needed.
> > 3. When the max number on any of list is reached, the header array
> >    size will be doubled. Then each entries linked to this list will
> >    be split into two lists with the new mask (one more bit 1 in
> >    the mask, e.g. 0xfff -> 0x1fff). And a global shift number and
> >    local shift number of each list is used for the further checking.
> > 4. When some other list is being accessed, a comparison for the shift
> >    numbers is used to check if the splitting of this list is needed.
> >    If so, then there will be two conditions:
> >    a. If the local shift number is only 1 less than global or
> >       non-zero, then this list is the one that needs to be split.
> >    b. If more than 1, then it means that the table is extended more
> >       than once. And If the local shift is zero, a mechanism is used
> >       to find the last unsplit list.
> >    And then the list will be split into 2/4/8... lists depends on
> >    the gap. All the entries then will linked to the proper header.
> > So, each time when the hlist APIs are called, only one list will be
> > traversed but not all the lists. And since there is parameter to set a
> > max number of entries in a list. The traversal time is predictable and
> > these will not cause a significant performance degradation.
> >
> > BR. Bing
> >
> >
> > Bing Zhao (1):
> >   rte_hash: introduce hash list into hash lib
> >
> >  lib/librte_hash/Makefile             |   2 +
> >  lib/librte_hash/rte_hash_version.map |  15 +
> >  lib/librte_hash/rte_hlist.c          | 687
> +++++++++++++++++++++++++++++++++++
> >  lib/librte_hash/rte_hlist.h          | 563 ++++++++++++++++++++++++++++
> >  4 files changed, 1267 insertions(+)
> >  create mode 100644 lib/librte_hash/rte_hlist.c  create mode 100644
> > lib/librte_hash/rte_hlist.h
> >
> 
> A resizeable hash table (with RCU?) would be good to have.
> See old article about this https://lwn.net/Articles/573431/ But this patch got
> review feedback and no follow up.
> Marking it as Changes Requested, maybe someone can resuscitate it later.

To my understanding, resizable hash may be needed by the application in the real life. Sometimes the maximal entries may not be known during startup, and it is some waste to allocate a huge amount of memory. (Probably there would be some failure to insert, even)
The extendable hash list would be the simplest way to go. While I am not a researcher, maybe there is some other advanced way to go in the papers and we need to translate it into code. Any ideas?

BR. Bing


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-06-14  9:04 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-08-28  6:51 [dpdk-dev] [RFC] hash: introduce resizable hash list Bing Zhao
2019-08-28  6:51 ` [dpdk-dev] [RFC] rte_hash: introduce hash list into hash lib Bing Zhao
2019-08-28 11:50   ` Stephen Hemminger
2019-08-28 11:51   ` Stephen Hemminger
2019-08-28 11:53   ` Stephen Hemminger
2019-09-05 19:25 ` [dpdk-dev] [RFC] hash: introduce resizable hash list Wang, Yipeng1
2019-09-12  5:41   ` Bing Zhao
2019-09-21  1:16     ` Wang, Yipeng1
2023-06-12 16:44 ` Stephen Hemminger
2023-06-13  6:34   ` Bing Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).