DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH v7 0/5] Address reader-writer concurrency in rte_hash
@ 2018-10-26  5:37 Honnappa Nagarahalli
  2018-10-26  5:37 ` [dpdk-dev] [PATCH v7 1/5] hash: separate multi-writer from rw-concurrency Honnappa Nagarahalli
                   ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Honnappa Nagarahalli @ 2018-10-26  5:37 UTC (permalink / raw)
  To: bruce.richardson, pablo.de.lara.guarch
  Cc: dev, yipeng1.wang, honnappa.nagarahalli, dharmik.thakkar, gavin.hu, nd

    Currently, reader-writer concurrency problems in rte_hash are
    addressed using reader-writer locks. Use of reader-writer locks
    results in following issues:
    
    1) In many of the use cases for the hash table, writer threads
       are running on control plane. If the writer is preempted while
       holding the lock, it will block the readers for an extended period
       resulting in packet drops. This problem seems to apply for platforms
       with transactional memory support as well because of the algorithm
       used for rte_rwlock_write_lock_tm:
    
       static inline void
       rte_rwlock_write_lock_tm(rte_rwlock_t *rwl)
       {
            if (likely(rte_try_tm(&rwl->cnt)))
                    return;
            rte_rwlock_write_lock(rwl);
       }
    
       i.e. there is a posibility of using rte_rwlock_write_lock in
       failure cases.
    2) Reader-writer lock based solution does not address the following
       issue.
       rte_hash_lookup_xxx APIs return the index of the element in
       the key store. Application(reader) can use that index to reference
       other data structures in its scope. Because of this, the
       index should not be freed till the application completes
       using the index.
    3) Since writer blocks all the readers, the hash lookup
       rate comes down significantly when there is activity on the writer.
       This happens even for unrelated entries. Performance numbers
       given below clearly indicate this.
    
    Lock-free solution is required to solve these problems. This patch
    series adds the lock-free capabilities in the following steps:

    1) Add support to not free the key-store index upon calling
       rte_hash_del_xxx APIs. This solves the issue in 2).
    
    2) Correct the alignment for the key store entry to prep for
       memory ordering.

    3) Add memory ordering to prevent race conditions when a new key
       is added to the table.
    
    4) Reader-writer concurrency issue, caused by moving the keys
       to their alternate locations during key insert, is solved
       by introducing an atomic global counter indicating a change
       in table.
    
    5) This solution also has to solve the issue of readers using
       key store element even after the key is deleted from
       control plane.
       To solve this issue, the hash_del_key_xxx APIs do not free
       the key store element when lock-free algorithm is enabled.
       The key store element has to be freed using the newly introduced
       rte_hash_free_key_with_position API. It needs to be called once
       all the readers have stopped using the key store element. How this
       is determined is outside the scope of this patch (RCU is one such
       mechanism that the application can use).
    
    6) Finally, a lock free reader-writer concurrency flag is added
       to enable this feature at run time.
    
    Performance numbers can be got from the additional test case added
    as part of this patch.

    v6->v7
    1) Rebased with latest master. V7 does not depend on any other
       patch.
    2) Corrected rte_hash_version.map file (Thomas)
    3) Changed 'Cc' in commit messages to stable@dpdk.org (Thomas)

    v5->v6
    1) Added lock-free RW concurrency tests to autotest_data.py and
       meson test (Bruce)
    2) Fixed Yipeng's email ID
    3) Added Bruce's Acked-by

    v4->v5
    1) Rebased with patch v8 of extendable hash bucket feature
       (http://patchwork.dpdk.org/patch/47196/)
    2) Changed 'success' to 'hit' and 'fail' to 'miss' in read/write
       concurrenct lock free test code (Yipeng)

    v3-v4
    1) Merged 4/7, 5/7 and 6/7 into 4/5
    2) Changed RTE_HASH_EXTRA_FLAGS_RECYCLE_ON_DEL to
       RTE_HASH_EXTRA_FLAGS_NO_FREE_ON_DEL (Yipeng)
    3) Changed the commit log for the patch
       "hash: correct key store element alignment" (Yipeng)
    4) Changed the comment for rte_hash_add_key_data API (Yipeng)
    5) Added bulk lookup for lock-free performance test case (Yipeng)
    5) Reduced the number of keys to 4M in the tests (Yipeng)

    v2->v3
    1) Rebased on top of:
       http://patchwork.dpdk.org/cover/45611/
       http://patchwork.dpdk.org/project/dpdk/list/?series=1822
    2) Added comments to RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF
       to indicate multi writer support (Yipeng)
    3) Updated the comments for rte_hash_add_key_data_xxx APIs
       to free the 'data' only after the readers have completed
       using 'data' (Yipeng)
    4) Extendable tables are not supported when lock free algorithm
       is requested.

    v1->v2
    1) Separate multi-writer capability from rw concurrency
    2) Add do not recycle on delete feature (Yipeng)
    3) Add Arm copyright
    4) Add test case to test lock-free algorithm and multi-writer
       test case (Yipeng)
    5) Additional API documentation to indicate RCU usage (Yipeng)
    6) Additional documentation on rte_hash_reset API (Yipeng)
    7) Allocate memory for the global counter and avoid API
       changes (Yipeng)

Dharmik Thakkar (1):
  test/hash: read-write lock-free concurrency test

Honnappa Nagarahalli (4):
  hash: separate multi-writer from rw-concurrency
  hash: support do not free on delete
  hash: fix key store element alignment
  hash: add lock-free read-write concurrency

 lib/librte_hash/rte_cuckoo_hash.c    |  520 +++++++++++----
 lib/librte_hash/rte_cuckoo_hash.h    |   21 +-
 lib/librte_hash/rte_hash.h           |   77 ++-
 lib/librte_hash/rte_hash_version.map |    7 +
 test/test/Makefile                   |    1 +
 test/test/autotest_data.py           |    6 +
 test/test/meson.build                |    2 +
 test/test/test_hash.c                |  140 +++-
 test/test/test_hash_readwrite.c      |    6 +-
 test/test/test_hash_readwrite_lf.c   | 1220 ++++++++++++++++++++++++++++++++++
 10 files changed, 1855 insertions(+), 145 deletions(-)
 create mode 100644 test/test/test_hash_readwrite_lf.c

-- 
2.7.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [dpdk-dev] [PATCH v7 1/5] hash: separate multi-writer from rw-concurrency
  2018-10-26  5:37 [dpdk-dev] [PATCH v7 0/5] Address reader-writer concurrency in rte_hash Honnappa Nagarahalli
@ 2018-10-26  5:37 ` Honnappa Nagarahalli
  2018-10-26  5:37 ` [dpdk-dev] [PATCH v7 2/5] hash: support do not free on delete Honnappa Nagarahalli
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Honnappa Nagarahalli @ 2018-10-26  5:37 UTC (permalink / raw)
  To: bruce.richardson, pablo.de.lara.guarch
  Cc: dev, yipeng1.wang, honnappa.nagarahalli, dharmik.thakkar,
	gavin.hu, nd, stable

RW concurrency is required with single writer and multiple reader
usecase as well. Hence, multi-writer should not be enabled by default when
RW concurrency is enabled.

Fixes: f2e3001b53ec ("hash: support read/write concurrency")
Cc: stable@dpdk.org

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Yipeng Wang <yipeng1.wang@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.c | 49 +++++++++++++++++++++------------------
 lib/librte_hash/rte_cuckoo_hash.h |  8 +++++--
 test/test/test_hash_readwrite.c   |  6 +++--
 3 files changed, 37 insertions(+), 26 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c
index 9a48934..3539a10 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -136,9 +136,10 @@ rte_hash_create(const struct rte_hash_parameters *params)
 	char ext_ring_name[RTE_RING_NAMESIZE];
 	unsigned num_key_slots;
 	unsigned i;
-	unsigned int hw_trans_mem_support = 0, multi_writer_support = 0;
+	unsigned int hw_trans_mem_support = 0, use_local_cache = 0;
 	unsigned int ext_table_support = 0;
 	unsigned int readwrite_concur_support = 0;
+	unsigned int writer_takes_lock = 0;
 
 	rte_hash_function default_hash_func = (rte_hash_function)rte_jhash;
 
@@ -162,19 +163,21 @@ rte_hash_create(const struct rte_hash_parameters *params)
 	if (params->extra_flag & RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT)
 		hw_trans_mem_support = 1;
 
-	if (params->extra_flag & RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD)
-		multi_writer_support = 1;
+	if (params->extra_flag & RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD) {
+		use_local_cache = 1;
+		writer_takes_lock = 1;
+	}
 
 	if (params->extra_flag & RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY) {
 		readwrite_concur_support = 1;
-		multi_writer_support = 1;
+		writer_takes_lock = 1;
 	}
 
 	if (params->extra_flag & RTE_HASH_EXTRA_FLAGS_EXT_TABLE)
 		ext_table_support = 1;
 
 	/* Store all keys and leave the first entry as a dummy entry for lookup_bulk */
-	if (multi_writer_support)
+	if (use_local_cache)
 		/*
 		 * Increase number of slots by total number of indices
 		 * that can be stored in the lcore caches
@@ -322,7 +325,7 @@ rte_hash_create(const struct rte_hash_parameters *params)
 	h->cmp_jump_table_idx = KEY_OTHER_BYTES;
 #endif
 
-	if (multi_writer_support) {
+	if (use_local_cache) {
 		h->local_free_slots = rte_zmalloc_socket(NULL,
 				sizeof(struct lcore_cache) * RTE_MAX_LCORE,
 				RTE_CACHE_LINE_SIZE, params->socket_id);
@@ -352,9 +355,10 @@ rte_hash_create(const struct rte_hash_parameters *params)
 	h->key_store = k;
 	h->free_slots = r;
 	h->hw_trans_mem_support = hw_trans_mem_support;
-	h->multi_writer_support = multi_writer_support;
+	h->use_local_cache = use_local_cache;
 	h->readwrite_concur_support = readwrite_concur_support;
 	h->ext_table_support = ext_table_support;
+	h->writer_takes_lock = writer_takes_lock;
 
 #if defined(RTE_ARCH_X86)
 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE2))
@@ -363,10 +367,11 @@ rte_hash_create(const struct rte_hash_parameters *params)
 #endif
 		h->sig_cmp_fn = RTE_HASH_COMPARE_SCALAR;
 
-	/* Turn on multi-writer only with explicit flag from user and TM
-	 * support.
+	/* Writer threads need to take the lock when:
+	 * 1) RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY is enabled OR
+	 * 2) RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD is enabled
 	 */
-	if (h->multi_writer_support) {
+	if (h->writer_takes_lock) {
 		h->readwrite_lock = rte_malloc(NULL, sizeof(rte_rwlock_t),
 						RTE_CACHE_LINE_SIZE);
 		if (h->readwrite_lock == NULL)
@@ -425,10 +430,10 @@ rte_hash_free(struct rte_hash *h)
 
 	rte_rwlock_write_unlock(RTE_EAL_TAILQ_RWLOCK);
 
-	if (h->multi_writer_support) {
+	if (h->use_local_cache)
 		rte_free(h->local_free_slots);
+	if (h->writer_takes_lock)
 		rte_free(h->readwrite_lock);
-	}
 	rte_ring_free(h->free_slots);
 	rte_ring_free(h->free_ext_bkts);
 	rte_free(h->key_store);
@@ -454,7 +459,7 @@ rte_hash_count(const struct rte_hash *h)
 	if (h == NULL)
 		return -EINVAL;
 
-	if (h->multi_writer_support) {
+	if (h->use_local_cache) {
 		tot_ring_cnt = h->entries + (RTE_MAX_LCORE - 1) *
 					(LCORE_CACHE_SIZE - 1);
 		for (i = 0; i < RTE_MAX_LCORE; i++)
@@ -473,9 +478,9 @@ rte_hash_count(const struct rte_hash *h)
 static inline void
 __hash_rw_writer_lock(const struct rte_hash *h)
 {
-	if (h->multi_writer_support && h->hw_trans_mem_support)
+	if (h->writer_takes_lock && h->hw_trans_mem_support)
 		rte_rwlock_write_lock_tm(h->readwrite_lock);
-	else if (h->multi_writer_support)
+	else if (h->writer_takes_lock)
 		rte_rwlock_write_lock(h->readwrite_lock);
 }
 
@@ -491,9 +496,9 @@ __hash_rw_reader_lock(const struct rte_hash *h)
 static inline void
 __hash_rw_writer_unlock(const struct rte_hash *h)
 {
-	if (h->multi_writer_support && h->hw_trans_mem_support)
+	if (h->writer_takes_lock && h->hw_trans_mem_support)
 		rte_rwlock_write_unlock_tm(h->readwrite_lock);
-	else if (h->multi_writer_support)
+	else if (h->writer_takes_lock)
 		rte_rwlock_write_unlock(h->readwrite_lock);
 }
 
@@ -532,7 +537,7 @@ rte_hash_reset(struct rte_hash *h)
 	}
 
 	/* Repopulate the free slots ring. Entry zero is reserved for key misses */
-	if (h->multi_writer_support)
+	if (h->use_local_cache)
 		tot_ring_cnt = h->entries + (RTE_MAX_LCORE - 1) *
 					(LCORE_CACHE_SIZE - 1);
 	else
@@ -548,7 +553,7 @@ rte_hash_reset(struct rte_hash *h)
 						(void *)((uintptr_t) i));
 	}
 
-	if (h->multi_writer_support) {
+	if (h->use_local_cache) {
 		/* Reset local caches per lcore */
 		for (i = 0; i < RTE_MAX_LCORE; i++)
 			h->local_free_slots[i].len = 0;
@@ -566,7 +571,7 @@ enqueue_slot_back(const struct rte_hash *h,
 		struct lcore_cache *cached_free_slots,
 		void *slot_id)
 {
-	if (h->multi_writer_support) {
+	if (h->use_local_cache) {
 		cached_free_slots->objs[cached_free_slots->len] = slot_id;
 		cached_free_slots->len++;
 	} else
@@ -848,7 +853,7 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 	__hash_rw_writer_unlock(h);
 
 	/* Did not find a match, so get a new slot for storing the new key */
-	if (h->multi_writer_support) {
+	if (h->use_local_cache) {
 		lcore_id = rte_lcore_id();
 		cached_free_slots = &h->local_free_slots[lcore_id];
 		/* Try to get a free slot from the local cache */
@@ -1117,7 +1122,7 @@ remove_entry(const struct rte_hash *h, struct rte_hash_bucket *bkt, unsigned i)
 	struct lcore_cache *cached_free_slots;
 
 	bkt->sig_current[i] = NULL_SIGNATURE;
-	if (h->multi_writer_support) {
+	if (h->use_local_cache) {
 		lcore_id = rte_lcore_id();
 		cached_free_slots = &h->local_free_slots[lcore_id];
 		/* Cache full, need to free it. */
diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index 7753cd8..8c522ac 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -161,11 +161,15 @@ struct rte_hash {
 	/**< Length of hash key. */
 	uint8_t hw_trans_mem_support;
 	/**< If hardware transactional memory is used. */
-	uint8_t multi_writer_support;
-	/**< If multi-writer support is enabled. */
+	uint8_t use_local_cache;
+	/**< If multi-writer support is enabled, use local cache
+	 * to allocate key-store slots.
+	 */
 	uint8_t readwrite_concur_support;
 	/**< If read-write concurrency support is enabled */
 	uint8_t ext_table_support;     /**< Enable extendable bucket table */
+	uint8_t writer_takes_lock;
+	/**< Indicates if the writer threads need to take lock */
 	rte_hash_function hash_func;    /**< Function used to calculate hash. */
 	uint32_t hash_func_init_val;    /**< Init value used by hash_func. */
 	rte_hash_cmp_eq_t rte_hash_custom_cmp_eq;
diff --git a/test/test/test_hash_readwrite.c b/test/test/test_hash_readwrite.c
index 2a4f7b9..a8fadd0 100644
--- a/test/test/test_hash_readwrite.c
+++ b/test/test/test_hash_readwrite.c
@@ -122,10 +122,12 @@ init_params(int use_htm, int use_jhash)
 	if (use_htm)
 		hash_params.extra_flag =
 			RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT |
-			RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY;
+			RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY |
+			RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD;
 	else
 		hash_params.extra_flag =
-			RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY;
+			RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY |
+			RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD;
 
 	hash_params.name = "tests";
 
-- 
2.7.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [dpdk-dev] [PATCH v7 2/5] hash: support do not free on delete
  2018-10-26  5:37 [dpdk-dev] [PATCH v7 0/5] Address reader-writer concurrency in rte_hash Honnappa Nagarahalli
  2018-10-26  5:37 ` [dpdk-dev] [PATCH v7 1/5] hash: separate multi-writer from rw-concurrency Honnappa Nagarahalli
@ 2018-10-26  5:37 ` Honnappa Nagarahalli
  2018-10-26  5:37 ` [dpdk-dev] [PATCH v7 3/5] hash: fix key store element alignment Honnappa Nagarahalli
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Honnappa Nagarahalli @ 2018-10-26  5:37 UTC (permalink / raw)
  To: bruce.richardson, pablo.de.lara.guarch
  Cc: dev, yipeng1.wang, honnappa.nagarahalli, dharmik.thakkar, gavin.hu, nd

rte_hash_lookup_xxx APIs return the index of slot in
the key store. Application(reader) can use that index to reference
other data structures in its scope. Because of this, the
index should not be freed till the application completes
using the index.
RTE_HASH_EXTRA_FLAGS_NO_FREE_ON_DEL is introduced to support this.
When this flag is enabled rte_hash_del_xxx APIs do not free the
key-store index/internal memory associated with the deleted
entry. The new API rte_hash_free_key_with_position should be called
to free the key-store index/internal memory after calling
rte_hash_del_xxx APIs.

Suggested-by: Yipeng Wang <yipeng1.wang@intel.com>
Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Yipeng Wang <yipeng1.wang@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.c    |  50 ++++++++++++-
 lib/librte_hash/rte_cuckoo_hash.h    |   6 ++
 lib/librte_hash/rte_hash.h           |  40 ++++++++++
 lib/librte_hash/rte_hash_version.map |   7 ++
 test/test/test_hash.c                | 140 ++++++++++++++++++++++++++++++++++-
 5 files changed, 239 insertions(+), 4 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c
index 3539a10..e087393 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -140,6 +140,7 @@ rte_hash_create(const struct rte_hash_parameters *params)
 	unsigned int ext_table_support = 0;
 	unsigned int readwrite_concur_support = 0;
 	unsigned int writer_takes_lock = 0;
+	unsigned int no_free_on_del = 0;
 
 	rte_hash_function default_hash_func = (rte_hash_function)rte_jhash;
 
@@ -176,6 +177,9 @@ rte_hash_create(const struct rte_hash_parameters *params)
 	if (params->extra_flag & RTE_HASH_EXTRA_FLAGS_EXT_TABLE)
 		ext_table_support = 1;
 
+	if (params->extra_flag & RTE_HASH_EXTRA_FLAGS_NO_FREE_ON_DEL)
+		no_free_on_del = 1;
+
 	/* Store all keys and leave the first entry as a dummy entry for lookup_bulk */
 	if (use_local_cache)
 		/*
@@ -359,6 +363,7 @@ rte_hash_create(const struct rte_hash_parameters *params)
 	h->readwrite_concur_support = readwrite_concur_support;
 	h->ext_table_support = ext_table_support;
 	h->writer_takes_lock = writer_takes_lock;
+	h->no_free_on_del = no_free_on_del;
 
 #if defined(RTE_ARCH_X86)
 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE2))
@@ -1121,7 +1126,6 @@ remove_entry(const struct rte_hash *h, struct rte_hash_bucket *bkt, unsigned i)
 	unsigned lcore_id, n_slots;
 	struct lcore_cache *cached_free_slots;
 
-	bkt->sig_current[i] = NULL_SIGNATURE;
 	if (h->use_local_cache) {
 		lcore_id = rte_lcore_id();
 		cached_free_slots = &h->local_free_slots[lcore_id];
@@ -1183,7 +1187,12 @@ search_and_remove(const struct rte_hash *h, const void *key,
 			k = (struct rte_hash_key *) ((char *)keys +
 					bkt->key_idx[i] * h->key_entry_size);
 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {
-				remove_entry(h, bkt, i);
+				bkt->sig_current[i] = NULL_SIGNATURE;
+				/* Free the key store index if
+				 * no_free_on_del is disabled.
+				 */
+				if (!h->no_free_on_del)
+					remove_entry(h, bkt, i);
 
 				/* Return index where key is stored,
 				 * subtracting the first dummy index
@@ -1301,6 +1310,43 @@ rte_hash_get_key_with_position(const struct rte_hash *h, const int32_t position,
 	return 0;
 }
 
+int __rte_experimental
+rte_hash_free_key_with_position(const struct rte_hash *h,
+				const int32_t position)
+{
+	RETURN_IF_TRUE(((h == NULL) || (position == EMPTY_SLOT)), -EINVAL);
+
+	unsigned int lcore_id, n_slots;
+	struct lcore_cache *cached_free_slots;
+	const int32_t total_entries = h->num_buckets * RTE_HASH_BUCKET_ENTRIES;
+
+	/* Out of bounds */
+	if (position >= total_entries)
+		return -EINVAL;
+
+	if (h->use_local_cache) {
+		lcore_id = rte_lcore_id();
+		cached_free_slots = &h->local_free_slots[lcore_id];
+		/* Cache full, need to free it. */
+		if (cached_free_slots->len == LCORE_CACHE_SIZE) {
+			/* Need to enqueue the free slots in global ring. */
+			n_slots = rte_ring_mp_enqueue_burst(h->free_slots,
+						cached_free_slots->objs,
+						LCORE_CACHE_SIZE, NULL);
+			cached_free_slots->len -= n_slots;
+		}
+		/* Put index of new free slot in cache. */
+		cached_free_slots->objs[cached_free_slots->len] =
+					(void *)((uintptr_t)position);
+		cached_free_slots->len++;
+	} else {
+		rte_ring_sp_enqueue(h->free_slots,
+				(void *)((uintptr_t)position));
+	}
+
+	return 0;
+}
+
 static inline void
 compare_signatures(uint32_t *prim_hash_matches, uint32_t *sec_hash_matches,
 			const struct rte_hash_bucket *prim_bkt,
diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index 8c522ac..ff95181 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -168,6 +168,12 @@ struct rte_hash {
 	uint8_t readwrite_concur_support;
 	/**< If read-write concurrency support is enabled */
 	uint8_t ext_table_support;     /**< Enable extendable bucket table */
+	uint8_t no_free_on_del;
+	/**< If key index should be freed on calling rte_hash_del_xxx APIs.
+	 * If this is set, rte_hash_free_key_with_position must be called to
+	 * free the key index associated with the deleted entry.
+	 * This flag is enabled by default.
+	 */
 	uint8_t writer_takes_lock;
 	/**< Indicates if the writer threads need to take lock */
 	rte_hash_function hash_func;    /**< Function used to calculate hash. */
diff --git a/lib/librte_hash/rte_hash.h b/lib/librte_hash/rte_hash.h
index 6ace64e..dfa542b 100644
--- a/lib/librte_hash/rte_hash.h
+++ b/lib/librte_hash/rte_hash.h
@@ -14,6 +14,8 @@
 #include <stdint.h>
 #include <stddef.h>
 
+#include <rte_compat.h>
+
 #ifdef __cplusplus
 extern "C" {
 #endif
@@ -40,6 +42,11 @@ extern "C" {
 /** Flag to indicate the extendabe bucket table feature should be used */
 #define RTE_HASH_EXTRA_FLAGS_EXT_TABLE 0x08
 
+/** Flag to disable freeing of key index on hash delete.
+ * Refer to rte_hash_del_xxx APIs for more details.
+ */
+#define RTE_HASH_EXTRA_FLAGS_NO_FREE_ON_DEL 0x10
+
 /**
  * The type of hash value of a key.
  * It should be a value of at least 32bit with fully random pattern.
@@ -236,6 +243,10 @@ rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key, hash_sig_t
  * and should only be called from one thread by default.
  * Thread safety can be enabled by setting flag during
  * table creation.
+ * If RTE_HASH_EXTRA_FLAGS_NO_FREE_ON_DEL is enabled,
+ * the key index returned by rte_hash_add_key_xxx APIs will not be
+ * freed by this API. rte_hash_free_key_with_position API must be called
+ * additionally to free the index associated with the key.
  *
  * @param h
  *   Hash table to remove the key from.
@@ -257,6 +268,10 @@ rte_hash_del_key(const struct rte_hash *h, const void *key);
  * and should only be called from one thread by default.
  * Thread safety can be enabled by setting flag during
  * table creation.
+ * If RTE_HASH_EXTRA_FLAGS_NO_FREE_ON_DEL is enabled,
+ * the key index returned by rte_hash_add_key_xxx APIs will not be
+ * freed by this API. rte_hash_free_key_with_position API must be called
+ * additionally to free the index associated with the key.
  *
  * @param h
  *   Hash table to remove the key from.
@@ -296,6 +311,31 @@ rte_hash_get_key_with_position(const struct rte_hash *h, const int32_t position,
 			       void **key);
 
 /**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Free a hash key in the hash table given the position
+ * of the key. This operation is not multi-thread safe and should
+ * only be called from one thread by default. Thread safety
+ * can be enabled by setting flag during table creation.
+ * If RTE_HASH_EXTRA_FLAGS_NO_FREE_ON_DEL is enabled,
+ * this API must be called, with the key index returned by rte_hash_add_key_xxx
+ * APIs, after the key is deleted using rte_hash_del_key_xxx APIs.
+ * This API does not validate if the key is already freed.
+ *
+ * @param h
+ *   Hash table to free the key from.
+ * @param position
+ *   Position returned when the key was deleted.
+ * @return
+ *   - 0 if freed successfully
+ *   - -EINVAL if the parameters are invalid.
+ */
+int __rte_experimental
+rte_hash_free_key_with_position(const struct rte_hash *h,
+				const int32_t position);
+
+/**
  * Find a key-value pair in the hash table.
  * This operation is multi-thread safe with regarding to other lookup threads.
  * Read-write concurrency can be enabled by setting flag during
diff --git a/lib/librte_hash/rte_hash_version.map b/lib/librte_hash/rte_hash_version.map
index e216ac8..734ae28 100644
--- a/lib/librte_hash/rte_hash_version.map
+++ b/lib/librte_hash/rte_hash_version.map
@@ -53,3 +53,10 @@ DPDK_18.08 {
 	rte_hash_count;
 
 } DPDK_16.07;
+
+EXPERIMENTAL {
+	global:
+
+	rte_hash_free_key_with_position;
+
+};
diff --git a/test/test/test_hash.c b/test/test/test_hash.c
index 815c734..6d06eb2 100644
--- a/test/test/test_hash.c
+++ b/test/test/test_hash.c
@@ -260,6 +260,13 @@ static void run_hash_func_tests(void)
  *	- lookup (hit)
  *	- delete
  *	- lookup (miss)
+ *
+ * Repeat the test case when 'free on delete' is disabled.
+ *	- add
+ *	- lookup (hit)
+ *	- delete
+ *	- lookup (miss)
+ *	- free
  */
 static int test_add_delete(void)
 {
@@ -295,10 +302,12 @@ static int test_add_delete(void)
 
 	/* repeat test with precomputed hash functions */
 	hash_sig_t hash_value;
-	int pos1, expectedPos1;
+	int pos1, expectedPos1, delPos1;
 
+	ut_params.extra_flag = RTE_HASH_EXTRA_FLAGS_NO_FREE_ON_DEL;
 	handle = rte_hash_create(&ut_params);
 	RETURN_IF_ERROR(handle == NULL, "hash creation failed");
+	ut_params.extra_flag = 0;
 
 	hash_value = rte_hash_hash(handle, &keys[0]);
 	pos1 = rte_hash_add_key_with_hash(handle, &keys[0], hash_value);
@@ -315,12 +324,18 @@ static int test_add_delete(void)
 	print_key_info("Del", &keys[0], pos1);
 	RETURN_IF_ERROR(pos1 != expectedPos1,
 			"failed to delete key (pos1=%d)", pos1);
+	delPos1 = pos1;
 
 	pos1 = rte_hash_lookup_with_hash(handle, &keys[0], hash_value);
 	print_key_info("Lkp", &keys[0], pos1);
 	RETURN_IF_ERROR(pos1 != -ENOENT,
 			"fail: found key after deleting! (pos1=%d)", pos1);
 
+	pos1 = rte_hash_free_key_with_position(handle, delPos1);
+	print_key_info("Free", &keys[0], delPos1);
+	RETURN_IF_ERROR(pos1 != 0,
+			"failed to free key (pos1=%d)", delPos1);
+
 	rte_hash_free(handle);
 
 	return 0;
@@ -391,6 +406,84 @@ static int test_add_update_delete(void)
 }
 
 /*
+ * Sequence of operations for a single key with 'disable free on del' set:
+ *	- delete: miss
+ *	- add
+ *	- lookup: hit
+ *	- add: update
+ *	- lookup: hit (updated data)
+ *	- delete: hit
+ *	- delete: miss
+ *	- lookup: miss
+ *	- free: hit
+ *	- lookup: miss
+ */
+static int test_add_update_delete_free(void)
+{
+	struct rte_hash *handle;
+	int pos0, expectedPos0, delPos0, result;
+
+	ut_params.name = "test2";
+	ut_params.extra_flag = RTE_HASH_EXTRA_FLAGS_NO_FREE_ON_DEL;
+	handle = rte_hash_create(&ut_params);
+	RETURN_IF_ERROR(handle == NULL, "hash creation failed");
+	ut_params.extra_flag = 0;
+
+	pos0 = rte_hash_del_key(handle, &keys[0]);
+	print_key_info("Del", &keys[0], pos0);
+	RETURN_IF_ERROR(pos0 != -ENOENT,
+			"fail: found non-existent key (pos0=%d)", pos0);
+
+	pos0 = rte_hash_add_key(handle, &keys[0]);
+	print_key_info("Add", &keys[0], pos0);
+	RETURN_IF_ERROR(pos0 < 0, "failed to add key (pos0=%d)", pos0);
+	expectedPos0 = pos0;
+
+	pos0 = rte_hash_lookup(handle, &keys[0]);
+	print_key_info("Lkp", &keys[0], pos0);
+	RETURN_IF_ERROR(pos0 != expectedPos0,
+			"failed to find key (pos0=%d)", pos0);
+
+	pos0 = rte_hash_add_key(handle, &keys[0]);
+	print_key_info("Add", &keys[0], pos0);
+	RETURN_IF_ERROR(pos0 != expectedPos0,
+			"failed to re-add key (pos0=%d)", pos0);
+
+	pos0 = rte_hash_lookup(handle, &keys[0]);
+	print_key_info("Lkp", &keys[0], pos0);
+	RETURN_IF_ERROR(pos0 != expectedPos0,
+			"failed to find key (pos0=%d)", pos0);
+
+	delPos0 = rte_hash_del_key(handle, &keys[0]);
+	print_key_info("Del", &keys[0], delPos0);
+	RETURN_IF_ERROR(delPos0 != expectedPos0,
+			"failed to delete key (pos0=%d)", delPos0);
+
+	pos0 = rte_hash_del_key(handle, &keys[0]);
+	print_key_info("Del", &keys[0], pos0);
+	RETURN_IF_ERROR(pos0 != -ENOENT,
+			"fail: deleted already deleted key (pos0=%d)", pos0);
+
+	pos0 = rte_hash_lookup(handle, &keys[0]);
+	print_key_info("Lkp", &keys[0], pos0);
+	RETURN_IF_ERROR(pos0 != -ENOENT,
+			"fail: found key after deleting! (pos0=%d)", pos0);
+
+	result = rte_hash_free_key_with_position(handle, delPos0);
+	print_key_info("Free", &keys[0], delPos0);
+	RETURN_IF_ERROR(result != 0,
+			"failed to free key (pos1=%d)", delPos0);
+
+	pos0 = rte_hash_lookup(handle, &keys[0]);
+	print_key_info("Lkp", &keys[0], pos0);
+	RETURN_IF_ERROR(pos0 != -ENOENT,
+			"fail: found key after deleting! (pos0=%d)", pos0);
+
+	rte_hash_free(handle);
+	return 0;
+}
+
+/*
  * Sequence of operations for retrieving a key with its position
  *
  *  - create table
@@ -399,11 +492,20 @@ static int test_add_update_delete(void)
  *  - delete key
  *  - try to get the deleted key: miss
  *
+ * Repeat the test case when 'free on delete' is disabled.
+ *  - create table
+ *  - add key
+ *  - get the key with its position: hit
+ *  - delete key
+ *  - try to get the deleted key: hit
+ *  - free key
+ *  - try to get the deleted key: miss
+ *
  */
 static int test_hash_get_key_with_position(void)
 {
 	struct rte_hash *handle = NULL;
-	int pos, expectedPos, result;
+	int pos, expectedPos, delPos, result;
 	void *key;
 
 	ut_params.name = "hash_get_key_w_pos";
@@ -427,6 +529,38 @@ static int test_hash_get_key_with_position(void)
 	RETURN_IF_ERROR(result != -ENOENT, "non valid key retrieved");
 
 	rte_hash_free(handle);
+
+	ut_params.name = "hash_get_key_w_pos";
+	ut_params.extra_flag = RTE_HASH_EXTRA_FLAGS_NO_FREE_ON_DEL;
+	handle = rte_hash_create(&ut_params);
+	RETURN_IF_ERROR(handle == NULL, "hash creation failed");
+	ut_params.extra_flag = 0;
+
+	pos = rte_hash_add_key(handle, &keys[0]);
+	print_key_info("Add", &keys[0], pos);
+	RETURN_IF_ERROR(pos < 0, "failed to add key (pos0=%d)", pos);
+	expectedPos = pos;
+
+	result = rte_hash_get_key_with_position(handle, pos, &key);
+	RETURN_IF_ERROR(result != 0, "error retrieving a key");
+
+	delPos = rte_hash_del_key(handle, &keys[0]);
+	print_key_info("Del", &keys[0], delPos);
+	RETURN_IF_ERROR(delPos != expectedPos,
+			"failed to delete key (pos0=%d)", delPos);
+
+	result = rte_hash_get_key_with_position(handle, delPos, &key);
+	RETURN_IF_ERROR(result != -ENOENT, "non valid key retrieved");
+
+	result = rte_hash_free_key_with_position(handle, delPos);
+	print_key_info("Free", &keys[0], delPos);
+	RETURN_IF_ERROR(result != 0,
+			"failed to free key (pos1=%d)", delPos);
+
+	result = rte_hash_get_key_with_position(handle, delPos, &key);
+	RETURN_IF_ERROR(result != -ENOENT, "non valid key retrieved");
+
+	rte_hash_free(handle);
 	return 0;
 }
 
@@ -1609,6 +1743,8 @@ test_hash(void)
 		return -1;
 	if (test_add_update_delete() < 0)
 		return -1;
+	if (test_add_update_delete_free() < 0)
+		return -1;
 	if (test_five_keys() < 0)
 		return -1;
 	if (test_full_bucket() < 0)
-- 
2.7.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [dpdk-dev] [PATCH v7 3/5] hash: fix key store element alignment
  2018-10-26  5:37 [dpdk-dev] [PATCH v7 0/5] Address reader-writer concurrency in rte_hash Honnappa Nagarahalli
  2018-10-26  5:37 ` [dpdk-dev] [PATCH v7 1/5] hash: separate multi-writer from rw-concurrency Honnappa Nagarahalli
  2018-10-26  5:37 ` [dpdk-dev] [PATCH v7 2/5] hash: support do not free on delete Honnappa Nagarahalli
@ 2018-10-26  5:37 ` Honnappa Nagarahalli
  2018-10-26  5:37 ` [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency Honnappa Nagarahalli
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Honnappa Nagarahalli @ 2018-10-26  5:37 UTC (permalink / raw)
  To: bruce.richardson, pablo.de.lara.guarch
  Cc: dev, yipeng1.wang, honnappa.nagarahalli, dharmik.thakkar,
	gavin.hu, nd, stable

Fix the key store array element alignment such that every array
element is aligned on KEY_ALIGNMENT boundary. This is required to
make 'pdata' in 'struct rte_hash_key' align on its natural boundary
for atomic load/store.

Fixes: 473d1bebce43 ("hash: allow to store data in hash table")
Cc: stable@dpdk.org

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Yipeng Wang <yipeng1.wang@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.c | 4 +++-
 lib/librte_hash/rte_cuckoo_hash.h | 2 +-
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c
index e087393..d79ba68 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -279,7 +279,9 @@ rte_hash_create(const struct rte_hash_parameters *params)
 			rte_ring_sp_enqueue(r_ext, (void *)((uintptr_t) i));
 	}
 
-	const uint32_t key_entry_size = sizeof(struct rte_hash_key) + params->key_len;
+	const uint32_t key_entry_size =
+		RTE_ALIGN(sizeof(struct rte_hash_key) + params->key_len,
+			  KEY_ALIGNMENT);
 	const uint64_t key_tbl_size = (uint64_t) key_entry_size * num_key_slots;
 
 	k = rte_zmalloc_socket(NULL, key_tbl_size,
diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index ff95181..601b2ce 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -123,7 +123,7 @@ struct rte_hash_key {
 	};
 	/* Variable key size */
 	char key[0];
-} __attribute__((aligned(KEY_ALIGNMENT)));
+};
 
 /* All different signature compare functions */
 enum rte_hash_sig_compare_function {
-- 
2.7.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency
  2018-10-26  5:37 [dpdk-dev] [PATCH v7 0/5] Address reader-writer concurrency in rte_hash Honnappa Nagarahalli
                   ` (2 preceding siblings ...)
  2018-10-26  5:37 ` [dpdk-dev] [PATCH v7 3/5] hash: fix key store element alignment Honnappa Nagarahalli
@ 2018-10-26  5:37 ` Honnappa Nagarahalli
  2018-11-03 11:52   ` Jerin Jacob
  2018-10-26  5:37 ` [dpdk-dev] [PATCH v7 5/5] test/hash: read-write lock-free concurrency test Honnappa Nagarahalli
  2018-10-26 10:52 ` [dpdk-dev] [PATCH v7 0/5] Address reader-writer concurrency in rte_hash Thomas Monjalon
  5 siblings, 1 reply; 19+ messages in thread
From: Honnappa Nagarahalli @ 2018-10-26  5:37 UTC (permalink / raw)
  To: bruce.richardson, pablo.de.lara.guarch
  Cc: dev, yipeng1.wang, honnappa.nagarahalli, dharmik.thakkar, gavin.hu, nd

Add lock-free read-write concurrency. This is achieved by the
following changes.

1) Add memory ordering to avoid race conditions. The only race
condition that can occur is -  using the key store element
before the key write is completed. Hence, while inserting the element
the release memory order is used. Any other race condition is caught
by the key comparison. Memory orderings are added only where needed.
For ex: reads in the writer's context do not need memory ordering
as there is a single writer.

key_idx in the bucket entry and pdata in the key store element are
used for synchronisation. key_idx is used to release an inserted
entry in the bucket to the reader. Use of pdata for synchronisation
is required due to updation of an existing entry where-in only
the pdata is updated without updating key_idx.

2) Reader-writer concurrency issue, caused by moving the keys
to their alternative locations during key insert, is solved
by introducing a global counter(tbl_chng_cnt) indicating a
change in table.

3) Add the flag to enable reader-writer concurrency during
run time.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ola Liljedahl <ola.liljedahl@arm.com>
Reviewed-by: Steve Capper <steve.capper@arm.com>
Reviewed-by: Yipeng Wang <yipeng1.wang@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
---
 lib/librte_hash/rte_cuckoo_hash.c | 417 ++++++++++++++++++++++++++++----------
 lib/librte_hash/rte_cuckoo_hash.h |   5 +
 lib/librte_hash/rte_hash.h        |  47 ++++-
 3 files changed, 351 insertions(+), 118 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c
index d79ba68..0648a22 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2010-2016 Intel Corporation
+ * Copyright(c) 2018 Arm Limited
  */
 
 #include <string.h>
@@ -141,6 +142,8 @@ rte_hash_create(const struct rte_hash_parameters *params)
 	unsigned int readwrite_concur_support = 0;
 	unsigned int writer_takes_lock = 0;
 	unsigned int no_free_on_del = 0;
+	uint32_t *tbl_chng_cnt = NULL;
+	unsigned int readwrite_concur_lf_support = 0;
 
 	rte_hash_function default_hash_func = (rte_hash_function)rte_jhash;
 
@@ -160,6 +163,24 @@ rte_hash_create(const struct rte_hash_parameters *params)
 		return NULL;
 	}
 
+	/* Validate correct usage of extra options */
+	if ((params->extra_flag & RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY) &&
+	    (params->extra_flag & RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF)) {
+		rte_errno = EINVAL;
+		RTE_LOG(ERR, HASH, "rte_hash_create: choose rw concurrency or "
+			"rw concurrency lock free\n");
+		return NULL;
+	}
+
+	if ((params->extra_flag & RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF) &&
+	    (params->extra_flag & RTE_HASH_EXTRA_FLAGS_EXT_TABLE)) {
+		rte_errno = EINVAL;
+		RTE_LOG(ERR, HASH, "rte_hash_create: extendable bucket "
+			"feature not supported with rw concurrency "
+			"lock free\n");
+		return NULL;
+	}
+
 	/* Check extra flags field to check extra options. */
 	if (params->extra_flag & RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT)
 		hw_trans_mem_support = 1;
@@ -180,6 +201,12 @@ rte_hash_create(const struct rte_hash_parameters *params)
 	if (params->extra_flag & RTE_HASH_EXTRA_FLAGS_NO_FREE_ON_DEL)
 		no_free_on_del = 1;
 
+	if (params->extra_flag & RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF) {
+		readwrite_concur_lf_support = 1;
+		/* Enable not freeing internal memory/index on delete */
+		no_free_on_del = 1;
+	}
+
 	/* Store all keys and leave the first entry as a dummy entry for lookup_bulk */
 	if (use_local_cache)
 		/*
@@ -292,6 +319,14 @@ rte_hash_create(const struct rte_hash_parameters *params)
 		goto err_unlock;
 	}
 
+	tbl_chng_cnt = rte_zmalloc_socket(NULL, sizeof(uint32_t),
+			RTE_CACHE_LINE_SIZE, params->socket_id);
+
+	if (tbl_chng_cnt == NULL) {
+		RTE_LOG(ERR, HASH, "memory allocation failed\n");
+		goto err_unlock;
+	}
+
 /*
  * If x86 architecture is used, select appropriate compare function,
  * which may use x86 intrinsics, otherwise use memcmp
@@ -360,12 +395,15 @@ rte_hash_create(const struct rte_hash_parameters *params)
 		default_hash_func : params->hash_func;
 	h->key_store = k;
 	h->free_slots = r;
+	h->tbl_chng_cnt = tbl_chng_cnt;
+	*h->tbl_chng_cnt = 0;
 	h->hw_trans_mem_support = hw_trans_mem_support;
 	h->use_local_cache = use_local_cache;
 	h->readwrite_concur_support = readwrite_concur_support;
 	h->ext_table_support = ext_table_support;
 	h->writer_takes_lock = writer_takes_lock;
 	h->no_free_on_del = no_free_on_del;
+	h->readwrite_concur_lf_support = readwrite_concur_lf_support;
 
 #if defined(RTE_ARCH_X86)
 	if (rte_cpu_get_flag_enabled(RTE_CPUFLAG_SSE2))
@@ -406,6 +444,7 @@ rte_hash_create(const struct rte_hash_parameters *params)
 	rte_free(buckets);
 	rte_free(buckets_ext);
 	rte_free(k);
+	rte_free(tbl_chng_cnt);
 	return NULL;
 }
 
@@ -446,6 +485,7 @@ rte_hash_free(struct rte_hash *h)
 	rte_free(h->key_store);
 	rte_free(h->buckets);
 	rte_free(h->buckets_ext);
+	rte_free(h->tbl_chng_cnt);
 	rte_free(h);
 	rte_free(te);
 }
@@ -530,6 +570,7 @@ rte_hash_reset(struct rte_hash *h)
 	__hash_rw_writer_lock(h);
 	memset(h->buckets, 0, h->num_buckets * sizeof(struct rte_hash_bucket));
 	memset(h->key_store, 0, h->key_entry_size * (h->entries + 1));
+	*h->tbl_chng_cnt = 0;
 
 	/* clear the free ring */
 	while (rte_ring_dequeue(h->free_slots, &ptr) == 0)
@@ -585,7 +626,9 @@ enqueue_slot_back(const struct rte_hash *h,
 		rte_ring_sp_enqueue(h->free_slots, slot_id);
 }
 
-/* Search a key from bucket and update its data */
+/* Search a key from bucket and update its data.
+ * Writer holds the lock before calling this.
+ */
 static inline int32_t
 search_and_update(const struct rte_hash *h, void *data, const void *key,
 	struct rte_hash_bucket *bkt, uint16_t sig)
@@ -598,8 +641,13 @@ search_and_update(const struct rte_hash *h, void *data, const void *key,
 			k = (struct rte_hash_key *) ((char *)keys +
 					bkt->key_idx[i] * h->key_entry_size);
 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {
-				/* Update data */
-				k->pdata = data;
+				/* 'pdata' acts as the synchronization point
+				 * when an existing hash entry is updated.
+				 * Key is not updated in this case.
+				 */
+				__atomic_store_n(&k->pdata,
+					data,
+					__ATOMIC_RELEASE);
 				/*
 				 * Return index where key is stored,
 				 * subtracting the first dummy index
@@ -655,7 +703,15 @@ rte_hash_cuckoo_insert_mw(const struct rte_hash *h,
 		/* Check if slot is available */
 		if (likely(prim_bkt->key_idx[i] == EMPTY_SLOT)) {
 			prim_bkt->sig_current[i] = sig;
-			prim_bkt->key_idx[i] = new_idx;
+			/* Key can be of arbitrary length, so it is
+			 * not possible to store it atomically.
+			 * Hence the new key element's memory stores
+			 * (key as well as data) should be complete
+			 * before it is referenced.
+			 */
+			__atomic_store_n(&prim_bkt->key_idx[i],
+					 new_idx,
+					 __ATOMIC_RELEASE);
 			break;
 		}
 	}
@@ -728,27 +784,66 @@ rte_hash_cuckoo_move_insert_mw(const struct rte_hash *h,
 		if (unlikely(&h->buckets[prev_alt_bkt_idx]
 				!= curr_bkt)) {
 			/* revert it to empty, otherwise duplicated keys */
-			curr_bkt->key_idx[curr_slot] = EMPTY_SLOT;
+			__atomic_store_n(&curr_bkt->key_idx[curr_slot],
+				EMPTY_SLOT,
+				__ATOMIC_RELEASE);
 			__hash_rw_writer_unlock(h);
 			return -1;
 		}
 
+		if (h->readwrite_concur_lf_support) {
+			/* Inform the previous move. The current move need
+			 * not be informed now as the current bucket entry
+			 * is present in both primary and secondary.
+			 * Since there is one writer, load acquires on
+			 * tbl_chng_cnt are not required.
+			 */
+			__atomic_store_n(h->tbl_chng_cnt,
+					 *h->tbl_chng_cnt + 1,
+					 __ATOMIC_RELEASE);
+			/* The stores to sig_alt and sig_current should not
+			 * move above the store to tbl_chng_cnt.
+			 */
+			__atomic_thread_fence(__ATOMIC_RELEASE);
+		}
+
 		/* Need to swap current/alt sig to allow later
 		 * Cuckoo insert to move elements back to its
 		 * primary bucket if available
 		 */
 		curr_bkt->sig_current[curr_slot] =
 			prev_bkt->sig_current[prev_slot];
-		curr_bkt->key_idx[curr_slot] =
-			prev_bkt->key_idx[prev_slot];
+		/* Release the updated bucket entry */
+		__atomic_store_n(&curr_bkt->key_idx[curr_slot],
+			prev_bkt->key_idx[prev_slot],
+			__ATOMIC_RELEASE);
 
 		curr_slot = prev_slot;
 		curr_node = prev_node;
 		curr_bkt = curr_node->bkt;
 	}
 
+	if (h->readwrite_concur_lf_support) {
+		/* Inform the previous move. The current move need
+		 * not be informed now as the current bucket entry
+		 * is present in both primary and secondary.
+		 * Since there is one writer, load acquires on
+		 * tbl_chng_cnt are not required.
+		 */
+		__atomic_store_n(h->tbl_chng_cnt,
+				 *h->tbl_chng_cnt + 1,
+				 __ATOMIC_RELEASE);
+		/* The stores to sig_alt and sig_current should not
+		 * move above the store to tbl_chng_cnt.
+		 */
+		__atomic_thread_fence(__ATOMIC_RELEASE);
+	}
+
 	curr_bkt->sig_current[curr_slot] = sig;
-	curr_bkt->key_idx[curr_slot] = new_idx;
+	/* Release the new bucket entry */
+	__atomic_store_n(&curr_bkt->key_idx[curr_slot],
+			 new_idx,
+			 __ATOMIC_RELEASE);
 
 	__hash_rw_writer_unlock(h);
 
@@ -889,8 +984,15 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 	new_idx = (uint32_t)((uintptr_t) slot_id);
 	/* Copy key */
 	rte_memcpy(new_k->key, key, h->key_len);
-	new_k->pdata = data;
-
+	/* Key can be of arbitrary length, so it is not possible to store
+	 * it atomically. Hence the new key element's memory stores
+	 * (key as well as data) should be complete before it is referenced.
+	 * 'pdata' acts as the synchronization point when an existing hash
+	 * entry is updated.
+	 */
+	__atomic_store_n(&new_k->pdata,
+		data,
+		__ATOMIC_RELEASE);
 
 	/* Find an empty slot and insert */
 	ret = rte_hash_cuckoo_insert_mw(h, prim_bkt, sec_bkt, key, data,
@@ -1034,21 +1136,27 @@ search_one_bucket(const struct rte_hash *h, const void *key, uint16_t sig,
 			void **data, const struct rte_hash_bucket *bkt)
 {
 	int i;
+	uint32_t key_idx;
+	void *pdata;
 	struct rte_hash_key *k, *keys = h->key_store;
 
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->sig_current[i] == sig &&
-				bkt->key_idx[i] != EMPTY_SLOT) {
+		key_idx = __atomic_load_n(&bkt->key_idx[i],
+					  __ATOMIC_ACQUIRE);
+		if (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT) {
 			k = (struct rte_hash_key *) ((char *)keys +
-					bkt->key_idx[i] * h->key_entry_size);
+					key_idx * h->key_entry_size);
+			pdata = __atomic_load_n(&k->pdata,
+					__ATOMIC_ACQUIRE);
+
 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {
 				if (data != NULL)
-					*data = k->pdata;
+					*data = pdata;
 				/*
 				 * Return index where key is stored,
 				 * subtracting the first dummy index
 				 */
-				return bkt->key_idx[i] - 1;
+				return key_idx - 1;
 			}
 		}
 	}
@@ -1061,34 +1169,62 @@ __rte_hash_lookup_with_hash(const struct rte_hash *h, const void *key,
 {
 	uint32_t prim_bucket_idx, sec_bucket_idx;
 	struct rte_hash_bucket *bkt, *cur_bkt;
+	uint32_t cnt_b, cnt_a;
 	int ret;
 	uint16_t short_sig;
 
 	short_sig = get_short_sig(sig);
 	prim_bucket_idx = get_prim_bucket_index(h, sig);
 	sec_bucket_idx = get_alt_bucket_index(h, prim_bucket_idx, short_sig);
-	bkt = &h->buckets[prim_bucket_idx];
 
 	__hash_rw_reader_lock(h);
 
-	/* Check if key is in primary location */
-	ret = search_one_bucket(h, key, short_sig, data, bkt);
-	if (ret != -1) {
-		__hash_rw_reader_unlock(h);
-		return ret;
-	}
-	/* Calculate secondary hash */
-	bkt = &h->buckets[sec_bucket_idx];
+	do {
+		/* Load the table change counter before the lookup
+		 * starts. Acquire semantics will make sure that
+		 * loads in search_one_bucket are not hoisted.
+		 */
+		cnt_b = __atomic_load_n(h->tbl_chng_cnt,
+				__ATOMIC_ACQUIRE);
 
-	/* Check if key is in secondary location */
-	FOR_EACH_BUCKET(cur_bkt, bkt) {
-		ret = search_one_bucket(h, key, short_sig, data, cur_bkt);
+		/* Check if key is in primary location */
+		bkt = &h->buckets[prim_bucket_idx];
+		ret = search_one_bucket(h, key, short_sig, data, bkt);
 		if (ret != -1) {
 			__hash_rw_reader_unlock(h);
 			return ret;
 		}
-	}
+		/* Calculate secondary hash */
+		bkt = &h->buckets[sec_bucket_idx];
+
+		/* Check if key is in secondary location */
+		FOR_EACH_BUCKET(cur_bkt, bkt) {
+			ret = search_one_bucket(h, key, short_sig,
+						data, cur_bkt);
+			if (ret != -1) {
+				__hash_rw_reader_unlock(h);
+				return ret;
+			}
+		}
+
+		/* The loads of sig_current in search_one_bucket
+		 * should not move below the load from tbl_chng_cnt.
+		 */
+		__atomic_thread_fence(__ATOMIC_ACQUIRE);
+		/* Re-read the table change counter to check if the
+		 * table has changed during search. If yes, re-do
+		 * the search.
+		 * This load should not get hoisted. The load
+		 * acquires on cnt_b, key index in primary bucket
+		 * and key index in secondary bucket will make sure
+		 * that it does not get hoisted.
+		 */
+		cnt_a = __atomic_load_n(h->tbl_chng_cnt,
+					__ATOMIC_ACQUIRE);
+	} while (cnt_b != cnt_a);
+
 	__hash_rw_reader_unlock(h);
+
 	return -ENOENT;
 }
 
@@ -1173,21 +1309,25 @@ __rte_hash_compact_ll(struct rte_hash_bucket *cur_bkt, int pos) {
 	}
 }
 
-/* Search one bucket and remove the matched key */
+/* Search one bucket and remove the matched key.
+ * Writer is expected to hold the lock while calling this
+ * function.
+ */
 static inline int32_t
 search_and_remove(const struct rte_hash *h, const void *key,
 			struct rte_hash_bucket *bkt, uint16_t sig, int *pos)
 {
 	struct rte_hash_key *k, *keys = h->key_store;
 	unsigned int i;
-	int32_t ret;
+	uint32_t key_idx;
 
 	/* Check if key is in bucket */
 	for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-		if (bkt->sig_current[i] == sig &&
-				bkt->key_idx[i] != EMPTY_SLOT) {
+		key_idx = __atomic_load_n(&bkt->key_idx[i],
+					  __ATOMIC_ACQUIRE);
+		if (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT) {
 			k = (struct rte_hash_key *) ((char *)keys +
-					bkt->key_idx[i] * h->key_entry_size);
+					key_idx * h->key_entry_size);
 			if (rte_hash_cmp_eq(key, k->key, h) == 0) {
 				bkt->sig_current[i] = NULL_SIGNATURE;
 				/* Free the key store index if
@@ -1196,13 +1336,16 @@ search_and_remove(const struct rte_hash *h, const void *key,
 				if (!h->no_free_on_del)
 					remove_entry(h, bkt, i);
 
-				/* Return index where key is stored,
+				__atomic_store_n(&bkt->key_idx[i],
+						 EMPTY_SLOT,
+						 __ATOMIC_RELEASE);
+
+				*pos = i;
+				/*
+				 * Return index where key is stored,
 				 * subtracting the first dummy index
 				 */
-				ret = bkt->key_idx[i] - 1;
-				bkt->key_idx[i] = EMPTY_SLOT;
-				*pos = i;
-				return ret;
+				return key_idx - 1;
 			}
 		}
 	}
@@ -1402,6 +1545,8 @@ __rte_hash_lookup_bulk(const struct rte_hash *h, const void **keys,
 	uint32_t prim_hitmask[RTE_HASH_LOOKUP_BULK_MAX] = {0};
 	uint32_t sec_hitmask[RTE_HASH_LOOKUP_BULK_MAX] = {0};
 	struct rte_hash_bucket *cur_bkt, *next_bkt;
+	void *pdata[RTE_HASH_LOOKUP_BULK_MAX];
+	uint32_t cnt_b, cnt_a;
 
 	/* Prefetch first keys */
 	for (i = 0; i < PREFETCH_OFFSET && i < num_keys; i++)
@@ -1443,91 +1588,138 @@ __rte_hash_lookup_bulk(const struct rte_hash *h, const void **keys,
 	}
 
 	__hash_rw_reader_lock(h);
-	/* Compare signatures and prefetch key slot of first hit */
-	for (i = 0; i < num_keys; i++) {
-		compare_signatures(&prim_hitmask[i], &sec_hitmask[i],
+	do {
+		/* Load the table change counter before the lookup
+		 * starts. Acquire semantics will make sure that
+		 * loads in compare_signatures are not hoisted.
+		 */
+		cnt_b = __atomic_load_n(h->tbl_chng_cnt,
+					__ATOMIC_ACQUIRE);
+
+		/* Compare signatures and prefetch key slot of first hit */
+		for (i = 0; i < num_keys; i++) {
+			compare_signatures(&prim_hitmask[i], &sec_hitmask[i],
 				primary_bkt[i], secondary_bkt[i],
 				sig[i], h->sig_cmp_fn);
 
-		if (prim_hitmask[i]) {
-			uint32_t first_hit =
-					__builtin_ctzl(prim_hitmask[i]) >> 1;
-			uint32_t key_idx = primary_bkt[i]->key_idx[first_hit];
-			const struct rte_hash_key *key_slot =
-				(const struct rte_hash_key *)(
-				(const char *)h->key_store +
-				key_idx * h->key_entry_size);
-			rte_prefetch0(key_slot);
-			continue;
-		}
-
-		if (sec_hitmask[i]) {
-			uint32_t first_hit =
-					__builtin_ctzl(sec_hitmask[i]) >> 1;
-			uint32_t key_idx = secondary_bkt[i]->key_idx[first_hit];
-			const struct rte_hash_key *key_slot =
-				(const struct rte_hash_key *)(
-				(const char *)h->key_store +
-				key_idx * h->key_entry_size);
-			rte_prefetch0(key_slot);
-		}
-	}
-
-	/* Compare keys, first hits in primary first */
-	for (i = 0; i < num_keys; i++) {
-		positions[i] = -ENOENT;
-		while (prim_hitmask[i]) {
-			uint32_t hit_index =
-					__builtin_ctzl(prim_hitmask[i]) >> 1;
-
-			uint32_t key_idx = primary_bkt[i]->key_idx[hit_index];
-			const struct rte_hash_key *key_slot =
-				(const struct rte_hash_key *)(
-				(const char *)h->key_store +
-				key_idx * h->key_entry_size);
-			/*
-			 * If key index is 0, do not compare key,
-			 * as it is checking the dummy slot
-			 */
-			if (!!key_idx & !rte_hash_cmp_eq(key_slot->key, keys[i], h)) {
-				if (data != NULL)
-					data[i] = key_slot->pdata;
+			if (prim_hitmask[i]) {
+				uint32_t first_hit =
+						__builtin_ctzl(prim_hitmask[i])
+						>> 1;
+				uint32_t key_idx =
+					primary_bkt[i]->key_idx[first_hit];
+				const struct rte_hash_key *key_slot =
+					(const struct rte_hash_key *)(
+					(const char *)h->key_store +
+					key_idx * h->key_entry_size);
+				rte_prefetch0(key_slot);
+				continue;
+			}
 
-				hits |= 1ULL << i;
-				positions[i] = key_idx - 1;
-				goto next_key;
+			if (sec_hitmask[i]) {
+				uint32_t first_hit =
+						__builtin_ctzl(sec_hitmask[i])
+						>> 1;
+				uint32_t key_idx =
+					secondary_bkt[i]->key_idx[first_hit];
+				const struct rte_hash_key *key_slot =
+					(const struct rte_hash_key *)(
+					(const char *)h->key_store +
+					key_idx * h->key_entry_size);
+				rte_prefetch0(key_slot);
 			}
-			prim_hitmask[i] &= ~(3ULL << (hit_index << 1));
 		}
 
-		while (sec_hitmask[i]) {
-			uint32_t hit_index =
-					__builtin_ctzl(sec_hitmask[i]) >> 1;
-
-			uint32_t key_idx = secondary_bkt[i]->key_idx[hit_index];
-			const struct rte_hash_key *key_slot =
-				(const struct rte_hash_key *)(
-				(const char *)h->key_store +
-				key_idx * h->key_entry_size);
-			/*
-			 * If key index is 0, do not compare key,
-			 * as it is checking the dummy slot
-			 */
+		/* Compare keys, first hits in primary first */
+		for (i = 0; i < num_keys; i++) {
+			positions[i] = -ENOENT;
+			while (prim_hitmask[i]) {
+				uint32_t hit_index =
+						__builtin_ctzl(prim_hitmask[i])
+						>> 1;
+				uint32_t key_idx =
+				__atomic_load_n(
+					&primary_bkt[i]->key_idx[hit_index],
+					__ATOMIC_ACQUIRE);
+				const struct rte_hash_key *key_slot =
+					(const struct rte_hash_key *)(
+					(const char *)h->key_store +
+					key_idx * h->key_entry_size);
+
+				if (key_idx != EMPTY_SLOT)
+					pdata[i] = __atomic_load_n(
+							&key_slot->pdata,
+							__ATOMIC_ACQUIRE);
+				/*
+				 * If key index is 0, do not compare key,
+				 * as it is checking the dummy slot
+				 */
+				if (!!key_idx &
+					!rte_hash_cmp_eq(
+						key_slot->key, keys[i], h)) {
+					if (data != NULL)
+						data[i] = pdata[i];
+
+					hits |= 1ULL << i;
+					positions[i] = key_idx - 1;
+					goto next_key;
+				}
+				prim_hitmask[i] &= ~(3ULL << (hit_index << 1));
+			}
 
-			if (!!key_idx & !rte_hash_cmp_eq(key_slot->key, keys[i], h)) {
-				if (data != NULL)
-					data[i] = key_slot->pdata;
+			while (sec_hitmask[i]) {
+				uint32_t hit_index =
+						__builtin_ctzl(sec_hitmask[i])
+						>> 1;
+				uint32_t key_idx =
+				__atomic_load_n(
+					&secondary_bkt[i]->key_idx[hit_index],
+					__ATOMIC_ACQUIRE);
+				const struct rte_hash_key *key_slot =
+					(const struct rte_hash_key *)(
+					(const char *)h->key_store +
+					key_idx * h->key_entry_size);
+
+				if (key_idx != EMPTY_SLOT)
+					pdata[i] = __atomic_load_n(
+							&key_slot->pdata,
+							__ATOMIC_ACQUIRE);
+				/*
+				 * If key index is 0, do not compare key,
+				 * as it is checking the dummy slot
+				 */
 
-				hits |= 1ULL << i;
-				positions[i] = key_idx - 1;
-				goto next_key;
+				if (!!key_idx &
+					!rte_hash_cmp_eq(
+						key_slot->key, keys[i], h)) {
+					if (data != NULL)
+						data[i] = pdata[i];
+
+					hits |= 1ULL << i;
+					positions[i] = key_idx - 1;
+					goto next_key;
+				}
+				sec_hitmask[i] &= ~(3ULL << (hit_index << 1));
 			}
-			sec_hitmask[i] &= ~(3ULL << (hit_index << 1));
+next_key:
+			continue;
 		}
 
-next_key:
-		continue;
-	}
+		/* The loads of sig_current in compare_signatures
+		 * should not move below the load from tbl_chng_cnt.
+		 */
+		__atomic_thread_fence(__ATOMIC_ACQUIRE);
+		/* Re-read the table change counter to check if the
+		 * table has changed during search. If yes, re-do
+		 * the search.
+		 * This load should not get hoisted. The load
+		 * acquires on cnt_b, primary key index and secondary
+		 * key index will make sure that it does not get
+		 * hoisted.
+		 */
+		cnt_a = __atomic_load_n(h->tbl_chng_cnt,
+					__ATOMIC_ACQUIRE);
+	} while (cnt_b != cnt_a);
 
 	/* all found, do not need to go through ext bkt */
 	if ((hits == ((1ULL << num_keys) - 1)) || !h->ext_table_support) {
@@ -1612,7 +1804,8 @@ rte_hash_iterate(const struct rte_hash *h, const void **key, void **data, uint32
 	idx = *next % RTE_HASH_BUCKET_ENTRIES;
 
 	/* If current position is empty, go to the next one */
-	while ((position = h->buckets[bucket_idx].key_idx[idx]) == EMPTY_SLOT) {
+	while ((position = __atomic_load_n(&h->buckets[bucket_idx].key_idx[idx],
+					__ATOMIC_ACQUIRE)) == EMPTY_SLOT) {
 		(*next)++;
 		/* End of table */
 		if (*next == total_entries_main)
diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index 601b2ce..5dfbbc4 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2016 Intel Corporation
+ * Copyright(c) 2018 Arm Limited
  */
 
 /* rte_cuckoo_hash.h
@@ -174,6 +175,8 @@ struct rte_hash {
 	 * free the key index associated with the deleted entry.
 	 * This flag is enabled by default.
 	 */
+	uint8_t readwrite_concur_lf_support;
+	/**< If read-write concurrency lock free support is enabled */
 	uint8_t writer_takes_lock;
 	/**< Indicates if the writer threads need to take lock */
 	rte_hash_function hash_func;    /**< Function used to calculate hash. */
@@ -196,6 +199,8 @@ struct rte_hash {
 	rte_rwlock_t *readwrite_lock; /**< Read-write lock thread-safety. */
 	struct rte_hash_bucket *buckets_ext; /**< Extra buckets array */
 	struct rte_ring *free_ext_bkts; /**< Ring of indexes of free buckets */
+	uint32_t *tbl_chng_cnt;
+	/**< Indicates if the hash table changed from last read. */
 } __rte_cache_aligned;
 
 struct queue_node {
diff --git a/lib/librte_hash/rte_hash.h b/lib/librte_hash/rte_hash.h
index dfa542b..c93d1a1 100644
--- a/lib/librte_hash/rte_hash.h
+++ b/lib/librte_hash/rte_hash.h
@@ -44,9 +44,18 @@ extern "C" {
 
 /** Flag to disable freeing of key index on hash delete.
  * Refer to rte_hash_del_xxx APIs for more details.
+ * This is enabled by default when RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF
+ * is enabled.
  */
 #define RTE_HASH_EXTRA_FLAGS_NO_FREE_ON_DEL 0x10
 
+/** Flag to support lock free reader writer concurrency. Both single writer
+ * and multi writer use cases are supported.
+ * Currently, extendable bucket table feature is not supported with
+ * this feature.
+ */
+#define RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF 0x20
+
 /**
  * The type of hash value of a key.
  * It should be a value of at least 32bit with fully random pattern.
@@ -132,7 +141,12 @@ void
 rte_hash_free(struct rte_hash *h);
 
 /**
- * Reset all hash structure, by zeroing all entries
+ * Reset all hash structure, by zeroing all entries.
+ * When RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF is enabled,
+ * it is application's responsibility to make sure that
+ * none of the readers are referencing the hash table
+ * while calling this API.
+ *
  * @param h
  *   Hash table to reset
  */
@@ -156,6 +170,11 @@ rte_hash_count(const struct rte_hash *h);
  * and should only be called from one thread by default.
  * Thread safety can be enabled by setting flag during
  * table creation.
+ * If the key exists already in the table, this API updates its value
+ * with 'data' passed in this API. It is the responsibility of
+ * the application to manage any memory associated with the old value.
+ * The readers might still be using the old value even after this API
+ * has returned.
  *
  * @param h
  *   Hash table to add the key to.
@@ -178,6 +197,11 @@ rte_hash_add_key_data(const struct rte_hash *h, const void *key, void *data);
  * and should only be called from one thread by default.
  * Thread safety can be enabled by setting flag during
  * table creation.
+ * If the key exists already in the table, this API updates its value
+ * with 'data' passed in this API. It is the responsibility of
+ * the application to manage any memory associated with the old value.
+ * The readers might still be using the old value even after this API
+ * has returned.
  *
  * @param h
  *   Hash table to add the key to.
@@ -243,10 +267,14 @@ rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key, hash_sig_t
  * and should only be called from one thread by default.
  * Thread safety can be enabled by setting flag during
  * table creation.
- * If RTE_HASH_EXTRA_FLAGS_NO_FREE_ON_DEL is enabled,
+ * If RTE_HASH_EXTRA_FLAGS_NO_FREE_ON_DEL or
+ * RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF is enabled,
  * the key index returned by rte_hash_add_key_xxx APIs will not be
  * freed by this API. rte_hash_free_key_with_position API must be called
  * additionally to free the index associated with the key.
+ * rte_hash_free_key_with_position API should be called after all
+ * the readers have stopped referencing the entry corresponding to
+ * this key. RCU mechanisms could be used to determine such a state.
  *
  * @param h
  *   Hash table to remove the key from.
@@ -268,10 +296,14 @@ rte_hash_del_key(const struct rte_hash *h, const void *key);
  * and should only be called from one thread by default.
  * Thread safety can be enabled by setting flag during
  * table creation.
- * If RTE_HASH_EXTRA_FLAGS_NO_FREE_ON_DEL is enabled,
+ * If RTE_HASH_EXTRA_FLAGS_NO_FREE_ON_DEL or
+ * RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF is enabled,
  * the key index returned by rte_hash_add_key_xxx APIs will not be
  * freed by this API. rte_hash_free_key_with_position API must be called
  * additionally to free the index associated with the key.
+ * rte_hash_free_key_with_position API should be called after all
+ * the readers have stopped referencing the entry corresponding to
+ * this key. RCU mechanisms could be used to determine such a state.
  *
  * @param h
  *   Hash table to remove the key from.
@@ -318,9 +350,12 @@ rte_hash_get_key_with_position(const struct rte_hash *h, const int32_t position,
  * of the key. This operation is not multi-thread safe and should
  * only be called from one thread by default. Thread safety
  * can be enabled by setting flag during table creation.
- * If RTE_HASH_EXTRA_FLAGS_NO_FREE_ON_DEL is enabled,
- * this API must be called, with the key index returned by rte_hash_add_key_xxx
- * APIs, after the key is deleted using rte_hash_del_key_xxx APIs.
+ * If RTE_HASH_EXTRA_FLAGS_NO_FREE_ON_DEL or
+ * RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF is enabled,
+ * the key index returned by rte_hash_del_key_xxx APIs must be freed
+ * using this API. This API should be called after all the readers
+ * have stopped referencing the entry corresponding to this key.
+ * RCU mechanisms could be used to determine such a state.
  * This API does not validate if the key is already freed.
  *
  * @param h
-- 
2.7.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [dpdk-dev] [PATCH v7 5/5] test/hash: read-write lock-free concurrency test
  2018-10-26  5:37 [dpdk-dev] [PATCH v7 0/5] Address reader-writer concurrency in rte_hash Honnappa Nagarahalli
                   ` (3 preceding siblings ...)
  2018-10-26  5:37 ` [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency Honnappa Nagarahalli
@ 2018-10-26  5:37 ` Honnappa Nagarahalli
  2018-10-26 10:52 ` [dpdk-dev] [PATCH v7 0/5] Address reader-writer concurrency in rte_hash Thomas Monjalon
  5 siblings, 0 replies; 19+ messages in thread
From: Honnappa Nagarahalli @ 2018-10-26  5:37 UTC (permalink / raw)
  To: bruce.richardson, pablo.de.lara.guarch
  Cc: dev, yipeng1.wang, honnappa.nagarahalli, dharmik.thakkar, gavin.hu, nd

From: Dharmik Thakkar <dharmik.thakkar@arm.com>

Unit tests to check for hash lookup and bulk-lookup perf
with lock-free enabled and with lock-free disabled.
Unit tests performed with readers running in parallel with writers.

Tests include:

- hash lookup on existing keys with:
  - hash add causing NO key-shifts of existing keys in the table

- hash lookup on existing keys likely to be on shift-path with:
  - hash add causing key-shifts of existing keys in the table

- hash lookup on existing keys NOT likely to be on shift-path with:
  - hash add causing key-shifts of existing keys in the table

- hash lookup on non-existing keys with:
  - hash add causing NO key-shifts of existing keys in the table
  - hash add causing key-shifts of existing keys in the table

- hash lookup on keys likely to be on shift-path with:
  - multiple writers causing key-shifts of existing keys in the table

Signed-off-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Yipeng Wang <yipeng1.wang@intel.com>
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
---
 test/test/Makefile                 |    1 +
 test/test/autotest_data.py         |    6 +
 test/test/meson.build              |    2 +
 test/test/test_hash_readwrite_lf.c | 1220 ++++++++++++++++++++++++++++++++++++
 4 files changed, 1229 insertions(+)
 create mode 100644 test/test/test_hash_readwrite_lf.c

diff --git a/test/test/Makefile b/test/test/Makefile
index 8c347e4..1b5c070 100644
--- a/test/test/Makefile
+++ b/test/test/Makefile
@@ -116,6 +116,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_hash_functions.c
 SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_hash_scaling.c
 SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_hash_multiwriter.c
 SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_hash_readwrite.c
+SRCS-$(CONFIG_RTE_LIBRTE_HASH) += test_hash_readwrite_lf.c
 
 SRCS-$(CONFIG_RTE_LIBRTE_LPM) += test_lpm.c
 SRCS-$(CONFIG_RTE_LIBRTE_LPM) += test_lpm_perf.c
diff --git a/test/test/autotest_data.py b/test/test/autotest_data.py
index 51f8e16..4eae588 100644
--- a/test/test/autotest_data.py
+++ b/test/test/autotest_data.py
@@ -580,6 +580,12 @@
         "Report":  None,
     },
     {
+        "Name":    "Hash read-write lock-free concurrency autotest",
+        "Command": "hash_readwrite_lf_autotest",
+        "Func":    default_autotest,
+        "Report":  None,
+    },
+    {
         "Name":       "Power autotest",
         "Command":    "power_autotest",
         "Func":       default_autotest,
diff --git a/test/test/meson.build b/test/test/meson.build
index 90d6607..faef5a4 100644
--- a/test/test/meson.build
+++ b/test/test/meson.build
@@ -46,6 +46,7 @@ test_sources = files('commands.c',
 	'test_hash_multiwriter.c',
 	'test_hash_readwrite.c',
 	'test_hash_perf.c',
+	'test_hash_readwrite_lf.c',
 	'test_hash_scaling.c',
 	'test_interrupts.c',
 	'test_kni.c',
@@ -174,6 +175,7 @@ test_names = [
 	'hash_functions_autotest',
 	'hash_multiwriter_autotest',
 	'hash_perf_autotest',
+	'hash_readwrite_lf_autotest',
 	'interrupt_autotest',
 	'kni_autotest',
 	'kvargs_autotest',
diff --git a/test/test/test_hash_readwrite_lf.c b/test/test/test_hash_readwrite_lf.c
new file mode 100644
index 0000000..cbfd932
--- /dev/null
+++ b/test/test/test_hash_readwrite_lf.c
@@ -0,0 +1,1220 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2018 Arm Limited
+ */
+
+#include <inttypes.h>
+#include <locale.h>
+
+#include <rte_cycles.h>
+#include <rte_hash.h>
+#include <rte_hash_crc.h>
+#include <rte_jhash.h>
+#include <rte_launch.h>
+#include <rte_malloc.h>
+#include <rte_random.h>
+#include <rte_spinlock.h>
+
+#include "test.h"
+
+#ifndef RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF
+#define RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF 0
+#endif
+
+#define BULK_LOOKUP_SIZE 32
+
+#define RUN_WITH_HTM_DISABLED 0
+
+#if (RUN_WITH_HTM_DISABLED)
+
+#define TOTAL_ENTRY (5*1024)
+#define TOTAL_INSERT (5*1024)
+
+#else
+
+#define TOTAL_ENTRY (4*1024*1024)
+#define TOTAL_INSERT (4*1024*1024)
+
+#endif
+
+#define READ_FAIL 1
+#define READ_PASS_NO_KEY_SHIFTS 2
+#define READ_PASS_SHIFT_PATH 4
+#define READ_PASS_NON_SHIFT_PATH 8
+#define BULK_LOOKUP 16
+#define NUM_TEST 3
+unsigned int rwc_core_cnt[NUM_TEST] = {1, 2, 4};
+
+struct rwc_perf {
+	uint32_t w_no_ks_r_hit[2][NUM_TEST];
+	uint32_t w_no_ks_r_miss[2][NUM_TEST];
+	uint32_t w_ks_r_hit_nsp[2][NUM_TEST];
+	uint32_t w_ks_r_hit_sp[2][NUM_TEST];
+	uint32_t w_ks_r_miss[2][NUM_TEST];
+	uint32_t multi_rw[NUM_TEST - 1][2][NUM_TEST];
+};
+
+static struct rwc_perf rwc_lf_results, rwc_non_lf_results;
+
+struct {
+	uint32_t *keys;
+	uint32_t *keys_no_ks;
+	uint32_t *keys_ks;
+	uint32_t *keys_absent;
+	uint32_t *keys_shift_path;
+	uint32_t *keys_non_shift_path;
+	uint32_t count_keys_no_ks;
+	uint32_t count_keys_ks;
+	uint32_t count_keys_absent;
+	uint32_t count_keys_shift_path;
+	uint32_t count_keys_non_shift_path;
+	uint32_t single_insert;
+	struct rte_hash *h;
+} tbl_rwc_test_param;
+
+static rte_atomic64_t gread_cycles;
+static rte_atomic64_t greads;
+
+static volatile uint8_t writer_done;
+static volatile uint8_t multi_writer_done[4];
+
+uint16_t enabled_core_ids[RTE_MAX_LCORE];
+
+uint8_t *scanned_bkts;
+
+static inline int
+get_enabled_cores_list(void)
+{
+	uint32_t i = 0;
+	uint16_t core_id;
+	uint32_t max_cores = rte_lcore_count();
+	for (core_id = 0; core_id < RTE_MAX_LCORE && i < max_cores; core_id++) {
+		if (rte_lcore_is_enabled(core_id)) {
+			enabled_core_ids[i] = core_id;
+			i++;
+		}
+	}
+
+	if (i != max_cores) {
+		printf("Number of enabled cores in list is different from "
+			"number given by rte_lcore_count()\n");
+		return -1;
+	}
+	return 0;
+}
+
+static inline int
+check_bucket(uint32_t bkt_idx, uint32_t key)
+{
+	uint32_t iter;
+	uint32_t prev_iter;
+	uint32_t diff;
+	uint32_t count = 0;
+	const void *next_key;
+	void *next_data;
+
+	/* Temporary bucket to hold the keys */
+	uint32_t keys_in_bkt[8];
+
+	iter = bkt_idx * 8;
+	prev_iter = iter;
+	while (rte_hash_iterate(tbl_rwc_test_param.h,
+			&next_key, &next_data, &iter) >= 0) {
+
+		/* Check for duplicate entries */
+		if (*(const uint32_t *)next_key == key)
+			return 1;
+
+		/* Identify if there is any free entry in the bucket */
+		diff = iter - prev_iter;
+		if (diff > 1)
+			break;
+
+		prev_iter = iter;
+		keys_in_bkt[count] = *(const uint32_t *)next_key;
+		count++;
+
+		/* All entries in the bucket are occupied */
+		if (count == 8) {
+
+			/*
+			 * Check if bucket was not scanned before, to avoid
+			 * duplicate keys.
+			 */
+			if (scanned_bkts[bkt_idx] == 0) {
+				/*
+				 * Since this bucket (pointed to by bkt_idx) is
+				 * full, it is likely that key(s) in this
+				 * bucket will be on the shift path, when
+				 * collision occurs. Thus, add it to
+				 * keys_shift_path.
+				 */
+				memcpy(tbl_rwc_test_param.keys_shift_path +
+					tbl_rwc_test_param.count_keys_shift_path
+					, keys_in_bkt, 32);
+				tbl_rwc_test_param.count_keys_shift_path += 8;
+				scanned_bkts[bkt_idx] = 1;
+			}
+			return -1;
+		}
+	}
+	return 0;
+}
+
+static int
+generate_keys(void)
+{
+	uint32_t *keys = NULL;
+	uint32_t *keys_no_ks = NULL;
+	uint32_t *keys_ks = NULL;
+	uint32_t *keys_absent = NULL;
+	uint32_t *keys_non_shift_path = NULL;
+	uint32_t *found = NULL;
+	uint32_t count_keys_no_ks = 0;
+	uint32_t count_keys_ks = 0;
+	uint32_t i;
+
+	/*
+	 * keys will consist of a) keys whose addition to the hash table
+	 * will result in shifting of the existing keys to their alternate
+	 * locations b) keys whose addition to the hash table will not result
+	 * in shifting of the existing keys.
+	 */
+	keys = rte_malloc(NULL, sizeof(uint32_t) * TOTAL_INSERT, 0);
+	if (keys == NULL) {
+		printf("RTE_MALLOC failed\n");
+		goto err;
+	}
+
+	/*
+	 * keys_no_ks (no key-shifts): Subset of 'keys' - consists of keys  that
+	 * will NOT result in shifting of the existing keys to their alternate
+	 * locations. Roughly around 900K keys.
+	 */
+	keys_no_ks = rte_malloc(NULL, sizeof(uint32_t) * TOTAL_INSERT, 0);
+	if (keys_no_ks == NULL) {
+		printf("RTE_MALLOC failed\n");
+		goto err;
+	}
+
+	/*
+	 * keys_ks (key-shifts): Subset of 'keys' - consists of keys that will
+	 * result in shifting of the existing keys to their alternate locations.
+	 * Roughly around 146K keys. There might be repeating keys. More code is
+	 * required to filter out these keys which will complicate the test case
+	 */
+	keys_ks = rte_malloc(NULL, sizeof(uint32_t) * TOTAL_INSERT, 0);
+	if (keys_ks == NULL) {
+		printf("RTE_MALLOC failed\n");
+		goto err;
+	}
+
+	/* Used to identify keys not inserted in the hash table */
+	found = rte_zmalloc(NULL, sizeof(uint32_t) * TOTAL_INSERT, 0);
+	if (found == NULL) {
+		printf("RTE_MALLOC failed\n");
+		goto err;
+	}
+
+	/*
+	 * This consist of keys not inserted to the hash table.
+	 * Used to test perf of lookup on keys that do not exist in the table.
+	 */
+	keys_absent = rte_malloc(NULL, sizeof(uint32_t) * TOTAL_INSERT, 0);
+	if (keys_absent == NULL) {
+		printf("RTE_MALLOC failed\n");
+		goto err;
+	}
+
+	/*
+	 * This consist of keys which are likely to be on the shift
+	 * path (i.e. being moved to alternate location), when collision occurs
+	 * on addition of a key to an already full primary bucket.
+	 * Used to test perf of lookup on keys that are on the shift path.
+	 */
+	tbl_rwc_test_param.keys_shift_path = rte_malloc(NULL, sizeof(uint32_t) *
+							TOTAL_INSERT, 0);
+	if (tbl_rwc_test_param.keys_shift_path == NULL) {
+		printf("RTE_MALLOC failed\n");
+		goto err;
+	}
+
+	/*
+	 * This consist of keys which are never on the shift
+	 * path (i.e. being moved to alternate location), when collision occurs
+	 * on addition of a key to an already full primary bucket.
+	 * Used to test perf of lookup on keys that are not on the shift path.
+	 */
+	keys_non_shift_path = rte_malloc(NULL, sizeof(uint32_t) * TOTAL_INSERT,
+					 0);
+	if (keys_non_shift_path == NULL) {
+		printf("RTE_MALLOC failed\n");
+		goto err;
+	}
+
+
+	hash_sig_t sig;
+	uint32_t prim_bucket_idx;
+	int ret;
+	uint32_t num_buckets;
+	uint32_t bucket_bitmask;
+	num_buckets  = rte_align32pow2(TOTAL_ENTRY) / 8;
+	bucket_bitmask = num_buckets - 1;
+
+	/*
+	 * Used to mark bkts in which at least one key was shifted to its
+	 * alternate location
+	 */
+	scanned_bkts = rte_malloc(NULL, sizeof(uint8_t) * num_buckets, 0);
+	if (scanned_bkts == NULL) {
+		printf("RTE_MALLOC failed\n");
+		goto err;
+	}
+
+	tbl_rwc_test_param.keys = keys;
+	tbl_rwc_test_param.keys_no_ks = keys_no_ks;
+	tbl_rwc_test_param.keys_ks = keys_ks;
+	tbl_rwc_test_param.keys_absent = keys_absent;
+	tbl_rwc_test_param.keys_non_shift_path = keys_non_shift_path;
+	/* Generate keys by adding previous two keys, neglect overflow */
+	printf("Generating keys...\n");
+	keys[0] = 0;
+	keys[1] = 1;
+	for (i = 2; i < TOTAL_INSERT; i++)
+		keys[i] = keys[i-1] + keys[i-2];
+
+	/* Segregate keys into keys_no_ks and keys_ks */
+	for (i = 0; i < TOTAL_INSERT; i++) {
+		/* Check if primary bucket has space.*/
+		sig = rte_hash_hash(tbl_rwc_test_param.h,
+					tbl_rwc_test_param.keys+i);
+		prim_bucket_idx = sig & bucket_bitmask;
+		ret = check_bucket(prim_bucket_idx, keys[i]);
+		if (ret < 0) {
+			/*
+			 * Primary bucket is full, this key will result in
+			 * shifting of the keys to their alternate locations.
+			 */
+			keys_ks[count_keys_ks] = keys[i];
+			count_keys_ks++;
+		} else if (ret == 0) {
+			/*
+			 * Primary bucket has space, this key will not result in
+			 * shifting of the keys. Hence, add key to the table.
+			 */
+			ret = rte_hash_add_key_data(tbl_rwc_test_param.h,
+							keys+i,
+							(void *)((uintptr_t)i));
+			if (ret < 0) {
+				printf("writer failed %"PRIu32"\n", i);
+				break;
+			}
+			keys_no_ks[count_keys_no_ks] = keys[i];
+			count_keys_no_ks++;
+		}
+	}
+
+	for (i = 0; i < count_keys_no_ks; i++) {
+		/*
+		 * Identify keys in keys_no_ks with value less than
+		 * 4M (HTM enabled) OR 5K (HTM disabled)
+		 */
+		if (keys_no_ks[i] < TOTAL_INSERT)
+			found[keys_no_ks[i]]++;
+	}
+
+	for (i = 0; i < count_keys_ks; i++) {
+		/*
+		 * Identify keys in keys_ks with value less than
+		 * 4M (HTM enabled) OR 5K (HTM disabled)
+		 */
+		if (keys_ks[i] < TOTAL_INSERT)
+			found[keys_ks[i]]++;
+	}
+
+	uint32_t count_keys_absent = 0;
+	for (i = 0; i < TOTAL_INSERT; i++) {
+		/*
+		 * Identify missing keys between 0 and
+		 * 4M (HTM enabled) OR 5K (HTM disabled)
+		 */
+		if (found[i] == 0)
+			keys_absent[count_keys_absent++] = i;
+	}
+
+	/* Find keys that will not be on the shift path */
+	uint32_t iter;
+	const void *next_key;
+	void *next_data;
+	uint32_t count = 0;
+	for (i = 0; i < num_buckets; i++) {
+		/* Check bucket for no keys shifted to alternate locations */
+		if (scanned_bkts[i] == 0) {
+			iter = i * 8;
+			while (rte_hash_iterate(tbl_rwc_test_param.h,
+				&next_key, &next_data, &iter) >= 0) {
+
+				/* Check if key belongs to the current bucket */
+				if (i >= (iter-1)/8)
+					keys_non_shift_path[count++]
+						= *(const uint32_t *)next_key;
+				else
+					break;
+			}
+		}
+	}
+
+	tbl_rwc_test_param.count_keys_no_ks = count_keys_no_ks;
+	tbl_rwc_test_param.count_keys_ks = count_keys_ks;
+	tbl_rwc_test_param.count_keys_absent = count_keys_absent;
+	tbl_rwc_test_param.count_keys_non_shift_path = count;
+
+	printf("\nCount of keys NOT causing shifting of existing keys to "
+	"alternate location: %d\n", tbl_rwc_test_param.count_keys_no_ks);
+	printf("\nCount of keys causing shifting of existing keys to alternate "
+		"locations: %d\n\n", tbl_rwc_test_param.count_keys_ks);
+	printf("Count of absent keys that will never be added to the hash "
+		"table: %d\n\n", tbl_rwc_test_param.count_keys_absent);
+	printf("Count of keys likely to be on the shift path: %d\n\n",
+	       tbl_rwc_test_param.count_keys_shift_path);
+	printf("Count of keys not likely to be on the shift path: %d\n\n",
+	       tbl_rwc_test_param.count_keys_non_shift_path);
+
+	rte_free(found);
+	rte_hash_free(tbl_rwc_test_param.h);
+	return 0;
+
+err:
+	rte_free(keys);
+	rte_free(keys_no_ks);
+	rte_free(keys_ks);
+	rte_free(keys_absent);
+	rte_free(found);
+	rte_free(tbl_rwc_test_param.keys_shift_path);
+	rte_free(scanned_bkts);
+	return -1;
+}
+
+static int
+init_params(int rwc_lf, int use_jhash, int htm)
+{
+	struct rte_hash *handle;
+
+	struct rte_hash_parameters hash_params = {
+		.entries = TOTAL_ENTRY,
+		.key_len = sizeof(uint32_t),
+		.hash_func_init_val = 0,
+		.socket_id = rte_socket_id(),
+	};
+
+	if (use_jhash)
+		hash_params.hash_func = rte_jhash;
+	else
+		hash_params.hash_func = rte_hash_crc;
+
+	if (rwc_lf)
+		hash_params.extra_flag =
+			RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF |
+			RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD;
+	else if (htm)
+		hash_params.extra_flag =
+			RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT |
+			RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY |
+			RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD;
+	else
+		hash_params.extra_flag =
+			RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY |
+			RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD;
+
+	hash_params.name = "tests";
+
+	handle = rte_hash_create(&hash_params);
+	if (handle == NULL) {
+		printf("hash creation failed");
+		return -1;
+	}
+
+	tbl_rwc_test_param.h = handle;
+	return 0;
+}
+
+static int
+test_rwc_reader(__attribute__((unused)) void *arg)
+{
+	uint32_t i, j;
+	int ret;
+	uint64_t begin, cycles;
+	uint32_t loop_cnt = 0;
+	uint8_t read_type = (uint8_t)((uintptr_t)arg);
+	uint32_t read_cnt;
+	uint32_t *keys;
+	uint32_t extra_keys;
+	int32_t *pos;
+	void *temp_a[BULK_LOOKUP_SIZE];
+
+	/* Used to identify keys not inserted in the hash table */
+	pos = rte_zmalloc(NULL, sizeof(uint32_t) * BULK_LOOKUP_SIZE, 0);
+	if (pos == NULL) {
+		printf("RTE_MALLOC failed\n");
+		return -1;
+	}
+
+	if (read_type & READ_FAIL) {
+		keys = tbl_rwc_test_param.keys_absent;
+		read_cnt = tbl_rwc_test_param.count_keys_absent;
+	} else if (read_type & READ_PASS_NO_KEY_SHIFTS) {
+		keys = tbl_rwc_test_param.keys_no_ks;
+		read_cnt = tbl_rwc_test_param.count_keys_no_ks;
+	} else if (read_type & READ_PASS_SHIFT_PATH) {
+		keys = tbl_rwc_test_param.keys_shift_path;
+		read_cnt = tbl_rwc_test_param.count_keys_shift_path;
+	} else {
+		keys = tbl_rwc_test_param.keys_non_shift_path;
+		read_cnt = tbl_rwc_test_param.count_keys_non_shift_path;
+	}
+
+	extra_keys = read_cnt & (BULK_LOOKUP_SIZE - 1);
+
+	begin = rte_rdtsc_precise();
+	do {
+		if (read_type & BULK_LOOKUP) {
+			for (i = 0; i < (read_cnt - extra_keys);
+			     i += BULK_LOOKUP_SIZE) {
+				/* Array of  pointer to the list of keys */
+				for (j = 0; j < BULK_LOOKUP_SIZE; j++)
+					temp_a[j] = keys + i + j;
+
+				rte_hash_lookup_bulk(tbl_rwc_test_param.h,
+						   (const void **)
+						   ((uintptr_t)temp_a),
+						   BULK_LOOKUP_SIZE, pos);
+				/* Validate lookup result */
+				for (j = 0; j < BULK_LOOKUP_SIZE; j++)
+					if ((read_type & READ_FAIL &&
+					     pos[j] != -ENOENT) ||
+					    (!(read_type & READ_FAIL) &&
+					     pos[j] == -ENOENT)) {
+						printf("lookup failed!"
+						       "%"PRIu32"\n",
+						       keys[i + j]);
+						return -1;
+					}
+			}
+			for (j = 0; j < extra_keys; j++)
+				temp_a[j] = keys + i + j;
+
+			rte_hash_lookup_bulk(tbl_rwc_test_param.h,
+					   (const void **)
+					   ((uintptr_t)temp_a),
+					   extra_keys, pos);
+			for (j = 0; j < extra_keys; j++)
+				if ((read_type & READ_FAIL &&
+				     pos[j] != -ENOENT) ||
+				    (!(read_type & READ_FAIL) &&
+				     pos[j] == -ENOENT)) {
+					printf("lookup failed! %"PRIu32"\n",
+					       keys[i + j]);
+					return -1;
+				}
+		} else {
+			for (i = 0; i < read_cnt; i++) {
+				ret = rte_hash_lookup
+					(tbl_rwc_test_param.h, keys + i);
+				if (((read_type & READ_FAIL) &&
+				     (ret != -ENOENT)) ||
+				    (!(read_type & READ_FAIL) &&
+					ret == -ENOENT)) {
+					printf("lookup failed! %"PRIu32"\n",
+					       keys[i]);
+					return -1;
+				}
+			}
+		}
+		loop_cnt++;
+	} while (!writer_done);
+
+	cycles = rte_rdtsc_precise() - begin;
+	rte_atomic64_add(&gread_cycles, cycles);
+	rte_atomic64_add(&greads, read_cnt*loop_cnt);
+	return 0;
+}
+
+static int
+write_keys(uint8_t key_shift)
+{
+	uint32_t i;
+	int ret;
+	uint32_t key_cnt;
+	uint32_t *keys;
+	if (key_shift) {
+		key_cnt = tbl_rwc_test_param.count_keys_ks;
+		keys = tbl_rwc_test_param.keys_ks;
+	} else {
+		key_cnt = tbl_rwc_test_param.count_keys_no_ks;
+		keys = tbl_rwc_test_param.keys_no_ks;
+	}
+	for (i = 0; i < key_cnt; i++) {
+		ret = rte_hash_add_key(tbl_rwc_test_param.h, keys + i);
+		if (!key_shift && ret < 0) {
+			printf("writer failed %"PRIu32"\n", i);
+			return -1;
+		}
+	}
+	return 0;
+}
+
+static int
+test_rwc_multi_writer(__attribute__((unused)) void *arg)
+{
+	uint32_t i, offset;
+	uint32_t pos_core = (uint32_t)((uintptr_t)arg);
+	offset = pos_core * tbl_rwc_test_param.single_insert;
+	for (i = offset; i < offset + tbl_rwc_test_param.single_insert; i++)
+		rte_hash_add_key(tbl_rwc_test_param.h,
+				 tbl_rwc_test_param.keys_ks + i);
+	multi_writer_done[pos_core] = 1;
+	return 0;
+}
+
+/*
+ * Test lookup perf:
+ * Reader(s) lookup keys present in the table.
+ */
+static int
+test_hash_add_no_ks_lookup_hit(struct rwc_perf *rwc_perf_results, int rwc_lf,
+				int htm)
+{
+	unsigned int n, m;
+	uint64_t i;
+	int use_jhash = 0;
+	uint8_t key_shift = 0;
+	uint8_t read_type = READ_PASS_NO_KEY_SHIFTS;
+
+	rte_atomic64_init(&greads);
+	rte_atomic64_init(&gread_cycles);
+
+	if (init_params(rwc_lf, use_jhash, htm) != 0)
+		goto err;
+	printf("\nTest: Hash add - no key-shifts, read - hit\n");
+	for (m = 0; m < 2; m++) {
+		if (m == 1) {
+			printf("\n** With bulk-lookup **\n");
+			read_type |= BULK_LOOKUP;
+		}
+		for (n = 0; n < NUM_TEST; n++) {
+			unsigned int tot_lcore = rte_lcore_count();
+			if (tot_lcore < rwc_core_cnt[n] + 1)
+				goto finish;
+
+			printf("\nNumber of readers: %u\n", rwc_core_cnt[n]);
+
+			rte_atomic64_clear(&greads);
+			rte_atomic64_clear(&gread_cycles);
+
+			rte_hash_reset(tbl_rwc_test_param.h);
+			writer_done = 0;
+			if (write_keys(key_shift) < 0)
+				goto err;
+			writer_done = 1;
+			for (i = 1; i <= rwc_core_cnt[n]; i++)
+				rte_eal_remote_launch(test_rwc_reader,
+						(void *)(uintptr_t)read_type,
+							enabled_core_ids[i]);
+			rte_eal_mp_wait_lcore();
+
+			for (i = 1; i <= rwc_core_cnt[n]; i++)
+				if (lcore_config[i].ret < 0)
+					goto err;
+
+			unsigned long long cycles_per_lookup =
+				rte_atomic64_read(&gread_cycles) /
+				rte_atomic64_read(&greads);
+			rwc_perf_results->w_no_ks_r_hit[m][n]
+						= cycles_per_lookup;
+			printf("Cycles per lookup: %llu\n", cycles_per_lookup);
+		}
+	}
+
+finish:
+	rte_hash_free(tbl_rwc_test_param.h);
+	return 0;
+
+err:
+	rte_hash_free(tbl_rwc_test_param.h);
+	return -1;
+}
+
+/*
+ * Test lookup perf:
+ * Reader(s) lookup keys absent in the table while
+ * 'Main' thread adds with no key-shifts.
+ */
+static int
+test_hash_add_no_ks_lookup_miss(struct rwc_perf *rwc_perf_results, int rwc_lf,
+				int htm)
+{
+	unsigned int n, m;
+	uint64_t i;
+	int use_jhash = 0;
+	uint8_t key_shift = 0;
+	uint8_t read_type = READ_FAIL;
+	int ret;
+
+	rte_atomic64_init(&greads);
+	rte_atomic64_init(&gread_cycles);
+
+	if (init_params(rwc_lf, use_jhash, htm) != 0)
+		goto err;
+	printf("\nTest: Hash add - no key-shifts, Hash lookup - miss\n");
+	for (m = 0; m < 2; m++) {
+		if (m == 1) {
+			printf("\n** With bulk-lookup **\n");
+			read_type |= BULK_LOOKUP;
+		}
+		for (n = 0; n < NUM_TEST; n++) {
+			unsigned int tot_lcore = rte_lcore_count();
+			if (tot_lcore < rwc_core_cnt[n] + 1)
+				goto finish;
+
+			printf("\nNumber of readers: %u\n", rwc_core_cnt[n]);
+
+			rte_atomic64_clear(&greads);
+			rte_atomic64_clear(&gread_cycles);
+
+			rte_hash_reset(tbl_rwc_test_param.h);
+			writer_done = 0;
+
+			for (i = 1; i <= rwc_core_cnt[n]; i++)
+				rte_eal_remote_launch(test_rwc_reader,
+						(void *)(uintptr_t)read_type,
+							enabled_core_ids[i]);
+			ret = write_keys(key_shift);
+			writer_done = 1;
+			rte_eal_mp_wait_lcore();
+
+			if (ret < 0)
+				goto err;
+			for (i = 1; i <= rwc_core_cnt[n]; i++)
+				if (lcore_config[i].ret < 0)
+					goto err;
+
+			unsigned long long cycles_per_lookup =
+				rte_atomic64_read(&gread_cycles) /
+				rte_atomic64_read(&greads);
+			rwc_perf_results->w_no_ks_r_miss[m][n]
+						= cycles_per_lookup;
+			printf("Cycles per lookup: %llu\n", cycles_per_lookup);
+		}
+	}
+
+finish:
+	rte_hash_free(tbl_rwc_test_param.h);
+	return 0;
+
+err:
+	rte_hash_free(tbl_rwc_test_param.h);
+	return -1;
+}
+
+/*
+ * Test lookup perf:
+ * Reader(s) lookup keys present in the table and not likely to be on the
+ * shift path  while 'Main' thread adds keys causing key-shifts.
+ */
+static int
+test_hash_add_ks_lookup_hit_non_sp(struct rwc_perf *rwc_perf_results,
+				    int rwc_lf, int htm)
+{
+	unsigned int n, m;
+	uint64_t i;
+	int use_jhash = 0;
+	int ret;
+	uint8_t key_shift;
+	uint8_t read_type = READ_PASS_NON_SHIFT_PATH;
+
+	rte_atomic64_init(&greads);
+	rte_atomic64_init(&gread_cycles);
+
+	if (init_params(rwc_lf, use_jhash, htm) != 0)
+		goto err;
+	printf("\nTest: Hash add - key shift, Hash lookup - hit"
+	       " (non-shift-path)\n");
+	for (m = 0; m < 2; m++) {
+		if (m == 1) {
+			printf("\n** With bulk-lookup **\n");
+			read_type |= BULK_LOOKUP;
+		}
+		for (n = 0; n < NUM_TEST; n++) {
+			unsigned int tot_lcore = rte_lcore_count();
+			if (tot_lcore < rwc_core_cnt[n] + 1)
+				goto finish;
+
+			printf("\nNumber of readers: %u\n", rwc_core_cnt[n]);
+
+			rte_atomic64_clear(&greads);
+			rte_atomic64_clear(&gread_cycles);
+
+			rte_hash_reset(tbl_rwc_test_param.h);
+			writer_done = 0;
+			key_shift = 0;
+			if (write_keys(key_shift) < 0)
+				goto err;
+			for (i = 1; i <= rwc_core_cnt[n]; i++)
+				rte_eal_remote_launch(test_rwc_reader,
+						(void *)(uintptr_t)read_type,
+							enabled_core_ids[i]);
+			key_shift = 1;
+			ret = write_keys(key_shift);
+			writer_done = 1;
+			rte_eal_mp_wait_lcore();
+
+			if (ret < 0)
+				goto err;
+			for (i = 1; i <= rwc_core_cnt[n]; i++)
+				if (lcore_config[i].ret < 0)
+					goto err;
+
+			unsigned long long cycles_per_lookup =
+				rte_atomic64_read(&gread_cycles) /
+				rte_atomic64_read(&greads);
+			rwc_perf_results->w_ks_r_hit_nsp[m][n]
+						= cycles_per_lookup;
+			printf("Cycles per lookup: %llu\n", cycles_per_lookup);
+		}
+	}
+
+finish:
+	rte_hash_free(tbl_rwc_test_param.h);
+	return 0;
+
+err:
+	rte_hash_free(tbl_rwc_test_param.h);
+	return -1;
+}
+
+/*
+ * Test lookup perf:
+ * Reader(s) lookup keys present in the table and likely on the shift-path while
+ * 'Main' thread adds keys causing key-shifts.
+ */
+static int
+test_hash_add_ks_lookup_hit_sp(struct rwc_perf *rwc_perf_results, int rwc_lf,
+				int htm)
+{
+	unsigned int n, m;
+	uint64_t i;
+	int use_jhash = 0;
+	int ret;
+	uint8_t key_shift;
+	uint8_t read_type = READ_PASS_SHIFT_PATH;
+
+	rte_atomic64_init(&greads);
+	rte_atomic64_init(&gread_cycles);
+
+	if (init_params(rwc_lf, use_jhash, htm) != 0)
+		goto err;
+	printf("\nTest: Hash add - key shift, Hash lookup - hit (shift-path)"
+	       "\n");
+
+	for (m = 0; m < 2; m++) {
+		if (m == 1) {
+			printf("\n** With bulk-lookup **\n");
+			read_type |= BULK_LOOKUP;
+		}
+		for (n = 0; n < NUM_TEST; n++) {
+			unsigned int tot_lcore = rte_lcore_count();
+			if (tot_lcore < rwc_core_cnt[n])
+				goto finish;
+
+			printf("\nNumber of readers: %u\n", rwc_core_cnt[n]);
+			rte_atomic64_clear(&greads);
+			rte_atomic64_clear(&gread_cycles);
+
+			rte_hash_reset(tbl_rwc_test_param.h);
+			writer_done = 0;
+			key_shift = 0;
+			if (write_keys(key_shift) < 0)
+				goto err;
+			for (i = 1; i <= rwc_core_cnt[n]; i++)
+				rte_eal_remote_launch(test_rwc_reader,
+						(void *)(uintptr_t)read_type,
+						enabled_core_ids[i]);
+			key_shift = 1;
+			ret = write_keys(key_shift);
+			writer_done = 1;
+			rte_eal_mp_wait_lcore();
+
+			if (ret < 0)
+				goto err;
+			for (i = 1; i <= rwc_core_cnt[n]; i++)
+				if (lcore_config[i].ret < 0)
+					goto err;
+
+			unsigned long long cycles_per_lookup =
+				rte_atomic64_read(&gread_cycles) /
+				rte_atomic64_read(&greads);
+			rwc_perf_results->w_ks_r_hit_sp[m][n]
+						= cycles_per_lookup;
+			printf("Cycles per lookup: %llu\n", cycles_per_lookup);
+		}
+	}
+
+finish:
+	rte_hash_free(tbl_rwc_test_param.h);
+	return 0;
+
+err:
+	rte_hash_free(tbl_rwc_test_param.h);
+	return -1;
+}
+
+/*
+ * Test lookup perf:
+ * Reader(s) lookup keys absent in the table while
+ * 'Main' thread adds keys causing key-shifts.
+ */
+static int
+test_hash_add_ks_lookup_miss(struct rwc_perf *rwc_perf_results, int rwc_lf, int
+			     htm)
+{
+	unsigned int n, m;
+	uint64_t i;
+	int use_jhash = 0;
+	int ret;
+	uint8_t key_shift;
+	uint8_t read_type = READ_FAIL;
+
+	rte_atomic64_init(&greads);
+	rte_atomic64_init(&gread_cycles);
+
+	if (init_params(rwc_lf, use_jhash, htm) != 0)
+		goto err;
+	printf("\nTest: Hash add - key shift, Hash lookup - miss\n");
+	for (m = 0; m < 2; m++) {
+		if (m == 1) {
+			printf("\n** With bulk-lookup **\n");
+			read_type |= BULK_LOOKUP;
+		}
+		for (n = 0; n < NUM_TEST; n++) {
+			unsigned int tot_lcore = rte_lcore_count();
+			if (tot_lcore < rwc_core_cnt[n] + 1)
+				goto finish;
+
+			printf("\nNumber of readers: %u\n", rwc_core_cnt[n]);
+
+			rte_atomic64_clear(&greads);
+			rte_atomic64_clear(&gread_cycles);
+
+			rte_hash_reset(tbl_rwc_test_param.h);
+			writer_done = 0;
+			key_shift = 0;
+			if (write_keys(key_shift) < 0)
+				goto err;
+			for (i = 1; i <= rwc_core_cnt[n]; i++)
+				rte_eal_remote_launch(test_rwc_reader,
+						(void *)(uintptr_t)read_type,
+							enabled_core_ids[i]);
+			key_shift = 1;
+			ret = write_keys(key_shift);
+			writer_done = 1;
+			rte_eal_mp_wait_lcore();
+
+			if (ret < 0)
+				goto err;
+			for (i = 1; i <= rwc_core_cnt[n]; i++)
+				if (lcore_config[i].ret < 0)
+					goto err;
+
+			unsigned long long cycles_per_lookup =
+				rte_atomic64_read(&gread_cycles) /
+				rte_atomic64_read(&greads);
+			rwc_perf_results->w_ks_r_miss[m][n] = cycles_per_lookup;
+			printf("Cycles per lookup: %llu\n", cycles_per_lookup);
+		}
+	}
+
+finish:
+	rte_hash_free(tbl_rwc_test_param.h);
+	return 0;
+
+err:
+	rte_hash_free(tbl_rwc_test_param.h);
+	return -1;
+}
+
+/*
+ * Test lookup perf for multi-writer:
+ * Reader(s) lookup keys present in the table and likely on the shift-path while
+ * Writers add keys causing key-shiftsi.
+ * Writers are running in parallel, on different data plane cores.
+ */
+static int
+test_hash_multi_add_lookup(struct rwc_perf *rwc_perf_results, int rwc_lf,
+			   int htm)
+{
+	unsigned int n, m, k;
+	uint64_t i;
+	int use_jhash = 0;
+	uint8_t key_shift;
+	uint8_t read_type = READ_PASS_SHIFT_PATH;
+
+	rte_atomic64_init(&greads);
+	rte_atomic64_init(&gread_cycles);
+
+	if (init_params(rwc_lf, use_jhash, htm) != 0)
+		goto err;
+	printf("\nTest: Multi-add-lookup\n");
+	uint8_t pos_core;
+	for (m = 1; m < NUM_TEST; m++) {
+		/* Calculate keys added by each writer */
+		tbl_rwc_test_param.single_insert =
+			tbl_rwc_test_param.count_keys_ks / rwc_core_cnt[m];
+		for (k = 0; k < 2; k++) {
+			if (k == 1) {
+				printf("\n** With bulk-lookup **\n");
+				read_type |= BULK_LOOKUP;
+			}
+			for (n = 0; n < NUM_TEST; n++) {
+				unsigned int tot_lcore	= rte_lcore_count();
+				if (tot_lcore < (rwc_core_cnt[n] +
+				     rwc_core_cnt[m] + 1))
+					goto finish;
+
+				printf("\nNumber of writers: %u",
+				       rwc_core_cnt[m]);
+				printf("\nNumber of readers: %u\n",
+				       rwc_core_cnt[n]);
+
+				rte_atomic64_clear(&greads);
+				rte_atomic64_clear(&gread_cycles);
+
+				rte_hash_reset(tbl_rwc_test_param.h);
+				writer_done = 0;
+				for (i = 0; i < 4; i++)
+					multi_writer_done[i] = 0;
+				key_shift = 0;
+				if (write_keys(key_shift) < 0)
+					goto err;
+
+				/* Launch reader(s) */
+				for (i = 1; i <= rwc_core_cnt[n]; i++)
+					rte_eal_remote_launch(test_rwc_reader,
+						(void *)(uintptr_t)read_type,
+						enabled_core_ids[i]);
+				key_shift = 1;
+				pos_core = 0;
+
+				/* Launch writers */
+				for (; i <= rwc_core_cnt[m]
+				     + rwc_core_cnt[n];	i++) {
+					rte_eal_remote_launch
+						(test_rwc_multi_writer,
+						(void *)(uintptr_t)pos_core,
+						enabled_core_ids[i]);
+					pos_core++;
+				}
+
+				/* Wait for writers to complete */
+				for (i = 0; i < rwc_core_cnt[m]; i++)
+					while
+						(multi_writer_done[i] == 0);
+				writer_done = 1;
+
+				rte_eal_mp_wait_lcore();
+
+				for (i = 1; i <= rwc_core_cnt[n]; i++)
+					if (lcore_config[i].ret < 0)
+						goto err;
+
+				unsigned long long cycles_per_lookup =
+					rte_atomic64_read(&gread_cycles)
+					/ rte_atomic64_read(&greads);
+				rwc_perf_results->multi_rw[m][k][n]
+					= cycles_per_lookup;
+				printf("Cycles per lookup: %llu\n",
+				       cycles_per_lookup);
+			}
+		}
+	}
+
+finish:
+	rte_hash_free(tbl_rwc_test_param.h);
+	return 0;
+
+err:
+	rte_hash_free(tbl_rwc_test_param.h);
+	return -1;
+}
+
+static int
+test_hash_readwrite_lf_main(void)
+{
+	/*
+	 * Variables used to choose different tests.
+	 * rwc_lf indicates if read-write concurrency lock-free support is
+	 * enabled.
+	 * htm indicates if Hardware transactional memory support is enabled.
+	 */
+	int rwc_lf = 0;
+	int htm;
+	int use_jhash = 0;
+	if (rte_lcore_count() == 1) {
+		printf("More than one lcore is required "
+			"to do read write lock-free concurrency test\n");
+		return -1;
+	}
+
+	setlocale(LC_NUMERIC, "");
+
+	if (rte_tm_supported())
+		htm = 1;
+	else
+		htm = 0;
+
+	if (init_params(rwc_lf, use_jhash, htm) != 0)
+		return -1;
+	if (generate_keys() != 0)
+		return -1;
+	if (get_enabled_cores_list() != 0)
+		return -1;
+
+	if (RTE_HASH_EXTRA_FLAGS_RW_CONCURRENCY_LF) {
+		rwc_lf = 1;
+		printf("Test lookup with read-write concurrency lock free support"
+		       " enabled\n");
+		if (test_hash_add_no_ks_lookup_hit(&rwc_lf_results, rwc_lf,
+							htm) < 0)
+			return -1;
+		if (test_hash_add_no_ks_lookup_miss(&rwc_lf_results, rwc_lf,
+							htm) < 0)
+			return -1;
+		if (test_hash_add_ks_lookup_hit_non_sp(&rwc_lf_results, rwc_lf,
+							htm) < 0)
+			return -1;
+		if (test_hash_add_ks_lookup_hit_sp(&rwc_lf_results, rwc_lf,
+							htm) < 0)
+			return -1;
+		if (test_hash_add_ks_lookup_miss(&rwc_lf_results, rwc_lf, htm)
+							< 0)
+			return -1;
+		if (test_hash_multi_add_lookup(&rwc_lf_results, rwc_lf, htm)
+							< 0)
+			return -1;
+	}
+	printf("\nTest lookup with read-write concurrency lock free support"
+	       " disabled\n");
+	rwc_lf = 0;
+	if (!htm) {
+		printf("With HTM Disabled\n");
+		if (!RUN_WITH_HTM_DISABLED) {
+			printf("Enable RUN_WITH_HTM_DISABLED to test with"
+			       " lock-free disabled");
+			goto results;
+		}
+	} else
+		printf("With HTM Enabled\n");
+	if (test_hash_add_no_ks_lookup_hit(&rwc_non_lf_results, rwc_lf, htm)
+						< 0)
+		return -1;
+	if (test_hash_add_no_ks_lookup_miss(&rwc_non_lf_results, rwc_lf, htm)
+						< 0)
+		return -1;
+	if (test_hash_add_ks_lookup_hit_non_sp(&rwc_non_lf_results, rwc_lf,
+						htm) < 0)
+		return -1;
+	if (test_hash_add_ks_lookup_hit_sp(&rwc_non_lf_results, rwc_lf, htm)
+						< 0)
+		return -1;
+	if (test_hash_add_ks_lookup_miss(&rwc_non_lf_results, rwc_lf, htm) < 0)
+		return -1;
+	if (test_hash_multi_add_lookup(&rwc_non_lf_results, rwc_lf, htm) < 0)
+		return -1;
+results:
+	printf("\n\t\t\t\t\t\t********** Results summary **********\n\n");
+	int i, j, k;
+	for (j = 0; j < 2; j++) {
+		if (j == 1)
+			printf("\n\t\t\t\t\t#######********** Bulk Lookup "
+			       "**********#######\n\n");
+		printf("_______\t\t_______\t\t_________\t___\t\t_________\t\t"
+			"\t\t\t\t_________________\n");
+		printf("Writers\t\tReaders\t\tLock-free\tHTM\t\tTest-case\t\t\t"
+		       "\t\t\tCycles per lookup\n");
+		printf("_______\t\t_______\t\t_________\t___\t\t_________\t\t\t"
+		       "\t\t\t_________________\n");
+		for (i = 0; i < NUM_TEST; i++) {
+			printf("%u\t\t%u\t\t", 1, rwc_core_cnt[i]);
+			printf("Enabled\t\t");
+			printf("N/A\t\t");
+			printf("Hash add - no key-shifts, lookup - hit\t\t\t\t"
+				"%u\n\t\t\t\t\t\t\t\t",
+				rwc_lf_results.w_no_ks_r_hit[j][i]);
+			printf("Hash add - no key-shifts, lookup - miss\t\t\t\t"
+				"%u\n\t\t\t\t\t\t\t\t",
+				rwc_lf_results.w_no_ks_r_miss[j][i]);
+			printf("Hash add - key-shifts, lookup - hit"
+			       "(non-shift-path)\t\t%u\n\t\t\t\t\t\t\t\t",
+			       rwc_lf_results.w_ks_r_hit_nsp[j][i]);
+			printf("Hash add - key-shifts, lookup - hit "
+			       "(shift-path)\t\t%u\n\t\t\t\t\t\t\t\t",
+			       rwc_lf_results.w_ks_r_hit_sp[j][i]);
+			printf("Hash add - key-shifts, Hash lookup miss\t\t\t\t"
+				"%u\n\n\t\t\t\t",
+				rwc_lf_results.w_ks_r_miss[j][i]);
+
+			printf("Disabled\t");
+			if (htm)
+				printf("Enabled\t\t");
+			else
+				printf("Disabled\t");
+			printf("Hash add - no key-shifts, lookup - hit\t\t\t\t"
+				"%u\n\t\t\t\t\t\t\t\t",
+				rwc_non_lf_results.w_no_ks_r_hit[j][i]);
+			printf("Hash add - no key-shifts, lookup - miss\t\t\t\t"
+				"%u\n\t\t\t\t\t\t\t\t",
+				rwc_non_lf_results.w_no_ks_r_miss[j][i]);
+			printf("Hash add - key-shifts, lookup - hit "
+			       "(non-shift-path)\t\t%u\n\t\t\t\t\t\t\t\t",
+			       rwc_non_lf_results.w_ks_r_hit_nsp[j][i]);
+			printf("Hash add - key-shifts, lookup - hit "
+			       "(shift-path)\t\t%u\n\t\t\t\t\t\t\t\t",
+			       rwc_non_lf_results.w_ks_r_hit_sp[j][i]);
+			printf("Hash add - key-shifts, Hash lookup miss\t\t\t\t"
+			       "%u\n", rwc_non_lf_results.w_ks_r_miss[j][i]);
+
+			printf("_______\t\t_______\t\t_________\t___\t\t"
+			       "_________\t\t\t\t\t\t_________________\n");
+		}
+
+		for (i = 1; i < NUM_TEST; i++) {
+			for (k = 0; k < NUM_TEST; k++) {
+				printf("%u", rwc_core_cnt[i]);
+				printf("\t\t%u\t\t", rwc_core_cnt[k]);
+				printf("Enabled\t\t");
+				printf("N/A\t\t");
+				printf("Multi-add-lookup\t\t\t\t\t\t%u\n\n\t\t"
+				       "\t\t",
+				       rwc_lf_results.multi_rw[i][j][k]);
+				printf("Disabled\t");
+				if (htm)
+					printf("Enabled\t\t");
+				else
+					printf("Disabled\t");
+				printf("Multi-add-lookup\t\t\t\t\t\t%u\n",
+				       rwc_non_lf_results.multi_rw[i][j][k]);
+
+				printf("_______\t\t_______\t\t_________\t___"
+				       "\t\t_________\t\t\t\t\t\t"
+				       "_________________\n");
+			}
+		}
+	}
+	rte_free(tbl_rwc_test_param.keys);
+	rte_free(tbl_rwc_test_param.keys_no_ks);
+	rte_free(tbl_rwc_test_param.keys_ks);
+	rte_free(tbl_rwc_test_param.keys_absent);
+	rte_free(tbl_rwc_test_param.keys_shift_path);
+	rte_free(scanned_bkts);
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(hash_readwrite_lf_autotest, test_hash_readwrite_lf_main);
-- 
2.7.4

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dpdk-dev] [PATCH v7 0/5] Address reader-writer concurrency in rte_hash
  2018-10-26  5:37 [dpdk-dev] [PATCH v7 0/5] Address reader-writer concurrency in rte_hash Honnappa Nagarahalli
                   ` (4 preceding siblings ...)
  2018-10-26  5:37 ` [dpdk-dev] [PATCH v7 5/5] test/hash: read-write lock-free concurrency test Honnappa Nagarahalli
@ 2018-10-26 10:52 ` Thomas Monjalon
  5 siblings, 0 replies; 19+ messages in thread
From: Thomas Monjalon @ 2018-10-26 10:52 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: dev, bruce.richardson, pablo.de.lara.guarch, yipeng1.wang,
	dharmik.thakkar, gavin.hu, nd

> Dharmik Thakkar (1):
>   test/hash: read-write lock-free concurrency test
> 
> Honnappa Nagarahalli (4):
>   hash: separate multi-writer from rw-concurrency
>   hash: support do not free on delete
>   hash: fix key store element alignment
>   hash: add lock-free read-write concurrency

Applied with doc squashed, thanks

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency
  2018-10-26  5:37 ` [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency Honnappa Nagarahalli
@ 2018-11-03 11:52   ` Jerin Jacob
  2018-11-03 15:40     ` Jerin Jacob
  0 siblings, 1 reply; 19+ messages in thread
From: Jerin Jacob @ 2018-11-03 11:52 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: bruce.richardson, pablo.de.lara.guarch, dev, yipeng1.wang,
	dharmik.thakkar, gavin.hu, nd, thomas, ferruh.yigit,
	hemant.agrawal

-----Original Message-----
> Date: Fri, 26 Oct 2018 00:37:32 -0500
> From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> To: bruce.richardson@intel.com, pablo.de.lara.guarch@intel.com
> CC: dev@dpdk.org, yipeng1.wang@intel.com, honnappa.nagarahalli@arm.com,
>  dharmik.thakkar@arm.com, gavin.hu@arm.com, nd@arm.com
> Subject: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write
>  concurrency
> X-Mailer: git-send-email 2.7.4
> 
> 
> Add lock-free read-write concurrency. This is achieved by the
> following changes.
> 
> 1) Add memory ordering to avoid race conditions. The only race
> condition that can occur is -  using the key store element
> before the key write is completed. Hence, while inserting the element
> the release memory order is used. Any other race condition is caught
> by the key comparison. Memory orderings are added only where needed.
> For ex: reads in the writer's context do not need memory ordering
> as there is a single writer.
> 
> key_idx in the bucket entry and pdata in the key store element are
> used for synchronisation. key_idx is used to release an inserted
> entry in the bucket to the reader. Use of pdata for synchronisation
> is required due to updation of an existing entry where-in only
> the pdata is updated without updating key_idx.
> 
> 2) Reader-writer concurrency issue, caused by moving the keys
> to their alternative locations during key insert, is solved
> by introducing a global counter(tbl_chng_cnt) indicating a
> change in table.
> 
> 3) Add the flag to enable reader-writer concurrency during
> run time.
> 
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Hi Honnappa,

This patch is causing _~24%_ performance regression on mpps/core with 64B
packet with l3fwd in EM mode with octeontx.

Example command to reproduce with 2 core+2 port l3fwd in hash mode(-E)

# l3fwd -v -c 0xf00000 -n 4 -- -P -E -p 0x3 --config="(0, 0, 23),(1, 0, 22)"

Observations:
1) When hash lookup is _success_ then regression is only 3%. Which is kind of
make sense because additional new atomic instructions

What I meant by lookup is _success_ is: 
Configuring traffic gen like below to match lookup as defined 
ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c

dest.ip      port0    201.0.0.0
src.ip       port0    200.20.0.1
dest.port    port0    102
src.port     port0    12

dest.ip      port1    101.0.0.0
src.ip       port1    100.10.0.1
dest.port    port1    101
src.port     port1    11

tx.type      IPv4+TCP



2) When hash lookup _fails_ the per core mpps regression comes around 24% with 64B packet size.

What I meant by lookup is _failure_ is: 
Configuring traffic gen not to hit the 5 tuples defined in 
ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c


3) perf top _without_ this patch
  37.30%  l3fwd         [.] em_main_loop
  22.40%  l3fwd         [.] rte_hash_lookup
  13.05%  l3fwd         [.] nicvf_recv_pkts_cksum
   9.70%  l3fwd         [.] nicvf_xmit_pkts
   6.18%  l3fwd         [.] ipv4_hash_crc
   4.77%  l3fwd         [.] nicvf_fill_rbdr
   4.50%  l3fwd         [.] nicvf_single_pool_free_xmited_buffers
   1.16%  libc-2.28.so  [.] memcpy
   0.47%  l3fwd         [.] common_ring_mp_enqueue
   0.44%  l3fwd         [.] common_ring_mc_dequeue
   0.03%  l3fwd         [.] strerror_r@plt

4) perf top with this patch

  47.41%  l3fwd         [.] rte_hash_lookup
  23.55%  l3fwd         [.] em_main_loop
   9.53%  l3fwd         [.] nicvf_recv_pkts_cksum
   6.95%  l3fwd         [.] nicvf_xmit_pkts
   4.63%  l3fwd         [.] ipv4_hash_crc
   3.30%  l3fwd         [.] nicvf_fill_rbdr
   3.29%  l3fwd         [.] nicvf_single_pool_free_xmited_buffers
   0.76%  libc-2.28.so  [.] memcpy
   0.30%  l3fwd         [.] common_ring_mp_enqueue
   0.25%  l3fwd         [.] common_ring_mc_dequeue
   0.04%  l3fwd         [.] strerror_r@plt


5) Based on assembly, most of the cycles spends in rte_hash_lookup
around  key_idx = __atomic_load_n(&bkt->key_idx[i](whose LDAR)
and "if (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT) {"


6) Since this patch is big and does 3 things are mentioned above,
it is difficult to pin point what is causing the exact issue.

But, my primary analysis shows the item (1)(adding the atomic barriers).
But I need to spend more cycles to find out the exact causes.

The use case like lwfwd in hash mode, where writer does not update
stuff in fastpath(aka insert op) will be impact with this patch.

7) Have you checked the l3fwd lookup failure use case in your environment?
if so, please share your observation and if not, could you please check it?

8) IMO, Such performance regression is not acceptable for l3fwd use case
where hash insert op will be done in slowpath.

9) Does any else facing this problem?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency
  2018-11-03 11:52   ` Jerin Jacob
@ 2018-11-03 15:40     ` Jerin Jacob
  2018-11-06  6:07       ` Honnappa Nagarahalli
  0 siblings, 1 reply; 19+ messages in thread
From: Jerin Jacob @ 2018-11-03 15:40 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: bruce.richardson, pablo.de.lara.guarch, dev, yipeng1.wang,
	dharmik.thakkar, gavin.hu, nd, thomas, ferruh.yigit,
	hemant.agrawal, chaozhu

-----Original Message-----
> Date: Sat, 3 Nov 2018 11:52:54 +0000
> From: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> To: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> CC: "bruce.richardson@intel.com" <bruce.richardson@intel.com>,
>  "pablo.de.lara.guarch@intel.com" <pablo.de.lara.guarch@intel.com>,
>  "dev@dpdk.org" <dev@dpdk.org>, "yipeng1.wang@intel.com"
>  <yipeng1.wang@intel.com>, "dharmik.thakkar@arm.com"
>  <dharmik.thakkar@arm.com>, "gavin.hu@arm.com" <gavin.hu@arm.com>,
>  "nd@arm.com" <nd@arm.com>, "thomas@monjalon.net" <thomas@monjalon.net>,
>  "ferruh.yigit@intel.com" <ferruh.yigit@intel.com>,
>  "hemant.agrawal@nxp.com" <hemant.agrawal@nxp.com>
> Subject: Re: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write
>  concurrency
> 
> -----Original Message-----
> > Date: Fri, 26 Oct 2018 00:37:32 -0500
> > From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > To: bruce.richardson@intel.com, pablo.de.lara.guarch@intel.com
> > CC: dev@dpdk.org, yipeng1.wang@intel.com, honnappa.nagarahalli@arm.com,
> >  dharmik.thakkar@arm.com, gavin.hu@arm.com, nd@arm.com
> > Subject: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write
> >  concurrency
> > X-Mailer: git-send-email 2.7.4
> >
> >
> > Add lock-free read-write concurrency. This is achieved by the
> > following changes.
> >
> > 1) Add memory ordering to avoid race conditions. The only race
> > condition that can occur is -  using the key store element
> > before the key write is completed. Hence, while inserting the element
> > the release memory order is used. Any other race condition is caught
> > by the key comparison. Memory orderings are added only where needed.
> > For ex: reads in the writer's context do not need memory ordering
> > as there is a single writer.
> >
> > key_idx in the bucket entry and pdata in the key store element are
> > used for synchronisation. key_idx is used to release an inserted
> > entry in the bucket to the reader. Use of pdata for synchronisation
> > is required due to updation of an existing entry where-in only
> > the pdata is updated without updating key_idx.
> >
> > 2) Reader-writer concurrency issue, caused by moving the keys
> > to their alternative locations during key insert, is solved
> > by introducing a global counter(tbl_chng_cnt) indicating a
> > change in table.
> >
> > 3) Add the flag to enable reader-writer concurrency during
> > run time.
> >
> > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> 
> Hi Honnappa,
> 
> This patch is causing _~24%_ performance regression on mpps/core with 64B
> packet with l3fwd in EM mode with octeontx.
> 
> Example command to reproduce with 2 core+2 port l3fwd in hash mode(-E)
> 
> # l3fwd -v -c 0xf00000 -n 4 -- -P -E -p 0x3 --config="(0, 0, 23),(1, 0, 22)"
> 
> Observations:
> 1) When hash lookup is _success_ then regression is only 3%. Which is kind of
> make sense because additional new atomic instructions
> 
> What I meant by lookup is _success_ is:
> Configuring traffic gen like below to match lookup as defined
> ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c
> 
> dest.ip      port0    201.0.0.0
> src.ip       port0    200.20.0.1
> dest.port    port0    102
> src.port     port0    12
> 
> dest.ip      port1    101.0.0.0
> src.ip       port1    100.10.0.1
> dest.port    port1    101
> src.port     port1    11
> 
> tx.type      IPv4+TCP
> 
> 
> 
> 2) When hash lookup _fails_ the per core mpps regression comes around 24% with 64B packet size.
> 
> What I meant by lookup is _failure_ is:
> Configuring traffic gen not to hit the 5 tuples defined in
> ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c
> 
> 
> 3) perf top _without_ this patch
>   37.30%  l3fwd         [.] em_main_loop
>   22.40%  l3fwd         [.] rte_hash_lookup
>   13.05%  l3fwd         [.] nicvf_recv_pkts_cksum
>    9.70%  l3fwd         [.] nicvf_xmit_pkts
>    6.18%  l3fwd         [.] ipv4_hash_crc
>    4.77%  l3fwd         [.] nicvf_fill_rbdr
>    4.50%  l3fwd         [.] nicvf_single_pool_free_xmited_buffers
>    1.16%  libc-2.28.so  [.] memcpy
>    0.47%  l3fwd         [.] common_ring_mp_enqueue
>    0.44%  l3fwd         [.] common_ring_mc_dequeue
>    0.03%  l3fwd         [.] strerror_r@plt
> 
> 4) perf top with this patch
> 
>   47.41%  l3fwd         [.] rte_hash_lookup
>   23.55%  l3fwd         [.] em_main_loop
>    9.53%  l3fwd         [.] nicvf_recv_pkts_cksum
>    6.95%  l3fwd         [.] nicvf_xmit_pkts
>    4.63%  l3fwd         [.] ipv4_hash_crc
>    3.30%  l3fwd         [.] nicvf_fill_rbdr
>    3.29%  l3fwd         [.] nicvf_single_pool_free_xmited_buffers
>    0.76%  libc-2.28.so  [.] memcpy
>    0.30%  l3fwd         [.] common_ring_mp_enqueue
>    0.25%  l3fwd         [.] common_ring_mc_dequeue
>    0.04%  l3fwd         [.] strerror_r@plt
> 
> 
> 5) Based on assembly, most of the cycles spends in rte_hash_lookup
> around  key_idx = __atomic_load_n(&bkt->key_idx[i](whose LDAR)
> and "if (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT) {"
> 
> 
> 6) Since this patch is big and does 3 things are mentioned above,
> it is difficult to pin point what is causing the exact issue.
> 
> But, my primary analysis shows the item (1)(adding the atomic barriers).
> But I need to spend more cycles to find out the exact causes.


+ Adding POWERPC maintainer as mostly POWERPC also impacted on this patch.
Looks like __atomic_load_n(__ATOMIC_ACQUIRE) will be just mov instruction
on x86, so x86 may not be much impacted.

I analyzed it further, it a plain LD vs __atomic_load_n(__ATOMIC_ACQUIRE) issue.

The outer __rte_hash_lookup_with_hash has only 2ish __atomic_load_n
operation which causing only around 1% regression.

But since this patch has "two" __atomic_load_n in each
search_one_bucket() and in the worst case it is looping around 16 time.s
i.e "32 LDAR per packet" explains why 24% drop in lookup miss cases
and ~3% drop in lookup success case.

So this patch's regression will be based on how many cycles an LDAR
takes on given ARMv8 platform and on how many issue(s) it can issue LDAR
instructions at given point of time.

IMO, This scheme won't work. I think, we are introducing such
performance critical feature, we need to put under function pointer scheme so that
if an application does not need such feature it can use plain loads.

Already we have a lot of flags in the hash library to define the runtime
behavior, I think, it makes sense to select function pointer based
on such flags and have a performance effective solution
based on application requirements.

Just to prove the above root cause analysis, the following patch can fix
the performance issue. I know, it is NOT correct in the context of
this patch. Just pasting in case, someone want to see the cost of LD vs
__atomic_load_n(__ATOMIC_ACQUIRE) on a given platform.


On a different note, I think, it makes sense to use RCU based structure
in these case to avoid performance issue. liburcu has a good hash
library for such cases. (very less write and more read cases)

/Jerin

@@ -1135,27 +1134,21 @@ search_one_bucket(const struct rte_hash *h,
const void *key, uint16_t sig,
                        void **data, const struct rte_hash_bucket *bkt)
 {
        int i;
-       uint32_t key_idx;
-       void *pdata;
        struct rte_hash_key *k, *keys = h->key_store;
 
        for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-               key_idx = __atomic_load_n(&bkt->key_idx[i],
-                                         __ATOMIC_ACQUIRE);
-               if (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT)
                {
+               if (bkt->sig_current[i] == sig &&
+                               bkt->key_idx[i] != EMPTY_SLOT) {
                        k = (struct rte_hash_key *) ((char *)keys +
-                                       key_idx * h->key_entry_size);
-                       pdata = __atomic_load_n(&k->pdata,
-                                       __ATOMIC_ACQUIRE);
-
+                                       bkt->key_idx[i] *
h->key_entry_size);
                        if (rte_hash_cmp_eq(key, k->key, h) == 0) {
                                if (data != NULL)
-                                       *data = pdata;
+                                       *data = k->pdata;
                                /*
                                 * Return index where key is stored,
                                 * subtracting the first dummy index
                                 */
-                               return key_idx - 1;
+                               return bkt->key_idx[i] - 1;
                        }
                }
        }


> 
> The use case like lwfwd in hash mode, where writer does not update
> stuff in fastpath(aka insert op) will be impact with this patch.
> 
> 7) Have you checked the l3fwd lookup failure use case in your environment?
> if so, please share your observation and if not, could you please check it?
> 
> 8) IMO, Such performance regression is not acceptable for l3fwd use case
> where hash insert op will be done in slowpath.
> 
> 9) Does anyone else facing this problem?
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency
  2018-11-03 15:40     ` Jerin Jacob
@ 2018-11-06  6:07       ` Honnappa Nagarahalli
  2018-11-06  9:10         ` Jerin Jacob
  2018-11-07  2:15         ` Wang, Yipeng1
  0 siblings, 2 replies; 19+ messages in thread
From: Honnappa Nagarahalli @ 2018-11-06  6:07 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: bruce.richardson, pablo.de.lara.guarch, dev, yipeng1.wang,
	Dharmik Thakkar, Gavin Hu (Arm Technology China),
	nd, thomas, ferruh.yigit, hemant.agrawal, chaozhu, nd

> > >
> > > Add lock-free read-write concurrency. This is achieved by the
> > > following changes.
> > >
> > > 1) Add memory ordering to avoid race conditions. The only race
> > > condition that can occur is -  using the key store element before
> > > the key write is completed. Hence, while inserting the element the
> > > release memory order is used. Any other race condition is caught by
> > > the key comparison. Memory orderings are added only where needed.
> > > For ex: reads in the writer's context do not need memory ordering as
> > > there is a single writer.
> > >
> > > key_idx in the bucket entry and pdata in the key store element are
> > > used for synchronisation. key_idx is used to release an inserted
> > > entry in the bucket to the reader. Use of pdata for synchronisation
> > > is required due to updation of an existing entry where-in only the
> > > pdata is updated without updating key_idx.
> > >
> > > 2) Reader-writer concurrency issue, caused by moving the keys to
> > > their alternative locations during key insert, is solved by
> > > introducing a global counter(tbl_chng_cnt) indicating a change in
> > > table.
> > >
> > > 3) Add the flag to enable reader-writer concurrency during run time.
> > >
> > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> >
> > Hi Honnappa,
> >
Jerin, thank you for running this test and all the analysis. I have not run this test. I was focused on simultaneous reads and writes. You can look at file test_hash_readwrite_lf.c to look for the kind of the use cases.

I am trying to reproduce this, I will get back with more details soon.

> > This patch is causing _~24%_ performance regression on mpps/core with
> > 64B packet with l3fwd in EM mode with octeontx.
> >
> > Example command to reproduce with 2 core+2 port l3fwd in hash mode(-E)
> >
Have you run with more cores (8, 16)?

> > # l3fwd -v -c 0xf00000 -n 4 -- -P -E -p 0x3 --config="(0, 0, 23),(1, 0, 22)"
> >
> > Observations:
> > 1) When hash lookup is _success_ then regression is only 3%. Which is
> > kind of make sense because additional new atomic instructions
> >
> > What I meant by lookup is _success_ is:
> > Configuring traffic gen like below to match lookup as defined
> > ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c
> >
> > dest.ip      port0    201.0.0.0
> > src.ip       port0    200.20.0.1
> > dest.port    port0    102
> > src.port     port0    12
> >
> > dest.ip      port1    101.0.0.0
> > src.ip       port1    100.10.0.1
> > dest.port    port1    101
> > src.port     port1    11
> >
> > tx.type      IPv4+TCP
> >
> >
> >
> > 2) When hash lookup _fails_ the per core mpps regression comes around 24%
> with 64B packet size.
> >
> > What I meant by lookup is _failure_ is:
> > Configuring traffic gen not to hit the 5 tuples defined in
> > ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c
> >
> >
> > 3) perf top _without_ this patch
> >   37.30%  l3fwd         [.] em_main_loop
> >   22.40%  l3fwd         [.] rte_hash_lookup
> >   13.05%  l3fwd         [.] nicvf_recv_pkts_cksum
> >    9.70%  l3fwd         [.] nicvf_xmit_pkts
> >    6.18%  l3fwd         [.] ipv4_hash_crc
> >    4.77%  l3fwd         [.] nicvf_fill_rbdr
> >    4.50%  l3fwd         [.] nicvf_single_pool_free_xmited_buffers
> >    1.16%  libc-2.28.so  [.] memcpy
> >    0.47%  l3fwd         [.] common_ring_mp_enqueue
> >    0.44%  l3fwd         [.] common_ring_mc_dequeue
> >    0.03%  l3fwd         [.] strerror_r@plt
> >
> > 4) perf top with this patch
> >
> >   47.41%  l3fwd         [.] rte_hash_lookup
> >   23.55%  l3fwd         [.] em_main_loop
> >    9.53%  l3fwd         [.] nicvf_recv_pkts_cksum
> >    6.95%  l3fwd         [.] nicvf_xmit_pkts
> >    4.63%  l3fwd         [.] ipv4_hash_crc
> >    3.30%  l3fwd         [.] nicvf_fill_rbdr
> >    3.29%  l3fwd         [.] nicvf_single_pool_free_xmited_buffers
> >    0.76%  libc-2.28.so  [.] memcpy
> >    0.30%  l3fwd         [.] common_ring_mp_enqueue
> >    0.25%  l3fwd         [.] common_ring_mc_dequeue
> >    0.04%  l3fwd         [.] strerror_r@plt
> >
> >
> > 5) Based on assembly, most of the cycles spends in rte_hash_lookup
> > around  key_idx = __atomic_load_n(&bkt->key_idx[i](whose LDAR) and "if
> > (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT) {"
> >
> >
> > 6) Since this patch is big and does 3 things are mentioned above, it
> > is difficult to pin point what is causing the exact issue.
> >
> > But, my primary analysis shows the item (1)(adding the atomic barriers).
> > But I need to spend more cycles to find out the exact causes.
> 
> 
> + Adding POWERPC maintainer as mostly POWERPC also impacted on this
> patch.
> Looks like __atomic_load_n(__ATOMIC_ACQUIRE) will be just mov instruction
> on x86, so x86 may not be much impacted.
> 
> I analyzed it further, it a plain LD vs __atomic_load_n(__ATOMIC_ACQUIRE)
> issue.
> 
> The outer __rte_hash_lookup_with_hash has only 2ish __atomic_load_n
> operation which causing only around 1% regression.
> 
> But since this patch has "two" __atomic_load_n in each
> search_one_bucket() and in the worst case it is looping around 16 time.s i.e
> "32 LDAR per packet" explains why 24% drop in lookup miss cases and ~3%
> drop in lookup success case.
Agree 'search_one_bucket' has 2 atomic loads. However, only 1 of them should be called during a lookup miss. Ideally, the second one should not be called as it is expected that the 'if' statement on the top is supposed to block it. Are you seeing the 2nd atomic load in your perf output?
Is your failure traffic having multiple flows or single flow?

> 
> So this patch's regression will be based on how many cycles an LDAR takes on
> given ARMv8 platform and on how many issue(s) it can issue LDAR
> instructions at given point of time.
Agree. There are multiple micro-architectures in Arm eco-system. We should establish few simple rules to make sure algorithms perform well on all the available platforms. I established few rules in VPP and they are working fine so far.

> 
> IMO, This scheme won't work. I think, we are introducing such performance
> critical feature, we need to put under function pointer scheme so that if an
> application does not need such feature it can use plain loads.
> 
IMO, we should do some more debugging before going into exploring other options.

> Already we have a lot of flags in the hash library to define the runtime
> behavior, I think, it makes sense to select function pointer based on such flags
> and have a performance effective solution based on application requirements.
> 
IMO, many flags can be removed by doing some cleanup exercise.

> Just to prove the above root cause analysis, the following patch can fix the
> performance issue. I know, it is NOT correct in the context of this patch. Just
> pasting in case, someone want to see the cost of LD vs
> __atomic_load_n(__ATOMIC_ACQUIRE) on a given platform.
> 
> 
> On a different note, I think, it makes sense to use RCU based structure in these
> case to avoid performance issue. liburcu has a good hash library for such
> cases. (very less write and more read cases)
> 
I think we miss a point here. These changes are based on customer feedback and requests. DPDK is not only providing the implementation, it is providing the APIs as well. IMO, we should make sure the code we have in DPDK supports multiple use cases rather than addressing a narrow use case and pointing somewhere else for others.

> /Jerin
> 
> @@ -1135,27 +1134,21 @@ search_one_bucket(const struct rte_hash *h,
> const void *key, uint16_t sig,
>                         void **data, const struct rte_hash_bucket *bkt)  {
>         int i;
> -       uint32_t key_idx;
> -       void *pdata;
>         struct rte_hash_key *k, *keys = h->key_store;
> 
>         for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
> -               key_idx = __atomic_load_n(&bkt->key_idx[i],
> -                                         __ATOMIC_ACQUIRE);
> -               if (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT)
>                 {
> +               if (bkt->sig_current[i] == sig &&
> +                               bkt->key_idx[i] != EMPTY_SLOT) {
>                         k = (struct rte_hash_key *) ((char *)keys +
> -                                       key_idx * h->key_entry_size);
> -                       pdata = __atomic_load_n(&k->pdata,
> -                                       __ATOMIC_ACQUIRE);
> -
> +                                       bkt->key_idx[i] *
> h->key_entry_size);
Does this make a difference for the lookup miss test case?

>                         if (rte_hash_cmp_eq(key, k->key, h) == 0) {
>                                 if (data != NULL)
> -                                       *data = pdata;
> +                                       *data = k->pdata;
>                                 /*
>                                  * Return index where key is stored,
>                                  * subtracting the first dummy index
>                                  */
> -                               return key_idx - 1;
> +                               return bkt->key_idx[i] - 1;
>                         }
>                 }
>         }
> 
> 
> >
> > The use case like lwfwd in hash mode, where writer does not update
> > stuff in fastpath(aka insert op) will be impact with this patch.
> >
I do not think 'l3fwd in hash mode' application is practical. If the aim of this application is to showcase performance, it should represent real-life use case. It should have hash inserts from control plane along with lookups. We also have to note that, the algorithm without lock-free patch was not usable for certain use cases on all platforms. It was not usable for even more use cases on Arm platforms. So, I am not sure if we should be comparing the numbers from previous algorithm.
However, there seem to be some scope for improvement with the data structure layout. I am looking into this further.

> > 7) Have you checked the l3fwd lookup failure use case in your environment?
> > if so, please share your observation and if not, could you please check it?
> >
> > 8) IMO, Such performance regression is not acceptable for l3fwd use
> > case where hash insert op will be done in slowpath.
What is missing in this application is continuous hash adds from slow path.

> >
> > 9) Does anyone else facing this problem?
Any data on x86?

> >
> >

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency
  2018-11-06  6:07       ` Honnappa Nagarahalli
@ 2018-11-06  9:10         ` Jerin Jacob
  2018-11-06  9:13           ` Thomas Monjalon
  2018-11-09  1:34           ` Honnappa Nagarahalli
  2018-11-07  2:15         ` Wang, Yipeng1
  1 sibling, 2 replies; 19+ messages in thread
From: Jerin Jacob @ 2018-11-06  9:10 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: bruce.richardson, pablo.de.lara.guarch, dev, yipeng1.wang,
	Dharmik Thakkar, Gavin Hu (Arm Technology China),
	nd, thomas, ferruh.yigit, hemant.agrawal, chaozhu, Kapoor,
	Prasun

-----Original Message-----
> Date: Tue, 6 Nov 2018 06:07:43 +0000
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> CC: "bruce.richardson@intel.com" <bruce.richardson@intel.com>,
>  "pablo.de.lara.guarch@intel.com" <pablo.de.lara.guarch@intel.com>,
>  "dev@dpdk.org" <dev@dpdk.org>, "yipeng1.wang@intel.com"
>  <yipeng1.wang@intel.com>, Dharmik Thakkar <Dharmik.Thakkar@arm.com>,
>  "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>, nd <nd@arm.com>,
>  "thomas@monjalon.net" <thomas@monjalon.net>, "ferruh.yigit@intel.com"
>  <ferruh.yigit@intel.com>, "hemant.agrawal@nxp.com"
>  <hemant.agrawal@nxp.com>, "chaozhu@linux.vnet.ibm.com"
>  <chaozhu@linux.vnet.ibm.com>, nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write
>  concurrency
> 
> 
> > > >
> > > > Add lock-free read-write concurrency. This is achieved by the
> > > > following changes.
> > > >
> > > > 1) Add memory ordering to avoid race conditions. The only race
> > > > condition that can occur is -  using the key store element before
> > > > the key write is completed. Hence, while inserting the element the
> > > > release memory order is used. Any other race condition is caught by
> > > > the key comparison. Memory orderings are added only where needed.
> > > > For ex: reads in the writer's context do not need memory ordering as
> > > > there is a single writer.
> > > >
> > > > key_idx in the bucket entry and pdata in the key store element are
> > > > used for synchronisation. key_idx is used to release an inserted
> > > > entry in the bucket to the reader. Use of pdata for synchronisation
> > > > is required due to updation of an existing entry where-in only the
> > > > pdata is updated without updating key_idx.
> > > >
> > > > 2) Reader-writer concurrency issue, caused by moving the keys to
> > > > their alternative locations during key insert, is solved by
> > > > introducing a global counter(tbl_chng_cnt) indicating a change in
> > > > table.
> > > >
> > > > 3) Add the flag to enable reader-writer concurrency during run time.
> > > >
> > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > >
> > > Hi Honnappa,
> > >
> Jerin, thank you for running this test and all the analysis. I have not run this test. I was focused on simultaneous reads and writes. You can look at file test_hash_readwrite_lf.c to look for the kind of the use cases.
> 
> I am trying to reproduce this, I will get back with more details soon.

OK. Since RC2 approaching, bit worried about next steps if we cannot
avoid the regression with this new feature for this release.

24% drop of performance regression not seen in past for a specific
usecase.So not sure what we did past for similar situations.

Thomas,
Any thought of this?


> 
> > > This patch is causing _~24%_ performance regression on mpps/core with
> > > 64B packet with l3fwd in EM mode with octeontx.
> > >
> > > Example command to reproduce with 2 core+2 port l3fwd in hash mode(-E)
> > >
> Have you run with more cores (8, 16)?

Tried with 4, 8, 16 cores, performance drop is linear.


> 
> > > # l3fwd -v -c 0xf00000 -n 4 -- -P -E -p 0x3 --config="(0, 0, 23),(1, 0, 22)"
> > >
> > > Observations:
> > > 1) When hash lookup is _success_ then regression is only 3%. Which is
> > > kind of make sense because additional new atomic instructions
> > >
> > > What I meant by lookup is _success_ is:
> > > Configuring traffic gen like below to match lookup as defined
> > > ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c
> > >
> > > dest.ip      port0    201.0.0.0
> > > src.ip       port0    200.20.0.1
> > > dest.port    port0    102
> > > src.port     port0    12
> > >
> > > dest.ip      port1    101.0.0.0
> > > src.ip       port1    100.10.0.1
> > > dest.port    port1    101
> > > src.port     port1    11
> > >
> > > tx.type      IPv4+TCP
> > >
> > >
> > >
> > > 2) When hash lookup _fails_ the per core mpps regression comes around 24%
> > with 64B packet size.
> > >
> > > What I meant by lookup is _failure_ is:
> > > Configuring traffic gen not to hit the 5 tuples defined in
> > > ipv4_l3fwd_em_route_array() in examples/l3fwd/l3fwd_em.c
> > >
> > >
> > > 3) perf top _without_ this patch
> > >   37.30%  l3fwd         [.] em_main_loop
> > >   22.40%  l3fwd         [.] rte_hash_lookup
> > >   13.05%  l3fwd         [.] nicvf_recv_pkts_cksum
> > >    9.70%  l3fwd         [.] nicvf_xmit_pkts
> > >    6.18%  l3fwd         [.] ipv4_hash_crc
> > >    4.77%  l3fwd         [.] nicvf_fill_rbdr
> > >    4.50%  l3fwd         [.] nicvf_single_pool_free_xmited_buffers
> > >    1.16%  libc-2.28.so  [.] memcpy
> > >    0.47%  l3fwd         [.] common_ring_mp_enqueue
> > >    0.44%  l3fwd         [.] common_ring_mc_dequeue
> > >    0.03%  l3fwd         [.] strerror_r@plt
> > >
> > > 4) perf top with this patch
> > >
> > >   47.41%  l3fwd         [.] rte_hash_lookup
> > >   23.55%  l3fwd         [.] em_main_loop
> > >    9.53%  l3fwd         [.] nicvf_recv_pkts_cksum
> > >    6.95%  l3fwd         [.] nicvf_xmit_pkts
> > >    4.63%  l3fwd         [.] ipv4_hash_crc
> > >    3.30%  l3fwd         [.] nicvf_fill_rbdr
> > >    3.29%  l3fwd         [.] nicvf_single_pool_free_xmited_buffers
> > >    0.76%  libc-2.28.so  [.] memcpy
> > >    0.30%  l3fwd         [.] common_ring_mp_enqueue
> > >    0.25%  l3fwd         [.] common_ring_mc_dequeue
> > >    0.04%  l3fwd         [.] strerror_r@plt
> > >
> > >
> > > 5) Based on assembly, most of the cycles spends in rte_hash_lookup
> > > around  key_idx = __atomic_load_n(&bkt->key_idx[i](whose LDAR) and "if
> > > (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT) {"
> > >
> > >
> > > 6) Since this patch is big and does 3 things are mentioned above, it
> > > is difficult to pin point what is causing the exact issue.
> > >
> > > But, my primary analysis shows the item (1)(adding the atomic barriers).
> > > But I need to spend more cycles to find out the exact causes.
> >
> >
> > + Adding POWERPC maintainer as mostly POWERPC also impacted on this
> > patch.
> > Looks like __atomic_load_n(__ATOMIC_ACQUIRE) will be just mov instruction
> > on x86, so x86 may not be much impacted.
> >
> > I analyzed it further, it a plain LD vs __atomic_load_n(__ATOMIC_ACQUIRE)
> > issue.
> >
> > The outer __rte_hash_lookup_with_hash has only 2ish __atomic_load_n
> > operation which causing only around 1% regression.
> >
> > But since this patch has "two" __atomic_load_n in each
> > search_one_bucket() and in the worst case it is looping around 16 time.s i.e
> > "32 LDAR per packet" explains why 24% drop in lookup miss cases and ~3%
> > drop in lookup success case.
> Agree 'search_one_bucket' has 2 atomic loads. However, only 1 of them should be called during a lookup miss. Ideally, the second one should not be called as it is expected that the 'if' statement on the top is supposed to block it. Are you seeing the 2nd atomic load in your perf output?
> Is your failure traffic having multiple flows or single flow?

l3fwd has only 4 entries. So yes, it is single flow.

> 
> >
> > So this patch's regression will be based on how many cycles an LDAR takes on
> > given ARMv8 platform and on how many issue(s) it can issue LDAR
> > instructions at given point of time.
> Agree. There are multiple micro-architectures in Arm eco-system. We should establish few simple rules to make sure algorithms perform well on all the available platforms. I established few rules in VPP and they are working fine so far.

Can you share that rules for everyone's benefit?


> 
> >
> > IMO, This scheme won't work. I think, we are introducing such performance
> > critical feature, we need to put under function pointer scheme so that if an
> > application does not need such feature it can use plain loads.
> >
> IMO, we should do some more debugging before going into exploring other options.

Yes. But, how do we align with v18.11 release?

> 
> > Already we have a lot of flags in the hash library to define the runtime
> > behavior, I think, it makes sense to select function pointer based on such flags
> > and have a performance effective solution based on application requirements.
> >
> IMO, many flags can be removed by doing some cleanup exercise.
> 
> > Just to prove the above root cause analysis, the following patch can fix the
> > performance issue. I know, it is NOT correct in the context of this patch. Just
> > pasting in case, someone want to see the cost of LD vs
> > __atomic_load_n(__ATOMIC_ACQUIRE) on a given platform.
> >
> >
> > On a different note, I think, it makes sense to use RCU based structure in these
> > case to avoid performance issue. liburcu has a good hash library for such
> > cases. (very less write and more read cases)
> >
> I think we miss a point here. These changes are based on customer feedback and requests. DPDK is not only providing the implementation, it is providing the APIs as well. IMO, we should make sure the code we have in DPDK supports multiple use cases rather than addressing a narrow use case and pointing somewhere else for others.

Sure. But Introducing a use case should NOT harm other exiting use case
used current example applications.


> 
> >
> > >
> > > The use case like lwfwd in hash mode, where writer does not update
> > > stuff in fastpath(aka insert op) will be impact with this patch.
> > >
> I do not think 'l3fwd in hash mode' application is practical. If the aim of this application is to showcase performance, it should represent real-life use case. It should have hash inserts from control plane along with lookups. We also have to note that, the algorithm without lock-free patch was not usable for certain use cases on all platforms. It was not usable for even more use cases on Arm platforms. So, I am not sure if we should be comparing the numbers from previous algorithm.

In worst case, we need to make compile time or runtime with function
pointer or whatever it takes to fix.


> However, there seem to be some scope for improvement with the data structure layout. I am looking into this further.

OK

> 
> > > 7) Have you checked the l3fwd lookup failure use case in your environment?
> > > if so, please share your observation and if not, could you please check it?
> > >
> > > 8) IMO, Such performance regression is not acceptable for l3fwd use
> > > case where hash insert op will be done in slowpath.
> What is missing in this application is continuous hash adds from slow path.

Yes. That's one use case for hash library. Some application may not need such
use case so it should be driven from application to define the requirements,
especially a feature impacting another usecase's performance.


> 
> > >
> > > 9) Does anyone else facing this problem?
> Any data on x86?
> 
> > >
> > >

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency
  2018-11-06  9:10         ` Jerin Jacob
@ 2018-11-06  9:13           ` Thomas Monjalon
  2018-11-06  9:47             ` Jerin Jacob
  2018-11-09  1:34           ` Honnappa Nagarahalli
  1 sibling, 1 reply; 19+ messages in thread
From: Thomas Monjalon @ 2018-11-06  9:13 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: Honnappa Nagarahalli, bruce.richardson, pablo.de.lara.guarch,
	dev, yipeng1.wang, Dharmik Thakkar,
	Gavin Hu (Arm Technology China),
	nd, ferruh.yigit, hemant.agrawal, chaozhu, Kapoor, Prasun

06/11/2018 10:10, Jerin Jacob:
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > Jerin, thank you for running this test and all the analysis. I have not run this test. I was focused on simultaneous reads and writes. You can look at file test_hash_readwrite_lf.c to look for the kind of the use cases.
> > 
> > I am trying to reproduce this, I will get back with more details soon.
> 
> OK. Since RC2 approaching, bit worried about next steps if we cannot
> avoid the regression with this new feature for this release.

18.11-rc2 is already out.

> 24% drop of performance regression not seen in past for a specific
> usecase.So not sure what we did past for similar situations.
> 
> Thomas,
> Any thought of this?

Everything is possible, you just need to agree on a patch.
Please let's fix (or disable or revert) it in -rc3 this week.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency
  2018-11-06  9:13           ` Thomas Monjalon
@ 2018-11-06  9:47             ` Jerin Jacob
  0 siblings, 0 replies; 19+ messages in thread
From: Jerin Jacob @ 2018-11-06  9:47 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: Honnappa Nagarahalli, bruce.richardson, pablo.de.lara.guarch,
	dev, yipeng1.wang, Dharmik Thakkar,
	Gavin Hu (Arm Technology China),
	nd, ferruh.yigit, hemant.agrawal, chaozhu, Kapoor, Prasun

-----Original Message-----
> Date: Tue, 06 Nov 2018 10:13:53 +0100
> From: Thomas Monjalon <thomas@monjalon.net>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>,
>  "bruce.richardson@intel.com" <bruce.richardson@intel.com>,
>  "pablo.de.lara.guarch@intel.com" <pablo.de.lara.guarch@intel.com>,
>  "dev@dpdk.org" <dev@dpdk.org>, "yipeng1.wang@intel.com"
>  <yipeng1.wang@intel.com>, Dharmik Thakkar <Dharmik.Thakkar@arm.com>,
>  "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>, nd <nd@arm.com>,
>  "ferruh.yigit@intel.com" <ferruh.yigit@intel.com>,
>  "hemant.agrawal@nxp.com" <hemant.agrawal@nxp.com>,
>  "chaozhu@linux.vnet.ibm.com" <chaozhu@linux.vnet.ibm.com>, "Kapoor,
>  Prasun" <Prasun.Kapoor@cavium.com>
> Subject: Re: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write
>  concurrency
> 
> 06/11/2018 10:10, Jerin Jacob:
> > From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> > > Jerin, thank you for running this test and all the analysis. I have not run this test. I was focused on simultaneous reads and writes. You can look at file test_hash_readwrite_lf.c to look for the kind of the use cases.
> > >
> > > I am trying to reproduce this, I will get back with more details soon.
> >
> > OK. Since RC2 approaching, bit worried about next steps if we cannot
> > avoid the regression with this new feature for this release.
> 
> 18.11-rc2 is already out.
> 
> > 24% drop of performance regression not seen in past for a specific
> > usecase.So not sure what we did past for similar situations.
> >
> > Thomas,
> > Any thought of this?
> 
> Everything is possible, you just need to agree on a patch.
> Please let's fix (or disable or revert) it in -rc3 this week.

I let Honnappa to decide fix, disable or revert action for rc3 release.
But it should either one of them for rc3 not this patch as is.

> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency
  2018-11-06  6:07       ` Honnappa Nagarahalli
  2018-11-06  9:10         ` Jerin Jacob
@ 2018-11-07  2:15         ` Wang, Yipeng1
  2018-11-09  0:47           ` Honnappa Nagarahalli
  1 sibling, 1 reply; 19+ messages in thread
From: Wang, Yipeng1 @ 2018-11-07  2:15 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Jerin Jacob
  Cc: Richardson, Bruce, De Lara Guarch, Pablo, dev, Dharmik Thakkar,
	Gavin Hu (Arm Technology China),
	nd, thomas, Yigit, Ferruh, hemant.agrawal, chaozhu, nd, Gobriel,
	Sameh

>-----Original Message-----
>From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
>Sent: Monday, November 5, 2018 10:08 PM
>To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
>Cc: Richardson, Bruce <bruce.richardson@intel.com>; De Lara Guarch, Pablo <pablo.de.lara.guarch@intel.com>; dev@dpdk.org; Wang,
>Yipeng1 <yipeng1.wang@intel.com>; Dharmik Thakkar <Dharmik.Thakkar@arm.com>; Gavin Hu (Arm Technology China)
><Gavin.Hu@arm.com>; nd <nd@arm.com>; thomas@monjalon.net; Yigit, Ferruh <ferruh.yigit@intel.com>;
>hemant.agrawal@nxp.com; chaozhu@linux.vnet.ibm.com; nd <nd@arm.com>
>Subject: RE: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency
>> >
>> > 9) Does anyone else facing this problem?
>Any data on x86?
>
[Wang, Yipeng] 
I tried Jerin's tests on x86. So by default l3fwd on x86 will use lookup_bulk and SIMD instruction so there is no obvious throughput
drop on both hit and miss cases (for hit case, there is about 2.5% drop though).

I manually changed l3fwd  to do single packet lookup instead of bulk. For hit case there is no throughput drop.
For miss case, there is 10% throughput drop.

I dig into it, as expected, atomic load indeed translates to regular mov on x86. 
But since the reordering of the instruction, the compiler(gcc 5.4)
cannot unroll the for loop to a switch-case like assembly as before. 
So I believe the reason of performance drops on x86 is because compiler cannot optimize the code as well as previously.
I guess this is totally different reason from why
your performance drop on non-TSO machine. On non-TSO machine, probably the excessive number of atomic load
causes a lot of overhead.

A quick fix I found useful on x86 is to read all index together. I am no expert on the use of atomic intinsics, but I assume
By adding a fence should still maintain the correct ordering?
-       uint32_t key_idx;
+       uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES];
        void *pdata;
        struct rte_hash_key *k, *keys = h->key_store;

+       memcpy(key_idx, bkt->key_idx, 4 * RTE_HASH_BUCKET_ENTRIES);
+       __atomic_thread_fence(__ATOMIC_ACQUIRE);
+
        for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
-               key_idx = __atomic_load_n(&bkt->key_idx[i],
-                                         __ATOMIC_ACQUIRE);
-               if (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT) {
+               if (bkt->sig_current[i] == sig && key_idx[i] != EMPTY_SLOT){

Yipeng

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency
  2018-11-07  2:15         ` Wang, Yipeng1
@ 2018-11-09  0:47           ` Honnappa Nagarahalli
  0 siblings, 0 replies; 19+ messages in thread
From: Honnappa Nagarahalli @ 2018-11-09  0:47 UTC (permalink / raw)
  To: Wang, Yipeng1, Jerin Jacob
  Cc: Richardson, Bruce, De Lara Guarch, Pablo, dev, Dharmik Thakkar,
	Gavin Hu (Arm Technology China),
	nd, thomas, Yigit, Ferruh, hemant.agrawal, chaozhu, nd, Gobriel,
	Sameh, nd

> >> >
> >> > 9) Does anyone else facing this problem?
> >Any data on x86?
> >
> [Wang, Yipeng]
> I tried Jerin's tests on x86. So by default l3fwd on x86 will use lookup_bulk
> and SIMD instruction so there is no obvious throughput drop on both hit
> and miss cases (for hit case, there is about 2.5% drop though).
Do you mean, if the test case has 'hit only' lookups, there is 2.5% drop?

> 
> I manually changed l3fwd  to do single packet lookup instead of bulk. For hit
> case there is no throughput drop.
> For miss case, there is 10% throughput drop.
> 
> I dig into it, as expected, atomic load indeed translates to regular mov on
> x86.
> But since the reordering of the instruction, the compiler(gcc 5.4) cannot
> unroll the for loop to a switch-case like assembly as before.
> So I believe the reason of performance drops on x86 is because compiler
> cannot optimize the code as well as previously.
Thank you. This makes sense.

> I guess this is totally different reason from why your performance drop on
> non-TSO machine. On non-TSO machine, probably the excessive number of
> atomic load causes a lot of overhead.
> 
> A quick fix I found useful on x86 is to read all index together. I am no expert
> on the use of atomic intinsics, but I assume By adding a fence should still
> maintain the correct ordering?
> -       uint32_t key_idx;
> +       uint32_t key_idx[RTE_HASH_BUCKET_ENTRIES];
>         void *pdata;
>         struct rte_hash_key *k, *keys = h->key_store;
> 
> +       memcpy(key_idx, bkt->key_idx, 4 * RTE_HASH_BUCKET_ENTRIES);
> +       __atomic_thread_fence(__ATOMIC_ACQUIRE);
> +
>         for (i = 0; i < RTE_HASH_BUCKET_ENTRIES; i++) {
> -               key_idx = __atomic_load_n(&bkt->key_idx[i],
> -                                         __ATOMIC_ACQUIRE);
> -               if (bkt->sig_current[i] == sig && key_idx != EMPTY_SLOT) {
> +               if (bkt->sig_current[i] == sig && key_idx[i] !=
> + EMPTY_SLOT){
Thank you for your suggestion. I tried it on Arm platforms, unfortunately it did not help. However, the idea of reducing the number of memory orderings addresses the problem. I worked on a hacked patch for the last couple of days. I have tested it with L3FWD data set, it provides good benefits. I have sent it to you and Jerin. Any feedback will be helpful.

> 
> Yipeng

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency
  2018-11-06  9:10         ` Jerin Jacob
  2018-11-06  9:13           ` Thomas Monjalon
@ 2018-11-09  1:34           ` Honnappa Nagarahalli
  2018-11-09  2:20             ` Honnappa Nagarahalli
  2018-11-09  9:28             ` Jerin Jacob
  1 sibling, 2 replies; 19+ messages in thread
From: Honnappa Nagarahalli @ 2018-11-09  1:34 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: bruce.richardson, pablo.de.lara.guarch, dev, yipeng1.wang,
	Dharmik Thakkar, Gavin Hu (Arm Technology China),
	nd, thomas, ferruh.yigit, hemant.agrawal, chaozhu, Kapoor,
	Prasun, nd


> > Agree. There are multiple micro-architectures in Arm eco-system. We
> should establish few simple rules to make sure algorithms perform well on all
> the available platforms. I established few rules in VPP and they are working
> fine so far.
> 
> Can you share that rules for everyone's benefit?
> 
These are just few simple rules anyone can think of, but avoid the surprises.
We identified a owner for each platform (we have this already in DPDK, even across platforms)
Each patch submitted for Arm platforms is marked with -2 (VPP uses Gerrit)
Every platform owner tests on her/his platform. -2 will be removed only if it does not cause regression on any platforms. Platform owner helps out with optimization where required as they understand their micro-architecture best. I guess this is what is supposed to happen through the review process in DPDK. But making sure everyone tests it before it gets merged avoids the surprises.

> > >
> > > IMO, This scheme won't work. I think, we are introducing such
> > > performance critical feature, we need to put under function pointer
> > > scheme so that if an application does not need such feature it can use
> plain loads.
> > >
> > IMO, we should do some more debugging before going into exploring other
> options.
> 
> Yes. But, how do we align with v18.11 release?
> 
I think I have spent enough time optimizing the code. Please provide the feedback and I will work on completing the fix.

However, if the new patch is not satisfactory enough, we need another plan.

You had mentioned about using function pointers. I suggest, we use the function pointer only for lookup function. Otherwise, it will be too much of code duplication.
When lock-free is not used, the function with no memory-orderings will be called. However, I am not sure about the function pointer overhead. But this will be a simple change.

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency
  2018-11-09  1:34           ` Honnappa Nagarahalli
@ 2018-11-09  2:20             ` Honnappa Nagarahalli
  2018-11-09  9:28             ` Jerin Jacob
  1 sibling, 0 replies; 19+ messages in thread
From: Honnappa Nagarahalli @ 2018-11-09  2:20 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: bruce.richardson, pablo.de.lara.guarch, dev, yipeng1.wang,
	Dharmik Thakkar, Gavin Hu (Arm Technology China),
	nd, thomas, ferruh.yigit, hemant.agrawal, chaozhu, Kapoor,
	Prasun, nd, nd

> 
> > > Agree. There are multiple micro-architectures in Arm eco-system. We
> > should establish few simple rules to make sure algorithms perform well
> > on all the available platforms. I established few rules in VPP and
> > they are working fine so far.
> >
> > Can you share that rules for everyone's benefit?
> >
> These are just few simple rules anyone can think of, but avoid the surprises.
> We identified a owner for each platform (we have this already in DPDK, even
> across platforms) Each patch submitted for Arm platforms is marked with -2
> (VPP uses Gerrit) Every platform owner tests on her/his platform. -2 will be
> removed only if it does not cause regression on any platforms. Platform
> owner helps out with optimization where required as they understand their
> micro-architecture best. I guess this is what is supposed to happen through
> the review process in DPDK. But making sure everyone tests it before it gets
> merged avoids the surprises.
> 
> > > >
> > > > IMO, This scheme won't work. I think, we are introducing such
> > > > performance critical feature, we need to put under function
> > > > pointer scheme so that if an application does not need such
> > > > feature it can use
> > plain loads.
> > > >
> > > IMO, we should do some more debugging before going into exploring
> > > other
> > options.
> >
> > Yes. But, how do we align with v18.11 release?
> >
> I think I have spent enough time optimizing the code. Please provide the
> feedback and I will work on completing the fix.
> 
> However, if the new patch is not satisfactory enough, we need another plan.
> 
> You had mentioned about using function pointers. I suggest, we use the
> function pointer only for lookup function. Otherwise, it will be too much of
> code duplication.
> When lock-free is not used, the function with no memory-orderings will be
> called. However, I am not sure about the function pointer overhead. But this
> will be a simple change.
Yipeng/Bruce, would this work for you?

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency
  2018-11-09  1:34           ` Honnappa Nagarahalli
  2018-11-09  2:20             ` Honnappa Nagarahalli
@ 2018-11-09  9:28             ` Jerin Jacob
  2018-11-09 15:37               ` Honnappa Nagarahalli
  1 sibling, 1 reply; 19+ messages in thread
From: Jerin Jacob @ 2018-11-09  9:28 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: bruce.richardson, pablo.de.lara.guarch, dev, yipeng1.wang,
	Dharmik Thakkar, Gavin Hu (Arm Technology China),
	nd, thomas, ferruh.yigit, hemant.agrawal, chaozhu, Kapoor,
	Prasun

-----Original Message-----
> Date: Fri, 9 Nov 2018 01:34:25 +0000
> From: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>
> To: Jerin Jacob <jerin.jacob@caviumnetworks.com>
> CC: "bruce.richardson@intel.com" <bruce.richardson@intel.com>,
>  "pablo.de.lara.guarch@intel.com" <pablo.de.lara.guarch@intel.com>,
>  "dev@dpdk.org" <dev@dpdk.org>, "yipeng1.wang@intel.com"
>  <yipeng1.wang@intel.com>, Dharmik Thakkar <Dharmik.Thakkar@arm.com>,
>  "Gavin Hu (Arm Technology China)" <Gavin.Hu@arm.com>, nd <nd@arm.com>,
>  "thomas@monjalon.net" <thomas@monjalon.net>, "ferruh.yigit@intel.com"
>  <ferruh.yigit@intel.com>, "hemant.agrawal@nxp.com"
>  <hemant.agrawal@nxp.com>, "chaozhu@linux.vnet.ibm.com"
>  <chaozhu@linux.vnet.ibm.com>, "Kapoor, Prasun" <Prasun.Kapoor@cavium.com>,
>  nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write
>  concurrency
> 
> > > Agree. There are multiple micro-architectures in Arm eco-system. We
> > should establish few simple rules to make sure algorithms perform well on all
> > the available platforms. I established few rules in VPP and they are working
> > fine so far.
> >
> > Can you share that rules for everyone's benefit?
> >
> These are just few simple rules anyone can think of, but avoid the surprises.
> We identified a owner for each platform (we have this already in DPDK, even across platforms)
> Each patch submitted for Arm platforms is marked with -2 (VPP uses Gerrit)
> Every platform owner tests on her/his platform. -2 will be removed only if it does not cause regression on any platforms. Platform owner helps out with optimization where required as they understand their micro-architecture best. I guess this is what is supposed to happen through the review process in DPDK. But making sure everyone tests it before it gets merged avoids the surprises.

I think, The very same philosophy can be implemented with exiting mailing list
method, if

1) Author Cc all the architecture maintainers and platform owners for
any fast path change where it can introduce performance regression.
2) Author can CC the same list to request for performance check
along with test command if area of performance regression known before. 

I agree with last minute surprises are bad for both Author and platform
owner. I think, it can fixed by above scheme.

> 
> > > >
> > > > IMO, This scheme won't work. I think, we are introducing such
> > > > performance critical feature, we need to put under function pointer
> > > > scheme so that if an application does not need such feature it can use
> > plain loads.
> > > >
> > > IMO, we should do some more debugging before going into exploring other
> > options.
> >
> > Yes. But, how do we align with v18.11 release?
> >
> I think I have spent enough time optimizing the code. Please provide the feedback and I will work on completing the fix.
> 
> However, if the new patch is not satisfactory enough, we need another plan.

Based on the public release meeting held yesterday, RC3 date is on next Monday.

I would suggest:
- Send your exiting tested patch in mailing list for review. In my
  setup, The regression reduced to 5.7% from 24%
- Extend the patch for hash bulk case as well and check NO_HASH_MULTI_LOOKUP
  as zero with  EM_HASH_LOOKUP_COUNT value as 16 or 8 for arm64.

> 
> You had mentioned about using function pointers. I suggest, we use the function pointer only for lookup function. Otherwise, it will be too much of code duplication.
> When lock-free is not used, the function with no memory-orderings will be called. However, I am not sure about the function pointer overhead. But this will be a simple change.

It may not be very simple change as we need to take care secondary
process case as well, see struct rte_mempool::ops_index scheme.

Since rte_hash_lookup() already NOT a inline function, so making
it as inline and calling a function pointer inside may not attract much
overhead. But we can tell only after testing(Which may be not possible for
RC3)

I think, in future it make sense to have function pointer scheme to
avoid new APIs for different hash library and we can plugin other
proven hash library like urcu based one etc to DPDK.

https://lwn.net/Articles/573431/

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency
  2018-11-09  9:28             ` Jerin Jacob
@ 2018-11-09 15:37               ` Honnappa Nagarahalli
  0 siblings, 0 replies; 19+ messages in thread
From: Honnappa Nagarahalli @ 2018-11-09 15:37 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: bruce.richardson, pablo.de.lara.guarch, dev, yipeng1.wang,
	Dharmik Thakkar, Gavin Hu (Arm Technology China),
	nd, thomas, ferruh.yigit, hemant.agrawal, chaozhu, Kapoor,
	Prasun, nd

> > > > Agree. There are multiple micro-architectures in Arm eco-system.
> > > > We
> > > should establish few simple rules to make sure algorithms perform
> > > well on all the available platforms. I established few rules in VPP
> > > and they are working fine so far.
> > >
> > > Can you share that rules for everyone's benefit?
> > >
> > These are just few simple rules anyone can think of, but avoid the surprises.
> > We identified a owner for each platform (we have this already in DPDK,
> > even across platforms) Each patch submitted for Arm platforms is
> > marked with -2 (VPP uses Gerrit) Every platform owner tests on her/his
> platform. -2 will be removed only if it does not cause regression on any
> platforms. Platform owner helps out with optimization where required as they
> understand their micro-architecture best. I guess this is what is supposed to
> happen through the review process in DPDK. But making sure everyone tests it
> before it gets merged avoids the surprises.
> 
> I think, The very same philosophy can be implemented with exiting mailing list
> method, if
> 
> 1) Author Cc all the architecture maintainers and platform owners for any fast
> path change where it can introduce performance regression.
> 2) Author can CC the same list to request for performance check along with
> test command if area of performance regression known before.
> 
> I agree with last minute surprises are bad for both Author and platform owner.
> I think, it can fixed by above scheme.
> 
Makes sense to me.

> >
> > > > >
> > > > > IMO, This scheme won't work. I think, we are introducing such
> > > > > performance critical feature, we need to put under function
> > > > > pointer scheme so that if an application does not need such
> > > > > feature it can use
> > > plain loads.
> > > > >
> > > > IMO, we should do some more debugging before going into exploring
> > > > other
> > > options.
> > >
> > > Yes. But, how do we align with v18.11 release?
> > >
> > I think I have spent enough time optimizing the code. Please provide the
> feedback and I will work on completing the fix.
> >
> > However, if the new patch is not satisfactory enough, we need another plan.
> 
> Based on the public release meeting held yesterday, RC3 date is on next
> Monday.
> 
> I would suggest:
> - Send your exiting tested patch in mailing list for review. In my
>   setup, The regression reduced to 5.7% from 24%
> - Extend the patch for hash bulk case as well and check
> NO_HASH_MULTI_LOOKUP
>   as zero with  EM_HASH_LOOKUP_COUNT value as 16 or 8 for arm64.
> 
Thank you Jerin for testing. Unfortunately, when I looked at making the corresponding changes for the hash_add API, they look more involved. They may not be acceptable for RC3. I also need more time to spend on bulk case. For RC3, we will split the lookup path for RW-lock and Lock-free. I will add the rest in a follow up patch.

> >
> > You had mentioned about using function pointers. I suggest, we use the
> function pointer only for lookup function. Otherwise, it will be too much of
> code duplication.
> > When lock-free is not used, the function with no memory-orderings will be
> called. However, I am not sure about the function pointer overhead. But this
> will be a simple change.
> 
> It may not be very simple change as we need to take care secondary process
> case as well, see struct rte_mempool::ops_index scheme.
> 
Yes, I realized it while making changes. I have used a if statement with an existing lock-free flag. Will send out the patch soon.

> Since rte_hash_lookup() already NOT a inline function, so making it as inline
> and calling a function pointer inside may not attract much overhead. But we
> can tell only after testing(Which may be not possible for
> RC3)
> 
> I think, in future it make sense to have function pointer scheme to avoid new
> APIs for different hash library and we can plugin other proven hash library like
> urcu based one etc to DPDK.
> 
> https://lwn.net/Articles/573431/

^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2018-11-09 15:37 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-10-26  5:37 [dpdk-dev] [PATCH v7 0/5] Address reader-writer concurrency in rte_hash Honnappa Nagarahalli
2018-10-26  5:37 ` [dpdk-dev] [PATCH v7 1/5] hash: separate multi-writer from rw-concurrency Honnappa Nagarahalli
2018-10-26  5:37 ` [dpdk-dev] [PATCH v7 2/5] hash: support do not free on delete Honnappa Nagarahalli
2018-10-26  5:37 ` [dpdk-dev] [PATCH v7 3/5] hash: fix key store element alignment Honnappa Nagarahalli
2018-10-26  5:37 ` [dpdk-dev] [PATCH v7 4/5] hash: add lock-free read-write concurrency Honnappa Nagarahalli
2018-11-03 11:52   ` Jerin Jacob
2018-11-03 15:40     ` Jerin Jacob
2018-11-06  6:07       ` Honnappa Nagarahalli
2018-11-06  9:10         ` Jerin Jacob
2018-11-06  9:13           ` Thomas Monjalon
2018-11-06  9:47             ` Jerin Jacob
2018-11-09  1:34           ` Honnappa Nagarahalli
2018-11-09  2:20             ` Honnappa Nagarahalli
2018-11-09  9:28             ` Jerin Jacob
2018-11-09 15:37               ` Honnappa Nagarahalli
2018-11-07  2:15         ` Wang, Yipeng1
2018-11-09  0:47           ` Honnappa Nagarahalli
2018-10-26  5:37 ` [dpdk-dev] [PATCH v7 5/5] test/hash: read-write lock-free concurrency test Honnappa Nagarahalli
2018-10-26 10:52 ` [dpdk-dev] [PATCH v7 0/5] Address reader-writer concurrency in rte_hash Thomas Monjalon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).