When testing with rte_hash library and TSX, a higher level of transaction aborts was observed in some cases, especially with newer compilers. These patches reduce aborts by ensuring that the compiler does not accidentally insert instructions that cause aborts, and by adding in delays on retry to avoid repeated collisions. Bruce Richardson (2): eal/x86: reduce contention when retrying TSX test/test: improve output for hash read-write test Yipeng Wang (2): hash: fix extra TSX collisions with newer gcc hash: add local cache for TSX region .../common/include/arch/x86/rte_spinlock.h | 19 ++++++++++++++++--- lib/librte_eal/linuxapp/eal/eal_alarm.c | 2 ++ lib/librte_hash/rte_cmp_x86.h | 2 ++ lib/librte_hash/rte_cuckoo_hash.c | 7 ++++--- test/test/test_hash_readwrite.c | 20 +++++++++++--------- 5 files changed, 35 insertions(+), 15 deletions(-) -- 1.8.5.6
From: Yipeng Wang <yipeng1.wang@intel.com> gcc 7 and 8 with O3 will generate vzeroupper from rte_memcpy into TSX region which may abort the TSX transaction. This fix changes rte_memcpy to memcpy which will not insert extra vzeroupper into the library. Fixes: f2e3001b53ec ("hash: support read/write concurrency") Cc: stable@dpdk.org Signed-off-by: Yipeng Wang <yipeng1.wang@intel.com> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> --- lib/librte_hash/rte_cmp_x86.h | 2 ++ lib/librte_hash/rte_cuckoo_hash.c | 3 +-- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/lib/librte_hash/rte_cmp_x86.h b/lib/librte_hash/rte_cmp_x86.h index e82b4c0..13a5836 100644 --- a/lib/librte_hash/rte_cmp_x86.h +++ b/lib/librte_hash/rte_cmp_x86.h @@ -2,6 +2,8 @@ * Copyright(c) 2015 Intel Corporation */ +#include <rte_vect.h> + /* Functions to compare multiple of 16 byte keys (up to 128 bytes) */ static int rte_hash_k16_cmp_eq(const void *key1, const void *key2, size_t key_len __rte_unused) diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c index 5ddcccd..76f5dc8 100644 --- a/lib/librte_hash/rte_cuckoo_hash.c +++ b/lib/librte_hash/rte_cuckoo_hash.c @@ -13,7 +13,6 @@ #include <rte_common.h> #include <rte_memory.h> /* for definition of RTE_CACHE_LINE_SIZE */ #include <rte_log.h> -#include <rte_memcpy.h> #include <rte_prefetch.h> #include <rte_branch_prediction.h> #include <rte_malloc.h> @@ -982,7 +981,7 @@ struct rte_hash * new_k = RTE_PTR_ADD(keys, (uintptr_t)slot_id * h->key_entry_size); new_idx = (uint32_t)((uintptr_t) slot_id); /* Copy key */ - rte_memcpy(new_k->key, key, h->key_len); + memcpy(new_k->key, key, h->key_len); /* Key can be of arbitrary length, so it is not possible to store * it atomically. Hence the new key element's memory stores * (key as well as data) should be complete before it is referenced. -- 1.8.5.6
From: Yipeng Wang <yipeng1.wang@intel.com> This patch adds back the local cache when TSX support is turned on. When TSX is turned on, free key-data slot ring would be contended by various TSX regions. The purpose of this commit is to reduce possible memory collisions during key insertion. Signed-off-by: Yipeng Wang <yipeng1.wang@intel.com> --- lib/librte_hash/rte_cuckoo_hash.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c index 76f5dc8..5d5f9f1 100644 --- a/lib/librte_hash/rte_cuckoo_hash.c +++ b/lib/librte_hash/rte_cuckoo_hash.c @@ -180,8 +180,10 @@ struct rte_hash * } /* Check extra flags field to check extra options. */ - if (params->extra_flag & RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT) + if (params->extra_flag & RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT) { + use_local_cache = 1; hw_trans_mem_support = 1; + } if (params->extra_flag & RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD) { use_local_cache = 1; -- 1.8.5.6
When TSX transactions abort, it is generally worth retrying a number of times before falling back to the traditional locking path, as the parallelism benefits from TSX can be worth it when a transaction does succeed. For cases with multiple threads and high contention rates, it can be useful to have increasing delays between retry attempts, so as to avoid having the same threads repeatedly collide. Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> --- lib/librte_eal/common/include/arch/x86/rte_spinlock.h | 19 ++++++++++++++++--- lib/librte_eal/linuxapp/eal/eal_alarm.c | 2 ++ 2 files changed, 18 insertions(+), 3 deletions(-) diff --git a/lib/librte_eal/common/include/arch/x86/rte_spinlock.h b/lib/librte_eal/common/include/arch/x86/rte_spinlock.h index 60321da..961a3c0 100644 --- a/lib/librte_eal/common/include/arch/x86/rte_spinlock.h +++ b/lib/librte_eal/common/include/arch/x86/rte_spinlock.h @@ -15,8 +15,9 @@ #include "rte_branch_prediction.h" #include "rte_common.h" #include "rte_pause.h" +#include "rte_cycles.h" -#define RTE_RTM_MAX_RETRIES (10) +#define RTE_RTM_MAX_RETRIES (20) #define RTE_XABORT_LOCK_BUSY (0xff) #ifndef RTE_FORCE_INTRINSICS @@ -87,7 +88,7 @@ static inline int rte_tm_supported(void) unsigned int status = rte_xbegin(); - if (likely(RTE_XBEGIN_STARTED == status)) { + if (RTE_XBEGIN_STARTED == status) { if (unlikely(*lock)) rte_xabort(RTE_XABORT_LOCK_BUSY); else @@ -97,8 +98,20 @@ static inline int rte_tm_supported(void) rte_pause(); if ((status & RTE_XABORT_EXPLICIT) && - (RTE_XABORT_CODE(status) == RTE_XABORT_LOCK_BUSY)) + (RTE_XABORT_CODE(status) == RTE_XABORT_LOCK_BUSY)) { + /* add a small delay before retrying, basing the + * delay on the number of times we've already tried, + * to give a back-off type of behaviour. We + * randomize trycount by taking bits from the tsc count + */ + int try_count = RTE_RTM_MAX_RETRIES - retries; + int pause_count = (rte_rdtsc() & 0x7) | 1; + pause_count <<= try_count; + int i; + for (i = 0; i < pause_count; i++) + rte_pause(); continue; + } if ((status & RTE_XABORT_RETRY) == 0) /* do not retry */ break; diff --git a/lib/librte_eal/linuxapp/eal/eal_alarm.c b/lib/librte_eal/linuxapp/eal/eal_alarm.c index 391d2a6..840ede7 100644 --- a/lib/librte_eal/linuxapp/eal/eal_alarm.c +++ b/lib/librte_eal/linuxapp/eal/eal_alarm.c @@ -30,7 +30,9 @@ #define NS_PER_US 1000 #define US_PER_MS 1000 #define MS_PER_S 1000 +#ifndef US_PER_S #define US_PER_S (US_PER_MS * MS_PER_S) +#endif #ifdef CLOCK_MONOTONIC_RAW /* Defined in glibc bits/time.h */ #define CLOCK_TYPE_ID CLOCK_MONOTONIC_RAW -- 1.8.5.6
The hash read-write autotest generates a lot of text, which is very dense on the screen. Even the summary at the end is hard to follow as everything is very compact. We can improve readability by highlighting the starts of the various sections, and by indenting the values within subsections. Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> --- test/test/test_hash_readwrite.c | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/test/test/test_hash_readwrite.c b/test/test/test_hash_readwrite.c index 01f986c..6b695ce 100644 --- a/test/test/test_hash_readwrite.c +++ b/test/test/test_hash_readwrite.c @@ -678,24 +678,26 @@ struct { reader_faster) < 0) return -1; + printf("================\n"); printf("Results summary:\n"); + printf("================\n"); printf("single read: %u\n", htm_results.single_read); printf("single write: %u\n", htm_results.single_write); for (i = 0; i < NUM_TEST; i++) { - printf("core_cnt: %u\n", core_cnt[i]); + printf("+++ core_cnt: %u +++\n", core_cnt[i]); printf("HTM:\n"); - printf("read only: %u\n", htm_results.read_only[i]); - printf("write only: %u\n", htm_results.write_only[i]); - printf("read-write read: %u\n", htm_results.read_write_r[i]); - printf("read-write write: %u\n", htm_results.read_write_w[i]); + printf(" read only: %u\n", htm_results.read_only[i]); + printf(" write only: %u\n", htm_results.write_only[i]); + printf(" read-write read: %u\n", htm_results.read_write_r[i]); + printf(" read-write write: %u\n", htm_results.read_write_w[i]); printf("non HTM:\n"); - printf("read only: %u\n", non_htm_results.read_only[i]); - printf("write only: %u\n", non_htm_results.write_only[i]); - printf("read-write read: %u\n", + printf(" read only: %u\n", non_htm_results.read_only[i]); + printf(" write only: %u\n", non_htm_results.write_only[i]); + printf(" read-write read: %u\n", non_htm_results.read_write_r[i]); - printf("read-write write: %u\n", + printf(" read-write write: %u\n", non_htm_results.read_write_w[i]); } -- 1.8.5.6
On Mon, Nov 05, 2018 at 05:39:11PM +0000, Bruce Richardson wrote:
> From: Yipeng Wang <yipeng1.wang@intel.com>
>
> This patch adds back the local cache when TSX support is turned on.
>
> When TSX is turned on, free key-data slot ring would be contended by
> various TSX regions. The purpose of this commit is to reduce possible
> memory collisions during key insertion.
>
> Signed-off-by: Yipeng Wang <yipeng1.wang@intel.com>
>
> ---
Acked-by: Bruce Richardson <bruce.richardson@intel.com>
When testing with rte_hash library and TSX, a higher level of transaction aborts was observed in some cases, especially with newer compilers. These patches reduce aborts by ensuring that the compiler does not accidentally insert instructions that cause aborts, and by adding in delays on retry to avoid repeated collisions. Bruce Richardson (2): eal/x86: reduce contention when retrying TSX test/test: improve output for hash read-write test Yipeng Wang (2): hash: fix TSX aborts with newer gcc hash: add local cache for TSX region .../common/include/arch/x86/rte_spinlock.h | 21 +++++++++++++++++---- lib/librte_eal/linuxapp/eal/eal_alarm.c | 2 ++ lib/librte_hash/rte_cmp_x86.h | 2 ++ lib/librte_hash/rte_cuckoo_hash.c | 7 ++++--- test/test/test_hash_readwrite.c | 20 +++++++++++--------- 5 files changed, 36 insertions(+), 16 deletions(-) -- 1.8.5.6
From: Yipeng Wang <yipeng1.wang@intel.com> gcc 7 and 8 with O3 will generate vzeroupper from rte_memcpy into TSX region which may abort the TSX transaction. This fix changes rte_memcpy to memcpy which will not insert extra vzeroupper into the library. Fixes: f2e3001b53ec ("hash: support read/write concurrency") Cc: stable@dpdk.org Signed-off-by: Yipeng Wang <yipeng1.wang@intel.com> Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> --- lib/librte_hash/rte_cmp_x86.h | 2 ++ lib/librte_hash/rte_cuckoo_hash.c | 3 +-- 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/lib/librte_hash/rte_cmp_x86.h b/lib/librte_hash/rte_cmp_x86.h index e82b4c0..13a5836 100644 --- a/lib/librte_hash/rte_cmp_x86.h +++ b/lib/librte_hash/rte_cmp_x86.h @@ -2,6 +2,8 @@ * Copyright(c) 2015 Intel Corporation */ +#include <rte_vect.h> + /* Functions to compare multiple of 16 byte keys (up to 128 bytes) */ static int rte_hash_k16_cmp_eq(const void *key1, const void *key2, size_t key_len __rte_unused) diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c index 5ddcccd..76f5dc8 100644 --- a/lib/librte_hash/rte_cuckoo_hash.c +++ b/lib/librte_hash/rte_cuckoo_hash.c @@ -13,7 +13,6 @@ #include <rte_common.h> #include <rte_memory.h> /* for definition of RTE_CACHE_LINE_SIZE */ #include <rte_log.h> -#include <rte_memcpy.h> #include <rte_prefetch.h> #include <rte_branch_prediction.h> #include <rte_malloc.h> @@ -982,7 +981,7 @@ struct rte_hash * new_k = RTE_PTR_ADD(keys, (uintptr_t)slot_id * h->key_entry_size); new_idx = (uint32_t)((uintptr_t) slot_id); /* Copy key */ - rte_memcpy(new_k->key, key, h->key_len); + memcpy(new_k->key, key, h->key_len); /* Key can be of arbitrary length, so it is not possible to store * it atomically. Hence the new key element's memory stores * (key as well as data) should be complete before it is referenced. -- 1.8.5.6
From: Yipeng Wang <yipeng1.wang@intel.com> This patch adds back the local cache when TSX support is turned on. When TSX is turned on, free key-data slot ring would be contended by various TSX regions. The purpose of this commit is to reduce possible memory collisions during key insertion. Signed-off-by: Yipeng Wang <yipeng1.wang@intel.com> Acked-by: Bruce Richardson <bruce.richardson@intel.com> --- lib/librte_hash/rte_cuckoo_hash.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c index 76f5dc8..5d5f9f1 100644 --- a/lib/librte_hash/rte_cuckoo_hash.c +++ b/lib/librte_hash/rte_cuckoo_hash.c @@ -180,8 +180,10 @@ struct rte_hash * } /* Check extra flags field to check extra options. */ - if (params->extra_flag & RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT) + if (params->extra_flag & RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT) { + use_local_cache = 1; hw_trans_mem_support = 1; + } if (params->extra_flag & RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD) { use_local_cache = 1; -- 1.8.5.6
When TSX transactions abort, it is generally worth retrying a number of times before falling back to the traditional locking path, as the parallelism benefits from TSX can be worth it when a transaction does succeed. For cases with multiple threads and high contention rates, it can be useful to have increasing delays between retry attempts, so as to avoid having the same threads repeatedly collided. Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> --- V2: Have retry with backoff for all cases of memory conflicts, not just those where we explicitly abort due to the lock being held. --- .../common/include/arch/x86/rte_spinlock.h | 21 +++++++++++++++++---- lib/librte_eal/linuxapp/eal/eal_alarm.c | 2 ++ 2 files changed, 19 insertions(+), 4 deletions(-) diff --git a/lib/librte_eal/common/include/arch/x86/rte_spinlock.h b/lib/librte_eal/common/include/arch/x86/rte_spinlock.h index 60321da..e2e2b26 100644 --- a/lib/librte_eal/common/include/arch/x86/rte_spinlock.h +++ b/lib/librte_eal/common/include/arch/x86/rte_spinlock.h @@ -15,8 +15,9 @@ #include "rte_branch_prediction.h" #include "rte_common.h" #include "rte_pause.h" +#include "rte_cycles.h" -#define RTE_RTM_MAX_RETRIES (10) +#define RTE_RTM_MAX_RETRIES (20) #define RTE_XABORT_LOCK_BUSY (0xff) #ifndef RTE_FORCE_INTRINSICS @@ -76,7 +77,7 @@ static inline int rte_tm_supported(void) static inline int rte_try_tm(volatile int *lock) { - int retries; + int i, retries; if (!rte_rtm_supported) return 0; @@ -96,9 +97,21 @@ static inline int rte_tm_supported(void) while (*lock) rte_pause(); - if ((status & RTE_XABORT_EXPLICIT) && - (RTE_XABORT_CODE(status) == RTE_XABORT_LOCK_BUSY)) + if ((status & RTE_XABORT_CONFLICT) || + ((status & RTE_XABORT_EXPLICIT) && + (RTE_XABORT_CODE(status) == RTE_XABORT_LOCK_BUSY))) { + /* add a small delay before retrying, basing the + * delay on the number of times we've already tried, + * to give a back-off type of behaviour. We + * randomize trycount by taking bits from the tsc count + */ + int try_count = RTE_RTM_MAX_RETRIES - retries; + int pause_count = (rte_rdtsc() & 0x7) | 1; + pause_count <<= try_count; + for (i = 0; i < pause_count; i++) + rte_pause(); continue; + } if ((status & RTE_XABORT_RETRY) == 0) /* do not retry */ break; diff --git a/lib/librte_eal/linuxapp/eal/eal_alarm.c b/lib/librte_eal/linuxapp/eal/eal_alarm.c index 391d2a6..840ede7 100644 --- a/lib/librte_eal/linuxapp/eal/eal_alarm.c +++ b/lib/librte_eal/linuxapp/eal/eal_alarm.c @@ -30,7 +30,9 @@ #define NS_PER_US 1000 #define US_PER_MS 1000 #define MS_PER_S 1000 +#ifndef US_PER_S #define US_PER_S (US_PER_MS * MS_PER_S) +#endif #ifdef CLOCK_MONOTONIC_RAW /* Defined in glibc bits/time.h */ #define CLOCK_TYPE_ID CLOCK_MONOTONIC_RAW -- 1.8.5.6
The hash read-write autotest generates a lot of text, which is very dense on the screen. Even the summary at the end is hard to follow as everything is very compact. We can improve readability by highlighting the starts of the various sections, and by indenting the values within subsections. Signed-off-by: Bruce Richardson <bruce.richardson@intel.com> --- test/test/test_hash_readwrite.c | 20 +++++++++++--------- 1 file changed, 11 insertions(+), 9 deletions(-) diff --git a/test/test/test_hash_readwrite.c b/test/test/test_hash_readwrite.c index 01f986c..6b695ce 100644 --- a/test/test/test_hash_readwrite.c +++ b/test/test/test_hash_readwrite.c @@ -678,24 +678,26 @@ struct { reader_faster) < 0) return -1; + printf("================\n"); printf("Results summary:\n"); + printf("================\n"); printf("single read: %u\n", htm_results.single_read); printf("single write: %u\n", htm_results.single_write); for (i = 0; i < NUM_TEST; i++) { - printf("core_cnt: %u\n", core_cnt[i]); + printf("+++ core_cnt: %u +++\n", core_cnt[i]); printf("HTM:\n"); - printf("read only: %u\n", htm_results.read_only[i]); - printf("write only: %u\n", htm_results.write_only[i]); - printf("read-write read: %u\n", htm_results.read_write_r[i]); - printf("read-write write: %u\n", htm_results.read_write_w[i]); + printf(" read only: %u\n", htm_results.read_only[i]); + printf(" write only: %u\n", htm_results.write_only[i]); + printf(" read-write read: %u\n", htm_results.read_write_r[i]); + printf(" read-write write: %u\n", htm_results.read_write_w[i]); printf("non HTM:\n"); - printf("read only: %u\n", non_htm_results.read_only[i]); - printf("write only: %u\n", non_htm_results.write_only[i]); - printf("read-write read: %u\n", + printf(" read only: %u\n", non_htm_results.read_only[i]); + printf(" write only: %u\n", non_htm_results.write_only[i]); + printf(" read-write read: %u\n", non_htm_results.read_write_r[i]); - printf("read-write write: %u\n", + printf(" read-write write: %u\n", non_htm_results.read_write_w[i]); } -- 1.8.5.6
> > From: Yipeng Wang <yipeng1.wang@intel.com> > > This patch adds back the local cache when TSX support is turned on. > > When TSX is turned on, free key-data slot ring would be contended by various > TSX regions. The purpose of this commit is to reduce possible memory > collisions during key insertion. > > Signed-off-by: Yipeng Wang <yipeng1.wang@intel.com> > Acked-by: Bruce Richardson <bruce.richardson@intel.com> > --- > lib/librte_hash/rte_cuckoo_hash.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/lib/librte_hash/rte_cuckoo_hash.c > b/lib/librte_hash/rte_cuckoo_hash.c > index 76f5dc8..5d5f9f1 100644 > --- a/lib/librte_hash/rte_cuckoo_hash.c > +++ b/lib/librte_hash/rte_cuckoo_hash.c > @@ -180,8 +180,10 @@ struct rte_hash * > } > > /* Check extra flags field to check extra options. */ > - if (params->extra_flag & > RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT) > + if (params->extra_flag & > RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT) { > + use_local_cache = 1; Do you see the issue even in the case of single writer? Enabling this flag creates local caches on all the data plane cores. This increases the memory usage for the single writer use case. Then there is 'writers on the control plane' use case, the requirement on hash_add rate is comparatively lower when compared to 'writers on the data plane'. The writers also are not pinned to any core as well. In this use case, I am not sure how much having a local cache matters. Enabling this flag effectively changes the free slot allocation from a ring to a stack data structure. Does it indicate that for single writer use case with TSX, the free slot (global) data structure should be a stack (rather than a ring)? > hw_trans_mem_support = 1; > + } > > if (params->extra_flag & > RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD) { > use_local_cache = 1; > -- > 1.8.5.6
12/11/2018 19:34, Honnappa Nagarahalli:
> >
> > From: Yipeng Wang <yipeng1.wang@intel.com>
> >
> > This patch adds back the local cache when TSX support is turned on.
> >
> > When TSX is turned on, free key-data slot ring would be contended by various
> > TSX regions. The purpose of this commit is to reduce possible memory
> > collisions during key insertion.
> >
> > Signed-off-by: Yipeng Wang <yipeng1.wang@intel.com>
> > Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> > ---
> > lib/librte_hash/rte_cuckoo_hash.c | 4 +++-
> > 1 file changed, 3 insertions(+), 1 deletion(-)
> >
> > diff --git a/lib/librte_hash/rte_cuckoo_hash.c
> > b/lib/librte_hash/rte_cuckoo_hash.c
> > index 76f5dc8..5d5f9f1 100644
> > --- a/lib/librte_hash/rte_cuckoo_hash.c
> > +++ b/lib/librte_hash/rte_cuckoo_hash.c
> > @@ -180,8 +180,10 @@ struct rte_hash *
> > }
> >
> > /* Check extra flags field to check extra options. */
> > - if (params->extra_flag &
> > RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT)
> > + if (params->extra_flag &
> > RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT) {
> > + use_local_cache = 1;
> Do you see the issue even in the case of single writer? Enabling this flag creates local caches on all the data plane cores. This increases the memory usage for the single writer use case. Then there is 'writers on the control plane' use case, the requirement on hash_add rate is comparatively lower when compared to 'writers on the data plane'. The writers also are not pinned to any core as well. In this use case, I am not sure how much having a local cache matters.
>
> Enabling this flag effectively changes the free slot allocation from a ring to a stack data structure. Does it indicate that for single writer use case with TSX, the free slot (global) data structure should be a stack (rather than a ring)?
Is it blocking this patchset from entering in 18.11?
If I understand well, there are some fixes for 18.11.
> > >
> > > From: Yipeng Wang <yipeng1.wang@intel.com>
> > >
> > > This patch adds back the local cache when TSX support is turned on.
> > >
> > > When TSX is turned on, free key-data slot ring would be contended by
> > > various TSX regions. The purpose of this commit is to reduce
> > > possible memory collisions during key insertion.
> > >
> > > Signed-off-by: Yipeng Wang <yipeng1.wang@intel.com>
> > > Acked-by: Bruce Richardson <bruce.richardson@intel.com>
> > > ---
> > > lib/librte_hash/rte_cuckoo_hash.c | 4 +++-
> > > 1 file changed, 3 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/lib/librte_hash/rte_cuckoo_hash.c
> > > b/lib/librte_hash/rte_cuckoo_hash.c
> > > index 76f5dc8..5d5f9f1 100644
> > > --- a/lib/librte_hash/rte_cuckoo_hash.c
> > > +++ b/lib/librte_hash/rte_cuckoo_hash.c
> > > @@ -180,8 +180,10 @@ struct rte_hash *
> > > }
> > >
> > > /* Check extra flags field to check extra options. */
> > > - if (params->extra_flag &
> > > RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT)
> > > + if (params->extra_flag &
> > > RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT) {
> > > + use_local_cache = 1;
> > Do you see the issue even in the case of single writer? Enabling this flag
> creates local caches on all the data plane cores. This increases the memory
> usage for the single writer use case. Then there is 'writers on the control
> plane' use case, the requirement on hash_add rate is comparatively lower
> when compared to 'writers on the data plane'. The writers also are not pinned
> to any core as well. In this use case, I am not sure how much having a local
> cache matters.
> >
> > Enabling this flag effectively changes the free slot allocation from a ring to a
> stack data structure. Does it indicate that for single writer use case with TSX,
> the free slot (global) data structure should be a stack (rather than a ring)?
>
> Is it blocking this patchset from entering in 18.11?
> If I understand well, there are some fixes for 18.11.
>
I am fine with the other fixes in this patchset
>-----Original Message----- >From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com] >Sent: Tuesday, November 13, 2018 9:17 AM >To: Thomas Monjalon <thomas@monjalon.net>; Richardson, Bruce <bruce.richardson@intel.com>; Wang, Yipeng1 ><yipeng1.wang@intel.com> >Cc: stable@dpdk.org; dev@dpdk.org; nd <nd@arm.com>; nd <nd@arm.com> >Subject: RE: [dpdk-stable] [dpdk-dev] [PATCH v2 2/4] hash: add local cache for TSX region >> > > /* Check extra flags field to check extra options. */ >> > > - if (params->extra_flag & >> > > RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT) >> > > + if (params->extra_flag & >> > > RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT) { >> > > + use_local_cache = 1; >> > Do you see the issue even in the case of single writer? Enabling this flag >> creates local caches on all the data plane cores. This increases the memory >> usage for the single writer use case. Then there is 'writers on the control >> plane' use case, the requirement on hash_add rate is comparatively lower >> when compared to 'writers on the data plane'. The writers also are not pinned >> to any core as well. In this use case, I am not sure how much having a local >> cache matters. >> > >> > Enabling this flag effectively changes the free slot allocation from a ring to a >> stack data structure. Does it indicate that for single writer use case with TSX, >> the free slot (global) data structure should be a stack (rather than a ring)? >> >> Is it blocking this patchset from entering in 18.11? >> If I understand well, there are some fixes for 18.11. [Wang, Yipeng] Hi Thomas, please go ahead merge the other commits without this one since Honnapa's concern. I will talk with Honnappa separately on a better way to do this. Thanks! >> >I am fine with the other fixes in this patchset
On Tue, Nov 13, 2018 at 05:24:55PM +0000, Wang, Yipeng1 wrote:
> >-----Original Message-----
> >From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> >Sent: Tuesday, November 13, 2018 9:17 AM
> >To: Thomas Monjalon <thomas@monjalon.net>; Richardson, Bruce <bruce.richardson@intel.com>; Wang, Yipeng1
> ><yipeng1.wang@intel.com>
> >Cc: stable@dpdk.org; dev@dpdk.org; nd <nd@arm.com>; nd <nd@arm.com>
> >Subject: RE: [dpdk-stable] [dpdk-dev] [PATCH v2 2/4] hash: add local cache for TSX region
> >> > > /* Check extra flags field to check extra options. */
> >> > > - if (params->extra_flag &
> >> > > RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT)
> >> > > + if (params->extra_flag &
> >> > > RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT) {
> >> > > + use_local_cache = 1;
> >> > Do you see the issue even in the case of single writer? Enabling this flag
> >> creates local caches on all the data plane cores. This increases the memory
> >> usage for the single writer use case. Then there is 'writers on the control
> >> plane' use case, the requirement on hash_add rate is comparatively lower
> >> when compared to 'writers on the data plane'. The writers also are not pinned
> >> to any core as well. In this use case, I am not sure how much having a local
> >> cache matters.
> >> >
> >> > Enabling this flag effectively changes the free slot allocation from a ring to a
> >> stack data structure. Does it indicate that for single writer use case with TSX,
> >> the free slot (global) data structure should be a stack (rather than a ring)?
> >>
> >> Is it blocking this patchset from entering in 18.11?
> >> If I understand well, there are some fixes for 18.11.
>
> [Wang, Yipeng] Hi Thomas, please go ahead merge the other commits without this one since Honnapa's concern.
> I will talk with Honnappa separately on a better way to do this.
>
> Thanks!
No objections to that plan here.
/Bruce
12/11/2018 11:47, Bruce Richardson:
> When testing with rte_hash library and TSX, a higher level of transaction
> aborts was observed in some cases, especially with newer compilers. These
> patches reduce aborts by ensuring that the compiler does not accidentally
> insert instructions that cause aborts, and by adding in delays on retry to
> avoid repeated collisions.
>
> Bruce Richardson (2):
> eal/x86: reduce contention when retrying TSX
> test/test: improve output for hash read-write test
>
> Yipeng Wang (2):
> hash: fix TSX aborts with newer gcc
> hash: add local cache for TSX region
Series applied without patch 2 "hash: add local cache for TSX region".