DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH 0/3] Add non-blocking stack mempool handler
@ 2019-01-10 20:55 Gage Eads
  2019-01-10 20:55 ` [dpdk-dev] [PATCH 1/3] eal: add 128-bit cmpset (x86-64 only) Gage Eads
                   ` (3 more replies)
  0 siblings, 4 replies; 43+ messages in thread
From: Gage Eads @ 2019-01-10 20:55 UTC (permalink / raw)
  To: dev; +Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev

For some users, the rte ring's "non-preemptive" constraint is not acceptable;
for example, if the application uses a mixture of pinned high-priority threads
and multiplexed low-priority threads that share a mempool.

This patchset introduces a non-blocking stack mempool handler. Note that the
non-blocking algorithm relies on a 128-bit compare-and-swap, so it is limited
to x86_64 machines.

In mempool_perf_autotest the lock-based stack outperforms the non-blocking
handler*, however:
- For applications with preemptible pthreads, a lock-based stack's
  worst-case performance (i.e. one thread being preempted while
  holding the spinlock) is much worse than the non-blocking stack's.
- Using per-thread mempool caches will largely mitigate the performance
  difference.

*Test setup: x86_64 build with default config, dual-socket Xeon E5-2699 v4,
running on isolcpus cores with a tickless scheduler. The lock-based stack's
rate_persec was 1x-3.5x the non-blocking stack's.

Gage Eads (3):
  eal: add 128-bit cmpset (x86-64 only)
  mempool/nb_stack: add non-blocking stack mempool
  doc: add NB stack comment to EAL "known issues"

 MAINTAINERS                                        |   4 +
 config/common_base                                 |   1 +
 doc/guides/prog_guide/env_abstraction_layer.rst    |   5 +
 drivers/mempool/Makefile                           |   1 +
 drivers/mempool/nb_stack/Makefile                  |  30 +++++
 drivers/mempool/nb_stack/meson.build               |   4 +
 drivers/mempool/nb_stack/nb_lifo.h                 | 132 +++++++++++++++++++++
 drivers/mempool/nb_stack/rte_mempool_nb_stack.c    | 125 +++++++++++++++++++
 .../nb_stack/rte_mempool_nb_stack_version.map      |   4 +
 .../common/include/arch/x86/rte_atomic_64.h        |  22 ++++
 mk/rte.app.mk                                      |   1 +
 11 files changed, 329 insertions(+)
 create mode 100644 drivers/mempool/nb_stack/Makefile
 create mode 100644 drivers/mempool/nb_stack/meson.build
 create mode 100644 drivers/mempool/nb_stack/nb_lifo.h
 create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack.c
 create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map

-- 
2.13.6

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [dpdk-dev] [PATCH 1/3] eal: add 128-bit cmpset (x86-64 only)
  2019-01-10 20:55 [dpdk-dev] [PATCH 0/3] Add non-blocking stack mempool handler Gage Eads
@ 2019-01-10 20:55 ` Gage Eads
  2019-01-13 12:18   ` Andrew Rybchenko
  2019-01-10 20:55 ` [dpdk-dev] [PATCH 2/3] mempool/nb_stack: add non-blocking stack mempool Gage Eads
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 43+ messages in thread
From: Gage Eads @ 2019-01-10 20:55 UTC (permalink / raw)
  To: dev; +Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev

This operation can be used for non-blocking algorithms, such as a
non-blocking stack or ring.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 .../common/include/arch/x86/rte_atomic_64.h        | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
index fd2ec9c53..34c2addf8 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
@@ -34,6 +34,7 @@
 /*
  * Inspired from FreeBSD src/sys/amd64/include/atomic.h
  * Copyright (c) 1998 Doug Rabson
+ * Copyright (c) 2019 Intel Corporation
  * All rights reserved.
  */
 
@@ -208,4 +209,25 @@ static inline void rte_atomic64_clear(rte_atomic64_t *v)
 }
 #endif
 
+static inline int
+rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp, uint64_t *src)
+{
+	uint8_t res;
+
+	asm volatile (
+		      MPLOCKED
+		      "cmpxchg16b %[dst];"
+		      " sete %[res]"
+		      : [dst] "=m" (*dst),
+			[res] "=r" (res)
+		      : "c" (src[1]),
+			"b" (src[0]),
+			"m" (*dst),
+			"d" (exp[1]),
+			"a" (exp[0])
+		      : "memory");
+
+	return res;
+}
+
 #endif /* _RTE_ATOMIC_X86_64_H_ */
-- 
2.13.6

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [dpdk-dev] [PATCH 2/3] mempool/nb_stack: add non-blocking stack mempool
  2019-01-10 20:55 [dpdk-dev] [PATCH 0/3] Add non-blocking stack mempool handler Gage Eads
  2019-01-10 20:55 ` [dpdk-dev] [PATCH 1/3] eal: add 128-bit cmpset (x86-64 only) Gage Eads
@ 2019-01-10 20:55 ` Gage Eads
  2019-01-13 13:31   ` Andrew Rybchenko
  2019-01-10 20:55 ` [dpdk-dev] [PATCH 3/3] doc: add NB stack comment to EAL "known issues" Gage Eads
  2019-01-15 22:32 ` [dpdk-dev] [PATCH v2 0/2] Add non-blocking stack mempool handler Gage Eads
  3 siblings, 1 reply; 43+ messages in thread
From: Gage Eads @ 2019-01-10 20:55 UTC (permalink / raw)
  To: dev; +Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev

This commit adds support for non-blocking (linked list based) stack mempool
handler. The stack uses a 128-bit compare-and-swap instruction, and thus is
limited to x86_64. The 128-bit CAS atomically updates the stack top pointer
and a modification counter, which protects against the ABA problem.

In mempool_perf_autotest the lock-based stack outperforms the non-blocking
handler*, however:
- For applications with preemptible pthreads, a lock-based stack's
  worst-case performance (i.e. one thread being preempted while
  holding the spinlock) is much worse than the non-blocking stack's.
- Using per-thread mempool caches will largely mitigate the performance
  difference.

*Test setup: x86_64 build with default config, dual-socket Xeon E5-2699 v4,
running on isolcpus cores with a tickless scheduler. The lock-based stack's
rate_persec was 1x-3.5x the non-blocking stack's.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 MAINTAINERS                                        |   4 +
 config/common_base                                 |   1 +
 drivers/mempool/Makefile                           |   1 +
 drivers/mempool/nb_stack/Makefile                  |  30 +++++
 drivers/mempool/nb_stack/meson.build               |   4 +
 drivers/mempool/nb_stack/nb_lifo.h                 | 132 +++++++++++++++++++++
 drivers/mempool/nb_stack/rte_mempool_nb_stack.c    | 125 +++++++++++++++++++
 .../nb_stack/rte_mempool_nb_stack_version.map      |   4 +
 mk/rte.app.mk                                      |   1 +
 9 files changed, 302 insertions(+)
 create mode 100644 drivers/mempool/nb_stack/Makefile
 create mode 100644 drivers/mempool/nb_stack/meson.build
 create mode 100644 drivers/mempool/nb_stack/nb_lifo.h
 create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack.c
 create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map

diff --git a/MAINTAINERS b/MAINTAINERS
index 470f36b9c..5519d3323 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -416,6 +416,10 @@ M: Artem V. Andreev <artem.andreev@oktetlabs.ru>
 M: Andrew Rybchenko <arybchenko@solarflare.com>
 F: drivers/mempool/bucket/
 
+Non-blocking stack memory pool
+M: Gage Eads <gage.eads@intel.com>
+F: drivers/mempool/nb_stack/
+
 
 Bus Drivers
 -----------
diff --git a/config/common_base b/config/common_base
index 964a6956e..40ce47312 100644
--- a/config/common_base
+++ b/config/common_base
@@ -728,6 +728,7 @@ CONFIG_RTE_DRIVER_MEMPOOL_BUCKET=y
 CONFIG_RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB=64
 CONFIG_RTE_DRIVER_MEMPOOL_RING=y
 CONFIG_RTE_DRIVER_MEMPOOL_STACK=y
+CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK=y
 
 #
 # Compile PMD for octeontx fpa mempool device
diff --git a/drivers/mempool/Makefile b/drivers/mempool/Makefile
index 28c2e8360..aeae3ac12 100644
--- a/drivers/mempool/Makefile
+++ b/drivers/mempool/Makefile
@@ -13,5 +13,6 @@ endif
 DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_RING) += ring
 DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK) += stack
 DIRS-$(CONFIG_RTE_LIBRTE_OCTEONTX_MEMPOOL) += octeontx
+DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += nb_stack
 
 include $(RTE_SDK)/mk/rte.subdir.mk
diff --git a/drivers/mempool/nb_stack/Makefile b/drivers/mempool/nb_stack/Makefile
new file mode 100644
index 000000000..38b45f4f5
--- /dev/null
+++ b/drivers/mempool/nb_stack/Makefile
@@ -0,0 +1,30 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2019 Intel Corporation
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+# The non-blocking stack uses a 128-bit compare-and-swap instruction, and thus
+# is limited to x86_64.
+ifeq ($(CONFIG_RTE_ARCH_X86_64),y)
+
+#
+# library name
+#
+LIB = librte_mempool_nb_stack.a
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+
+# Headers
+CFLAGS += -I$(RTE_SDK)/lib/librte_mempool
+LDLIBS += -lrte_eal -lrte_mempool
+
+EXPORT_MAP := rte_mempool_nb_stack_version.map
+
+LIBABIVER := 1
+
+SRCS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += rte_mempool_nb_stack.c
+
+endif
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/mempool/nb_stack/meson.build b/drivers/mempool/nb_stack/meson.build
new file mode 100644
index 000000000..66d64a9ba
--- /dev/null
+++ b/drivers/mempool/nb_stack/meson.build
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2019 Intel Corporation
+
+sources = files('rte_mempool_nb_stack.c')
diff --git a/drivers/mempool/nb_stack/nb_lifo.h b/drivers/mempool/nb_stack/nb_lifo.h
new file mode 100644
index 000000000..701d75e37
--- /dev/null
+++ b/drivers/mempool/nb_stack/nb_lifo.h
@@ -0,0 +1,132 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#ifndef _NB_LIFO_H_
+#define _NB_LIFO_H_
+
+struct nb_lifo_elem {
+	void *data;
+	struct nb_lifo_elem *next;
+};
+
+struct nb_lifo_head {
+	struct nb_lifo_elem *top; /**< Stack top */
+	uint64_t cnt; /**< Modification counter */
+};
+
+struct nb_lifo {
+	volatile struct nb_lifo_head head __rte_aligned(16);
+	rte_atomic64_t len;
+} __rte_cache_aligned;
+
+static __rte_always_inline void
+nb_lifo_init(struct nb_lifo *lifo)
+{
+	memset(lifo, 0, sizeof(*lifo));
+	rte_atomic64_set(&lifo->len, 0);
+}
+
+static __rte_always_inline unsigned int
+nb_lifo_len(struct nb_lifo *lifo)
+{
+	return (unsigned int) rte_atomic64_read(&lifo->len);
+}
+
+static __rte_always_inline void
+nb_lifo_push(struct nb_lifo *lifo,
+	     struct nb_lifo_elem *first,
+	     struct nb_lifo_elem *last,
+	     unsigned int num)
+{
+	while (1) {
+		struct nb_lifo_head old_head, new_head;
+
+		old_head = lifo->head;
+
+		/* Swing the top pointer to the first element in the list and
+		 * make the last element point to the old top.
+		 */
+		new_head.top = first;
+		new_head.cnt = old_head.cnt + 1;
+
+		last->next = old_head.top;
+
+		if (rte_atomic128_cmpset((volatile uint64_t *) &lifo->head,
+					 (uint64_t *)&old_head,
+					 (uint64_t *)&new_head))
+			break;
+	}
+
+	rte_atomic64_add(&lifo->len, num);
+}
+
+static __rte_always_inline void
+nb_lifo_push_single(struct nb_lifo *lifo, struct nb_lifo_elem *elem)
+{
+	nb_lifo_push(lifo, elem, elem, 1);
+}
+
+static __rte_always_inline struct nb_lifo_elem *
+nb_lifo_pop(struct nb_lifo *lifo,
+	    unsigned int num,
+	    void **obj_table,
+	    struct nb_lifo_elem **last)
+{
+	struct nb_lifo_head old_head;
+
+	/* Reserve num elements, if available */
+	while (1) {
+		uint64_t len = rte_atomic64_read(&lifo->len);
+
+		/* Does the list contain enough elements? */
+		if (len < num)
+			return NULL;
+
+		if (rte_atomic64_cmpset((volatile uint64_t *)&lifo->len,
+					len, len - num))
+			break;
+	}
+
+	/* Pop num elements */
+	while (1) {
+		struct nb_lifo_head new_head;
+		struct nb_lifo_elem *tmp;
+		unsigned int i;
+
+		old_head = lifo->head;
+
+		tmp = old_head.top;
+
+		/* Traverse the list to find the new head. A next pointer will
+		 * either point to another element or NULL; if a thread
+		 * encounters a pointer that has already been popped, the CAS
+		 * will fail.
+		 */
+		for (i = 0; i < num && tmp != NULL; i++) {
+			if (obj_table)
+				obj_table[i] = tmp->data;
+			if (last)
+				*last = tmp;
+			tmp = tmp->next;
+		}
+
+		/* If NULL was encountered, the list was modified while
+		 * traversing it. Retry.
+		 */
+		if (i != num)
+			continue;
+
+		new_head.top = tmp;
+		new_head.cnt = old_head.cnt + 1;
+
+		if (rte_atomic128_cmpset((volatile uint64_t *) &lifo->head,
+					 (uint64_t *)&old_head,
+					 (uint64_t *)&new_head))
+			break;
+	}
+
+	return old_head.top;
+}
+
+#endif /* _NB_LIFO_H_ */
diff --git a/drivers/mempool/nb_stack/rte_mempool_nb_stack.c b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
new file mode 100644
index 000000000..1b30775f7
--- /dev/null
+++ b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
@@ -0,0 +1,125 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include <stdio.h>
+#include <rte_mempool.h>
+#include <rte_malloc.h>
+
+#include "nb_lifo.h"
+
+struct rte_mempool_nb_stack {
+	uint64_t size;
+	struct nb_lifo used_lifo; /**< LIFO containing mempool pointers  */
+	struct nb_lifo free_lifo; /**< LIFO containing unused LIFO elements */
+};
+
+static int
+nb_stack_alloc(struct rte_mempool *mp)
+{
+	struct rte_mempool_nb_stack *s;
+	struct nb_lifo_elem *elems;
+	unsigned int n = mp->size;
+	unsigned int size, i;
+
+	size = sizeof(*s) + n * sizeof(struct nb_lifo_elem);
+
+	/* Allocate our local memory structure */
+	s = rte_zmalloc_socket("mempool-nb_stack",
+			       size,
+			       RTE_CACHE_LINE_SIZE,
+			       mp->socket_id);
+	if (s == NULL) {
+		RTE_LOG(ERR, MEMPOOL, "Cannot allocate nb_stack!\n");
+		return -ENOMEM;
+	}
+
+	s->size = n;
+
+	nb_lifo_init(&s->used_lifo);
+	nb_lifo_init(&s->free_lifo);
+
+	elems = (struct nb_lifo_elem *) &s[1];
+	for (i = 0; i < n; i++)
+		nb_lifo_push_single(&s->free_lifo, &elems[i]);
+
+	mp->pool_data = s;
+
+	return 0;
+}
+
+static int
+nb_stack_enqueue(struct rte_mempool *mp, void * const *obj_table,
+		 unsigned int n)
+{
+	struct rte_mempool_nb_stack *s = mp->pool_data;
+	struct nb_lifo_elem *first, *last, *tmp;
+	unsigned int i;
+
+	if (unlikely(n == 0))
+		return 0;
+
+	/* Pop n free elements */
+	first = nb_lifo_pop(&s->free_lifo, n, NULL, NULL);
+	if (unlikely(!first))
+		return -ENOBUFS;
+
+	/* Prepare the list elements */
+	tmp = first;
+	for (i = 0; i < n; i++) {
+		tmp->data = obj_table[i];
+		last = tmp;
+		tmp = tmp->next;
+	}
+
+	/* Enqueue them to the used list */
+	nb_lifo_push(&s->used_lifo, first, last, n);
+
+	return 0;
+}
+
+static int
+nb_stack_dequeue(struct rte_mempool *mp, void **obj_table,
+		 unsigned int n)
+{
+	struct rte_mempool_nb_stack *s = mp->pool_data;
+	struct nb_lifo_elem *first, *last;
+
+	if (unlikely(n == 0))
+		return 0;
+
+	/* Pop n used elements */
+	first = nb_lifo_pop(&s->used_lifo, n, obj_table, &last);
+	if (unlikely(!first))
+		return -ENOENT;
+
+	/* Enqueue the list elements to the free list */
+	nb_lifo_push(&s->free_lifo, first, last, n);
+
+	return 0;
+}
+
+static unsigned
+nb_stack_get_count(const struct rte_mempool *mp)
+{
+	struct rte_mempool_nb_stack *s = mp->pool_data;
+
+	return nb_lifo_len(&s->used_lifo);
+}
+
+static void
+nb_stack_free(struct rte_mempool *mp)
+{
+	rte_free((void *)(mp->pool_data));
+}
+
+static struct rte_mempool_ops ops_nb_stack = {
+	.name = "nb_stack",
+	.alloc = nb_stack_alloc,
+	.free = nb_stack_free,
+	.enqueue = nb_stack_enqueue,
+	.dequeue = nb_stack_dequeue,
+	.get_count = nb_stack_get_count
+};
+
+MEMPOOL_REGISTER_OPS(ops_nb_stack);
diff --git a/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map b/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
new file mode 100644
index 000000000..fc8c95e91
--- /dev/null
+++ b/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
@@ -0,0 +1,4 @@
+DPDK_19.05 {
+
+	local: *;
+};
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 02e8b6f05..0b11d9417 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -133,6 +133,7 @@ ifeq ($(CONFIG_RTE_BUILD_SHARED_LIB),n)
 
 _LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_BUCKET) += -lrte_mempool_bucket
 _LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK)  += -lrte_mempool_stack
+_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK)  += -lrte_mempool_nb_stack
 ifeq ($(CONFIG_RTE_LIBRTE_DPAA_BUS),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_DPAA_MEMPOOL)   += -lrte_mempool_dpaa
 endif
-- 
2.13.6

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [dpdk-dev] [PATCH 3/3] doc: add NB stack comment to EAL "known issues"
  2019-01-10 20:55 [dpdk-dev] [PATCH 0/3] Add non-blocking stack mempool handler Gage Eads
  2019-01-10 20:55 ` [dpdk-dev] [PATCH 1/3] eal: add 128-bit cmpset (x86-64 only) Gage Eads
  2019-01-10 20:55 ` [dpdk-dev] [PATCH 2/3] mempool/nb_stack: add non-blocking stack mempool Gage Eads
@ 2019-01-10 20:55 ` Gage Eads
  2019-01-15 22:32 ` [dpdk-dev] [PATCH v2 0/2] Add non-blocking stack mempool handler Gage Eads
  3 siblings, 0 replies; 43+ messages in thread
From: Gage Eads @ 2019-01-10 20:55 UTC (permalink / raw)
  To: dev; +Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev

This comment makes users aware of the non-blocking stack option and its
caveats.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 doc/guides/prog_guide/env_abstraction_layer.rst | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index 929d76dba..9497b879c 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -541,6 +541,11 @@ Known Issues
 
   5. It MUST not be used by multi-producer/consumer pthreads, whose scheduling policies are SCHED_FIFO or SCHED_RR.
 
+  Alternatively, x86_64 applications can use the non-blocking stack mempool handler. When considering this handler, note that:
+
+  - it is limited to the x86_64 platform, because it uses an instruction (16-byte compare-and-swap) that is not available on other platforms.
+  - it has worse average-case performance than the non-preemptive rte_ring, but software caching (e.g. the mempool cache) can mitigate this by reducing the number of handler operations.
+
 + rte_timer
 
   Running  ``rte_timer_manage()`` on a non-EAL pthread is not allowed. However, resetting/stopping the timer from a non-EAL pthread is allowed.
-- 
2.13.6

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH 1/3] eal: add 128-bit cmpset (x86-64 only)
  2019-01-10 20:55 ` [dpdk-dev] [PATCH 1/3] eal: add 128-bit cmpset (x86-64 only) Gage Eads
@ 2019-01-13 12:18   ` Andrew Rybchenko
  2019-01-14  4:29     ` Varghese, Vipin
  2019-01-14 15:43     ` Eads, Gage
  0 siblings, 2 replies; 43+ messages in thread
From: Andrew Rybchenko @ 2019-01-13 12:18 UTC (permalink / raw)
  To: Gage Eads, dev; +Cc: olivier.matz, bruce.richardson, konstantin.ananyev

On 1/10/19 11:55 PM, Gage Eads wrote:
> This operation can be used for non-blocking algorithms, such as a
> non-blocking stack or ring.
>
> Signed-off-by: Gage Eads <gage.eads@intel.com>
> ---
>   .../common/include/arch/x86/rte_atomic_64.h        | 22 ++++++++++++++++++++++
>   1 file changed, 22 insertions(+)
>
> diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> index fd2ec9c53..34c2addf8 100644
> --- a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> +++ b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> @@ -34,6 +34,7 @@
>   /*
>    * Inspired from FreeBSD src/sys/amd64/include/atomic.h
>    * Copyright (c) 1998 Doug Rabson
> + * Copyright (c) 2019 Intel Corporation
>    * All rights reserved.
>    */
>   
> @@ -208,4 +209,25 @@ static inline void rte_atomic64_clear(rte_atomic64_t *v)
>   }
>   #endif
>   
> +static inline int
> +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp, uint64_t *src)
> +{
> +	uint8_t res;
> +
> +	asm volatile (
> +		      MPLOCKED
> +		      "cmpxchg16b %[dst];"
> +		      " sete %[res]"
> +		      : [dst] "=m" (*dst),
> +			[res] "=r" (res)
> +		      : "c" (src[1]),
> +			"b" (src[0]),
> +			"m" (*dst),
> +			"d" (exp[1]),
> +			"a" (exp[0])
> +		      : "memory");
> +
> +	return res;
> +}
> +
>   #endif /* _RTE_ATOMIC_X86_64_

Is it OK to add it to rte_atomic_64.h header which is for 64-bit integer 
ops?

Andrew.
> H_ */

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH 2/3] mempool/nb_stack: add non-blocking stack mempool
  2019-01-10 20:55 ` [dpdk-dev] [PATCH 2/3] mempool/nb_stack: add non-blocking stack mempool Gage Eads
@ 2019-01-13 13:31   ` Andrew Rybchenko
  2019-01-14 16:22     ` Eads, Gage
  0 siblings, 1 reply; 43+ messages in thread
From: Andrew Rybchenko @ 2019-01-13 13:31 UTC (permalink / raw)
  To: Gage Eads, dev; +Cc: olivier.matz, bruce.richardson, konstantin.ananyev

Hi Gage,

In general looks very good.

Have you considered to make nb_lifo.h a library to be reusable outside
of the mempool driver?

There are few notes below.

Thanks,
Andrew.

On 1/10/19 11:55 PM, Gage Eads wrote:
> This commit adds support for non-blocking (linked list based) stack mempool
> handler. The stack uses a 128-bit compare-and-swap instruction, and thus is
> limited to x86_64. The 128-bit CAS atomically updates the stack top pointer
> and a modification counter, which protects against the ABA problem.
>
> In mempool_perf_autotest the lock-based stack outperforms the non-blocking
> handler*, however:
> - For applications with preemptible pthreads, a lock-based stack's
>    worst-case performance (i.e. one thread being preempted while
>    holding the spinlock) is much worse than the non-blocking stack's.
> - Using per-thread mempool caches will largely mitigate the performance
>    difference.
>
> *Test setup: x86_64 build with default config, dual-socket Xeon E5-2699 v4,
> running on isolcpus cores with a tickless scheduler. The lock-based stack's
> rate_persec was 1x-3.5x the non-blocking stack's.
>
> Signed-off-by: Gage Eads <gage.eads@intel.com>
> ---
>   MAINTAINERS                                        |   4 +
>   config/common_base                                 |   1 +
>   drivers/mempool/Makefile                           |   1 +
>   drivers/mempool/nb_stack/Makefile                  |  30 +++++
>   drivers/mempool/nb_stack/meson.build               |   4 +
>   drivers/mempool/nb_stack/nb_lifo.h                 | 132 +++++++++++++++++++++
>   drivers/mempool/nb_stack/rte_mempool_nb_stack.c    | 125 +++++++++++++++++++
>   .../nb_stack/rte_mempool_nb_stack_version.map      |   4 +
>   mk/rte.app.mk                                      |   1 +
>   9 files changed, 302 insertions(+)
>   create mode 100644 drivers/mempool/nb_stack/Makefile
>   create mode 100644 drivers/mempool/nb_stack/meson.build
>   create mode 100644 drivers/mempool/nb_stack/nb_lifo.h
>   create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack.c
>   create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 470f36b9c..5519d3323 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -416,6 +416,10 @@ M: Artem V. Andreev <artem.andreev@oktetlabs.ru>
>   M: Andrew Rybchenko <arybchenko@solarflare.com>
>   F: drivers/mempool/bucket/
>   
> +Non-blocking stack memory pool
> +M: Gage Eads <gage.eads@intel.com>
> +F: drivers/mempool/nb_stack/
> +
>   
>   Bus Drivers
>   -----------
> diff --git a/config/common_base b/config/common_base
> index 964a6956e..40ce47312 100644
> --- a/config/common_base
> +++ b/config/common_base
> @@ -728,6 +728,7 @@ CONFIG_RTE_DRIVER_MEMPOOL_BUCKET=y
>   CONFIG_RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB=64
>   CONFIG_RTE_DRIVER_MEMPOOL_RING=y
>   CONFIG_RTE_DRIVER_MEMPOOL_STACK=y
> +CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK=y

Typically it is alphabetically sorted.

>   #
>   # Compile PMD for octeontx fpa mempool device
> diff --git a/drivers/mempool/Makefile b/drivers/mempool/Makefile
> index 28c2e8360..aeae3ac12 100644
> --- a/drivers/mempool/Makefile
> +++ b/drivers/mempool/Makefile
> @@ -13,5 +13,6 @@ endif
>   DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_RING) += ring
>   DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK) += stack
>   DIRS-$(CONFIG_RTE_LIBRTE_OCTEONTX_MEMPOOL) += octeontx
> +DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += nb_stack

Typically it is alphabetically sorted. Yes, already broken, but, please, 
put it before ring.

>   
>   include $(RTE_SDK)/mk/rte.subdir.mk
> diff --git a/drivers/mempool/nb_stack/Makefile b/drivers/mempool/nb_stack/Makefile
> new file mode 100644
> index 000000000..38b45f4f5
> --- /dev/null
> +++ b/drivers/mempool/nb_stack/Makefile
> @@ -0,0 +1,30 @@
> +# SPDX-License-Identifier: BSD-3-Clause
> +# Copyright(c) 2019 Intel Corporation
> +
> +include $(RTE_SDK)/mk/rte.vars.mk
> +
> +# The non-blocking stack uses a 128-bit compare-and-swap instruction, and thus
> +# is limited to x86_64.
> +ifeq ($(CONFIG_RTE_ARCH_X86_64),y)
> +
> +#
> +# library name
> +#
> +LIB = librte_mempool_nb_stack.a
> +
> +CFLAGS += -O3
> +CFLAGS += $(WERROR_FLAGS)
> +
> +# Headers
> +CFLAGS += -I$(RTE_SDK)/lib/librte_mempool

I guess it is derived from stack. Is it really required? There is no 
such line
in ring, bucket, octeontx and dpaa2.

> +LDLIBS += -lrte_eal -lrte_mempool
> +
> +EXPORT_MAP := rte_mempool_nb_stack_version.map
> +
> +LIBABIVER := 1
> +
> +SRCS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += rte_mempool_nb_stack.c
> +
> +endif
> +
> +include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/drivers/mempool/nb_stack/meson.build b/drivers/mempool/nb_stack/meson.build
> new file mode 100644
> index 000000000..66d64a9ba
> --- /dev/null
> +++ b/drivers/mempool/nb_stack/meson.build
> @@ -0,0 +1,4 @@
> +# SPDX-License-Identifier: BSD-3-Clause
> +# Copyright(c) 2019 Intel Corporation
> +
> +sources = files('rte_mempool_nb_stack.c')

Have no tested meson build for non-x86_64 target?
I guess it should be fixed to skip it on non-x86_64 builds.

> diff --git a/drivers/mempool/nb_stack/nb_lifo.h b/drivers/mempool/nb_stack/nb_lifo.h
> new file mode 100644
> index 000000000..701d75e37
> --- /dev/null
> +++ b/drivers/mempool/nb_stack/nb_lifo.h
> @@ -0,0 +1,132 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2019 Intel Corporation
> + */
> +
> +#ifndef _NB_LIFO_H_
> +#define _NB_LIFO_H_
> +
> +struct nb_lifo_elem {
> +	void *data;
> +	struct nb_lifo_elem *next;
> +};
> +
> +struct nb_lifo_head {
> +	struct nb_lifo_elem *top; /**< Stack top */
> +	uint64_t cnt; /**< Modification counter */
> +};
> +
> +struct nb_lifo {
> +	volatile struct nb_lifo_head head __rte_aligned(16);
> +	rte_atomic64_t len;
> +} __rte_cache_aligned;
> +
> +static __rte_always_inline void
> +nb_lifo_init(struct nb_lifo *lifo)
> +{
> +	memset(lifo, 0, sizeof(*lifo));
> +	rte_atomic64_set(&lifo->len, 0);
> +}
> +
> +static __rte_always_inline unsigned int
> +nb_lifo_len(struct nb_lifo *lifo)
> +{
> +	return (unsigned int) rte_atomic64_read(&lifo->len);
> +}
> +
> +static __rte_always_inline void
> +nb_lifo_push(struct nb_lifo *lifo,
> +	     struct nb_lifo_elem *first,
> +	     struct nb_lifo_elem *last,
> +	     unsigned int num)
> +{
> +	while (1) {
> +		struct nb_lifo_head old_head, new_head;
> +
> +		old_head = lifo->head;
> +
> +		/* Swing the top pointer to the first element in the list and
> +		 * make the last element point to the old top.
> +		 */
> +		new_head.top = first;
> +		new_head.cnt = old_head.cnt + 1;
> +
> +		last->next = old_head.top;
> +
> +		if (rte_atomic128_cmpset((volatile uint64_t *) &lifo->head,
> +					 (uint64_t *)&old_head,
> +					 (uint64_t *)&new_head))
> +			break;
> +	}
> +
> +	rte_atomic64_add(&lifo->len, num);

I'd like to understand why it is not a problem that change of the list and
increase its length are not atomic. So, we can get wrong length of the stack
in the middle. It would be good to explain it in the comment.

> +}
> +
> +static __rte_always_inline void
> +nb_lifo_push_single(struct nb_lifo *lifo, struct nb_lifo_elem *elem)
> +{
> +	nb_lifo_push(lifo, elem, elem, 1);
> +}
> +
> +static __rte_always_inline struct nb_lifo_elem *
> +nb_lifo_pop(struct nb_lifo *lifo,
> +	    unsigned int num,
> +	    void **obj_table,
> +	    struct nb_lifo_elem **last)
> +{
> +	struct nb_lifo_head old_head;
> +
> +	/* Reserve num elements, if available */
> +	while (1) {
> +		uint64_t len = rte_atomic64_read(&lifo->len);
> +
> +		/* Does the list contain enough elements? */
> +		if (len < num)
> +			return NULL;
> +
> +		if (rte_atomic64_cmpset((volatile uint64_t *)&lifo->len,
> +					len, len - num))
> +			break;
> +	}
> +
> +	/* Pop num elements */
> +	while (1) {
> +		struct nb_lifo_head new_head;
> +		struct nb_lifo_elem *tmp;
> +		unsigned int i;
> +
> +		old_head = lifo->head;
> +
> +		tmp = old_head.top;
> +
> +		/* Traverse the list to find the new head. A next pointer will
> +		 * either point to another element or NULL; if a thread
> +		 * encounters a pointer that has already been popped, the CAS
> +		 * will fail.
> +		 */
> +		for (i = 0; i < num && tmp != NULL; i++) {
> +			if (obj_table)
> +				obj_table[i] = tmp->data;
> +			if (last)
> +				*last = tmp;

Isn't it better to do obj_table and last assignment later when
we successfully reserved elements? If there is not retries,
current solution is optimal, but I guess solution with the second
traversal to fill in obj_table will show more stable performance
results under high load when many retries are done.

> +			tmp = tmp->next;
> +		}
> +
> +		/* If NULL was encountered, the list was modified while
> +		 * traversing it. Retry.
> +		 */
> +		if (i != num)
> +			continue;
> +
> +		new_head.top = tmp;
> +		new_head.cnt = old_head.cnt + 1;
> +
> +		if (rte_atomic128_cmpset((volatile uint64_t *) &lifo->head,
> +					 (uint64_t *)&old_head,
> +					 (uint64_t *)&new_head))
> +			break;
> +	}
> +
> +	return old_head.top;
> +}
> +
> +#endif /* _NB_LIFO_H_ */
> diff --git a/drivers/mempool/nb_stack/rte_mempool_nb_stack.c b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> new file mode 100644
> index 000000000..1b30775f7
> --- /dev/null
> +++ b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> @@ -0,0 +1,125 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2019 Intel Corporation
> + */
> +
> +#include <stdio.h>
> +#include <rte_mempool.h>
> +#include <rte_malloc.h>
> +
> +#include "nb_lifo.h"
> +
> +struct rte_mempool_nb_stack {
> +	uint64_t size;
> +	struct nb_lifo used_lifo; /**< LIFO containing mempool pointers  */
> +	struct nb_lifo free_lifo; /**< LIFO containing unused LIFO elements */
> +};
> +
> +static int
> +nb_stack_alloc(struct rte_mempool *mp)
> +{
> +	struct rte_mempool_nb_stack *s;
> +	struct nb_lifo_elem *elems;
> +	unsigned int n = mp->size;
> +	unsigned int size, i;
> +
> +	size = sizeof(*s) + n * sizeof(struct nb_lifo_elem);
> +
> +	/* Allocate our local memory structure */
> +	s = rte_zmalloc_socket("mempool-nb_stack",
> +			       size,
> +			       RTE_CACHE_LINE_SIZE,
> +			       mp->socket_id);
> +	if (s == NULL) {
> +		RTE_LOG(ERR, MEMPOOL, "Cannot allocate nb_stack!\n");
> +		return -ENOMEM;
> +	}
> +
> +	s->size = n;
> +
> +	nb_lifo_init(&s->used_lifo);
> +	nb_lifo_init(&s->free_lifo);
> +
> +	elems = (struct nb_lifo_elem *) &s[1];

Does checkpatch.sh generate warning here because of space after type cast?
There are few similar cases in the patch.

> +	for (i = 0; i < n; i++)
> +		nb_lifo_push_single(&s->free_lifo, &elems[i]);
> +
> +	mp->pool_data = s;
> +
> +	return 0;
> +}
> +
> +static int
> +nb_stack_enqueue(struct rte_mempool *mp, void * const *obj_table,
> +		 unsigned int n)
> +{
> +	struct rte_mempool_nb_stack *s = mp->pool_data;
> +	struct nb_lifo_elem *first, *last, *tmp;
> +	unsigned int i;
> +
> +	if (unlikely(n == 0))
> +		return 0;
> +
> +	/* Pop n free elements */
> +	first = nb_lifo_pop(&s->free_lifo, n, NULL, NULL);
> +	if (unlikely(!first))

Just a nit, but as far as I know comparison with NULL is typically used 
in DPDK.
(few cases in the patch)

> +		return -ENOBUFS;
> +
> +	/* Prepare the list elements */
> +	tmp = first;
> +	for (i = 0; i < n; i++) {
> +		tmp->data = obj_table[i];
> +		last = tmp;
> +		tmp = tmp->next;
> +	}
> +
> +	/* Enqueue them to the used list */
> +	nb_lifo_push(&s->used_lifo, first, last, n);
> +
> +	return 0;
> +}
> +
> +static int
> +nb_stack_dequeue(struct rte_mempool *mp, void **obj_table,
> +		 unsigned int n)
> +{
> +	struct rte_mempool_nb_stack *s = mp->pool_data;
> +	struct nb_lifo_elem *first, *last;
> +
> +	if (unlikely(n == 0))
> +		return 0;
> +
> +	/* Pop n used elements */
> +	first = nb_lifo_pop(&s->used_lifo, n, obj_table, &last);
> +	if (unlikely(!first))
> +		return -ENOENT;
> +
> +	/* Enqueue the list elements to the free list */
> +	nb_lifo_push(&s->free_lifo, first, last, n);
> +
> +	return 0;
> +}
> +
> +static unsigned
> +nb_stack_get_count(const struct rte_mempool *mp)
> +{
> +	struct rte_mempool_nb_stack *s = mp->pool_data;
> +
> +	return nb_lifo_len(&s->used_lifo);
> +}
> +
> +static void
> +nb_stack_free(struct rte_mempool *mp)
> +{
> +	rte_free((void *)(mp->pool_data));

I think type cast is not required.

> +}
> +
> +static struct rte_mempool_ops ops_nb_stack = {
> +	.name = "nb_stack",
> +	.alloc = nb_stack_alloc,
> +	.free = nb_stack_free,
> +	.enqueue = nb_stack_enqueue,
> +	.dequeue = nb_stack_dequeue,
> +	.get_count = nb_stack_get_count
> +};
> +
> +MEMPOOL_REGISTER_OPS(ops_nb_stack);
> diff --git a/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map b/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
> new file mode 100644
> index 000000000..fc8c95e91
> --- /dev/null
> +++ b/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
> @@ -0,0 +1,4 @@
> +DPDK_19.05 {
> +
> +	local: *;
> +};
> diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> index 02e8b6f05..0b11d9417 100644
> --- a/mk/rte.app.mk
> +++ b/mk/rte.app.mk
> @@ -133,6 +133,7 @@ ifeq ($(CONFIG_RTE_BUILD_SHARED_LIB),n)
>   
>   _LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_BUCKET) += -lrte_mempool_bucket
>   _LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK)  += -lrte_mempool_stack
> +_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK)  += -lrte_mempool_nb_stack

It is better to sort it alphabetically.

>   ifeq ($(CONFIG_RTE_LIBRTE_DPAA_BUS),y)
>   _LDLIBS-$(CONFIG_RTE_LIBRTE_DPAA_MEMPOOL)   += -lrte_mempool_dpaa
>   endif

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH 1/3] eal: add 128-bit cmpset (x86-64 only)
  2019-01-13 12:18   ` Andrew Rybchenko
@ 2019-01-14  4:29     ` Varghese, Vipin
  2019-01-14 15:46       ` Eads, Gage
  2019-01-14 15:43     ` Eads, Gage
  1 sibling, 1 reply; 43+ messages in thread
From: Varghese, Vipin @ 2019-01-14  4:29 UTC (permalink / raw)
  To: Andrew Rybchenko, Eads, Gage, dev
  Cc: olivier.matz, Richardson, Bruce, Ananyev, Konstantin

Hi Gage,

snipped
> > @@ -208,4 +209,25 @@ static inline void
> rte_atomic64_clear(rte_atomic64_t *v)
> >   }
> >   #endif
> >
> > +static inline int
> > +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp, uint64_t
> > +*src) {
> > +	uint8_t res;
> > +
> > +	asm volatile (
> > +		      MPLOCKED
> > +		      "cmpxchg16b %[dst];"
> > +		      " sete %[res]"
> > +		      : [dst] "=m" (*dst),
> > +			[res] "=r" (res)
> > +		      : "c" (src[1]),
> > +			"b" (src[0]),
> > +			"m" (*dst),
> > +			"d" (exp[1]),
> > +			"a" (exp[0])
> > +		      : "memory");
Since update depends upon on the 'set|unset' value of ZF, should we first set ZF to 0?

Apologies in advance if it is internally taken care by 'sete'.

> > +
> > +	return res;
> > +}
> > +
> >   #endif /* _RTE_ATOMIC_X86_64_
> 
> Is it OK to add it to rte_atomic_64.h header which is for 64-bit integer ops?
> 
> Andrew.
> > H_ */


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH 1/3] eal: add 128-bit cmpset (x86-64 only)
  2019-01-13 12:18   ` Andrew Rybchenko
  2019-01-14  4:29     ` Varghese, Vipin
@ 2019-01-14 15:43     ` Eads, Gage
  1 sibling, 0 replies; 43+ messages in thread
From: Eads, Gage @ 2019-01-14 15:43 UTC (permalink / raw)
  To: Andrew Rybchenko, dev
  Cc: olivier.matz, Richardson, Bruce, Ananyev, Konstantin



From: Andrew Rybchenko [mailto:arybchenko@solarflare.com]
Sent: Sunday, January 13, 2019 6:19 AM
To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
Cc: olivier.matz@6wind.com; Richardson, Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin <konstantin.ananyev@intel.com>
Subject: Re: [PATCH 1/3] eal: add 128-bit cmpset (x86-64 only)

On 1/10/19 11:55 PM, Gage Eads wrote:

This operation can be used for non-blocking algorithms, such as a

non-blocking stack or ring.



Signed-off-by: Gage Eads <gage.eads@intel.com><mailto:gage.eads@intel.com>

---

 .../common/include/arch/x86/rte_atomic_64.h        | 22 ++++++++++++++++++++++

 1 file changed, 22 insertions(+)



diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h

index fd2ec9c53..34c2addf8 100644

--- a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h

+++ b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h

@@ -34,6 +34,7 @@

 /*

  * Inspired from FreeBSD src/sys/amd64/include/atomic.h

  * Copyright (c) 1998 Doug Rabson

+ * Copyright (c) 2019 Intel Corporation

  * All rights reserved.

  */



@@ -208,4 +209,25 @@ static inline void rte_atomic64_clear(rte_atomic64_t *v)

 }

 #endif



+static inline int

+rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp, uint64_t *src)

+{

+  uint8_t res;

+

+  asm volatile (

+                MPLOCKED

+                "cmpxchg16b %[dst];"

+                " sete %[res]"

+                : [dst] "=m" (*dst),

+                  [res] "=r" (res)

+                : "c" (src[1]),

+                  "b" (src[0]),

+                  "m" (*dst),

+                  "d" (exp[1]),

+                  "a" (exp[0])

+                : "memory");

+

+  return res;

+}

+

 #endif /* _RTE_ATOMIC_X86_64_

Is it OK to add it to rte_atomic_64.h header which is for 64-bit integer ops?

Andrew.

I believe this file is for atomic operations specific to x86_64 builds, but not necessarily limited to 64-bit operations (note that rte_atomic_32.h contains 64-bit operations specific to 32-bit builds). At least, that’s how I interpreted it.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH 1/3] eal: add 128-bit cmpset (x86-64 only)
  2019-01-14  4:29     ` Varghese, Vipin
@ 2019-01-14 15:46       ` Eads, Gage
  2019-01-16  4:34         ` Varghese, Vipin
  0 siblings, 1 reply; 43+ messages in thread
From: Eads, Gage @ 2019-01-14 15:46 UTC (permalink / raw)
  To: Varghese, Vipin, Andrew Rybchenko, dev
  Cc: olivier.matz, Richardson, Bruce, Ananyev, Konstantin



> -----Original Message-----
> From: Varghese, Vipin
> Sent: Sunday, January 13, 2019 10:29 PM
> To: Andrew Rybchenko <arybchenko@solarflare.com>; Eads, Gage
> <gage.eads@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; Richardson, Bruce <bruce.richardson@intel.com>;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Subject: RE: [dpdk-dev] [PATCH 1/3] eal: add 128-bit cmpset (x86-64 only)
> 
> Hi Gage,
> 
> snipped
> > > @@ -208,4 +209,25 @@ static inline void
> > rte_atomic64_clear(rte_atomic64_t *v)
> > >   }
> > >   #endif
> > >
> > > +static inline int
> > > +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp, uint64_t
> > > +*src) {
> > > +	uint8_t res;
> > > +
> > > +	asm volatile (
> > > +		      MPLOCKED
> > > +		      "cmpxchg16b %[dst];"
> > > +		      " sete %[res]"
> > > +		      : [dst] "=m" (*dst),
> > > +			[res] "=r" (res)
> > > +		      : "c" (src[1]),
> > > +			"b" (src[0]),
> > > +			"m" (*dst),
> > > +			"d" (exp[1]),
> > > +			"a" (exp[0])
> > > +		      : "memory");
> Since update depends upon on the 'set|unset' value of ZF, should we first set ZF
> to 0?
> 
> Apologies in advance if it is internally taken care by 'sete'.

cmpxchg16b will set the ZF if the compared values are equal, else it will clear the ZF, so there's no need to initialize the ZF.

Source: https://www.felixcloutier.com/x86/cmpxchg8b:cmpxchg16b

Thanks,
Gage

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH 2/3] mempool/nb_stack: add non-blocking stack mempool
  2019-01-13 13:31   ` Andrew Rybchenko
@ 2019-01-14 16:22     ` Eads, Gage
  0 siblings, 0 replies; 43+ messages in thread
From: Eads, Gage @ 2019-01-14 16:22 UTC (permalink / raw)
  To: Andrew Rybchenko, dev
  Cc: olivier.matz, Richardson, Bruce, Ananyev, Konstantin



> -----Original Message-----
> From: Andrew Rybchenko [mailto:arybchenko@solarflare.com]
> Sent: Sunday, January 13, 2019 7:31 AM
> To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; Richardson, Bruce <bruce.richardson@intel.com>;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Subject: Re: [PATCH 2/3] mempool/nb_stack: add non-blocking stack mempool
> 
> Hi Gage,
> 
> In general looks very good.
> 
> Have you considered to make nb_lifo.h a library to be reusable outside of the
> mempool driver?

I'm certainly open to it, if the community can benefit. Due to the difficulty of adding a new lib/ directory (I believe this requires Tech Board approval), I'll defer that work to a separate patchset.

> 
> There are few notes below.
> 
> Thanks,
> Andrew.
> 
> On 1/10/19 11:55 PM, Gage Eads wrote:
> 
> 
> 	This commit adds support for non-blocking (linked list based) stack
> mempool
> 	handler. The stack uses a 128-bit compare-and-swap instruction, and
> thus is
> 	limited to x86_64. The 128-bit CAS atomically updates the stack top
> pointer
> 	and a modification counter, which protects against the ABA problem.
> 
> 	In mempool_perf_autotest the lock-based stack outperforms the non-
> blocking
> 	handler*, however:
> 	- For applications with preemptible pthreads, a lock-based stack's
> 	  worst-case performance (i.e. one thread being preempted while
> 	  holding the spinlock) is much worse than the non-blocking stack's.
> 	- Using per-thread mempool caches will largely mitigate the
> performance
> 	  difference.
> 
> 	*Test setup: x86_64 build with default config, dual-socket Xeon E5-2699
> v4,
> 	running on isolcpus cores with a tickless scheduler. The lock-based
> stack's
> 	rate_persec was 1x-3.5x the non-blocking stack's.
> 
> 	Signed-off-by: Gage Eads <gage.eads@intel.com>
> <mailto:gage.eads@intel.com>
> 	---
> 	 MAINTAINERS                                        |   4 +
> 	 config/common_base                                 |   1 +
> 	 drivers/mempool/Makefile                           |   1 +
> 	 drivers/mempool/nb_stack/Makefile                  |  30 +++++
> 	 drivers/mempool/nb_stack/meson.build               |   4 +
> 	 drivers/mempool/nb_stack/nb_lifo.h                 | 132
> +++++++++++++++++++++
> 	 drivers/mempool/nb_stack/rte_mempool_nb_stack.c    | 125
> +++++++++++++++++++
> 	 .../nb_stack/rte_mempool_nb_stack_version.map      |   4 +
> 	 mk/rte.app.mk                                      |   1 +
> 	 9 files changed, 302 insertions(+)
> 	 create mode 100644 drivers/mempool/nb_stack/Makefile
> 	 create mode 100644 drivers/mempool/nb_stack/meson.build
> 	 create mode 100644 drivers/mempool/nb_stack/nb_lifo.h
> 	 create mode 100644
> drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> 	 create mode 100644
> drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
> 
> 	diff --git a/MAINTAINERS b/MAINTAINERS
> 	index 470f36b9c..5519d3323 100644
> 	--- a/MAINTAINERS
> 	+++ b/MAINTAINERS
> 	@@ -416,6 +416,10 @@ M: Artem V. Andreev
> <artem.andreev@oktetlabs.ru> <mailto:artem.andreev@oktetlabs.ru>
> 	 M: Andrew Rybchenko <arybchenko@solarflare.com>
> <mailto:arybchenko@solarflare.com>
> 	 F: drivers/mempool/bucket/
> 
> 	+Non-blocking stack memory pool
> 	+M: Gage Eads <gage.eads@intel.com> <mailto:gage.eads@intel.com>
> 	+F: drivers/mempool/nb_stack/
> 	+
> 
> 	 Bus Drivers
> 	 -----------
> 	diff --git a/config/common_base b/config/common_base
> 	index 964a6956e..40ce47312 100644
> 	--- a/config/common_base
> 	+++ b/config/common_base
> 	@@ -728,6 +728,7 @@ CONFIG_RTE_DRIVER_MEMPOOL_BUCKET=y
> 	 CONFIG_RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB=64
> 	 CONFIG_RTE_DRIVER_MEMPOOL_RING=y
> 	 CONFIG_RTE_DRIVER_MEMPOOL_STACK=y
> 	+CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK=y
> 
> 
> Typically it is alphabetically sorted.

Will fix.

> 
> 
> 
> 	 #
> 	 # Compile PMD for octeontx fpa mempool device
> 	diff --git a/drivers/mempool/Makefile b/drivers/mempool/Makefile
> 	index 28c2e8360..aeae3ac12 100644
> 	--- a/drivers/mempool/Makefile
> 	+++ b/drivers/mempool/Makefile
> 	@@ -13,5 +13,6 @@ endif
> 	 DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_RING) += ring
> 	 DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK) += stack
> 	 DIRS-$(CONFIG_RTE_LIBRTE_OCTEONTX_MEMPOOL) += octeontx
> 	+DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += nb_stack
> 
> 
> Typically it is alphabetically sorted. Yes, already broken, but, please, put it
> before ring.
> 

Sure, will do.

> 
> 
> 
> 	 include $(RTE_SDK)/mk/rte.subdir.mk
> 	diff --git a/drivers/mempool/nb_stack/Makefile
> b/drivers/mempool/nb_stack/Makefile
> 	new file mode 100644
> 	index 000000000..38b45f4f5
> 	--- /dev/null
> 	+++ b/drivers/mempool/nb_stack/Makefile
> 	@@ -0,0 +1,30 @@
> 	+# SPDX-License-Identifier: BSD-3-Clause
> 	+# Copyright(c) 2019 Intel Corporation
> 	+
> 	+include $(RTE_SDK)/mk/rte.vars.mk
> 	+
> 	+# The non-blocking stack uses a 128-bit compare-and-swap instruction,
> and thus
> 	+# is limited to x86_64.
> 	+ifeq ($(CONFIG_RTE_ARCH_X86_64),y)
> 	+
> 	+#
> 	+# library name
> 	+#
> 	+LIB = librte_mempool_nb_stack.a
> 	+
> 	+CFLAGS += -O3
> 	+CFLAGS += $(WERROR_FLAGS)
> 	+
> 	+# Headers
> 	+CFLAGS += -I$(RTE_SDK)/lib/librte_mempool
> 
> 
> I guess it is derived from stack. Is it really required? There is no such line in ring,
> bucket, octeontx and dpaa2.
> 
> 

Good guess :). No, I don't believe so -- I'll remove this.

> 
> 	+LDLIBS += -lrte_eal -lrte_mempool
> 	+
> 	+EXPORT_MAP := rte_mempool_nb_stack_version.map
> 	+
> 	+LIBABIVER := 1
> 	+
> 	+SRCS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) +=
> rte_mempool_nb_stack.c
> 	+
> 	+endif
> 	+
> 	+include $(RTE_SDK)/mk/rte.lib.mk
> 	diff --git a/drivers/mempool/nb_stack/meson.build
> b/drivers/mempool/nb_stack/meson.build
> 	new file mode 100644
> 	index 000000000..66d64a9ba
> 	--- /dev/null
> 	+++ b/drivers/mempool/nb_stack/meson.build
> 	@@ -0,0 +1,4 @@
> 	+# SPDX-License-Identifier: BSD-3-Clause
> 	+# Copyright(c) 2019 Intel Corporation
> 	+
> 	+sources = files('rte_mempool_nb_stack.c')
> 
> 
> Have no tested meson build for non-x86_64 target?
> I guess it should be fixed to skip it on non-x86_64 builds.
> 

I did not -- my mistake. I'll correct this.

> 
> 
> 
> 	diff --git a/drivers/mempool/nb_stack/nb_lifo.h
> b/drivers/mempool/nb_stack/nb_lifo.h
> 	new file mode 100644
> 	index 000000000..701d75e37
> 	--- /dev/null
> 	+++ b/drivers/mempool/nb_stack/nb_lifo.h
> 	@@ -0,0 +1,132 @@
> 	+/* SPDX-License-Identifier: BSD-3-Clause
> 	+ * Copyright(c) 2019 Intel Corporation
> 	+ */
> 	+
> 	+#ifndef _NB_LIFO_H_
> 	+#define _NB_LIFO_H_
> 	+
> 	+struct nb_lifo_elem {
> 	+	void *data;
> 	+	struct nb_lifo_elem *next;
> 	+};
> 	+
> 	+struct nb_lifo_head {
> 	+	struct nb_lifo_elem *top; /**< Stack top */
> 	+	uint64_t cnt; /**< Modification counter */
> 	+};
> 	+
> 	+struct nb_lifo {
> 	+	volatile struct nb_lifo_head head __rte_aligned(16);
> 	+	rte_atomic64_t len;
> 	+} __rte_cache_aligned;
> 	+
> 	+static __rte_always_inline void
> 	+nb_lifo_init(struct nb_lifo *lifo)
> 	+{
> 	+	memset(lifo, 0, sizeof(*lifo));
> 	+	rte_atomic64_set(&lifo->len, 0);
> 	+}
> 	+
> 	+static __rte_always_inline unsigned int
> 	+nb_lifo_len(struct nb_lifo *lifo)
> 	+{
> 	+	return (unsigned int) rte_atomic64_read(&lifo->len);
> 	+}
> 	+
> 	+static __rte_always_inline void
> 	+nb_lifo_push(struct nb_lifo *lifo,
> 	+	     struct nb_lifo_elem *first,
> 	+	     struct nb_lifo_elem *last,
> 	+	     unsigned int num)
> 	+{
> 	+	while (1) {
> 	+		struct nb_lifo_head old_head, new_head;
> 	+
> 	+		old_head = lifo->head;
> 	+
> 	+		/* Swing the top pointer to the first element in the list
> and
> 	+		 * make the last element point to the old top.
> 	+		 */
> 	+		new_head.top = first;
> 	+		new_head.cnt = old_head.cnt + 1;
> 	+
> 	+		last->next = old_head.top;
> 	+
> 	+		if (rte_atomic128_cmpset((volatile uint64_t *) &lifo-
> >head,
> 	+					 (uint64_t *)&old_head,
> 	+					 (uint64_t *)&new_head))
> 	+			break;
> 	+	}
> 	+
> 	+	rte_atomic64_add(&lifo->len, num);
> 
> 
> I'd like to understand why it is not a problem that change of the list and increase
> its length are not atomic. So, we can get wrong length of the stack in the
> middle. It would be good to explain it in the comment.
> 

Indeed, there is a window in which the list appears shorter than it is. I don't believe this a problem becaues the get_count callback is inherently racy/approximate (if it is called while the list being accessed). That is, even if the list and its size were updated atomically, the size could change between when get_count reads the size and when that value is returned to the calling thread.

I placed the lifo->len updates such that the list may appear to have fewer elements than it does, but will never appear to have more elements. If the mempool is near-empty to the point that this is a concern, I think the bigger problem is the mempool size.

If that seems reasonable, I'll add that as a comment in the code.

> 
> 
> 	+}
> 	+
> 	+static __rte_always_inline void
> 	+nb_lifo_push_single(struct nb_lifo *lifo, struct nb_lifo_elem *elem)
> 	+{
> 	+	nb_lifo_push(lifo, elem, elem, 1);
> 	+}
> 	+
> 	+static __rte_always_inline struct nb_lifo_elem *
> 	+nb_lifo_pop(struct nb_lifo *lifo,
> 	+	    unsigned int num,
> 	+	    void **obj_table,
> 	+	    struct nb_lifo_elem **last)
> 	+{
> 	+	struct nb_lifo_head old_head;
> 	+
> 	+	/* Reserve num elements, if available */
> 	+	while (1) {
> 	+		uint64_t len = rte_atomic64_read(&lifo->len);
> 	+
> 	+		/* Does the list contain enough elements? */
> 	+		if (len < num)
> 	+			return NULL;
> 	+
> 	+		if (rte_atomic64_cmpset((volatile uint64_t *)&lifo->len,
> 	+					len, len - num))
> 	+			break;
> 	+	}
> 	+
> 	+	/* Pop num elements */
> 	+	while (1) {
> 	+		struct nb_lifo_head new_head;
> 	+		struct nb_lifo_elem *tmp;
> 	+		unsigned int i;
> 	+
> 	+		old_head = lifo->head;
> 	+
> 	+		tmp = old_head.top;
> 	+
> 	+		/* Traverse the list to find the new head. A next pointer
> will
> 	+		 * either point to another element or NULL; if a thread
> 	+		 * encounters a pointer that has already been popped,
> the CAS
> 	+		 * will fail.
> 	+		 */
> 	+		for (i = 0; i < num && tmp != NULL; i++) {
> 	+			if (obj_table)
> 	+				obj_table[i] = tmp->data;
> 	+			if (last)
> 	+				*last = tmp;
> 
> 
> Isn't it better to do obj_table and last assignment later when we successfully
> reserved elements? If there is not retries, current solution is optimal, but I guess
> solution with the second traversal to fill in obj_table will show more stable
> performance results under high load when many retries are done.
> 

I suspect that the latency of the writes to obj_table and last would largely be hidden by the pointer chasing, since obj_table (aided by the next-line prefetcher) and last should be cached but chances are tmp->next won't be.

Admittedly this is just a theory, I haven't experimentally confirmed anything; if you prefer, I'll investigate this further.

> 
> 
> 	+			tmp = tmp->next;
> 	+		}
> 	+
> 	+		/* If NULL was encountered, the list was modified while
> 	+		 * traversing it. Retry.
> 	+		 */
> 	+		if (i != num)
> 	+			continue;
> 	+
> 	+		new_head.top = tmp;
> 	+		new_head.cnt = old_head.cnt + 1;
> 	+
> 	+		if (rte_atomic128_cmpset((volatile uint64_t *) &lifo-
> >head,
> 	+					 (uint64_t *)&old_head,
> 	+					 (uint64_t *)&new_head))
> 	+			break;
> 	+	}
> 	+
> 	+	return old_head.top;
> 	+}
> 	+
> 	+#endif /* _NB_LIFO_H_ */
> 	diff --git a/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> 	new file mode 100644
> 	index 000000000..1b30775f7
> 	--- /dev/null
> 	+++ b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> 	@@ -0,0 +1,125 @@
> 	+/* SPDX-License-Identifier: BSD-3-Clause
> 	+ * Copyright(c) 2019 Intel Corporation
> 	+ */
> 	+
> 	+#include <stdio.h>
> 	+#include <rte_mempool.h>
> 	+#include <rte_malloc.h>
> 	+
> 	+#include "nb_lifo.h"
> 	+
> 	+struct rte_mempool_nb_stack {
> 	+	uint64_t size;
> 	+	struct nb_lifo used_lifo; /**< LIFO containing mempool pointers
> */
> 	+	struct nb_lifo free_lifo; /**< LIFO containing unused LIFO
> elements */
> 	+};
> 	+
> 	+static int
> 	+nb_stack_alloc(struct rte_mempool *mp)
> 	+{
> 	+	struct rte_mempool_nb_stack *s;
> 	+	struct nb_lifo_elem *elems;
> 	+	unsigned int n = mp->size;
> 	+	unsigned int size, i;
> 	+
> 	+	size = sizeof(*s) + n * sizeof(struct nb_lifo_elem);
> 	+
> 	+	/* Allocate our local memory structure */
> 	+	s = rte_zmalloc_socket("mempool-nb_stack",
> 	+			       size,
> 	+			       RTE_CACHE_LINE_SIZE,
> 	+			       mp->socket_id);
> 	+	if (s == NULL) {
> 	+		RTE_LOG(ERR, MEMPOOL, "Cannot allocate
> nb_stack!\n");
> 	+		return -ENOMEM;
> 	+	}
> 	+
> 	+	s->size = n;
> 	+
> 	+	nb_lifo_init(&s->used_lifo);
> 	+	nb_lifo_init(&s->free_lifo);
> 	+
> 	+	elems = (struct nb_lifo_elem *) &s[1];
> 
> 
> Does checkpatch.sh generate warning here because of space after type cast?
> There are few similar cases in the patch.
> 

No, because that's a --strict option. Regardless, I'm starting to prefer that style -- I'll fix these instances in v2.

> 
> 
> 	+	for (i = 0; i < n; i++)
> 	+		nb_lifo_push_single(&s->free_lifo, &elems[i]);
> 	+
> 	+	mp->pool_data = s;
> 	+
> 	+	return 0;
> 	+}
> 	+
> 	+static int
> 	+nb_stack_enqueue(struct rte_mempool *mp, void * const *obj_table,
> 	+		 unsigned int n)
> 	+{
> 	+	struct rte_mempool_nb_stack *s = mp->pool_data;
> 	+	struct nb_lifo_elem *first, *last, *tmp;
> 	+	unsigned int i;
> 	+
> 	+	if (unlikely(n == 0))
> 	+		return 0;
> 	+
> 	+	/* Pop n free elements */
> 	+	first = nb_lifo_pop(&s->free_lifo, n, NULL, NULL);
> 	+	if (unlikely(!first))
> 
> 
> Just a nit, but as far as I know comparison with NULL is typically used in DPDK.
> (few cases in the patch)
> 

Looks like this is explicitly called out in the style guide: https://doc.dpdk.org/guides/contributing/coding_style.html#null-pointers

Will fix in v2.

> 
> 
> 	+		return -ENOBUFS;
> 	+
> 	+	/* Prepare the list elements */
> 	+	tmp = first;
> 	+	for (i = 0; i < n; i++) {
> 	+		tmp->data = obj_table[i];
> 	+		last = tmp;
> 	+		tmp = tmp->next;
> 	+	}
> 	+
> 	+	/* Enqueue them to the used list */
> 	+	nb_lifo_push(&s->used_lifo, first, last, n);
> 	+
> 	+	return 0;
> 	+}
> 	+
> 	+static int
> 	+nb_stack_dequeue(struct rte_mempool *mp, void **obj_table,
> 	+		 unsigned int n)
> 	+{
> 	+	struct rte_mempool_nb_stack *s = mp->pool_data;
> 	+	struct nb_lifo_elem *first, *last;
> 	+
> 	+	if (unlikely(n == 0))
> 	+		return 0;
> 	+
> 	+	/* Pop n used elements */
> 	+	first = nb_lifo_pop(&s->used_lifo, n, obj_table, &last);
> 	+	if (unlikely(!first))
> 	+		return -ENOENT;
> 	+
> 	+	/* Enqueue the list elements to the free list */
> 	+	nb_lifo_push(&s->free_lifo, first, last, n);
> 	+
> 	+	return 0;
> 	+}
> 	+
> 	+static unsigned
> 	+nb_stack_get_count(const struct rte_mempool *mp)
> 	+{
> 	+	struct rte_mempool_nb_stack *s = mp->pool_data;
> 	+
> 	+	return nb_lifo_len(&s->used_lifo);
> 	+}
> 	+
> 	+static void
> 	+nb_stack_free(struct rte_mempool *mp)
> 	+{
> 	+	rte_free((void *)(mp->pool_data));
> 
> 
> I think type cast is not required.
> 
> 

Will fix.

> 
> 
> 	+}
> 	+
> 	+static struct rte_mempool_ops ops_nb_stack = {
> 	+	.name = "nb_stack",
> 	+	.alloc = nb_stack_alloc,
> 	+	.free = nb_stack_free,
> 	+	.enqueue = nb_stack_enqueue,
> 	+	.dequeue = nb_stack_dequeue,
> 	+	.get_count = nb_stack_get_count
> 	+};
> 	+
> 	+MEMPOOL_REGISTER_OPS(ops_nb_stack);
> 	diff --git
> a/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
> b/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
> 	new file mode 100644
> 	index 000000000..fc8c95e91
> 	--- /dev/null
> 	+++ b/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
> 	@@ -0,0 +1,4 @@
> 	+DPDK_19.05 {
> 	+
> 	+	local: *;
> 	+};
> 	diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> 	index 02e8b6f05..0b11d9417 100644
> 	--- a/mk/rte.app.mk
> 	+++ b/mk/rte.app.mk
> 	@@ -133,6 +133,7 @@ ifeq ($(CONFIG_RTE_BUILD_SHARED_LIB),n)
> 
> 	 _LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_BUCKET) += -
> lrte_mempool_bucket
> 	 _LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK)  += -
> lrte_mempool_stack
> 	+_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK)  += -
> lrte_mempool_nb_stack
> 
> 
> It is better to sort it alphabetically.
> 

Will fix.

Appreciate the detailed review!

Thanks,
Gage

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [dpdk-dev] [PATCH v2 0/2] Add non-blocking stack mempool handler
  2019-01-10 20:55 [dpdk-dev] [PATCH 0/3] Add non-blocking stack mempool handler Gage Eads
                   ` (2 preceding siblings ...)
  2019-01-10 20:55 ` [dpdk-dev] [PATCH 3/3] doc: add NB stack comment to EAL "known issues" Gage Eads
@ 2019-01-15 22:32 ` Gage Eads
  2019-01-15 22:32   ` [dpdk-dev] [PATCH v2 1/2] eal: add 128-bit cmpset (x86-64 only) Gage Eads
                     ` (2 more replies)
  3 siblings, 3 replies; 43+ messages in thread
From: Gage Eads @ 2019-01-15 22:32 UTC (permalink / raw)
  To: dev; +Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev

For some users, the rte ring's "non-preemptive" constraint is not acceptable;
for example, if the application uses a mixture of pinned high-priority threads
and multiplexed low-priority threads that share a mempool.

This patchset introduces a non-blocking stack mempool handler. Note that the
non-blocking algorithm relies on a 128-bit compare-and-swap, so it is limited
to x86_64 machines.

In mempool_perf_autotest the lock-based stack outperforms the non-blocking
handler*, however:
- For applications with preemptible pthreads, a lock-based stack's
  worst-case performance (i.e. one thread being preempted while
  holding the spinlock) is much worse than the non-blocking stack's.
- Using per-thread mempool caches will largely mitigate the performance
  difference.

*Test setup: x86_64 build with default config, dual-socket Xeon E5-2699 v4,
running on isolcpus cores with a tickless scheduler. The lock-based stack's
rate_persec was 1x-3.5x the non-blocking stack's.

v2:
 - Merge separate docs commit into patch #2
 - Fixed two space-after-typecast issues
 - Fix alphabetical sorting for build files
 - Remove unnecessary include path from nb_stack/Makefile
 - Add a comment to nb_lifo_len() justifying its approximate behavior
 - Fix comparison with NULL
 - Remove unnecessary void * cast
 - Fix meson builds and limit them to x86_64
 - Fix missing library error for non-x86_64 builds

Gage Eads (2):
  eal: add 128-bit cmpset (x86-64 only)
  mempool/nb_stack: add non-blocking stack mempool

 MAINTAINERS                                        |   4 +
 config/common_base                                 |   1 +
 doc/guides/prog_guide/env_abstraction_layer.rst    |   5 +
 drivers/mempool/Makefile                           |   3 +
 drivers/mempool/meson.build                        |   5 +
 drivers/mempool/nb_stack/Makefile                  |  23 ++++
 drivers/mempool/nb_stack/meson.build               |   4 +
 drivers/mempool/nb_stack/nb_lifo.h                 | 147 +++++++++++++++++++++
 drivers/mempool/nb_stack/rte_mempool_nb_stack.c    | 125 ++++++++++++++++++
 .../nb_stack/rte_mempool_nb_stack_version.map      |   4 +
 .../common/include/arch/x86/rte_atomic_64.h        |  22 +++
 mk/rte.app.mk                                      |   7 +-
 12 files changed, 348 insertions(+), 2 deletions(-)
 create mode 100644 drivers/mempool/nb_stack/Makefile
 create mode 100644 drivers/mempool/nb_stack/meson.build
 create mode 100644 drivers/mempool/nb_stack/nb_lifo.h
 create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack.c
 create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map

-- 
2.13.6

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [dpdk-dev] [PATCH v2 1/2] eal: add 128-bit cmpset (x86-64 only)
  2019-01-15 22:32 ` [dpdk-dev] [PATCH v2 0/2] Add non-blocking stack mempool handler Gage Eads
@ 2019-01-15 22:32   ` Gage Eads
  2019-01-17  8:49     ` Gavin Hu (Arm Technology China)
  2019-01-15 22:32   ` [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-blocking stack mempool Gage Eads
  2019-01-16 15:18   ` [dpdk-dev] [PATCH v3 0/2] Add non-blocking stack mempool handler Gage Eads
  2 siblings, 1 reply; 43+ messages in thread
From: Gage Eads @ 2019-01-15 22:32 UTC (permalink / raw)
  To: dev; +Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev

This operation can be used for non-blocking algorithms, such as a
non-blocking stack or ring.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 .../common/include/arch/x86/rte_atomic_64.h        | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
index fd2ec9c53..34c2addf8 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
@@ -34,6 +34,7 @@
 /*
  * Inspired from FreeBSD src/sys/amd64/include/atomic.h
  * Copyright (c) 1998 Doug Rabson
+ * Copyright (c) 2019 Intel Corporation
  * All rights reserved.
  */
 
@@ -208,4 +209,25 @@ static inline void rte_atomic64_clear(rte_atomic64_t *v)
 }
 #endif
 
+static inline int
+rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp, uint64_t *src)
+{
+	uint8_t res;
+
+	asm volatile (
+		      MPLOCKED
+		      "cmpxchg16b %[dst];"
+		      " sete %[res]"
+		      : [dst] "=m" (*dst),
+			[res] "=r" (res)
+		      : "c" (src[1]),
+			"b" (src[0]),
+			"m" (*dst),
+			"d" (exp[1]),
+			"a" (exp[0])
+		      : "memory");
+
+	return res;
+}
+
 #endif /* _RTE_ATOMIC_X86_64_H_ */
-- 
2.13.6

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-blocking stack mempool
  2019-01-15 22:32 ` [dpdk-dev] [PATCH v2 0/2] Add non-blocking stack mempool handler Gage Eads
  2019-01-15 22:32   ` [dpdk-dev] [PATCH v2 1/2] eal: add 128-bit cmpset (x86-64 only) Gage Eads
@ 2019-01-15 22:32   ` Gage Eads
  2019-01-16  7:13     ` Andrew Rybchenko
  2019-01-17  8:06     ` Gavin Hu (Arm Technology China)
  2019-01-16 15:18   ` [dpdk-dev] [PATCH v3 0/2] Add non-blocking stack mempool handler Gage Eads
  2 siblings, 2 replies; 43+ messages in thread
From: Gage Eads @ 2019-01-15 22:32 UTC (permalink / raw)
  To: dev; +Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev

This commit adds support for non-blocking (linked list based) stack mempool
handler. The stack uses a 128-bit compare-and-swap instruction, and thus is
limited to x86_64. The 128-bit CAS atomically updates the stack top pointer
and a modification counter, which protects against the ABA problem.

In mempool_perf_autotest the lock-based stack outperforms the non-blocking
handler*, however:
- For applications with preemptible pthreads, a lock-based stack's
  worst-case performance (i.e. one thread being preempted while
  holding the spinlock) is much worse than the non-blocking stack's.
- Using per-thread mempool caches will largely mitigate the performance
  difference.

*Test setup: x86_64 build with default config, dual-socket Xeon E5-2699 v4,
running on isolcpus cores with a tickless scheduler. The lock-based stack's
rate_persec was 1x-3.5x the non-blocking stack's.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 MAINTAINERS                                        |   4 +
 config/common_base                                 |   1 +
 doc/guides/prog_guide/env_abstraction_layer.rst    |   5 +
 drivers/mempool/Makefile                           |   3 +
 drivers/mempool/meson.build                        |   5 +
 drivers/mempool/nb_stack/Makefile                  |  23 ++++
 drivers/mempool/nb_stack/meson.build               |   4 +
 drivers/mempool/nb_stack/nb_lifo.h                 | 147 +++++++++++++++++++++
 drivers/mempool/nb_stack/rte_mempool_nb_stack.c    | 125 ++++++++++++++++++
 .../nb_stack/rte_mempool_nb_stack_version.map      |   4 +
 mk/rte.app.mk                                      |   7 +-
 11 files changed, 326 insertions(+), 2 deletions(-)
 create mode 100644 drivers/mempool/nb_stack/Makefile
 create mode 100644 drivers/mempool/nb_stack/meson.build
 create mode 100644 drivers/mempool/nb_stack/nb_lifo.h
 create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack.c
 create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map

diff --git a/MAINTAINERS b/MAINTAINERS
index 470f36b9c..5519d3323 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -416,6 +416,10 @@ M: Artem V. Andreev <artem.andreev@oktetlabs.ru>
 M: Andrew Rybchenko <arybchenko@solarflare.com>
 F: drivers/mempool/bucket/
 
+Non-blocking stack memory pool
+M: Gage Eads <gage.eads@intel.com>
+F: drivers/mempool/nb_stack/
+
 
 Bus Drivers
 -----------
diff --git a/config/common_base b/config/common_base
index 964a6956e..8a51f36b1 100644
--- a/config/common_base
+++ b/config/common_base
@@ -726,6 +726,7 @@ CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n
 #
 CONFIG_RTE_DRIVER_MEMPOOL_BUCKET=y
 CONFIG_RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB=64
+CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK=y
 CONFIG_RTE_DRIVER_MEMPOOL_RING=y
 CONFIG_RTE_DRIVER_MEMPOOL_STACK=y
 
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index 929d76dba..9497b879c 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -541,6 +541,11 @@ Known Issues
 
   5. It MUST not be used by multi-producer/consumer pthreads, whose scheduling policies are SCHED_FIFO or SCHED_RR.
 
+  Alternatively, x86_64 applications can use the non-blocking stack mempool handler. When considering this handler, note that:
+
+  - it is limited to the x86_64 platform, because it uses an instruction (16-byte compare-and-swap) that is not available on other platforms.
+  - it has worse average-case performance than the non-preemptive rte_ring, but software caching (e.g. the mempool cache) can mitigate this by reducing the number of handler operations.
+
 + rte_timer
 
   Running  ``rte_timer_manage()`` on a non-EAL pthread is not allowed. However, resetting/stopping the timer from a non-EAL pthread is allowed.
diff --git a/drivers/mempool/Makefile b/drivers/mempool/Makefile
index 28c2e8360..895cf8a34 100644
--- a/drivers/mempool/Makefile
+++ b/drivers/mempool/Makefile
@@ -10,6 +10,9 @@ endif
 ifeq ($(CONFIG_RTE_EAL_VFIO)$(CONFIG_RTE_LIBRTE_FSLMC_BUS),yy)
 DIRS-$(CONFIG_RTE_LIBRTE_DPAA2_MEMPOOL) += dpaa2
 endif
+ifeq ($(CONFIG_RTE_ARCH_X86_64),y)
+DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += nb_stack
+endif
 DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_RING) += ring
 DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK) += stack
 DIRS-$(CONFIG_RTE_LIBRTE_OCTEONTX_MEMPOOL) += octeontx
diff --git a/drivers/mempool/meson.build b/drivers/mempool/meson.build
index 4527d9806..01ee30fee 100644
--- a/drivers/mempool/meson.build
+++ b/drivers/mempool/meson.build
@@ -2,6 +2,11 @@
 # Copyright(c) 2017 Intel Corporation
 
 drivers = ['bucket', 'dpaa', 'dpaa2', 'octeontx', 'ring', 'stack']
+
+if dpdk_conf.has('RTE_ARCH_X86_64')
+	drivers += 'nb_stack'
+endif
+
 std_deps = ['mempool']
 config_flag_fmt = 'RTE_LIBRTE_@0@_MEMPOOL'
 driver_name_fmt = 'rte_mempool_@0@'
diff --git a/drivers/mempool/nb_stack/Makefile b/drivers/mempool/nb_stack/Makefile
new file mode 100644
index 000000000..318b18283
--- /dev/null
+++ b/drivers/mempool/nb_stack/Makefile
@@ -0,0 +1,23 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2019 Intel Corporation
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_mempool_nb_stack.a
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+
+# Headers
+LDLIBS += -lrte_eal -lrte_mempool
+
+EXPORT_MAP := rte_mempool_nb_stack_version.map
+
+LIBABIVER := 1
+
+SRCS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += rte_mempool_nb_stack.c
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/mempool/nb_stack/meson.build b/drivers/mempool/nb_stack/meson.build
new file mode 100644
index 000000000..66d64a9ba
--- /dev/null
+++ b/drivers/mempool/nb_stack/meson.build
@@ -0,0 +1,4 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2019 Intel Corporation
+
+sources = files('rte_mempool_nb_stack.c')
diff --git a/drivers/mempool/nb_stack/nb_lifo.h b/drivers/mempool/nb_stack/nb_lifo.h
new file mode 100644
index 000000000..2edae1c0f
--- /dev/null
+++ b/drivers/mempool/nb_stack/nb_lifo.h
@@ -0,0 +1,147 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#ifndef _NB_LIFO_H_
+#define _NB_LIFO_H_
+
+struct nb_lifo_elem {
+	void *data;
+	struct nb_lifo_elem *next;
+};
+
+struct nb_lifo_head {
+	struct nb_lifo_elem *top; /**< Stack top */
+	uint64_t cnt; /**< Modification counter */
+};
+
+struct nb_lifo {
+	volatile struct nb_lifo_head head __rte_aligned(16);
+	rte_atomic64_t len;
+} __rte_cache_aligned;
+
+static __rte_always_inline void
+nb_lifo_init(struct nb_lifo *lifo)
+{
+	memset(lifo, 0, sizeof(*lifo));
+	rte_atomic64_set(&lifo->len, 0);
+}
+
+static __rte_always_inline unsigned int
+nb_lifo_len(struct nb_lifo *lifo)
+{
+	/* nb_lifo_push() and nb_lifo_pop() do not update the list's contents
+	 * and lifo->len atomically, which can cause the list to appear shorter
+	 * than it actually is if this function is called while other threads
+	 * are modifying the list.
+	 *
+	 * However, given the inherently approximate nature of the get_count
+	 * callback -- even if the list and its size were updated atomically,
+	 * the size could change between when get_count executes and when the
+	 * value is returned to the caller -- this is acceptable.
+	 *
+	 * The lifo->len updates are placed such that the list may appear to
+	 * have fewer elements than it does, but will never appear to have more
+	 * elements. If the mempool is near-empty to the point that this is a
+	 * concern, the user should consider increasing the mempool size.
+	 */
+	return (unsigned int)rte_atomic64_read(&lifo->len);
+}
+
+static __rte_always_inline void
+nb_lifo_push(struct nb_lifo *lifo,
+	     struct nb_lifo_elem *first,
+	     struct nb_lifo_elem *last,
+	     unsigned int num)
+{
+	while (1) {
+		struct nb_lifo_head old_head, new_head;
+
+		old_head = lifo->head;
+
+		/* Swing the top pointer to the first element in the list and
+		 * make the last element point to the old top.
+		 */
+		new_head.top = first;
+		new_head.cnt = old_head.cnt + 1;
+
+		last->next = old_head.top;
+
+		if (rte_atomic128_cmpset((volatile uint64_t *) &lifo->head,
+					 (uint64_t *)&old_head,
+					 (uint64_t *)&new_head))
+			break;
+	}
+
+	rte_atomic64_add(&lifo->len, num);
+}
+
+static __rte_always_inline void
+nb_lifo_push_single(struct nb_lifo *lifo, struct nb_lifo_elem *elem)
+{
+	nb_lifo_push(lifo, elem, elem, 1);
+}
+
+static __rte_always_inline struct nb_lifo_elem *
+nb_lifo_pop(struct nb_lifo *lifo,
+	    unsigned int num,
+	    void **obj_table,
+	    struct nb_lifo_elem **last)
+{
+	struct nb_lifo_head old_head;
+
+	/* Reserve num elements, if available */
+	while (1) {
+		uint64_t len = rte_atomic64_read(&lifo->len);
+
+		/* Does the list contain enough elements? */
+		if (len < num)
+			return NULL;
+
+		if (rte_atomic64_cmpset((volatile uint64_t *)&lifo->len,
+					len, len - num))
+			break;
+	}
+
+	/* Pop num elements */
+	while (1) {
+		struct nb_lifo_head new_head;
+		struct nb_lifo_elem *tmp;
+		unsigned int i;
+
+		old_head = lifo->head;
+
+		tmp = old_head.top;
+
+		/* Traverse the list to find the new head. A next pointer will
+		 * either point to another element or NULL; if a thread
+		 * encounters a pointer that has already been popped, the CAS
+		 * will fail.
+		 */
+		for (i = 0; i < num && tmp != NULL; i++) {
+			if (obj_table)
+				obj_table[i] = tmp->data;
+			if (last)
+				*last = tmp;
+			tmp = tmp->next;
+		}
+
+		/* If NULL was encountered, the list was modified while
+		 * traversing it. Retry.
+		 */
+		if (i != num)
+			continue;
+
+		new_head.top = tmp;
+		new_head.cnt = old_head.cnt + 1;
+
+		if (rte_atomic128_cmpset((volatile uint64_t *) &lifo->head,
+					 (uint64_t *)&old_head,
+					 (uint64_t *)&new_head))
+			break;
+	}
+
+	return old_head.top;
+}
+
+#endif /* _NB_LIFO_H_ */
diff --git a/drivers/mempool/nb_stack/rte_mempool_nb_stack.c b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
new file mode 100644
index 000000000..1818a2cfa
--- /dev/null
+++ b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
@@ -0,0 +1,125 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include <stdio.h>
+#include <rte_mempool.h>
+#include <rte_malloc.h>
+
+#include "nb_lifo.h"
+
+struct rte_mempool_nb_stack {
+	uint64_t size;
+	struct nb_lifo used_lifo; /**< LIFO containing mempool pointers  */
+	struct nb_lifo free_lifo; /**< LIFO containing unused LIFO elements */
+};
+
+static int
+nb_stack_alloc(struct rte_mempool *mp)
+{
+	struct rte_mempool_nb_stack *s;
+	struct nb_lifo_elem *elems;
+	unsigned int n = mp->size;
+	unsigned int size, i;
+
+	size = sizeof(*s) + n * sizeof(struct nb_lifo_elem);
+
+	/* Allocate our local memory structure */
+	s = rte_zmalloc_socket("mempool-nb_stack",
+			       size,
+			       RTE_CACHE_LINE_SIZE,
+			       mp->socket_id);
+	if (s == NULL) {
+		RTE_LOG(ERR, MEMPOOL, "Cannot allocate nb_stack!\n");
+		return -ENOMEM;
+	}
+
+	s->size = n;
+
+	nb_lifo_init(&s->used_lifo);
+	nb_lifo_init(&s->free_lifo);
+
+	elems = (struct nb_lifo_elem *)&s[1];
+	for (i = 0; i < n; i++)
+		nb_lifo_push_single(&s->free_lifo, &elems[i]);
+
+	mp->pool_data = s;
+
+	return 0;
+}
+
+static int
+nb_stack_enqueue(struct rte_mempool *mp, void * const *obj_table,
+		 unsigned int n)
+{
+	struct rte_mempool_nb_stack *s = mp->pool_data;
+	struct nb_lifo_elem *first, *last, *tmp;
+	unsigned int i;
+
+	if (unlikely(n == 0))
+		return 0;
+
+	/* Pop n free elements */
+	first = nb_lifo_pop(&s->free_lifo, n, NULL, NULL);
+	if (unlikely(first == NULL))
+		return -ENOBUFS;
+
+	/* Prepare the list elements */
+	tmp = first;
+	for (i = 0; i < n; i++) {
+		tmp->data = obj_table[i];
+		last = tmp;
+		tmp = tmp->next;
+	}
+
+	/* Enqueue them to the used list */
+	nb_lifo_push(&s->used_lifo, first, last, n);
+
+	return 0;
+}
+
+static int
+nb_stack_dequeue(struct rte_mempool *mp, void **obj_table,
+		 unsigned int n)
+{
+	struct rte_mempool_nb_stack *s = mp->pool_data;
+	struct nb_lifo_elem *first, *last;
+
+	if (unlikely(n == 0))
+		return 0;
+
+	/* Pop n used elements */
+	first = nb_lifo_pop(&s->used_lifo, n, obj_table, &last);
+	if (unlikely(first == NULL))
+		return -ENOENT;
+
+	/* Enqueue the list elements to the free list */
+	nb_lifo_push(&s->free_lifo, first, last, n);
+
+	return 0;
+}
+
+static unsigned
+nb_stack_get_count(const struct rte_mempool *mp)
+{
+	struct rte_mempool_nb_stack *s = mp->pool_data;
+
+	return nb_lifo_len(&s->used_lifo);
+}
+
+static void
+nb_stack_free(struct rte_mempool *mp)
+{
+	rte_free(mp->pool_data);
+}
+
+static struct rte_mempool_ops ops_nb_stack = {
+	.name = "nb_stack",
+	.alloc = nb_stack_alloc,
+	.free = nb_stack_free,
+	.enqueue = nb_stack_enqueue,
+	.dequeue = nb_stack_dequeue,
+	.get_count = nb_stack_get_count
+};
+
+MEMPOOL_REGISTER_OPS(ops_nb_stack);
diff --git a/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map b/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
new file mode 100644
index 000000000..fc8c95e91
--- /dev/null
+++ b/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
@@ -0,0 +1,4 @@
+DPDK_19.05 {
+
+	local: *;
+};
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 02e8b6f05..d4b4aaaf6 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -131,8 +131,11 @@ endif
 ifeq ($(CONFIG_RTE_BUILD_SHARED_LIB),n)
 # plugins (link only if static libraries)
 
-_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_BUCKET) += -lrte_mempool_bucket
-_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK)  += -lrte_mempool_stack
+_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_BUCKET)   += -lrte_mempool_bucket
+ifeq ($(CONFIG_RTE_ARCH_X86_64),y)
+_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += -lrte_mempool_nb_stack
+endif
+_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK)    += -lrte_mempool_stack
 ifeq ($(CONFIG_RTE_LIBRTE_DPAA_BUS),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_DPAA_MEMPOOL)   += -lrte_mempool_dpaa
 endif
-- 
2.13.6

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH 1/3] eal: add 128-bit cmpset (x86-64 only)
  2019-01-14 15:46       ` Eads, Gage
@ 2019-01-16  4:34         ` Varghese, Vipin
  0 siblings, 0 replies; 43+ messages in thread
From: Varghese, Vipin @ 2019-01-16  4:34 UTC (permalink / raw)
  To: Eads, Gage, Andrew Rybchenko, dev
  Cc: olivier.matz, Richardson, Bruce, Ananyev, Konstantin

Thanks Gage for clarifying and correcting. Appreciate the same.

> -----Original Message-----
> From: Eads, Gage
> Sent: Monday, January 14, 2019 9:17 PM
> To: Varghese, Vipin <vipin.varghese@intel.com>; Andrew Rybchenko
> <arybchenko@solarflare.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; Richardson, Bruce <bruce.richardson@intel.com>;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Subject: RE: [dpdk-dev] [PATCH 1/3] eal: add 128-bit cmpset (x86-64 only)
> 
> 
> 
> > -----Original Message-----
> > From: Varghese, Vipin
> > Sent: Sunday, January 13, 2019 10:29 PM
> > To: Andrew Rybchenko <arybchenko@solarflare.com>; Eads, Gage
> > <gage.eads@intel.com>; dev@dpdk.org
> > Cc: olivier.matz@6wind.com; Richardson, Bruce
> > <bruce.richardson@intel.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>
> > Subject: RE: [dpdk-dev] [PATCH 1/3] eal: add 128-bit cmpset (x86-64
> > only)
> >
> > Hi Gage,
> >
> > snipped
> > > > @@ -208,4 +209,25 @@ static inline void
> > > rte_atomic64_clear(rte_atomic64_t *v)
> > > >   }
> > > >   #endif
> > > >
> > > > +static inline int
> > > > +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp,
> > > > +uint64_t
> > > > +*src) {
> > > > +	uint8_t res;
> > > > +
> > > > +	asm volatile (
> > > > +		      MPLOCKED
> > > > +		      "cmpxchg16b %[dst];"
> > > > +		      " sete %[res]"
> > > > +		      : [dst] "=m" (*dst),
> > > > +			[res] "=r" (res)
> > > > +		      : "c" (src[1]),
> > > > +			"b" (src[0]),
> > > > +			"m" (*dst),
> > > > +			"d" (exp[1]),
> > > > +			"a" (exp[0])
> > > > +		      : "memory");
> > Since update depends upon on the 'set|unset' value of ZF, should we
> > first set ZF to 0?
> >
> > Apologies in advance if it is internally taken care by 'sete'.
> 
> cmpxchg16b will set the ZF if the compared values are equal, else it will clear the
> ZF, so there's no need to initialize the ZF.
> 
> Source: https://www.felixcloutier.com/x86/cmpxchg8b:cmpxchg16b
> 
> Thanks,
> Gage

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-blocking stack mempool
  2019-01-15 22:32   ` [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-blocking stack mempool Gage Eads
@ 2019-01-16  7:13     ` Andrew Rybchenko
  2019-01-17  8:06     ` Gavin Hu (Arm Technology China)
  1 sibling, 0 replies; 43+ messages in thread
From: Andrew Rybchenko @ 2019-01-16  7:13 UTC (permalink / raw)
  To: Gage Eads, dev; +Cc: olivier.matz, bruce.richardson, konstantin.ananyev

On 1/16/19 1:32 AM, Gage Eads wrote:
> This commit adds support for non-blocking (linked list based) stack mempool
> handler. The stack uses a 128-bit compare-and-swap instruction, and thus is
> limited to x86_64. The 128-bit CAS atomically updates the stack top pointer
> and a modification counter, which protects against the ABA problem.
>
> In mempool_perf_autotest the lock-based stack outperforms the non-blocking
> handler*, however:
> - For applications with preemptible pthreads, a lock-based stack's
>    worst-case performance (i.e. one thread being preempted while
>    holding the spinlock) is much worse than the non-blocking stack's.
> - Using per-thread mempool caches will largely mitigate the performance
>    difference.
>
> *Test setup: x86_64 build with default config, dual-socket Xeon E5-2699 v4,
> running on isolcpus cores with a tickless scheduler. The lock-based stack's
> rate_persec was 1x-3.5x the non-blocking stack's.
>
> Signed-off-by: Gage Eads <gage.eads@intel.com>
> ---

Few minor nits below. Other than that
Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>

Don't forget about release notes when 19.05 release cycle starts.

[snip]

> diff --git a/drivers/mempool/meson.build b/drivers/mempool/meson.build
> index 4527d9806..01ee30fee 100644
> --- a/drivers/mempool/meson.build
> +++ b/drivers/mempool/meson.build
> @@ -2,6 +2,11 @@
>   # Copyright(c) 2017 Intel Corporation
>   
>   drivers = ['bucket', 'dpaa', 'dpaa2', 'octeontx', 'ring', 'stack']
> +
> +if dpdk_conf.has('RTE_ARCH_X86_64')
> +	drivers += 'nb_stack'
> +endif
> +

I think it would be better to concentrate the logic inside 
nb_stack/meson.build.
There is a 'build' variable which may be set to false disable the build.
You can find an example in drivers/net/sfc/meson.build.

[snip]

> +static __rte_always_inline void
> +nb_lifo_push(struct nb_lifo *lifo,
> +	     struct nb_lifo_elem *first,
> +	     struct nb_lifo_elem *last,
> +	     unsigned int num)
> +{
> +	while (1) {
> +		struct nb_lifo_head old_head, new_head;
> +
> +		old_head = lifo->head;
> +
> +		/* Swing the top pointer to the first element in the list and
> +		 * make the last element point to the old top.
> +		 */
> +		new_head.top = first;
> +		new_head.cnt = old_head.cnt + 1;
> +
> +		last->next = old_head.top;
> +
> +		if (rte_atomic128_cmpset((volatile uint64_t *) &lifo->head,

Unnecessary space after type cast above.

[snip]

> +		new_head.top = tmp;
> +		new_head.cnt = old_head.cnt + 1;
> +
> +		if (rte_atomic128_cmpset((volatile uint64_t *) &lifo->head,

Unnecessary space after type cast above.

[snip]

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [dpdk-dev] [PATCH v3 0/2] Add non-blocking stack mempool handler
  2019-01-15 22:32 ` [dpdk-dev] [PATCH v2 0/2] Add non-blocking stack mempool handler Gage Eads
  2019-01-15 22:32   ` [dpdk-dev] [PATCH v2 1/2] eal: add 128-bit cmpset (x86-64 only) Gage Eads
  2019-01-15 22:32   ` [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-blocking stack mempool Gage Eads
@ 2019-01-16 15:18   ` Gage Eads
  2019-01-16 15:18     ` [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only) Gage Eads
                       ` (2 more replies)
  2 siblings, 3 replies; 43+ messages in thread
From: Gage Eads @ 2019-01-16 15:18 UTC (permalink / raw)
  To: dev; +Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev

For some users, the rte ring's "non-preemptive" constraint is not acceptable;
for example, if the application uses a mixture of pinned high-priority threads
and multiplexed low-priority threads that share a mempool.

This patchset introduces a non-blocking stack mempool handler. Note that the
non-blocking algorithm relies on a 128-bit compare-and-swap, so it is limited
to x86_64 machines.

In mempool_perf_autotest the lock-based stack outperforms the non-blocking
handler*, however:
- For applications with preemptible pthreads, a lock-based stack's
  worst-case performance (i.e. one thread being preempted while
  holding the spinlock) is much worse than the non-blocking stack's.
- Using per-thread mempool caches will largely mitigate the performance
  difference.

*Test setup: x86_64 build with default config, dual-socket Xeon E5-2699 v4,
running on isolcpus cores with a tickless scheduler. The lock-based stack's
rate_persec was 1x-3.5x the non-blocking stack's.

v3:
 - Fix two more space-after-typecast issues
 - Rework nb_stack's meson.build x86_64 check, borrowing from net/sfc/

v2:
 - Merge separate docs commit into patch #2
 - Fix two space-after-typecast issues
 - Fix alphabetical sorting for build files
 - Remove unnecessary include path from nb_stack/Makefile
 - Add a comment to nb_lifo_len() justifying its approximate behavior
 - Fix comparison with NULL
 - Remove unnecessary void * cast
 - Fix meson builds and limit them to x86_64
 - Fix missing library error for non-x86_64 builds

Gage Eads (2):
  eal: add 128-bit cmpset (x86-64 only)
  mempool/nb_stack: add non-blocking stack mempool

 MAINTAINERS                                        |   4 +
 config/common_base                                 |   1 +
 doc/guides/prog_guide/env_abstraction_layer.rst    |   5 +
 drivers/mempool/Makefile                           |   3 +
 drivers/mempool/meson.build                        |   3 +-
 drivers/mempool/nb_stack/Makefile                  |  23 ++++
 drivers/mempool/nb_stack/meson.build               |   8 ++
 drivers/mempool/nb_stack/nb_lifo.h                 | 147 +++++++++++++++++++++
 drivers/mempool/nb_stack/rte_mempool_nb_stack.c    | 125 ++++++++++++++++++
 .../nb_stack/rte_mempool_nb_stack_version.map      |   4 +
 .../common/include/arch/x86/rte_atomic_64.h        |  22 +++
 mk/rte.app.mk                                      |   7 +-
 12 files changed, 349 insertions(+), 3 deletions(-)
 create mode 100644 drivers/mempool/nb_stack/Makefile
 create mode 100644 drivers/mempool/nb_stack/meson.build
 create mode 100644 drivers/mempool/nb_stack/nb_lifo.h
 create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack.c
 create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map

-- 
2.13.6

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only)
  2019-01-16 15:18   ` [dpdk-dev] [PATCH v3 0/2] Add non-blocking stack mempool handler Gage Eads
@ 2019-01-16 15:18     ` Gage Eads
  2019-01-17 15:45       ` Honnappa Nagarahalli
  2019-01-16 15:18     ` [dpdk-dev] [PATCH v3 2/2] mempool/nb_stack: add non-blocking stack mempool Gage Eads
  2019-01-17 15:36     ` [dpdk-dev] [PATCH v4 0/2] Add non-blocking stack mempool handler Gage Eads
  2 siblings, 1 reply; 43+ messages in thread
From: Gage Eads @ 2019-01-16 15:18 UTC (permalink / raw)
  To: dev; +Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev

This operation can be used for non-blocking algorithms, such as a
non-blocking stack or ring.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 .../common/include/arch/x86/rte_atomic_64.h        | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
index fd2ec9c53..34c2addf8 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
@@ -34,6 +34,7 @@
 /*
  * Inspired from FreeBSD src/sys/amd64/include/atomic.h
  * Copyright (c) 1998 Doug Rabson
+ * Copyright (c) 2019 Intel Corporation
  * All rights reserved.
  */
 
@@ -208,4 +209,25 @@ static inline void rte_atomic64_clear(rte_atomic64_t *v)
 }
 #endif
 
+static inline int
+rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp, uint64_t *src)
+{
+	uint8_t res;
+
+	asm volatile (
+		      MPLOCKED
+		      "cmpxchg16b %[dst];"
+		      " sete %[res]"
+		      : [dst] "=m" (*dst),
+			[res] "=r" (res)
+		      : "c" (src[1]),
+			"b" (src[0]),
+			"m" (*dst),
+			"d" (exp[1]),
+			"a" (exp[0])
+		      : "memory");
+
+	return res;
+}
+
 #endif /* _RTE_ATOMIC_X86_64_H_ */
-- 
2.13.6

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [dpdk-dev] [PATCH v3 2/2] mempool/nb_stack: add non-blocking stack mempool
  2019-01-16 15:18   ` [dpdk-dev] [PATCH v3 0/2] Add non-blocking stack mempool handler Gage Eads
  2019-01-16 15:18     ` [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only) Gage Eads
@ 2019-01-16 15:18     ` Gage Eads
  2019-01-17 15:36     ` [dpdk-dev] [PATCH v4 0/2] Add non-blocking stack mempool handler Gage Eads
  2 siblings, 0 replies; 43+ messages in thread
From: Gage Eads @ 2019-01-16 15:18 UTC (permalink / raw)
  To: dev; +Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev

This commit adds support for non-blocking (linked list based) stack mempool
handler. The stack uses a 128-bit compare-and-swap instruction, and thus is
limited to x86_64. The 128-bit CAS atomically updates the stack top pointer
and a modification counter, which protects against the ABA problem.

In mempool_perf_autotest the lock-based stack outperforms the non-blocking
handler*, however:
- For applications with preemptible pthreads, a lock-based stack's
  worst-case performance (i.e. one thread being preempted while
  holding the spinlock) is much worse than the non-blocking stack's.
- Using per-thread mempool caches will largely mitigate the performance
  difference.

*Test setup: x86_64 build with default config, dual-socket Xeon E5-2699 v4,
running on isolcpus cores with a tickless scheduler. The lock-based stack's
rate_persec was 1x-3.5x the non-blocking stack's.

Signed-off-by: Gage Eads <gage.eads@intel.com>
Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
---
 MAINTAINERS                                        |   4 +
 config/common_base                                 |   1 +
 doc/guides/prog_guide/env_abstraction_layer.rst    |   5 +
 drivers/mempool/Makefile                           |   3 +
 drivers/mempool/meson.build                        |   3 +-
 drivers/mempool/nb_stack/Makefile                  |  23 ++++
 drivers/mempool/nb_stack/meson.build               |   8 ++
 drivers/mempool/nb_stack/nb_lifo.h                 | 147 +++++++++++++++++++++
 drivers/mempool/nb_stack/rte_mempool_nb_stack.c    | 125 ++++++++++++++++++
 .../nb_stack/rte_mempool_nb_stack_version.map      |   4 +
 mk/rte.app.mk                                      |   7 +-
 11 files changed, 327 insertions(+), 3 deletions(-)
 create mode 100644 drivers/mempool/nb_stack/Makefile
 create mode 100644 drivers/mempool/nb_stack/meson.build
 create mode 100644 drivers/mempool/nb_stack/nb_lifo.h
 create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack.c
 create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map

diff --git a/MAINTAINERS b/MAINTAINERS
index 470f36b9c..5519d3323 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -416,6 +416,10 @@ M: Artem V. Andreev <artem.andreev@oktetlabs.ru>
 M: Andrew Rybchenko <arybchenko@solarflare.com>
 F: drivers/mempool/bucket/
 
+Non-blocking stack memory pool
+M: Gage Eads <gage.eads@intel.com>
+F: drivers/mempool/nb_stack/
+
 
 Bus Drivers
 -----------
diff --git a/config/common_base b/config/common_base
index 964a6956e..8a51f36b1 100644
--- a/config/common_base
+++ b/config/common_base
@@ -726,6 +726,7 @@ CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n
 #
 CONFIG_RTE_DRIVER_MEMPOOL_BUCKET=y
 CONFIG_RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB=64
+CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK=y
 CONFIG_RTE_DRIVER_MEMPOOL_RING=y
 CONFIG_RTE_DRIVER_MEMPOOL_STACK=y
 
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index 929d76dba..9497b879c 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -541,6 +541,11 @@ Known Issues
 
   5. It MUST not be used by multi-producer/consumer pthreads, whose scheduling policies are SCHED_FIFO or SCHED_RR.
 
+  Alternatively, x86_64 applications can use the non-blocking stack mempool handler. When considering this handler, note that:
+
+  - it is limited to the x86_64 platform, because it uses an instruction (16-byte compare-and-swap) that is not available on other platforms.
+  - it has worse average-case performance than the non-preemptive rte_ring, but software caching (e.g. the mempool cache) can mitigate this by reducing the number of handler operations.
+
 + rte_timer
 
   Running  ``rte_timer_manage()`` on a non-EAL pthread is not allowed. However, resetting/stopping the timer from a non-EAL pthread is allowed.
diff --git a/drivers/mempool/Makefile b/drivers/mempool/Makefile
index 28c2e8360..895cf8a34 100644
--- a/drivers/mempool/Makefile
+++ b/drivers/mempool/Makefile
@@ -10,6 +10,9 @@ endif
 ifeq ($(CONFIG_RTE_EAL_VFIO)$(CONFIG_RTE_LIBRTE_FSLMC_BUS),yy)
 DIRS-$(CONFIG_RTE_LIBRTE_DPAA2_MEMPOOL) += dpaa2
 endif
+ifeq ($(CONFIG_RTE_ARCH_X86_64),y)
+DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += nb_stack
+endif
 DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_RING) += ring
 DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK) += stack
 DIRS-$(CONFIG_RTE_LIBRTE_OCTEONTX_MEMPOOL) += octeontx
diff --git a/drivers/mempool/meson.build b/drivers/mempool/meson.build
index 4527d9806..220cfaf63 100644
--- a/drivers/mempool/meson.build
+++ b/drivers/mempool/meson.build
@@ -1,7 +1,8 @@
 # SPDX-License-Identifier: BSD-3-Clause
 # Copyright(c) 2017 Intel Corporation
 
-drivers = ['bucket', 'dpaa', 'dpaa2', 'octeontx', 'ring', 'stack']
+drivers = ['bucket', 'dpaa', 'dpaa2', 'nb_stack', 'octeontx', 'ring', 'stack']
+
 std_deps = ['mempool']
 config_flag_fmt = 'RTE_LIBRTE_@0@_MEMPOOL'
 driver_name_fmt = 'rte_mempool_@0@'
diff --git a/drivers/mempool/nb_stack/Makefile b/drivers/mempool/nb_stack/Makefile
new file mode 100644
index 000000000..318b18283
--- /dev/null
+++ b/drivers/mempool/nb_stack/Makefile
@@ -0,0 +1,23 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2019 Intel Corporation
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_mempool_nb_stack.a
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+
+# Headers
+LDLIBS += -lrte_eal -lrte_mempool
+
+EXPORT_MAP := rte_mempool_nb_stack_version.map
+
+LIBABIVER := 1
+
+SRCS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += rte_mempool_nb_stack.c
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/mempool/nb_stack/meson.build b/drivers/mempool/nb_stack/meson.build
new file mode 100644
index 000000000..4a699511d
--- /dev/null
+++ b/drivers/mempool/nb_stack/meson.build
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2019 Intel Corporation
+
+if arch_subdir != 'x86' or cc.sizeof('void *') == 4
+	build = false
+endif
+
+sources = files('rte_mempool_nb_stack.c')
diff --git a/drivers/mempool/nb_stack/nb_lifo.h b/drivers/mempool/nb_stack/nb_lifo.h
new file mode 100644
index 000000000..ad4a3401f
--- /dev/null
+++ b/drivers/mempool/nb_stack/nb_lifo.h
@@ -0,0 +1,147 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#ifndef _NB_LIFO_H_
+#define _NB_LIFO_H_
+
+struct nb_lifo_elem {
+	void *data;
+	struct nb_lifo_elem *next;
+};
+
+struct nb_lifo_head {
+	struct nb_lifo_elem *top; /**< Stack top */
+	uint64_t cnt; /**< Modification counter */
+};
+
+struct nb_lifo {
+	volatile struct nb_lifo_head head __rte_aligned(16);
+	rte_atomic64_t len;
+} __rte_cache_aligned;
+
+static __rte_always_inline void
+nb_lifo_init(struct nb_lifo *lifo)
+{
+	memset(lifo, 0, sizeof(*lifo));
+	rte_atomic64_set(&lifo->len, 0);
+}
+
+static __rte_always_inline unsigned int
+nb_lifo_len(struct nb_lifo *lifo)
+{
+	/* nb_lifo_push() and nb_lifo_pop() do not update the list's contents
+	 * and lifo->len atomically, which can cause the list to appear shorter
+	 * than it actually is if this function is called while other threads
+	 * are modifying the list.
+	 *
+	 * However, given the inherently approximate nature of the get_count
+	 * callback -- even if the list and its size were updated atomically,
+	 * the size could change between when get_count executes and when the
+	 * value is returned to the caller -- this is acceptable.
+	 *
+	 * The lifo->len updates are placed such that the list may appear to
+	 * have fewer elements than it does, but will never appear to have more
+	 * elements. If the mempool is near-empty to the point that this is a
+	 * concern, the user should consider increasing the mempool size.
+	 */
+	return (unsigned int)rte_atomic64_read(&lifo->len);
+}
+
+static __rte_always_inline void
+nb_lifo_push(struct nb_lifo *lifo,
+	     struct nb_lifo_elem *first,
+	     struct nb_lifo_elem *last,
+	     unsigned int num)
+{
+	while (1) {
+		struct nb_lifo_head old_head, new_head;
+
+		old_head = lifo->head;
+
+		/* Swing the top pointer to the first element in the list and
+		 * make the last element point to the old top.
+		 */
+		new_head.top = first;
+		new_head.cnt = old_head.cnt + 1;
+
+		last->next = old_head.top;
+
+		if (rte_atomic128_cmpset((volatile uint64_t *)&lifo->head,
+					 (uint64_t *)&old_head,
+					 (uint64_t *)&new_head))
+			break;
+	}
+
+	rte_atomic64_add(&lifo->len, num);
+}
+
+static __rte_always_inline void
+nb_lifo_push_single(struct nb_lifo *lifo, struct nb_lifo_elem *elem)
+{
+	nb_lifo_push(lifo, elem, elem, 1);
+}
+
+static __rte_always_inline struct nb_lifo_elem *
+nb_lifo_pop(struct nb_lifo *lifo,
+	    unsigned int num,
+	    void **obj_table,
+	    struct nb_lifo_elem **last)
+{
+	struct nb_lifo_head old_head;
+
+	/* Reserve num elements, if available */
+	while (1) {
+		uint64_t len = rte_atomic64_read(&lifo->len);
+
+		/* Does the list contain enough elements? */
+		if (len < num)
+			return NULL;
+
+		if (rte_atomic64_cmpset((volatile uint64_t *)&lifo->len,
+					len, len - num))
+			break;
+	}
+
+	/* Pop num elements */
+	while (1) {
+		struct nb_lifo_head new_head;
+		struct nb_lifo_elem *tmp;
+		unsigned int i;
+
+		old_head = lifo->head;
+
+		tmp = old_head.top;
+
+		/* Traverse the list to find the new head. A next pointer will
+		 * either point to another element or NULL; if a thread
+		 * encounters a pointer that has already been popped, the CAS
+		 * will fail.
+		 */
+		for (i = 0; i < num && tmp != NULL; i++) {
+			if (obj_table)
+				obj_table[i] = tmp->data;
+			if (last)
+				*last = tmp;
+			tmp = tmp->next;
+		}
+
+		/* If NULL was encountered, the list was modified while
+		 * traversing it. Retry.
+		 */
+		if (i != num)
+			continue;
+
+		new_head.top = tmp;
+		new_head.cnt = old_head.cnt + 1;
+
+		if (rte_atomic128_cmpset((volatile uint64_t *)&lifo->head,
+					 (uint64_t *)&old_head,
+					 (uint64_t *)&new_head))
+			break;
+	}
+
+	return old_head.top;
+}
+
+#endif /* _NB_LIFO_H_ */
diff --git a/drivers/mempool/nb_stack/rte_mempool_nb_stack.c b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
new file mode 100644
index 000000000..1818a2cfa
--- /dev/null
+++ b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
@@ -0,0 +1,125 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include <stdio.h>
+#include <rte_mempool.h>
+#include <rte_malloc.h>
+
+#include "nb_lifo.h"
+
+struct rte_mempool_nb_stack {
+	uint64_t size;
+	struct nb_lifo used_lifo; /**< LIFO containing mempool pointers  */
+	struct nb_lifo free_lifo; /**< LIFO containing unused LIFO elements */
+};
+
+static int
+nb_stack_alloc(struct rte_mempool *mp)
+{
+	struct rte_mempool_nb_stack *s;
+	struct nb_lifo_elem *elems;
+	unsigned int n = mp->size;
+	unsigned int size, i;
+
+	size = sizeof(*s) + n * sizeof(struct nb_lifo_elem);
+
+	/* Allocate our local memory structure */
+	s = rte_zmalloc_socket("mempool-nb_stack",
+			       size,
+			       RTE_CACHE_LINE_SIZE,
+			       mp->socket_id);
+	if (s == NULL) {
+		RTE_LOG(ERR, MEMPOOL, "Cannot allocate nb_stack!\n");
+		return -ENOMEM;
+	}
+
+	s->size = n;
+
+	nb_lifo_init(&s->used_lifo);
+	nb_lifo_init(&s->free_lifo);
+
+	elems = (struct nb_lifo_elem *)&s[1];
+	for (i = 0; i < n; i++)
+		nb_lifo_push_single(&s->free_lifo, &elems[i]);
+
+	mp->pool_data = s;
+
+	return 0;
+}
+
+static int
+nb_stack_enqueue(struct rte_mempool *mp, void * const *obj_table,
+		 unsigned int n)
+{
+	struct rte_mempool_nb_stack *s = mp->pool_data;
+	struct nb_lifo_elem *first, *last, *tmp;
+	unsigned int i;
+
+	if (unlikely(n == 0))
+		return 0;
+
+	/* Pop n free elements */
+	first = nb_lifo_pop(&s->free_lifo, n, NULL, NULL);
+	if (unlikely(first == NULL))
+		return -ENOBUFS;
+
+	/* Prepare the list elements */
+	tmp = first;
+	for (i = 0; i < n; i++) {
+		tmp->data = obj_table[i];
+		last = tmp;
+		tmp = tmp->next;
+	}
+
+	/* Enqueue them to the used list */
+	nb_lifo_push(&s->used_lifo, first, last, n);
+
+	return 0;
+}
+
+static int
+nb_stack_dequeue(struct rte_mempool *mp, void **obj_table,
+		 unsigned int n)
+{
+	struct rte_mempool_nb_stack *s = mp->pool_data;
+	struct nb_lifo_elem *first, *last;
+
+	if (unlikely(n == 0))
+		return 0;
+
+	/* Pop n used elements */
+	first = nb_lifo_pop(&s->used_lifo, n, obj_table, &last);
+	if (unlikely(first == NULL))
+		return -ENOENT;
+
+	/* Enqueue the list elements to the free list */
+	nb_lifo_push(&s->free_lifo, first, last, n);
+
+	return 0;
+}
+
+static unsigned
+nb_stack_get_count(const struct rte_mempool *mp)
+{
+	struct rte_mempool_nb_stack *s = mp->pool_data;
+
+	return nb_lifo_len(&s->used_lifo);
+}
+
+static void
+nb_stack_free(struct rte_mempool *mp)
+{
+	rte_free(mp->pool_data);
+}
+
+static struct rte_mempool_ops ops_nb_stack = {
+	.name = "nb_stack",
+	.alloc = nb_stack_alloc,
+	.free = nb_stack_free,
+	.enqueue = nb_stack_enqueue,
+	.dequeue = nb_stack_dequeue,
+	.get_count = nb_stack_get_count
+};
+
+MEMPOOL_REGISTER_OPS(ops_nb_stack);
diff --git a/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map b/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
new file mode 100644
index 000000000..fc8c95e91
--- /dev/null
+++ b/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
@@ -0,0 +1,4 @@
+DPDK_19.05 {
+
+	local: *;
+};
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 02e8b6f05..d4b4aaaf6 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -131,8 +131,11 @@ endif
 ifeq ($(CONFIG_RTE_BUILD_SHARED_LIB),n)
 # plugins (link only if static libraries)
 
-_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_BUCKET) += -lrte_mempool_bucket
-_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK)  += -lrte_mempool_stack
+_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_BUCKET)   += -lrte_mempool_bucket
+ifeq ($(CONFIG_RTE_ARCH_X86_64),y)
+_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += -lrte_mempool_nb_stack
+endif
+_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK)    += -lrte_mempool_stack
 ifeq ($(CONFIG_RTE_LIBRTE_DPAA_BUS),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_DPAA_MEMPOOL)   += -lrte_mempool_dpaa
 endif
-- 
2.13.6

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-blocking stack mempool
  2019-01-15 22:32   ` [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-blocking stack mempool Gage Eads
  2019-01-16  7:13     ` Andrew Rybchenko
@ 2019-01-17  8:06     ` Gavin Hu (Arm Technology China)
  2019-01-17 14:11       ` Eads, Gage
  1 sibling, 1 reply; 43+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-01-17  8:06 UTC (permalink / raw)
  To: Gage Eads, dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	Phil Yang (Arm Technology China)


> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Gage Eads
> Sent: Wednesday, January 16, 2019 6:33 AM
> To: dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com;
> bruce.richardson@intel.com; konstantin.ananyev@intel.com
> Subject: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-blocking
> stack mempool
>
> This commit adds support for non-blocking (linked list based) stack
> mempool
> handler. The stack uses a 128-bit compare-and-swap instruction, and thus
> is
> limited to x86_64. The 128-bit CAS atomically updates the stack top
> pointer
> and a modification counter, which protects against the ABA problem.
>
> In mempool_perf_autotest the lock-based stack outperforms the non-
> blocking
> handler*, however:
> - For applications with preemptible pthreads, a lock-based stack's
>   worst-case performance (i.e. one thread being preempted while
>   holding the spinlock) is much worse than the non-blocking stack's.
> - Using per-thread mempool caches will largely mitigate the performance
>   difference.
>
> *Test setup: x86_64 build with default config, dual-socket Xeon E5-2699 v4,
> running on isolcpus cores with a tickless scheduler. The lock-based stack's
> rate_persec was 1x-3.5x the non-blocking stack's.
>
> Signed-off-by: Gage Eads <gage.eads@intel.com>
> ---
>  MAINTAINERS                                        |   4 +
>  config/common_base                                 |   1 +
>  doc/guides/prog_guide/env_abstraction_layer.rst    |   5 +
>  drivers/mempool/Makefile                           |   3 +
>  drivers/mempool/meson.build                        |   5 +
>  drivers/mempool/nb_stack/Makefile                  |  23 ++++
>  drivers/mempool/nb_stack/meson.build               |   4 +
>  drivers/mempool/nb_stack/nb_lifo.h                 | 147
> +++++++++++++++++++++
>  drivers/mempool/nb_stack/rte_mempool_nb_stack.c    | 125
> ++++++++++++++++++
>  .../nb_stack/rte_mempool_nb_stack_version.map      |   4 +
>  mk/rte.app.mk                                      |   7 +-
>  11 files changed, 326 insertions(+), 2 deletions(-)
>  create mode 100644 drivers/mempool/nb_stack/Makefile
>  create mode 100644 drivers/mempool/nb_stack/meson.build
>  create mode 100644 drivers/mempool/nb_stack/nb_lifo.h
>  create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack.c
>  create mode 100644
> drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 470f36b9c..5519d3323 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -416,6 +416,10 @@ M: Artem V. Andreev
> <artem.andreev@oktetlabs.ru>
>  M: Andrew Rybchenko <arybchenko@solarflare.com>
>  F: drivers/mempool/bucket/
>
> +Non-blocking stack memory pool
> +M: Gage Eads <gage.eads@intel.com>
> +F: drivers/mempool/nb_stack/
> +
>
>  Bus Drivers
>  -----------
> diff --git a/config/common_base b/config/common_base
> index 964a6956e..8a51f36b1 100644
> --- a/config/common_base
> +++ b/config/common_base
> @@ -726,6 +726,7 @@ CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n
>  #
>  CONFIG_RTE_DRIVER_MEMPOOL_BUCKET=y
>  CONFIG_RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB=64
> +CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK=y

NAK,  as this applies to x86_64 only, it will break arm/ppc and even 32bit i386 configurations.

>  CONFIG_RTE_DRIVER_MEMPOOL_RING=y
>  CONFIG_RTE_DRIVER_MEMPOOL_STACK=y
>
> diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst
> b/doc/guides/prog_guide/env_abstraction_layer.rst
> index 929d76dba..9497b879c 100644
> --- a/doc/guides/prog_guide/env_abstraction_layer.rst
> +++ b/doc/guides/prog_guide/env_abstraction_layer.rst
> @@ -541,6 +541,11 @@ Known Issues
>
>    5. It MUST not be used by multi-producer/consumer pthreads, whose
> scheduling policies are SCHED_FIFO or SCHED_RR.
>
> +  Alternatively, x86_64 applications can use the non-blocking stack
> mempool handler. When considering this handler, note that:
> +
> +  - it is limited to the x86_64 platform, because it uses an instruction (16-
> byte compare-and-swap) that is not available on other platforms.
> +  - it has worse average-case performance than the non-preemptive
> rte_ring, but software caching (e.g. the mempool cache) can mitigate this
> by reducing the number of handler operations.
> +
>  + rte_timer
>
>    Running  ``rte_timer_manage()`` on a non-EAL pthread is not allowed.
> However, resetting/stopping the timer from a non-EAL pthread is allowed.
> diff --git a/drivers/mempool/Makefile b/drivers/mempool/Makefile
> index 28c2e8360..895cf8a34 100644
> --- a/drivers/mempool/Makefile
> +++ b/drivers/mempool/Makefile
> @@ -10,6 +10,9 @@ endif
>  ifeq ($(CONFIG_RTE_EAL_VFIO)$(CONFIG_RTE_LIBRTE_FSLMC_BUS),yy)
>  DIRS-$(CONFIG_RTE_LIBRTE_DPAA2_MEMPOOL) += dpaa2
>  endif
> +ifeq ($(CONFIG_RTE_ARCH_X86_64),y)
> +DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += nb_stack
> +endif
>  DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_RING) += ring
>  DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK) += stack
>  DIRS-$(CONFIG_RTE_LIBRTE_OCTEONTX_MEMPOOL) += octeontx
> diff --git a/drivers/mempool/meson.build b/drivers/mempool/meson.build
> index 4527d9806..01ee30fee 100644
> --- a/drivers/mempool/meson.build
> +++ b/drivers/mempool/meson.build
> @@ -2,6 +2,11 @@
>  # Copyright(c) 2017 Intel Corporation
>
>  drivers = ['bucket', 'dpaa', 'dpaa2', 'octeontx', 'ring', 'stack']
> +
> +if dpdk_conf.has('RTE_ARCH_X86_64')
> +drivers += 'nb_stack'
> +endif
> +
>  std_deps = ['mempool']
>  config_flag_fmt = 'RTE_LIBRTE_@0@_MEMPOOL'
>  driver_name_fmt = 'rte_mempool_@0@'
> diff --git a/drivers/mempool/nb_stack/Makefile
> b/drivers/mempool/nb_stack/Makefile
> new file mode 100644
> index 000000000..318b18283
> --- /dev/null
> +++ b/drivers/mempool/nb_stack/Makefile
> @@ -0,0 +1,23 @@
> +# SPDX-License-Identifier: BSD-3-Clause
> +# Copyright(c) 2019 Intel Corporation
> +
> +include $(RTE_SDK)/mk/rte.vars.mk
> +
> +#
> +# library name
> +#
> +LIB = librte_mempool_nb_stack.a
> +
> +CFLAGS += -O3
> +CFLAGS += $(WERROR_FLAGS)
> +
> +# Headers
> +LDLIBS += -lrte_eal -lrte_mempool
> +
> +EXPORT_MAP := rte_mempool_nb_stack_version.map
> +
> +LIBABIVER := 1
> +
> +SRCS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) +=
> rte_mempool_nb_stack.c
> +
> +include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/drivers/mempool/nb_stack/meson.build
> b/drivers/mempool/nb_stack/meson.build
> new file mode 100644
> index 000000000..66d64a9ba
> --- /dev/null
> +++ b/drivers/mempool/nb_stack/meson.build
> @@ -0,0 +1,4 @@
> +# SPDX-License-Identifier: BSD-3-Clause
> +# Copyright(c) 2019 Intel Corporation
> +
> +sources = files('rte_mempool_nb_stack.c')
> diff --git a/drivers/mempool/nb_stack/nb_lifo.h
> b/drivers/mempool/nb_stack/nb_lifo.h
> new file mode 100644
> index 000000000..2edae1c0f
> --- /dev/null
> +++ b/drivers/mempool/nb_stack/nb_lifo.h
> @@ -0,0 +1,147 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2019 Intel Corporation
> + */
> +
> +#ifndef _NB_LIFO_H_
> +#define _NB_LIFO_H_
> +
> +struct nb_lifo_elem {
> +void *data;
> +struct nb_lifo_elem *next;
> +};
> +
> +struct nb_lifo_head {
> +struct nb_lifo_elem *top; /**< Stack top */
> +uint64_t cnt; /**< Modification counter */
> +};
> +
> +struct nb_lifo {
> +volatile struct nb_lifo_head head __rte_aligned(16);
> +rte_atomic64_t len;
> +} __rte_cache_aligned;
> +
> +static __rte_always_inline void
> +nb_lifo_init(struct nb_lifo *lifo)
> +{
> +memset(lifo, 0, sizeof(*lifo));
> +rte_atomic64_set(&lifo->len, 0);
> +}
> +
> +static __rte_always_inline unsigned int
> +nb_lifo_len(struct nb_lifo *lifo)
> +{
> +/* nb_lifo_push() and nb_lifo_pop() do not update the list's
> contents
> + * and lifo->len atomically, which can cause the list to appear
> shorter
> + * than it actually is if this function is called while other threads
> + * are modifying the list.
> + *
> + * However, given the inherently approximate nature of the
> get_count
> + * callback -- even if the list and its size were updated atomically,
> + * the size could change between when get_count executes and
> when the
> + * value is returned to the caller -- this is acceptable.
> + *
> + * The lifo->len updates are placed such that the list may appear to
> + * have fewer elements than it does, but will never appear to have
> more
> + * elements. If the mempool is near-empty to the point that this is
> a
> + * concern, the user should consider increasing the mempool size.
> + */
> +return (unsigned int)rte_atomic64_read(&lifo->len);
> +}
> +
> +static __rte_always_inline void
> +nb_lifo_push(struct nb_lifo *lifo,
> +     struct nb_lifo_elem *first,
> +     struct nb_lifo_elem *last,
> +     unsigned int num)
> +{
> +while (1) {
> +struct nb_lifo_head old_head, new_head;
> +
> +old_head = lifo->head;
> +
> +/* Swing the top pointer to the first element in the list and
> + * make the last element point to the old top.
> + */
> +new_head.top = first;
> +new_head.cnt = old_head.cnt + 1;
> +
> +last->next = old_head.top;
> +
> +if (rte_atomic128_cmpset((volatile uint64_t *) &lifo->head,
> + (uint64_t *)&old_head,
> + (uint64_t *)&new_head))
> +break;
> +}
> +
> +rte_atomic64_add(&lifo->len, num);
> +}
> +
> +static __rte_always_inline void
> +nb_lifo_push_single(struct nb_lifo *lifo, struct nb_lifo_elem *elem)
> +{
> +nb_lifo_push(lifo, elem, elem, 1);
> +}
> +
> +static __rte_always_inline struct nb_lifo_elem *
> +nb_lifo_pop(struct nb_lifo *lifo,
> +    unsigned int num,
> +    void **obj_table,
> +    struct nb_lifo_elem **last)
> +{
> +struct nb_lifo_head old_head;
> +
> +/* Reserve num elements, if available */
> +while (1) {
> +uint64_t len = rte_atomic64_read(&lifo->len);
> +
> +/* Does the list contain enough elements? */
> +if (len < num)
> +return NULL;
> +
> +if (rte_atomic64_cmpset((volatile uint64_t *)&lifo->len,
> +len, len - num))
> +break;
> +}
> +
> +/* Pop num elements */
> +while (1) {
> +struct nb_lifo_head new_head;
> +struct nb_lifo_elem *tmp;
> +unsigned int i;
> +
> +old_head = lifo->head;
> +
> +tmp = old_head.top;
> +
> +/* Traverse the list to find the new head. A next pointer
> will
> + * either point to another element or NULL; if a thread
> + * encounters a pointer that has already been popped, the
> CAS
> + * will fail.
> + */
> +for (i = 0; i < num && tmp != NULL; i++) {
> +if (obj_table)
> +obj_table[i] = tmp->data;
> +if (last)
> +*last = tmp;
> +tmp = tmp->next;
> +}
> +
> +/* If NULL was encountered, the list was modified while
> + * traversing it. Retry.
> + */
> +if (i != num)
> +continue;
> +
> +new_head.top = tmp;
> +new_head.cnt = old_head.cnt + 1;
> +
> +if (rte_atomic128_cmpset((volatile uint64_t *) &lifo->head,
> + (uint64_t *)&old_head,
> + (uint64_t *)&new_head))
> +break;
> +}
> +
> +return old_head.top;
> +}
> +
> +#endif /* _NB_LIFO_H_ */
> diff --git a/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> new file mode 100644
> index 000000000..1818a2cfa
> --- /dev/null
> +++ b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> @@ -0,0 +1,125 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2019 Intel Corporation
> + */
> +
> +#include <stdio.h>
> +#include <rte_mempool.h>
> +#include <rte_malloc.h>
> +
> +#include "nb_lifo.h"
> +
> +struct rte_mempool_nb_stack {
> +uint64_t size;
> +struct nb_lifo used_lifo; /**< LIFO containing mempool pointers  */
> +struct nb_lifo free_lifo; /**< LIFO containing unused LIFO elements
> */
> +};
> +
> +static int
> +nb_stack_alloc(struct rte_mempool *mp)
> +{
> +struct rte_mempool_nb_stack *s;
> +struct nb_lifo_elem *elems;
> +unsigned int n = mp->size;
> +unsigned int size, i;
> +
> +size = sizeof(*s) + n * sizeof(struct nb_lifo_elem);
> +
> +/* Allocate our local memory structure */
> +s = rte_zmalloc_socket("mempool-nb_stack",
> +       size,
> +       RTE_CACHE_LINE_SIZE,
> +       mp->socket_id);
> +if (s == NULL) {
> +RTE_LOG(ERR, MEMPOOL, "Cannot allocate nb_stack!\n");
> +return -ENOMEM;
> +}
> +
> +s->size = n;
> +
> +nb_lifo_init(&s->used_lifo);
> +nb_lifo_init(&s->free_lifo);
> +
> +elems = (struct nb_lifo_elem *)&s[1];
> +for (i = 0; i < n; i++)
> +nb_lifo_push_single(&s->free_lifo, &elems[i]);
> +
> +mp->pool_data = s;
> +
> +return 0;
> +}
> +
> +static int
> +nb_stack_enqueue(struct rte_mempool *mp, void * const *obj_table,
> + unsigned int n)
> +{
> +struct rte_mempool_nb_stack *s = mp->pool_data;
> +struct nb_lifo_elem *first, *last, *tmp;
> +unsigned int i;
> +
> +if (unlikely(n == 0))
> +return 0;
> +
> +/* Pop n free elements */
> +first = nb_lifo_pop(&s->free_lifo, n, NULL, NULL);
> +if (unlikely(first == NULL))
> +return -ENOBUFS;
> +
> +/* Prepare the list elements */
> +tmp = first;
> +for (i = 0; i < n; i++) {
> +tmp->data = obj_table[i];
> +last = tmp;
> +tmp = tmp->next;
> +}
> +
> +/* Enqueue them to the used list */
> +nb_lifo_push(&s->used_lifo, first, last, n);
> +
> +return 0;
> +}
> +
> +static int
> +nb_stack_dequeue(struct rte_mempool *mp, void **obj_table,
> + unsigned int n)
> +{
> +struct rte_mempool_nb_stack *s = mp->pool_data;
> +struct nb_lifo_elem *first, *last;
> +
> +if (unlikely(n == 0))
> +return 0;
> +
> +/* Pop n used elements */
> +first = nb_lifo_pop(&s->used_lifo, n, obj_table, &last);
> +if (unlikely(first == NULL))
> +return -ENOENT;
> +
> +/* Enqueue the list elements to the free list */
> +nb_lifo_push(&s->free_lifo, first, last, n);
> +
> +return 0;
> +}
> +
> +static unsigned
> +nb_stack_get_count(const struct rte_mempool *mp)
> +{
> +struct rte_mempool_nb_stack *s = mp->pool_data;
> +
> +return nb_lifo_len(&s->used_lifo);
> +}
> +
> +static void
> +nb_stack_free(struct rte_mempool *mp)
> +{
> +rte_free(mp->pool_data);
> +}
> +
> +static struct rte_mempool_ops ops_nb_stack = {
> +.name = "nb_stack",
> +.alloc = nb_stack_alloc,
> +.free = nb_stack_free,
> +.enqueue = nb_stack_enqueue,
> +.dequeue = nb_stack_dequeue,
> +.get_count = nb_stack_get_count
> +};
> +
> +MEMPOOL_REGISTER_OPS(ops_nb_stack);
> diff --git
> a/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
> b/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
> new file mode 100644
> index 000000000..fc8c95e91
> --- /dev/null
> +++ b/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
> @@ -0,0 +1,4 @@
> +DPDK_19.05 {
> +
> +local: *;
> +};
> diff --git a/mk/rte.app.mk b/mk/rte.app.mk
> index 02e8b6f05..d4b4aaaf6 100644
> --- a/mk/rte.app.mk
> +++ b/mk/rte.app.mk
> @@ -131,8 +131,11 @@ endif
>  ifeq ($(CONFIG_RTE_BUILD_SHARED_LIB),n)
>  # plugins (link only if static libraries)
>
> -_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_BUCKET) += -
> lrte_mempool_bucket
> -_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK)  += -
> lrte_mempool_stack
> +_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_BUCKET)   += -
> lrte_mempool_bucket
> +ifeq ($(CONFIG_RTE_ARCH_X86_64),y)
> +_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += -
> lrte_mempool_nb_stack
> +endif
> +_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK)    += -
> lrte_mempool_stack
>  ifeq ($(CONFIG_RTE_LIBRTE_DPAA_BUS),y)
>  _LDLIBS-$(CONFIG_RTE_LIBRTE_DPAA_MEMPOOL)   += -
> lrte_mempool_dpaa
>  endif
> --
> 2.13.6

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/2] eal: add 128-bit cmpset (x86-64 only)
  2019-01-15 22:32   ` [dpdk-dev] [PATCH v2 1/2] eal: add 128-bit cmpset (x86-64 only) Gage Eads
@ 2019-01-17  8:49     ` Gavin Hu (Arm Technology China)
  2019-01-17 15:14       ` Eads, Gage
  0 siblings, 1 reply; 43+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-01-17  8:49 UTC (permalink / raw)
  To: Gage Eads, dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	Honnappa Nagarahalli



> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Gage Eads
> Sent: Wednesday, January 16, 2019 6:33 AM
> To: dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com;
> bruce.richardson@intel.com; konstantin.ananyev@intel.com
> Subject: [dpdk-dev] [PATCH v2 1/2] eal: add 128-bit cmpset (x86-64 only)
>
> This operation can be used for non-blocking algorithms, such as a
> non-blocking stack or ring.
>
> Signed-off-by: Gage Eads <gage.eads@intel.com>
> ---
>  .../common/include/arch/x86/rte_atomic_64.h        | 22
> ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
>
> diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> index fd2ec9c53..34c2addf8 100644
> --- a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> +++ b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> @@ -34,6 +34,7 @@
>  /*
>   * Inspired from FreeBSD src/sys/amd64/include/atomic.h
>   * Copyright (c) 1998 Doug Rabson
> + * Copyright (c) 2019 Intel Corporation
>   * All rights reserved.
>   */
>
> @@ -208,4 +209,25 @@ static inline void
> rte_atomic64_clear(rte_atomic64_t *v)
>  }
>  #endif
>
> +static inline int
> +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp, uint64_t
> *src)
> +{
> +uint8_t res;
> +
> +asm volatile (
> +      MPLOCKED
> +      "cmpxchg16b %[dst];"
> +      " sete %[res]"
> +      : [dst] "=m" (*dst),
> +[res] "=r" (res)
> +      : "c" (src[1]),
> +"b" (src[0]),
> +"m" (*dst),
> +"d" (exp[1]),
> +"a" (exp[0])
> +      : "memory");
> +
> +return res;
> +}
> +

CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK=y can't coexist with RTE_FORCE_INTRINSICS=y, this should be explicitly described somewhere in the configuration and documentations.

>  #endif /* _RTE_ATOMIC_X86_64_H_ */
> --
> 2.13.6

IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-blocking stack mempool
  2019-01-17  8:06     ` Gavin Hu (Arm Technology China)
@ 2019-01-17 14:11       ` Eads, Gage
  2019-01-17 14:20         ` Bruce Richardson
  0 siblings, 1 reply; 43+ messages in thread
From: Eads, Gage @ 2019-01-17 14:11 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China), dev
  Cc: olivier.matz, arybchenko, Richardson, Bruce, Ananyev, Konstantin,
	Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	Phil Yang (Arm Technology China)



> -----Original Message-----
> From: Gavin Hu (Arm Technology China) [mailto:Gavin.Hu@arm.com]
> Sent: Thursday, January 17, 2019 2:06 AM
> To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm Technology China)
> <Ruifeng.Wang@arm.com>; Phil Yang (Arm Technology China)
> <Phil.Yang@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-blocking
> stack mempool
> 
> 
> > -----Original Message-----
> > From: dev <dev-bounces@dpdk.org> On Behalf Of Gage Eads
> > Sent: Wednesday, January 16, 2019 6:33 AM
> > To: dev@dpdk.org
> > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com;
> > bruce.richardson@intel.com; konstantin.ananyev@intel.com
> > Subject: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-blocking
> > stack mempool
> >
> > This commit adds support for non-blocking (linked list based) stack
> > mempool handler. The stack uses a 128-bit compare-and-swap
> > instruction, and thus is limited to x86_64. The 128-bit CAS atomically
> > updates the stack top pointer and a modification counter, which
> > protects against the ABA problem.
> >
> > In mempool_perf_autotest the lock-based stack outperforms the non-
> > blocking handler*, however:
> > - For applications with preemptible pthreads, a lock-based stack's
> >   worst-case performance (i.e. one thread being preempted while
> >   holding the spinlock) is much worse than the non-blocking stack's.
> > - Using per-thread mempool caches will largely mitigate the performance
> >   difference.
> >
> > *Test setup: x86_64 build with default config, dual-socket Xeon
> > E5-2699 v4, running on isolcpus cores with a tickless scheduler. The
> > lock-based stack's rate_persec was 1x-3.5x the non-blocking stack's.
> >
> > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > ---
> >  MAINTAINERS                                        |   4 +
> >  config/common_base                                 |   1 +
> >  doc/guides/prog_guide/env_abstraction_layer.rst    |   5 +
> >  drivers/mempool/Makefile                           |   3 +
> >  drivers/mempool/meson.build                        |   5 +
> >  drivers/mempool/nb_stack/Makefile                  |  23 ++++
> >  drivers/mempool/nb_stack/meson.build               |   4 +
> >  drivers/mempool/nb_stack/nb_lifo.h                 | 147
> > +++++++++++++++++++++
> >  drivers/mempool/nb_stack/rte_mempool_nb_stack.c    | 125
> > ++++++++++++++++++
> >  .../nb_stack/rte_mempool_nb_stack_version.map      |   4 +
> >  mk/rte.app.mk                                      |   7 +-
> >  11 files changed, 326 insertions(+), 2 deletions(-)  create mode
> > 100644 drivers/mempool/nb_stack/Makefile  create mode 100644
> > drivers/mempool/nb_stack/meson.build
> >  create mode 100644 drivers/mempool/nb_stack/nb_lifo.h
> >  create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> >  create mode 100644
> > drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS index 470f36b9c..5519d3323
> > 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -416,6 +416,10 @@ M: Artem V. Andreev <artem.andreev@oktetlabs.ru>
> >  M: Andrew Rybchenko <arybchenko@solarflare.com>
> >  F: drivers/mempool/bucket/
> >
> > +Non-blocking stack memory pool
> > +M: Gage Eads <gage.eads@intel.com>
> > +F: drivers/mempool/nb_stack/
> > +
> >
> >  Bus Drivers
> >  -----------
> > diff --git a/config/common_base b/config/common_base index
> > 964a6956e..8a51f36b1 100644
> > --- a/config/common_base
> > +++ b/config/common_base
> > @@ -726,6 +726,7 @@ CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n  #
> > CONFIG_RTE_DRIVER_MEMPOOL_BUCKET=y
> >  CONFIG_RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB=64
> > +CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK=y
> 
> NAK,  as this applies to x86_64 only, it will break arm/ppc and even 32bit i386
> configurations.
> 

Hi Gavin,

This patch resolves that in the make and meson build files, which ensure that the library is only built for x86-64 targets:

diff --git a/drivers/mempool/Makefile b/drivers/mempool/Makefile
index 28c2e8360..895cf8a34 100644
--- a/drivers/mempool/Makefile
+++ b/drivers/mempool/Makefile
@@ -10,6 +10,9 @@ endif
 ifeq ($(CONFIG_RTE_EAL_VFIO)$(CONFIG_RTE_LIBRTE_FSLMC_BUS),yy)
 DIRS-$(CONFIG_RTE_LIBRTE_DPAA2_MEMPOOL) += dpaa2
 endif
+ifeq ($(CONFIG_RTE_ARCH_X86_64),y)
+DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += nb_stack
+endif

diff --git a/drivers/mempool/nb_stack/meson.build b/drivers/mempool/nb_stack/meson.build
new file mode 100644
index 000000000..4a699511d
--- /dev/null
+++ b/drivers/mempool/nb_stack/meson.build
@@ -0,0 +1,8 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2019 Intel Corporation
+
+if arch_subdir != 'x86' or cc.sizeof('void *') == 4
+	build = false
+endif
+
+sources = files('rte_mempool_nb_stack.c')

(Note: this code was pulled from the v3 patch)

You can see successful 32-bit builds at the dpdk-test-report here: http://mails.dpdk.org/archives/test-report/2019-January/073636.html

> 
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended recipient,
> please notify the sender immediately and do not disclose the contents to any
> other person, use it for any purpose, or store or copy the information in any
> medium. Thank you.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-blocking stack mempool
  2019-01-17 14:11       ` Eads, Gage
@ 2019-01-17 14:20         ` Bruce Richardson
  2019-01-17 15:16           ` Eads, Gage
  0 siblings, 1 reply; 43+ messages in thread
From: Bruce Richardson @ 2019-01-17 14:20 UTC (permalink / raw)
  To: Eads, Gage
  Cc: Gavin Hu (Arm Technology China),
	dev, olivier.matz, arybchenko, Ananyev, Konstantin,
	Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	Phil Yang (Arm Technology China)

On Thu, Jan 17, 2019 at 02:11:22PM +0000, Eads, Gage wrote:
> 
> 
> > -----Original Message-----
> > From: Gavin Hu (Arm Technology China) [mailto:Gavin.Hu@arm.com]
> > Sent: Thursday, January 17, 2019 2:06 AM
> > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> > <bruce.richardson@intel.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm Technology China)
> > <Ruifeng.Wang@arm.com>; Phil Yang (Arm Technology China)
> > <Phil.Yang@arm.com>
> > Subject: RE: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-blocking
> > stack mempool
> > 
> > 
> > > -----Original Message-----
> > > From: dev <dev-bounces@dpdk.org> On Behalf Of Gage Eads
> > > Sent: Wednesday, January 16, 2019 6:33 AM
> > > To: dev@dpdk.org
> > > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com;
> > > bruce.richardson@intel.com; konstantin.ananyev@intel.com
> > > Subject: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-blocking
> > > stack mempool
> > >
> > > This commit adds support for non-blocking (linked list based) stack
> > > mempool handler. The stack uses a 128-bit compare-and-swap
> > > instruction, and thus is limited to x86_64. The 128-bit CAS atomically
> > > updates the stack top pointer and a modification counter, which
> > > protects against the ABA problem.
> > >
> > > In mempool_perf_autotest the lock-based stack outperforms the non-
> > > blocking handler*, however:
> > > - For applications with preemptible pthreads, a lock-based stack's
> > >   worst-case performance (i.e. one thread being preempted while
> > >   holding the spinlock) is much worse than the non-blocking stack's.
> > > - Using per-thread mempool caches will largely mitigate the performance
> > >   difference.
> > >
> > > *Test setup: x86_64 build with default config, dual-socket Xeon
> > > E5-2699 v4, running on isolcpus cores with a tickless scheduler. The
> > > lock-based stack's rate_persec was 1x-3.5x the non-blocking stack's.
> > >
> > > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > > ---
> > >  MAINTAINERS                                        |   4 +
> > >  config/common_base                                 |   1 +
> > >  doc/guides/prog_guide/env_abstraction_layer.rst    |   5 +
> > >  drivers/mempool/Makefile                           |   3 +
> > >  drivers/mempool/meson.build                        |   5 +
> > >  drivers/mempool/nb_stack/Makefile                  |  23 ++++
> > >  drivers/mempool/nb_stack/meson.build               |   4 +
> > >  drivers/mempool/nb_stack/nb_lifo.h                 | 147
> > > +++++++++++++++++++++
> > >  drivers/mempool/nb_stack/rte_mempool_nb_stack.c    | 125
> > > ++++++++++++++++++
> > >  .../nb_stack/rte_mempool_nb_stack_version.map      |   4 +
> > >  mk/rte.app.mk                                      |   7 +-
> > >  11 files changed, 326 insertions(+), 2 deletions(-)  create mode
> > > 100644 drivers/mempool/nb_stack/Makefile  create mode 100644
> > > drivers/mempool/nb_stack/meson.build
> > >  create mode 100644 drivers/mempool/nb_stack/nb_lifo.h
> > >  create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> > >  create mode 100644
> > > drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
> > >
> > > diff --git a/MAINTAINERS b/MAINTAINERS index 470f36b9c..5519d3323
> > > 100644
> > > --- a/MAINTAINERS
> > > +++ b/MAINTAINERS
> > > @@ -416,6 +416,10 @@ M: Artem V. Andreev <artem.andreev@oktetlabs.ru>
> > >  M: Andrew Rybchenko <arybchenko@solarflare.com>
> > >  F: drivers/mempool/bucket/
> > >
> > > +Non-blocking stack memory pool
> > > +M: Gage Eads <gage.eads@intel.com>
> > > +F: drivers/mempool/nb_stack/
> > > +
> > >
> > >  Bus Drivers
> > >  -----------
> > > diff --git a/config/common_base b/config/common_base index
> > > 964a6956e..8a51f36b1 100644
> > > --- a/config/common_base
> > > +++ b/config/common_base
> > > @@ -726,6 +726,7 @@ CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n  #
> > > CONFIG_RTE_DRIVER_MEMPOOL_BUCKET=y
> > >  CONFIG_RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB=64
> > > +CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK=y
> > 
> > NAK,  as this applies to x86_64 only, it will break arm/ppc and even 32bit i386
> > configurations.
> > 
> 
> Hi Gavin,
> 
> This patch resolves that in the make and meson build files, which ensure that the library is only built for x86-64 targets:
> 
> diff --git a/drivers/mempool/Makefile b/drivers/mempool/Makefile
> index 28c2e8360..895cf8a34 100644
> --- a/drivers/mempool/Makefile
> +++ b/drivers/mempool/Makefile
> @@ -10,6 +10,9 @@ endif
>  ifeq ($(CONFIG_RTE_EAL_VFIO)$(CONFIG_RTE_LIBRTE_FSLMC_BUS),yy)
>  DIRS-$(CONFIG_RTE_LIBRTE_DPAA2_MEMPOOL) += dpaa2
>  endif
> +ifeq ($(CONFIG_RTE_ARCH_X86_64),y)
> +DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += nb_stack
> +endif
> 
> diff --git a/drivers/mempool/nb_stack/meson.build b/drivers/mempool/nb_stack/meson.build
> new file mode 100644
> index 000000000..4a699511d
> --- /dev/null
> +++ b/drivers/mempool/nb_stack/meson.build
> @@ -0,0 +1,8 @@
> +# SPDX-License-Identifier: BSD-3-Clause
> +# Copyright(c) 2019 Intel Corporation
> +
> +if arch_subdir != 'x86' or cc.sizeof('void *') == 4
> +	build = false
> +endif
> +

Minor suggestion: 
Can be simplified to "build = dpdk_conf.has('RTE_ARCH_X86_64')", I believe.

/Bruce

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/2] eal: add 128-bit cmpset (x86-64 only)
  2019-01-17  8:49     ` Gavin Hu (Arm Technology China)
@ 2019-01-17 15:14       ` Eads, Gage
  2019-01-17 15:57         ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 43+ messages in thread
From: Eads, Gage @ 2019-01-17 15:14 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China), dev
  Cc: olivier.matz, arybchenko, Richardson, Bruce, Ananyev, Konstantin,
	Honnappa Nagarahalli



> -----Original Message-----
> From: Gavin Hu (Arm Technology China) [mailto:Gavin.Hu@arm.com]
> Sent: Thursday, January 17, 2019 2:49 AM
> To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v2 1/2] eal: add 128-bit cmpset (x86-64 only)
> 
> 
> 
> > -----Original Message-----
> > From: dev <dev-bounces@dpdk.org> On Behalf Of Gage Eads
> > Sent: Wednesday, January 16, 2019 6:33 AM
> > To: dev@dpdk.org
> > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com;
> > bruce.richardson@intel.com; konstantin.ananyev@intel.com
> > Subject: [dpdk-dev] [PATCH v2 1/2] eal: add 128-bit cmpset (x86-64
> > only)
> >
> > This operation can be used for non-blocking algorithms, such as a
> > non-blocking stack or ring.
> >
> > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > ---
> >  .../common/include/arch/x86/rte_atomic_64.h        | 22
> > ++++++++++++++++++++++
> >  1 file changed, 22 insertions(+)
> >
> > diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > index fd2ec9c53..34c2addf8 100644
> > --- a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > +++ b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > @@ -34,6 +34,7 @@
> >  /*
> >   * Inspired from FreeBSD src/sys/amd64/include/atomic.h
> >   * Copyright (c) 1998 Doug Rabson
> > + * Copyright (c) 2019 Intel Corporation
> >   * All rights reserved.
> >   */
> >
> > @@ -208,4 +209,25 @@ static inline void
> > rte_atomic64_clear(rte_atomic64_t *v)  }  #endif
> >
> > +static inline int
> > +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp, uint64_t
> > *src)
> > +{
> > +uint8_t res;
> > +
> > +asm volatile (
> > +      MPLOCKED
> > +      "cmpxchg16b %[dst];"
> > +      " sete %[res]"
> > +      : [dst] "=m" (*dst),
> > +[res] "=r" (res)
> > +      : "c" (src[1]),
> > +"b" (src[0]),
> > +"m" (*dst),
> > +"d" (exp[1]),
> > +"a" (exp[0])
> > +      : "memory");
> > +
> > +return res;
> > +}
> > +
> 
> CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK=y can't coexist with
> RTE_FORCE_INTRINSICS=y, this should be explicitly described somewhere in the
> configuration and documentations.
> 

This patch places rte_atomic128_cmpset() outside of the RTE_FORCE_INTRINSICS ifndef, and this file is included regardless of that config flag, so it's compiled either way.

> >  #endif /* _RTE_ATOMIC_X86_64_H_ */
> > --
> > 2.13.6
> 
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended recipient,
> please notify the sender immediately and do not disclose the contents to any
> other person, use it for any purpose, or store or copy the information in any
> medium. Thank you.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-blocking stack mempool
  2019-01-17 14:20         ` Bruce Richardson
@ 2019-01-17 15:16           ` Eads, Gage
  2019-01-17 15:42             ` Gavin Hu (Arm Technology China)
  0 siblings, 1 reply; 43+ messages in thread
From: Eads, Gage @ 2019-01-17 15:16 UTC (permalink / raw)
  To: Richardson, Bruce
  Cc: Gavin Hu (Arm Technology China),
	dev, olivier.matz, arybchenko, Ananyev, Konstantin,
	Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	Phil Yang (Arm Technology China)



> -----Original Message-----
> From: Richardson, Bruce
> Sent: Thursday, January 17, 2019 8:21 AM
> To: Eads, Gage <gage.eads@intel.com>
> Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org;
> olivier.matz@6wind.com; arybchenko@solarflare.com; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm Technology China)
> <Ruifeng.Wang@arm.com>; Phil Yang (Arm Technology China)
> <Phil.Yang@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-blocking
> stack mempool
> 
> On Thu, Jan 17, 2019 at 02:11:22PM +0000, Eads, Gage wrote:
> >
> >
> > > -----Original Message-----
> > > From: Gavin Hu (Arm Technology China) [mailto:Gavin.Hu@arm.com]
> > > Sent: Thursday, January 17, 2019 2:06 AM
> > > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> > > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson,
> > > Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> > > <konstantin.ananyev@intel.com>; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm Technology China)
> > > <Ruifeng.Wang@arm.com>; Phil Yang (Arm Technology China)
> > > <Phil.Yang@arm.com>
> > > Subject: RE: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add
> > > non-blocking stack mempool
> > >
> > >
> > > > -----Original Message-----
> > > > From: dev <dev-bounces@dpdk.org> On Behalf Of Gage Eads
> > > > Sent: Wednesday, January 16, 2019 6:33 AM
> > > > To: dev@dpdk.org
> > > > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com;
> > > > bruce.richardson@intel.com; konstantin.ananyev@intel.com
> > > > Subject: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add
> > > > non-blocking stack mempool
> > > >
> > > > This commit adds support for non-blocking (linked list based)
> > > > stack mempool handler. The stack uses a 128-bit compare-and-swap
> > > > instruction, and thus is limited to x86_64. The 128-bit CAS
> > > > atomically updates the stack top pointer and a modification
> > > > counter, which protects against the ABA problem.
> > > >
> > > > In mempool_perf_autotest the lock-based stack outperforms the non-
> > > > blocking handler*, however:
> > > > - For applications with preemptible pthreads, a lock-based stack's
> > > >   worst-case performance (i.e. one thread being preempted while
> > > >   holding the spinlock) is much worse than the non-blocking stack's.
> > > > - Using per-thread mempool caches will largely mitigate the performance
> > > >   difference.
> > > >
> > > > *Test setup: x86_64 build with default config, dual-socket Xeon
> > > > E5-2699 v4, running on isolcpus cores with a tickless scheduler.
> > > > The lock-based stack's rate_persec was 1x-3.5x the non-blocking stack's.
> > > >
> > > > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > > > ---
> > > >  MAINTAINERS                                        |   4 +
> > > >  config/common_base                                 |   1 +
> > > >  doc/guides/prog_guide/env_abstraction_layer.rst    |   5 +
> > > >  drivers/mempool/Makefile                           |   3 +
> > > >  drivers/mempool/meson.build                        |   5 +
> > > >  drivers/mempool/nb_stack/Makefile                  |  23 ++++
> > > >  drivers/mempool/nb_stack/meson.build               |   4 +
> > > >  drivers/mempool/nb_stack/nb_lifo.h                 | 147
> > > > +++++++++++++++++++++
> > > >  drivers/mempool/nb_stack/rte_mempool_nb_stack.c    | 125
> > > > ++++++++++++++++++
> > > >  .../nb_stack/rte_mempool_nb_stack_version.map      |   4 +
> > > >  mk/rte.app.mk                                      |   7 +-
> > > >  11 files changed, 326 insertions(+), 2 deletions(-)  create mode
> > > > 100644 drivers/mempool/nb_stack/Makefile  create mode 100644
> > > > drivers/mempool/nb_stack/meson.build
> > > >  create mode 100644 drivers/mempool/nb_stack/nb_lifo.h
> > > >  create mode 100644
> > > > drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> > > >  create mode 100644
> > > > drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
> > > >
> > > > diff --git a/MAINTAINERS b/MAINTAINERS index 470f36b9c..5519d3323
> > > > 100644
> > > > --- a/MAINTAINERS
> > > > +++ b/MAINTAINERS
> > > > @@ -416,6 +416,10 @@ M: Artem V. Andreev
> > > > <artem.andreev@oktetlabs.ru>
> > > >  M: Andrew Rybchenko <arybchenko@solarflare.com>
> > > >  F: drivers/mempool/bucket/
> > > >
> > > > +Non-blocking stack memory pool
> > > > +M: Gage Eads <gage.eads@intel.com>
> > > > +F: drivers/mempool/nb_stack/
> > > > +
> > > >
> > > >  Bus Drivers
> > > >  -----------
> > > > diff --git a/config/common_base b/config/common_base index
> > > > 964a6956e..8a51f36b1 100644
> > > > --- a/config/common_base
> > > > +++ b/config/common_base
> > > > @@ -726,6 +726,7 @@ CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n  #
> > > > CONFIG_RTE_DRIVER_MEMPOOL_BUCKET=y
> > > >  CONFIG_RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB=64
> > > > +CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK=y
> > >
> > > NAK,  as this applies to x86_64 only, it will break arm/ppc and even
> > > 32bit i386 configurations.
> > >
> >
> > Hi Gavin,
> >
> > This patch resolves that in the make and meson build files, which ensure that
> the library is only built for x86-64 targets:
> >
> > diff --git a/drivers/mempool/Makefile b/drivers/mempool/Makefile index
> > 28c2e8360..895cf8a34 100644
> > --- a/drivers/mempool/Makefile
> > +++ b/drivers/mempool/Makefile
> > @@ -10,6 +10,9 @@ endif
> >  ifeq ($(CONFIG_RTE_EAL_VFIO)$(CONFIG_RTE_LIBRTE_FSLMC_BUS),yy)
> >  DIRS-$(CONFIG_RTE_LIBRTE_DPAA2_MEMPOOL) += dpaa2  endif
> > +ifeq ($(CONFIG_RTE_ARCH_X86_64),y)
> > +DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += nb_stack endif
> >
> > diff --git a/drivers/mempool/nb_stack/meson.build
> > b/drivers/mempool/nb_stack/meson.build
> > new file mode 100644
> > index 000000000..4a699511d
> > --- /dev/null
> > +++ b/drivers/mempool/nb_stack/meson.build
> > @@ -0,0 +1,8 @@
> > +# SPDX-License-Identifier: BSD-3-Clause # Copyright(c) 2019 Intel
> > +Corporation
> > +
> > +if arch_subdir != 'x86' or cc.sizeof('void *') == 4
> > +	build = false
> > +endif
> > +
> 
> Minor suggestion:
> Can be simplified to "build = dpdk_conf.has('RTE_ARCH_X86_64')", I believe.
> 
> /Bruce

Sure, I'll switch to that check in v4.

Thanks,
Gage

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [dpdk-dev] [PATCH v4 0/2] Add non-blocking stack mempool handler
  2019-01-16 15:18   ` [dpdk-dev] [PATCH v3 0/2] Add non-blocking stack mempool handler Gage Eads
  2019-01-16 15:18     ` [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only) Gage Eads
  2019-01-16 15:18     ` [dpdk-dev] [PATCH v3 2/2] mempool/nb_stack: add non-blocking stack mempool Gage Eads
@ 2019-01-17 15:36     ` Gage Eads
  2019-01-17 15:36       ` [dpdk-dev] [PATCH v4 1/2] eal: add 128-bit cmpset (x86-64 only) Gage Eads
  2019-01-17 15:36       ` [dpdk-dev] [PATCH v4 2/2] mempool/nb_stack: add non-blocking stack mempool Gage Eads
  2 siblings, 2 replies; 43+ messages in thread
From: Gage Eads @ 2019-01-17 15:36 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev, gavin.hu

For some users, the rte ring's "non-preemptive" constraint is not acceptable;
for example, if the application uses a mixture of pinned high-priority threads
and multiplexed low-priority threads that share a mempool.

This patchset introduces a non-blocking stack mempool handler. Note that the
non-blocking algorithm relies on a 128-bit compare-and-swap, so it is limited
to x86_64 machines.

In mempool_perf_autotest the lock-based stack outperforms the non-blocking
handler*, however:
- For applications with preemptible pthreads, a lock-based stack's
  worst-case performance (i.e. one thread being preempted while
  holding the spinlock) is much worse than the non-blocking stack's.
- Using per-thread mempool caches will largely mitigate the performance
  difference.

*Test setup: x86_64 build with default config, dual-socket Xeon E5-2699 v4,
running on isolcpus cores with a tickless scheduler. The lock-based stack's
rate_persec was 1x-3.5x the non-blocking stack's.

v4:
 - Simplified the meson.build x86_64 check

v3:
 - Fix two more space-after-typecast issues
 - Rework nb_stack's meson.build x86_64 check, borrowing from net/sfc/

v2:
 - Merge separate docs commit into patch #2
 - Fix two space-after-typecast issues
 - Fix alphabetical sorting for build files
 - Remove unnecessary include path from nb_stack/Makefile
 - Add a comment to nb_lifo_len() justifying its approximate behavior
 - Fix comparison with NULL
 - Remove unnecessary void * cast
 - Fix meson builds and limit them to x86_64
 - Fix missing library error for non-x86_64 builds

Gage Eads (2):
  eal: add 128-bit cmpset (x86-64 only)
  mempool/nb_stack: add non-blocking stack mempool

 MAINTAINERS                                        |   4 +
 config/common_base                                 |   1 +
 doc/guides/prog_guide/env_abstraction_layer.rst    |   5 +
 drivers/mempool/Makefile                           |   3 +
 drivers/mempool/meson.build                        |   3 +-
 drivers/mempool/nb_stack/Makefile                  |  23 ++++
 drivers/mempool/nb_stack/meson.build               |   6 +
 drivers/mempool/nb_stack/nb_lifo.h                 | 147 +++++++++++++++++++++
 drivers/mempool/nb_stack/rte_mempool_nb_stack.c    | 125 ++++++++++++++++++
 .../nb_stack/rte_mempool_nb_stack_version.map      |   4 +
 .../common/include/arch/x86/rte_atomic_64.h        |  22 +++
 mk/rte.app.mk                                      |   7 +-
 12 files changed, 347 insertions(+), 3 deletions(-)
 create mode 100644 drivers/mempool/nb_stack/Makefile
 create mode 100644 drivers/mempool/nb_stack/meson.build
 create mode 100644 drivers/mempool/nb_stack/nb_lifo.h
 create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack.c
 create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map

-- 
2.13.6

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [dpdk-dev] [PATCH v4 1/2] eal: add 128-bit cmpset (x86-64 only)
  2019-01-17 15:36     ` [dpdk-dev] [PATCH v4 0/2] Add non-blocking stack mempool handler Gage Eads
@ 2019-01-17 15:36       ` Gage Eads
  2019-01-17 15:36       ` [dpdk-dev] [PATCH v4 2/2] mempool/nb_stack: add non-blocking stack mempool Gage Eads
  1 sibling, 0 replies; 43+ messages in thread
From: Gage Eads @ 2019-01-17 15:36 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev, gavin.hu

This operation can be used for non-blocking algorithms, such as a
non-blocking stack or ring.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 .../common/include/arch/x86/rte_atomic_64.h        | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
index fd2ec9c53..34c2addf8 100644
--- a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
+++ b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
@@ -34,6 +34,7 @@
 /*
  * Inspired from FreeBSD src/sys/amd64/include/atomic.h
  * Copyright (c) 1998 Doug Rabson
+ * Copyright (c) 2019 Intel Corporation
  * All rights reserved.
  */
 
@@ -208,4 +209,25 @@ static inline void rte_atomic64_clear(rte_atomic64_t *v)
 }
 #endif
 
+static inline int
+rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp, uint64_t *src)
+{
+	uint8_t res;
+
+	asm volatile (
+		      MPLOCKED
+		      "cmpxchg16b %[dst];"
+		      " sete %[res]"
+		      : [dst] "=m" (*dst),
+			[res] "=r" (res)
+		      : "c" (src[1]),
+			"b" (src[0]),
+			"m" (*dst),
+			"d" (exp[1]),
+			"a" (exp[0])
+		      : "memory");
+
+	return res;
+}
+
 #endif /* _RTE_ATOMIC_X86_64_H_ */
-- 
2.13.6

^ permalink raw reply	[flat|nested] 43+ messages in thread

* [dpdk-dev] [PATCH v4 2/2] mempool/nb_stack: add non-blocking stack mempool
  2019-01-17 15:36     ` [dpdk-dev] [PATCH v4 0/2] Add non-blocking stack mempool handler Gage Eads
  2019-01-17 15:36       ` [dpdk-dev] [PATCH v4 1/2] eal: add 128-bit cmpset (x86-64 only) Gage Eads
@ 2019-01-17 15:36       ` Gage Eads
  2019-01-18  5:05         ` Honnappa Nagarahalli
  1 sibling, 1 reply; 43+ messages in thread
From: Gage Eads @ 2019-01-17 15:36 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev, gavin.hu

This commit adds support for non-blocking (linked list based) stack mempool
handler. The stack uses a 128-bit compare-and-swap instruction, and thus is
limited to x86_64. The 128-bit CAS atomically updates the stack top pointer
and a modification counter, which protects against the ABA problem.

In mempool_perf_autotest the lock-based stack outperforms the non-blocking
handler*, however:
- For applications with preemptible pthreads, a lock-based stack's
  worst-case performance (i.e. one thread being preempted while
  holding the spinlock) is much worse than the non-blocking stack's.
- Using per-thread mempool caches will largely mitigate the performance
  difference.

*Test setup: x86_64 build with default config, dual-socket Xeon E5-2699 v4,
running on isolcpus cores with a tickless scheduler. The lock-based stack's
rate_persec was 1x-3.5x the non-blocking stack's.

Signed-off-by: Gage Eads <gage.eads@intel.com>
Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
---
 MAINTAINERS                                        |   4 +
 config/common_base                                 |   1 +
 doc/guides/prog_guide/env_abstraction_layer.rst    |   5 +
 drivers/mempool/Makefile                           |   3 +
 drivers/mempool/meson.build                        |   3 +-
 drivers/mempool/nb_stack/Makefile                  |  23 ++++
 drivers/mempool/nb_stack/meson.build               |   6 +
 drivers/mempool/nb_stack/nb_lifo.h                 | 147 +++++++++++++++++++++
 drivers/mempool/nb_stack/rte_mempool_nb_stack.c    | 125 ++++++++++++++++++
 .../nb_stack/rte_mempool_nb_stack_version.map      |   4 +
 mk/rte.app.mk                                      |   7 +-
 11 files changed, 325 insertions(+), 3 deletions(-)
 create mode 100644 drivers/mempool/nb_stack/Makefile
 create mode 100644 drivers/mempool/nb_stack/meson.build
 create mode 100644 drivers/mempool/nb_stack/nb_lifo.h
 create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack.c
 create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map

diff --git a/MAINTAINERS b/MAINTAINERS
index 470f36b9c..5519d3323 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -416,6 +416,10 @@ M: Artem V. Andreev <artem.andreev@oktetlabs.ru>
 M: Andrew Rybchenko <arybchenko@solarflare.com>
 F: drivers/mempool/bucket/
 
+Non-blocking stack memory pool
+M: Gage Eads <gage.eads@intel.com>
+F: drivers/mempool/nb_stack/
+
 
 Bus Drivers
 -----------
diff --git a/config/common_base b/config/common_base
index 964a6956e..8a51f36b1 100644
--- a/config/common_base
+++ b/config/common_base
@@ -726,6 +726,7 @@ CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n
 #
 CONFIG_RTE_DRIVER_MEMPOOL_BUCKET=y
 CONFIG_RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB=64
+CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK=y
 CONFIG_RTE_DRIVER_MEMPOOL_RING=y
 CONFIG_RTE_DRIVER_MEMPOOL_STACK=y
 
diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index 929d76dba..9497b879c 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -541,6 +541,11 @@ Known Issues
 
   5. It MUST not be used by multi-producer/consumer pthreads, whose scheduling policies are SCHED_FIFO or SCHED_RR.
 
+  Alternatively, x86_64 applications can use the non-blocking stack mempool handler. When considering this handler, note that:
+
+  - it is limited to the x86_64 platform, because it uses an instruction (16-byte compare-and-swap) that is not available on other platforms.
+  - it has worse average-case performance than the non-preemptive rte_ring, but software caching (e.g. the mempool cache) can mitigate this by reducing the number of handler operations.
+
 + rte_timer
 
   Running  ``rte_timer_manage()`` on a non-EAL pthread is not allowed. However, resetting/stopping the timer from a non-EAL pthread is allowed.
diff --git a/drivers/mempool/Makefile b/drivers/mempool/Makefile
index 28c2e8360..895cf8a34 100644
--- a/drivers/mempool/Makefile
+++ b/drivers/mempool/Makefile
@@ -10,6 +10,9 @@ endif
 ifeq ($(CONFIG_RTE_EAL_VFIO)$(CONFIG_RTE_LIBRTE_FSLMC_BUS),yy)
 DIRS-$(CONFIG_RTE_LIBRTE_DPAA2_MEMPOOL) += dpaa2
 endif
+ifeq ($(CONFIG_RTE_ARCH_X86_64),y)
+DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += nb_stack
+endif
 DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_RING) += ring
 DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK) += stack
 DIRS-$(CONFIG_RTE_LIBRTE_OCTEONTX_MEMPOOL) += octeontx
diff --git a/drivers/mempool/meson.build b/drivers/mempool/meson.build
index 4527d9806..220cfaf63 100644
--- a/drivers/mempool/meson.build
+++ b/drivers/mempool/meson.build
@@ -1,7 +1,8 @@
 # SPDX-License-Identifier: BSD-3-Clause
 # Copyright(c) 2017 Intel Corporation
 
-drivers = ['bucket', 'dpaa', 'dpaa2', 'octeontx', 'ring', 'stack']
+drivers = ['bucket', 'dpaa', 'dpaa2', 'nb_stack', 'octeontx', 'ring', 'stack']
+
 std_deps = ['mempool']
 config_flag_fmt = 'RTE_LIBRTE_@0@_MEMPOOL'
 driver_name_fmt = 'rte_mempool_@0@'
diff --git a/drivers/mempool/nb_stack/Makefile b/drivers/mempool/nb_stack/Makefile
new file mode 100644
index 000000000..318b18283
--- /dev/null
+++ b/drivers/mempool/nb_stack/Makefile
@@ -0,0 +1,23 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2019 Intel Corporation
+
+include $(RTE_SDK)/mk/rte.vars.mk
+
+#
+# library name
+#
+LIB = librte_mempool_nb_stack.a
+
+CFLAGS += -O3
+CFLAGS += $(WERROR_FLAGS)
+
+# Headers
+LDLIBS += -lrte_eal -lrte_mempool
+
+EXPORT_MAP := rte_mempool_nb_stack_version.map
+
+LIBABIVER := 1
+
+SRCS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += rte_mempool_nb_stack.c
+
+include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/drivers/mempool/nb_stack/meson.build b/drivers/mempool/nb_stack/meson.build
new file mode 100644
index 000000000..7dec72242
--- /dev/null
+++ b/drivers/mempool/nb_stack/meson.build
@@ -0,0 +1,6 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2019 Intel Corporation
+
+build = dpdk_conf.has('RTE_ARCH_X86_64')
+
+sources = files('rte_mempool_nb_stack.c')
diff --git a/drivers/mempool/nb_stack/nb_lifo.h b/drivers/mempool/nb_stack/nb_lifo.h
new file mode 100644
index 000000000..ad4a3401f
--- /dev/null
+++ b/drivers/mempool/nb_stack/nb_lifo.h
@@ -0,0 +1,147 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#ifndef _NB_LIFO_H_
+#define _NB_LIFO_H_
+
+struct nb_lifo_elem {
+	void *data;
+	struct nb_lifo_elem *next;
+};
+
+struct nb_lifo_head {
+	struct nb_lifo_elem *top; /**< Stack top */
+	uint64_t cnt; /**< Modification counter */
+};
+
+struct nb_lifo {
+	volatile struct nb_lifo_head head __rte_aligned(16);
+	rte_atomic64_t len;
+} __rte_cache_aligned;
+
+static __rte_always_inline void
+nb_lifo_init(struct nb_lifo *lifo)
+{
+	memset(lifo, 0, sizeof(*lifo));
+	rte_atomic64_set(&lifo->len, 0);
+}
+
+static __rte_always_inline unsigned int
+nb_lifo_len(struct nb_lifo *lifo)
+{
+	/* nb_lifo_push() and nb_lifo_pop() do not update the list's contents
+	 * and lifo->len atomically, which can cause the list to appear shorter
+	 * than it actually is if this function is called while other threads
+	 * are modifying the list.
+	 *
+	 * However, given the inherently approximate nature of the get_count
+	 * callback -- even if the list and its size were updated atomically,
+	 * the size could change between when get_count executes and when the
+	 * value is returned to the caller -- this is acceptable.
+	 *
+	 * The lifo->len updates are placed such that the list may appear to
+	 * have fewer elements than it does, but will never appear to have more
+	 * elements. If the mempool is near-empty to the point that this is a
+	 * concern, the user should consider increasing the mempool size.
+	 */
+	return (unsigned int)rte_atomic64_read(&lifo->len);
+}
+
+static __rte_always_inline void
+nb_lifo_push(struct nb_lifo *lifo,
+	     struct nb_lifo_elem *first,
+	     struct nb_lifo_elem *last,
+	     unsigned int num)
+{
+	while (1) {
+		struct nb_lifo_head old_head, new_head;
+
+		old_head = lifo->head;
+
+		/* Swing the top pointer to the first element in the list and
+		 * make the last element point to the old top.
+		 */
+		new_head.top = first;
+		new_head.cnt = old_head.cnt + 1;
+
+		last->next = old_head.top;
+
+		if (rte_atomic128_cmpset((volatile uint64_t *)&lifo->head,
+					 (uint64_t *)&old_head,
+					 (uint64_t *)&new_head))
+			break;
+	}
+
+	rte_atomic64_add(&lifo->len, num);
+}
+
+static __rte_always_inline void
+nb_lifo_push_single(struct nb_lifo *lifo, struct nb_lifo_elem *elem)
+{
+	nb_lifo_push(lifo, elem, elem, 1);
+}
+
+static __rte_always_inline struct nb_lifo_elem *
+nb_lifo_pop(struct nb_lifo *lifo,
+	    unsigned int num,
+	    void **obj_table,
+	    struct nb_lifo_elem **last)
+{
+	struct nb_lifo_head old_head;
+
+	/* Reserve num elements, if available */
+	while (1) {
+		uint64_t len = rte_atomic64_read(&lifo->len);
+
+		/* Does the list contain enough elements? */
+		if (len < num)
+			return NULL;
+
+		if (rte_atomic64_cmpset((volatile uint64_t *)&lifo->len,
+					len, len - num))
+			break;
+	}
+
+	/* Pop num elements */
+	while (1) {
+		struct nb_lifo_head new_head;
+		struct nb_lifo_elem *tmp;
+		unsigned int i;
+
+		old_head = lifo->head;
+
+		tmp = old_head.top;
+
+		/* Traverse the list to find the new head. A next pointer will
+		 * either point to another element or NULL; if a thread
+		 * encounters a pointer that has already been popped, the CAS
+		 * will fail.
+		 */
+		for (i = 0; i < num && tmp != NULL; i++) {
+			if (obj_table)
+				obj_table[i] = tmp->data;
+			if (last)
+				*last = tmp;
+			tmp = tmp->next;
+		}
+
+		/* If NULL was encountered, the list was modified while
+		 * traversing it. Retry.
+		 */
+		if (i != num)
+			continue;
+
+		new_head.top = tmp;
+		new_head.cnt = old_head.cnt + 1;
+
+		if (rte_atomic128_cmpset((volatile uint64_t *)&lifo->head,
+					 (uint64_t *)&old_head,
+					 (uint64_t *)&new_head))
+			break;
+	}
+
+	return old_head.top;
+}
+
+#endif /* _NB_LIFO_H_ */
diff --git a/drivers/mempool/nb_stack/rte_mempool_nb_stack.c b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
new file mode 100644
index 000000000..1818a2cfa
--- /dev/null
+++ b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
@@ -0,0 +1,125 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2019 Intel Corporation
+ */
+
+#include <stdio.h>
+#include <rte_mempool.h>
+#include <rte_malloc.h>
+
+#include "nb_lifo.h"
+
+struct rte_mempool_nb_stack {
+	uint64_t size;
+	struct nb_lifo used_lifo; /**< LIFO containing mempool pointers  */
+	struct nb_lifo free_lifo; /**< LIFO containing unused LIFO elements */
+};
+
+static int
+nb_stack_alloc(struct rte_mempool *mp)
+{
+	struct rte_mempool_nb_stack *s;
+	struct nb_lifo_elem *elems;
+	unsigned int n = mp->size;
+	unsigned int size, i;
+
+	size = sizeof(*s) + n * sizeof(struct nb_lifo_elem);
+
+	/* Allocate our local memory structure */
+	s = rte_zmalloc_socket("mempool-nb_stack",
+			       size,
+			       RTE_CACHE_LINE_SIZE,
+			       mp->socket_id);
+	if (s == NULL) {
+		RTE_LOG(ERR, MEMPOOL, "Cannot allocate nb_stack!\n");
+		return -ENOMEM;
+	}
+
+	s->size = n;
+
+	nb_lifo_init(&s->used_lifo);
+	nb_lifo_init(&s->free_lifo);
+
+	elems = (struct nb_lifo_elem *)&s[1];
+	for (i = 0; i < n; i++)
+		nb_lifo_push_single(&s->free_lifo, &elems[i]);
+
+	mp->pool_data = s;
+
+	return 0;
+}
+
+static int
+nb_stack_enqueue(struct rte_mempool *mp, void * const *obj_table,
+		 unsigned int n)
+{
+	struct rte_mempool_nb_stack *s = mp->pool_data;
+	struct nb_lifo_elem *first, *last, *tmp;
+	unsigned int i;
+
+	if (unlikely(n == 0))
+		return 0;
+
+	/* Pop n free elements */
+	first = nb_lifo_pop(&s->free_lifo, n, NULL, NULL);
+	if (unlikely(first == NULL))
+		return -ENOBUFS;
+
+	/* Prepare the list elements */
+	tmp = first;
+	for (i = 0; i < n; i++) {
+		tmp->data = obj_table[i];
+		last = tmp;
+		tmp = tmp->next;
+	}
+
+	/* Enqueue them to the used list */
+	nb_lifo_push(&s->used_lifo, first, last, n);
+
+	return 0;
+}
+
+static int
+nb_stack_dequeue(struct rte_mempool *mp, void **obj_table,
+		 unsigned int n)
+{
+	struct rte_mempool_nb_stack *s = mp->pool_data;
+	struct nb_lifo_elem *first, *last;
+
+	if (unlikely(n == 0))
+		return 0;
+
+	/* Pop n used elements */
+	first = nb_lifo_pop(&s->used_lifo, n, obj_table, &last);
+	if (unlikely(first == NULL))
+		return -ENOENT;
+
+	/* Enqueue the list elements to the free list */
+	nb_lifo_push(&s->free_lifo, first, last, n);
+
+	return 0;
+}
+
+static unsigned
+nb_stack_get_count(const struct rte_mempool *mp)
+{
+	struct rte_mempool_nb_stack *s = mp->pool_data;
+
+	return nb_lifo_len(&s->used_lifo);
+}
+
+static void
+nb_stack_free(struct rte_mempool *mp)
+{
+	rte_free(mp->pool_data);
+}
+
+static struct rte_mempool_ops ops_nb_stack = {
+	.name = "nb_stack",
+	.alloc = nb_stack_alloc,
+	.free = nb_stack_free,
+	.enqueue = nb_stack_enqueue,
+	.dequeue = nb_stack_dequeue,
+	.get_count = nb_stack_get_count
+};
+
+MEMPOOL_REGISTER_OPS(ops_nb_stack);
diff --git a/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map b/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
new file mode 100644
index 000000000..fc8c95e91
--- /dev/null
+++ b/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
@@ -0,0 +1,4 @@
+DPDK_19.05 {
+
+	local: *;
+};
diff --git a/mk/rte.app.mk b/mk/rte.app.mk
index 02e8b6f05..d4b4aaaf6 100644
--- a/mk/rte.app.mk
+++ b/mk/rte.app.mk
@@ -131,8 +131,11 @@ endif
 ifeq ($(CONFIG_RTE_BUILD_SHARED_LIB),n)
 # plugins (link only if static libraries)
 
-_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_BUCKET) += -lrte_mempool_bucket
-_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK)  += -lrte_mempool_stack
+_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_BUCKET)   += -lrte_mempool_bucket
+ifeq ($(CONFIG_RTE_ARCH_X86_64),y)
+_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += -lrte_mempool_nb_stack
+endif
+_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK)    += -lrte_mempool_stack
 ifeq ($(CONFIG_RTE_LIBRTE_DPAA_BUS),y)
 _LDLIBS-$(CONFIG_RTE_LIBRTE_DPAA_MEMPOOL)   += -lrte_mempool_dpaa
 endif
-- 
2.13.6

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-blocking stack mempool
  2019-01-17 15:16           ` Eads, Gage
@ 2019-01-17 15:42             ` Gavin Hu (Arm Technology China)
  2019-01-17 20:41               ` Eads, Gage
  0 siblings, 1 reply; 43+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-01-17 15:42 UTC (permalink / raw)
  To: Eads, Gage, Richardson, Bruce
  Cc: dev, olivier.matz, arybchenko, Ananyev, Konstantin,
	Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	Phil Yang (Arm Technology China)



> -----Original Message-----
> From: Eads, Gage <gage.eads@intel.com>
> Sent: Thursday, January 17, 2019 11:16 PM
> To: Richardson, Bruce <bruce.richardson@intel.com>
> Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> dev@dpdk.org; olivier.matz@6wind.com; arybchenko@solarflare.com;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm
> Technology China) <Ruifeng.Wang@arm.com>; Phil Yang (Arm Technology
> China) <Phil.Yang@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-
> blocking stack mempool
>
>
>
> > -----Original Message-----
> > From: Richardson, Bruce
> > Sent: Thursday, January 17, 2019 8:21 AM
> > To: Eads, Gage <gage.eads@intel.com>
> > Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> dev@dpdk.org;
> > olivier.matz@6wind.com; arybchenko@solarflare.com; Ananyev,
> Konstantin
> > <konstantin.ananyev@intel.com>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm Technology
> China)
> > <Ruifeng.Wang@arm.com>; Phil Yang (Arm Technology China)
> > <Phil.Yang@arm.com>
> > Subject: Re: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-
> blocking
> > stack mempool
> >
> > On Thu, Jan 17, 2019 at 02:11:22PM +0000, Eads, Gage wrote:
> > >
> > >
> > > > -----Original Message-----
> > > > From: Gavin Hu (Arm Technology China) [mailto:Gavin.Hu@arm.com]
> > > > Sent: Thursday, January 17, 2019 2:06 AM
> > > > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> > > > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson,
> > > > Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> > > > <konstantin.ananyev@intel.com>; Honnappa Nagarahalli
> > > > <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm Technology
> China)
> > > > <Ruifeng.Wang@arm.com>; Phil Yang (Arm Technology China)
> > > > <Phil.Yang@arm.com>
> > > > Subject: RE: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add
> > > > non-blocking stack mempool
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: dev <dev-bounces@dpdk.org> On Behalf Of Gage Eads
> > > > > Sent: Wednesday, January 16, 2019 6:33 AM
> > > > > To: dev@dpdk.org
> > > > > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com;
> > > > > bruce.richardson@intel.com; konstantin.ananyev@intel.com
> > > > > Subject: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add
> > > > > non-blocking stack mempool
> > > > >
> > > > > This commit adds support for non-blocking (linked list based)
> > > > > stack mempool handler. The stack uses a 128-bit compare-and-
> swap
> > > > > instruction, and thus is limited to x86_64. The 128-bit CAS
> > > > > atomically updates the stack top pointer and a modification
> > > > > counter, which protects against the ABA problem.
> > > > >
> > > > > In mempool_perf_autotest the lock-based stack outperforms the
> non-
> > > > > blocking handler*, however:
> > > > > - For applications with preemptible pthreads, a lock-based stack's
> > > > >   worst-case performance (i.e. one thread being preempted while
> > > > >   holding the spinlock) is much worse than the non-blocking stack's.
> > > > > - Using per-thread mempool caches will largely mitigate the
> performance
> > > > >   difference.
> > > > >
> > > > > *Test setup: x86_64 build with default config, dual-socket Xeon
> > > > > E5-2699 v4, running on isolcpus cores with a tickless scheduler.
> > > > > The lock-based stack's rate_persec was 1x-3.5x the non-blocking
> stack's.
> > > > >
> > > > > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > > > > ---
> > > > >  MAINTAINERS                                        |   4 +
> > > > >  config/common_base                                 |   1 +
> > > > >  doc/guides/prog_guide/env_abstraction_layer.rst    |   5 +
> > > > >  drivers/mempool/Makefile                           |   3 +
> > > > >  drivers/mempool/meson.build                        |   5 +
> > > > >  drivers/mempool/nb_stack/Makefile                  |  23 ++++
> > > > >  drivers/mempool/nb_stack/meson.build               |   4 +
> > > > >  drivers/mempool/nb_stack/nb_lifo.h                 | 147
> > > > > +++++++++++++++++++++
> > > > >  drivers/mempool/nb_stack/rte_mempool_nb_stack.c    | 125
> > > > > ++++++++++++++++++
> > > > >  .../nb_stack/rte_mempool_nb_stack_version.map      |   4 +
> > > > >  mk/rte.app.mk                                      |   7 +-
> > > > >  11 files changed, 326 insertions(+), 2 deletions(-)  create mode
> > > > > 100644 drivers/mempool/nb_stack/Makefile  create mode 100644
> > > > > drivers/mempool/nb_stack/meson.build
> > > > >  create mode 100644 drivers/mempool/nb_stack/nb_lifo.h
> > > > >  create mode 100644
> > > > > drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> > > > >  create mode 100644
> > > > > drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
> > > > >
> > > > > diff --git a/MAINTAINERS b/MAINTAINERS index
> 470f36b9c..5519d3323
> > > > > 100644
> > > > > --- a/MAINTAINERS
> > > > > +++ b/MAINTAINERS
> > > > > @@ -416,6 +416,10 @@ M: Artem V. Andreev
> > > > > <artem.andreev@oktetlabs.ru>
> > > > >  M: Andrew Rybchenko <arybchenko@solarflare.com>
> > > > >  F: drivers/mempool/bucket/
> > > > >
> > > > > +Non-blocking stack memory pool
> > > > > +M: Gage Eads <gage.eads@intel.com>
> > > > > +F: drivers/mempool/nb_stack/
> > > > > +
> > > > >
> > > > >  Bus Drivers
> > > > >  -----------
> > > > > diff --git a/config/common_base b/config/common_base index
> > > > > 964a6956e..8a51f36b1 100644
> > > > > --- a/config/common_base
> > > > > +++ b/config/common_base
> > > > > @@ -726,6 +726,7 @@ CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n
> #
> > > > > CONFIG_RTE_DRIVER_MEMPOOL_BUCKET=y
> > > > >  CONFIG_RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB=64
> > > > > +CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK=y
> > > >
> > > > NAK,  as this applies to x86_64 only, it will break arm/ppc and even
> > > > 32bit i386 configurations.
> > > >
> > >
> > > Hi Gavin,
> > >
> > > This patch resolves that in the make and meson build files, which
> ensure that
> > the library is only built for x86-64 targets:

Looking down to the changes with Makefile and meson.build, it will be compiled out for arm/ppc/i386. That works at least.
But having this entry in the arm/ppc/i386 configurations is very strange, since they have no such implementations.
Why not put it into defconfig_x86_64-native-linuxapp-icc/gcc/clang to limit the scope?

> > >
> > > diff --git a/drivers/mempool/Makefile b/drivers/mempool/Makefile
> index
> > > 28c2e8360..895cf8a34 100644
> > > --- a/drivers/mempool/Makefile
> > > +++ b/drivers/mempool/Makefile
> > > @@ -10,6 +10,9 @@ endif
> > >  ifeq ($(CONFIG_RTE_EAL_VFIO)$(CONFIG_RTE_LIBRTE_FSLMC_BUS),yy)
> > >  DIRS-$(CONFIG_RTE_LIBRTE_DPAA2_MEMPOOL) += dpaa2  endif
> > > +ifeq ($(CONFIG_RTE_ARCH_X86_64),y)
> > > +DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += nb_stack endif
> > >
> > > diff --git a/drivers/mempool/nb_stack/meson.build
> > > b/drivers/mempool/nb_stack/meson.build
> > > new file mode 100644
> > > index 000000000..4a699511d
> > > --- /dev/null
> > > +++ b/drivers/mempool/nb_stack/meson.build
> > > @@ -0,0 +1,8 @@
> > > +# SPDX-License-Identifier: BSD-3-Clause # Copyright(c) 2019 Intel
> > > +Corporation
> > > +
> > > +if arch_subdir != 'x86' or cc.sizeof('void *') == 4
> > > +build = false
> > > +endif
> > > +
> >
> > Minor suggestion:
> > Can be simplified to "build = dpdk_conf.has('RTE_ARCH_X86_64')", I
> believe.
> >
> > /Bruce
>
> Sure, I'll switch to that check in v4.
>
> Thanks,
> Gage
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only)
  2019-01-16 15:18     ` [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only) Gage Eads
@ 2019-01-17 15:45       ` Honnappa Nagarahalli
  2019-01-17 23:03         ` Eads, Gage
  0 siblings, 1 reply; 43+ messages in thread
From: Honnappa Nagarahalli @ 2019-01-17 15:45 UTC (permalink / raw)
  To: Gage Eads, dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	nd, Honnappa Nagarahalli, nd

> Subject: [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only)
> 
> This operation can be used for non-blocking algorithms, such as a non-
> blocking stack or ring.
> 
> Signed-off-by: Gage Eads <gage.eads@intel.com>
> ---
>  .../common/include/arch/x86/rte_atomic_64.h        | 22
> ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> index fd2ec9c53..34c2addf8 100644
> --- a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> +++ b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
Since this is a 128b operation should there be a new file created with the name rte_atomic_128.h?

> @@ -34,6 +34,7 @@
>  /*
>   * Inspired from FreeBSD src/sys/amd64/include/atomic.h
>   * Copyright (c) 1998 Doug Rabson
> + * Copyright (c) 2019 Intel Corporation
>   * All rights reserved.
>   */
> 
> @@ -208,4 +209,25 @@ static inline void rte_atomic64_clear(rte_atomic64_t
> *v)  }  #endif
> 
> +static inline int
> +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp, uint64_t
> +*src) {
The API name suggests it is a 128b operation. 'dst', 'exp' and 'src' should be pointers to 128b (__int128)? Or we could define our own data type.
Since, it is a new API, can we define it with memory orderings which will be more conducive to relaxed memory ordering based architectures? You can refer to [1] and [2] for guidance.
If this an external API, it requires 'experimental' tag.

1. https://github.com/ARM-software/progress64/blob/master/src/lockfree/aarch64.h#L63
2. https://github.com/ARM-software/progress64/blob/master/src/lockfree/x86-64.h#L34

> +	uint8_t res;
> +
> +	asm volatile (
> +		      MPLOCKED
> +		      "cmpxchg16b %[dst];"
> +		      " sete %[res]"
> +		      : [dst] "=m" (*dst),
> +			[res] "=r" (res)
> +		      : "c" (src[1]),
> +			"b" (src[0]),
> +			"m" (*dst),
> +			"d" (exp[1]),
> +			"a" (exp[0])
> +		      : "memory");
> +
> +	return res;
> +}
> +
>  #endif /* _RTE_ATOMIC_X86_64_H_ */
> --
> 2.13.6

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH v2 1/2] eal: add 128-bit cmpset (x86-64 only)
  2019-01-17 15:14       ` Eads, Gage
@ 2019-01-17 15:57         ` Gavin Hu (Arm Technology China)
  0 siblings, 0 replies; 43+ messages in thread
From: Gavin Hu (Arm Technology China) @ 2019-01-17 15:57 UTC (permalink / raw)
  To: Eads, Gage, dev
  Cc: olivier.matz, arybchenko, Richardson, Bruce, Ananyev, Konstantin,
	Honnappa Nagarahalli, nd



> -----Original Message-----
> From: Eads, Gage <gage.eads@intel.com>
> Sent: Thursday, January 17, 2019 11:14 PM
> To: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson,
> Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v2 1/2] eal: add 128-bit cmpset (x86-64
> only)
> 
> 
> 
> > -----Original Message-----
> > From: Gavin Hu (Arm Technology China) [mailto:Gavin.Hu@arm.com]
> > Sent: Thursday, January 17, 2019 2:49 AM
> > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson,
> Bruce
> > <bruce.richardson@intel.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>
> > Subject: RE: [dpdk-dev] [PATCH v2 1/2] eal: add 128-bit cmpset (x86-64
> only)
> >
> >
> >
> > > -----Original Message-----
> > > From: dev <dev-bounces@dpdk.org> On Behalf Of Gage Eads
> > > Sent: Wednesday, January 16, 2019 6:33 AM
> > > To: dev@dpdk.org
> > > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com;
> > > bruce.richardson@intel.com; konstantin.ananyev@intel.com
> > > Subject: [dpdk-dev] [PATCH v2 1/2] eal: add 128-bit cmpset (x86-64
> > > only)
> > >
> > > This operation can be used for non-blocking algorithms, such as a
> > > non-blocking stack or ring.
> > >
> > > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > > ---
> > >  .../common/include/arch/x86/rte_atomic_64.h        | 22
> > > ++++++++++++++++++++++
> > >  1 file changed, 22 insertions(+)
> > >
> > > diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > > b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > > index fd2ec9c53..34c2addf8 100644
> > > --- a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > > +++ b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > > @@ -34,6 +34,7 @@
> > >  /*
> > >   * Inspired from FreeBSD src/sys/amd64/include/atomic.h
> > >   * Copyright (c) 1998 Doug Rabson
> > > + * Copyright (c) 2019 Intel Corporation
> > >   * All rights reserved.
> > >   */
> > >
> > > @@ -208,4 +209,25 @@ static inline void
> > > rte_atomic64_clear(rte_atomic64_t *v)  }  #endif
> > >
> > > +static inline int
> > > +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp, uint64_t
> > > *src)
> > > +{
> > > +uint8_t res;
> > > +
> > > +asm volatile (
> > > +      MPLOCKED
> > > +      "cmpxchg16b %[dst];"
> > > +      " sete %[res]"
> > > +      : [dst] "=m" (*dst),
> > > +[res] "=r" (res)
> > > +      : "c" (src[1]),
> > > +"b" (src[0]),
> > > +"m" (*dst),
> > > +"d" (exp[1]),
> > > +"a" (exp[0])
> > > +      : "memory");
> > > +
> > > +return res;
> > > +}
> > > +
> >
> > CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK=y can't coexist with
> > RTE_FORCE_INTRINSICS=y, this should be explicitly described somewhere
> in the
> > configuration and documentations.
> >
> 
> This patch places rte_atomic128_cmpset() outside of the
> RTE_FORCE_INTRINSICS ifndef, and this file is included regardless of that
> config flag, so it's compiled either way.
> 

Acked-by: Gavin Hu <gavin.hu@arm.com>

> > >  #endif /* _RTE_ATOMIC_X86_64_H_ */
> > > --
> > > 2.13.6
> >

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-blocking stack mempool
  2019-01-17 15:42             ` Gavin Hu (Arm Technology China)
@ 2019-01-17 20:41               ` Eads, Gage
  0 siblings, 0 replies; 43+ messages in thread
From: Eads, Gage @ 2019-01-17 20:41 UTC (permalink / raw)
  To: Gavin Hu (Arm Technology China), Richardson, Bruce
  Cc: dev, olivier.matz, arybchenko, Ananyev, Konstantin,
	Honnappa Nagarahalli, Ruifeng Wang (Arm Technology China),
	Phil Yang (Arm Technology China)



> -----Original Message-----
> From: Gavin Hu (Arm Technology China) [mailto:Gavin.Hu@arm.com]
> Sent: Thursday, January 17, 2019 9:42 AM
> To: Eads, Gage <gage.eads@intel.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: dev@dpdk.org; olivier.matz@6wind.com; arybchenko@solarflare.com;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm Technology China)
> <Ruifeng.Wang@arm.com>; Phil Yang (Arm Technology China)
> <Phil.Yang@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-blocking
> stack mempool
> 
> 
> 
> > -----Original Message-----
> > From: Eads, Gage <gage.eads@intel.com>
> > Sent: Thursday, January 17, 2019 11:16 PM
> > To: Richardson, Bruce <bruce.richardson@intel.com>
> > Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; dev@dpdk.org;
> > olivier.matz@6wind.com; arybchenko@solarflare.com; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm Technology China)
> > <Ruifeng.Wang@arm.com>; Phil Yang (Arm Technology
> > China) <Phil.Yang@arm.com>
> > Subject: RE: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-
> > blocking stack mempool
> >
> >
> >
> > > -----Original Message-----
> > > From: Richardson, Bruce
> > > Sent: Thursday, January 17, 2019 8:21 AM
> > > To: Eads, Gage <gage.eads@intel.com>
> > > Cc: Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>;
> > dev@dpdk.org;
> > > olivier.matz@6wind.com; arybchenko@solarflare.com; Ananyev,
> > Konstantin
> > > <konstantin.ananyev@intel.com>; Honnappa Nagarahalli
> > > <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm Technology
> > China)
> > > <Ruifeng.Wang@arm.com>; Phil Yang (Arm Technology China)
> > > <Phil.Yang@arm.com>
> > > Subject: Re: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-
> > blocking
> > > stack mempool
> > >
> > > On Thu, Jan 17, 2019 at 02:11:22PM +0000, Eads, Gage wrote:
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Gavin Hu (Arm Technology China) [mailto:Gavin.Hu@arm.com]
> > > > > Sent: Thursday, January 17, 2019 2:06 AM
> > > > > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> > > > > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com;
> > > > > Richardson, Bruce <bruce.richardson@intel.com>; Ananyev,
> > > > > Konstantin <konstantin.ananyev@intel.com>; Honnappa Nagarahalli
> > > > > <Honnappa.Nagarahalli@arm.com>; Ruifeng Wang (Arm Technology
> > China)
> > > > > <Ruifeng.Wang@arm.com>; Phil Yang (Arm Technology China)
> > > > > <Phil.Yang@arm.com>
> > > > > Subject: RE: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add
> > > > > non-blocking stack mempool
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: dev <dev-bounces@dpdk.org> On Behalf Of Gage Eads
> > > > > > Sent: Wednesday, January 16, 2019 6:33 AM
> > > > > > To: dev@dpdk.org
> > > > > > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com;
> > > > > > bruce.richardson@intel.com; konstantin.ananyev@intel.com
> > > > > > Subject: [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add
> > > > > > non-blocking stack mempool
> > > > > >
> > > > > > This commit adds support for non-blocking (linked list based)
> > > > > > stack mempool handler. The stack uses a 128-bit compare-and-
> > swap
> > > > > > instruction, and thus is limited to x86_64. The 128-bit CAS
> > > > > > atomically updates the stack top pointer and a modification
> > > > > > counter, which protects against the ABA problem.
> > > > > >
> > > > > > In mempool_perf_autotest the lock-based stack outperforms the
> > non-
> > > > > > blocking handler*, however:
> > > > > > - For applications with preemptible pthreads, a lock-based stack's
> > > > > >   worst-case performance (i.e. one thread being preempted while
> > > > > >   holding the spinlock) is much worse than the non-blocking stack's.
> > > > > > - Using per-thread mempool caches will largely mitigate the
> > performance
> > > > > >   difference.
> > > > > >
> > > > > > *Test setup: x86_64 build with default config, dual-socket
> > > > > > Xeon
> > > > > > E5-2699 v4, running on isolcpus cores with a tickless scheduler.
> > > > > > The lock-based stack's rate_persec was 1x-3.5x the
> > > > > > non-blocking
> > stack's.
> > > > > >
> > > > > > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > > > > > ---
> > > > > >  MAINTAINERS                                        |   4 +
> > > > > >  config/common_base                                 |   1 +
> > > > > >  doc/guides/prog_guide/env_abstraction_layer.rst    |   5 +
> > > > > >  drivers/mempool/Makefile                           |   3 +
> > > > > >  drivers/mempool/meson.build                        |   5 +
> > > > > >  drivers/mempool/nb_stack/Makefile                  |  23 ++++
> > > > > >  drivers/mempool/nb_stack/meson.build               |   4 +
> > > > > >  drivers/mempool/nb_stack/nb_lifo.h                 | 147
> > > > > > +++++++++++++++++++++
> > > > > >  drivers/mempool/nb_stack/rte_mempool_nb_stack.c    | 125
> > > > > > ++++++++++++++++++
> > > > > >  .../nb_stack/rte_mempool_nb_stack_version.map      |   4 +
> > > > > >  mk/rte.app.mk                                      |   7 +-
> > > > > >  11 files changed, 326 insertions(+), 2 deletions(-)  create
> > > > > > mode
> > > > > > 100644 drivers/mempool/nb_stack/Makefile  create mode 100644
> > > > > > drivers/mempool/nb_stack/meson.build
> > > > > >  create mode 100644 drivers/mempool/nb_stack/nb_lifo.h
> > > > > >  create mode 100644
> > > > > > drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> > > > > >  create mode 100644
> > > > > > drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
> > > > > >
> > > > > > diff --git a/MAINTAINERS b/MAINTAINERS index
> > 470f36b9c..5519d3323
> > > > > > 100644
> > > > > > --- a/MAINTAINERS
> > > > > > +++ b/MAINTAINERS
> > > > > > @@ -416,6 +416,10 @@ M: Artem V. Andreev
> > > > > > <artem.andreev@oktetlabs.ru>
> > > > > >  M: Andrew Rybchenko <arybchenko@solarflare.com>
> > > > > >  F: drivers/mempool/bucket/
> > > > > >
> > > > > > +Non-blocking stack memory pool
> > > > > > +M: Gage Eads <gage.eads@intel.com>
> > > > > > +F: drivers/mempool/nb_stack/
> > > > > > +
> > > > > >
> > > > > >  Bus Drivers
> > > > > >  -----------
> > > > > > diff --git a/config/common_base b/config/common_base index
> > > > > > 964a6956e..8a51f36b1 100644
> > > > > > --- a/config/common_base
> > > > > > +++ b/config/common_base
> > > > > > @@ -726,6 +726,7 @@ CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n
> > #
> > > > > > CONFIG_RTE_DRIVER_MEMPOOL_BUCKET=y
> > > > > >  CONFIG_RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB=64
> > > > > > +CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK=y
> > > > >
> > > > > NAK,  as this applies to x86_64 only, it will break arm/ppc and
> > > > > even 32bit i386 configurations.
> > > > >
> > > >
> > > > Hi Gavin,
> > > >
> > > > This patch resolves that in the make and meson build files, which
> > ensure that
> > > the library is only built for x86-64 targets:
> 
> Looking down to the changes with Makefile and meson.build, it will be compiled
> out for arm/ppc/i386. That works at least.
> But having this entry in the arm/ppc/i386 configurations is very strange, since
> they have no such implementations.
> Why not put it into defconfig_x86_64-native-linuxapp-icc/gcc/clang to limit the
> scope?
> 

Certainly, that's reasonable -- it simply slipped my mind. I'll address this in the next version.

Thanks,
Gage

> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended recipient,
> please notify the sender immediately and do not disclose the contents to any
> other person, use it for any purpose, or store or copy the information in any
> medium. Thank you.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only)
  2019-01-17 15:45       ` Honnappa Nagarahalli
@ 2019-01-17 23:03         ` Eads, Gage
  2019-01-18  5:27           ` Honnappa Nagarahalli
  0 siblings, 1 reply; 43+ messages in thread
From: Eads, Gage @ 2019-01-17 23:03 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev
  Cc: olivier.matz, arybchenko, Richardson, Bruce, Ananyev, Konstantin, nd, nd



> -----Original Message-----
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Thursday, January 17, 2019 9:45 AM
> To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only)
> 
> > Subject: [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64
> > only)
> >
> > This operation can be used for non-blocking algorithms, such as a non-
> > blocking stack or ring.
> >
> > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > ---
> >  .../common/include/arch/x86/rte_atomic_64.h        | 22
> > ++++++++++++++++++++++
> >  1 file changed, 22 insertions(+)
> >
> > diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > index fd2ec9c53..34c2addf8 100644
> > --- a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > +++ b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> Since this is a 128b operation should there be a new file created with the name
> rte_atomic_128.h?
> 
> > @@ -34,6 +34,7 @@
> >  /*
> >   * Inspired from FreeBSD src/sys/amd64/include/atomic.h
> >   * Copyright (c) 1998 Doug Rabson
> > + * Copyright (c) 2019 Intel Corporation
> >   * All rights reserved.
> >   */
> >
> > @@ -208,4 +209,25 @@ static inline void
> > rte_atomic64_clear(rte_atomic64_t
> > *v)  }  #endif
> >
> > +static inline int
> > +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp, uint64_t
> > +*src) {
> The API name suggests it is a 128b operation. 'dst', 'exp' and 'src' should be
> pointers to 128b (__int128)? Or we could define our own data type.

I agree, I'm not a big fan of the 64b pointers here. I avoided __int128 originally because it fails to compile with -pedantic, but on second thought (and with your suggestion of a separate data type), we can resolve that with this typedef:

typedef struct {
        RTE_STD_C11 __int128 val;
} rte_int128_t;

> Since, it is a new API, can we define it with memory orderings which will be more
> conducive to relaxed memory ordering based architectures? You can refer to [1]
> and [2] for guidance.

I certainly see the value in controlling the operation's memory ordering, like in the __atomic intrinsics, but I'm not sure this patchset is the right place to address that. I see that work going a couple ways:
1. Expand the existing rte_atomicN_* interfaces with additional arguments. In that case, I'd prefer this be done in a separate patchset that addresses all the atomic operations, not just cmpset, so the interface changes are chosen according to the needs of the full set of atomic operations. If this approach is taken then there's no need to solve this while rte_atomic128_cmpset is experimental, since all the other functions are non-experimental anyway.

- Or -

2. Don't modify the existing rte_atomicN_* interfaces (or their strongly ordered behavior), and instead create new versions of them that take additional arguments. In this case, we can implement rte_atomic128_cmpset() as is and create a more flexible version in a later patchset.

Either way, I think the current interface (w.r.t. memory ordering options) can work and still leaves us in a good position for future changes/improvements.

> If this an external API, it requires 'experimental' tag.

Good catch -- will fix.

> 
> 1. https://github.com/ARM-
> software/progress64/blob/master/src/lockfree/aarch64.h#L63

I didn't know about aarch64's CASP instruction -- very cool! 

> 2. https://github.com/ARM-
> software/progress64/blob/master/src/lockfree/x86-64.h#L34
> 
> > +	uint8_t res;
> > +
> > +	asm volatile (
> > +		      MPLOCKED
> > +		      "cmpxchg16b %[dst];"
> > +		      " sete %[res]"
> > +		      : [dst] "=m" (*dst),
> > +			[res] "=r" (res)
> > +		      : "c" (src[1]),
> > +			"b" (src[0]),
> > +			"m" (*dst),
> > +			"d" (exp[1]),
> > +			"a" (exp[0])
> > +		      : "memory");
> > +
> > +	return res;
> > +}
> > +
> >  #endif /* _RTE_ATOMIC_X86_64_H_ */
> > --
> > 2.13.6

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/2] mempool/nb_stack: add non-blocking stack mempool
  2019-01-17 15:36       ` [dpdk-dev] [PATCH v4 2/2] mempool/nb_stack: add non-blocking stack mempool Gage Eads
@ 2019-01-18  5:05         ` Honnappa Nagarahalli
  2019-01-18 20:09           ` Eads, Gage
  2019-01-19  0:00           ` Eads, Gage
  0 siblings, 2 replies; 43+ messages in thread
From: Honnappa Nagarahalli @ 2019-01-18  5:05 UTC (permalink / raw)
  To: Gage Eads, dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	Gavin Hu (Arm Technology China),
	nd, Honnappa Nagarahalli, nd

Hi Gage,
     Thank you for your contribution on non-blocking data structures. I think they are important to extend DPDK into additional use cases.

I am wondering if it makes sense to decouple the NB stack data structure from mempool driver (similar to rte_ring)? I see that stack based mempool implements the stack data structure in the driver. But, NB stack might not be such a trivial data structure. It might be useful for the applications or other use cases as well.

I also suggest that we use C11 __atomic_xxx APIs for memory operations. The rte_atomic64_xxx APIs use __sync_xxx APIs which do not provide the capability to express memory orderings.

Please find few comments inline.

> 
> This commit adds support for non-blocking (linked list based) stack
> mempool handler. The stack uses a 128-bit compare-and-swap instruction,
> and thus is limited to x86_64. The 128-bit CAS atomically updates the stack
> top pointer and a modification counter, which protects against the ABA
> problem.
> 
> In mempool_perf_autotest the lock-based stack outperforms the non-
> blocking handler*, however:
> - For applications with preemptible pthreads, a lock-based stack's
>   worst-case performance (i.e. one thread being preempted while
>   holding the spinlock) is much worse than the non-blocking stack's.
> - Using per-thread mempool caches will largely mitigate the performance
>   difference.
> 
> *Test setup: x86_64 build with default config, dual-socket Xeon E5-2699 v4,
> running on isolcpus cores with a tickless scheduler. The lock-based stack's
> rate_persec was 1x-3.5x the non-blocking stack's.
> 
> Signed-off-by: Gage Eads <gage.eads@intel.com>
> Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
> ---
>  MAINTAINERS                                        |   4 +
>  config/common_base                                 |   1 +
>  doc/guides/prog_guide/env_abstraction_layer.rst    |   5 +
>  drivers/mempool/Makefile                           |   3 +
>  drivers/mempool/meson.build                        |   3 +-
>  drivers/mempool/nb_stack/Makefile                  |  23 ++++
>  drivers/mempool/nb_stack/meson.build               |   6 +
>  drivers/mempool/nb_stack/nb_lifo.h                 | 147
> +++++++++++++++++++++
>  drivers/mempool/nb_stack/rte_mempool_nb_stack.c    | 125
> ++++++++++++++++++
>  .../nb_stack/rte_mempool_nb_stack_version.map      |   4 +
>  mk/rte.app.mk                                      |   7 +-
>  11 files changed, 325 insertions(+), 3 deletions(-)  create mode 100644
> drivers/mempool/nb_stack/Makefile  create mode 100644
> drivers/mempool/nb_stack/meson.build
>  create mode 100644 drivers/mempool/nb_stack/nb_lifo.h
>  create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack.c
>  create mode 100644
> drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
> 
> diff --git a/MAINTAINERS b/MAINTAINERS
> index 470f36b9c..5519d3323 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -416,6 +416,10 @@ M: Artem V. Andreev <artem.andreev@oktetlabs.ru>
>  M: Andrew Rybchenko <arybchenko@solarflare.com>
>  F: drivers/mempool/bucket/
> 
> +Non-blocking stack memory pool
> +M: Gage Eads <gage.eads@intel.com>
> +F: drivers/mempool/nb_stack/
> +
> 
>  Bus Drivers
>  -----------
> diff --git a/config/common_base b/config/common_base index
> 964a6956e..8a51f36b1 100644
> --- a/config/common_base
> +++ b/config/common_base
> @@ -726,6 +726,7 @@ CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n  #
> CONFIG_RTE_DRIVER_MEMPOOL_BUCKET=y
>  CONFIG_RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB=64
> +CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK=y
>  CONFIG_RTE_DRIVER_MEMPOOL_RING=y
>  CONFIG_RTE_DRIVER_MEMPOOL_STACK=y
> 
> diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst
> b/doc/guides/prog_guide/env_abstraction_layer.rst
> index 929d76dba..9497b879c 100644
> --- a/doc/guides/prog_guide/env_abstraction_layer.rst
> +++ b/doc/guides/prog_guide/env_abstraction_layer.rst
> @@ -541,6 +541,11 @@ Known Issues
> 
>    5. It MUST not be used by multi-producer/consumer pthreads, whose
> scheduling policies are SCHED_FIFO or SCHED_RR.
> 
> +  Alternatively, x86_64 applications can use the non-blocking stack
> mempool handler. When considering this handler, note that:
> +
> +  - it is limited to the x86_64 platform, because it uses an instruction (16-
> byte compare-and-swap) that is not available on other platforms.
                                                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Arm architecture supports similar instructions. I suggest to simplify this statement to indicate that 'nb_stack feature is available for x86_64 platforms currently'

> +  - it has worse average-case performance than the non-preemptive
> rte_ring, but software caching (e.g. the mempool cache) can mitigate this by
> reducing the number of handler operations.
> +
>  + rte_timer
> 
>    Running  ``rte_timer_manage()`` on a non-EAL pthread is not allowed.
> However, resetting/stopping the timer from a non-EAL pthread is allowed.
> diff --git a/drivers/mempool/Makefile b/drivers/mempool/Makefile index
> 28c2e8360..895cf8a34 100644
> --- a/drivers/mempool/Makefile
> +++ b/drivers/mempool/Makefile
> @@ -10,6 +10,9 @@ endif
>  ifeq ($(CONFIG_RTE_EAL_VFIO)$(CONFIG_RTE_LIBRTE_FSLMC_BUS),yy)
>  DIRS-$(CONFIG_RTE_LIBRTE_DPAA2_MEMPOOL) += dpaa2  endif
> +ifeq ($(CONFIG_RTE_ARCH_X86_64),y)
> +DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += nb_stack endif
>  DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_RING) += ring
>  DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK) += stack
>  DIRS-$(CONFIG_RTE_LIBRTE_OCTEONTX_MEMPOOL) += octeontx diff --git
> a/drivers/mempool/meson.build b/drivers/mempool/meson.build index
> 4527d9806..220cfaf63 100644
> --- a/drivers/mempool/meson.build
> +++ b/drivers/mempool/meson.build
> @@ -1,7 +1,8 @@
>  # SPDX-License-Identifier: BSD-3-Clause  # Copyright(c) 2017 Intel
> Corporation
> 
> -drivers = ['bucket', 'dpaa', 'dpaa2', 'octeontx', 'ring', 'stack']
> +drivers = ['bucket', 'dpaa', 'dpaa2', 'nb_stack', 'octeontx', 'ring',
> +'stack']
> +
>  std_deps = ['mempool']
>  config_flag_fmt = 'RTE_LIBRTE_@0@_MEMPOOL'
>  driver_name_fmt = 'rte_mempool_@0@'
> diff --git a/drivers/mempool/nb_stack/Makefile
> b/drivers/mempool/nb_stack/Makefile
> new file mode 100644
> index 000000000..318b18283
> --- /dev/null
> +++ b/drivers/mempool/nb_stack/Makefile
> @@ -0,0 +1,23 @@
> +# SPDX-License-Identifier: BSD-3-Clause # Copyright(c) 2019 Intel
> +Corporation
> +
> +include $(RTE_SDK)/mk/rte.vars.mk
> +
> +#
> +# library name
> +#
> +LIB = librte_mempool_nb_stack.a
> +
> +CFLAGS += -O3
> +CFLAGS += $(WERROR_FLAGS)
> +
> +# Headers
> +LDLIBS += -lrte_eal -lrte_mempool
> +
> +EXPORT_MAP := rte_mempool_nb_stack_version.map
> +
> +LIBABIVER := 1
> +
> +SRCS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) +=
> rte_mempool_nb_stack.c
> +
> +include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/drivers/mempool/nb_stack/meson.build
> b/drivers/mempool/nb_stack/meson.build
> new file mode 100644
> index 000000000..7dec72242
> --- /dev/null
> +++ b/drivers/mempool/nb_stack/meson.build
> @@ -0,0 +1,6 @@
> +# SPDX-License-Identifier: BSD-3-Clause # Copyright(c) 2019 Intel
> +Corporation
> +
> +build = dpdk_conf.has('RTE_ARCH_X86_64')
> +
> +sources = files('rte_mempool_nb_stack.c')
> diff --git a/drivers/mempool/nb_stack/nb_lifo.h
> b/drivers/mempool/nb_stack/nb_lifo.h
> new file mode 100644
> index 000000000..ad4a3401f
> --- /dev/null
> +++ b/drivers/mempool/nb_stack/nb_lifo.h
> @@ -0,0 +1,147 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2019 Intel Corporation
> + */
> +
> +#ifndef _NB_LIFO_H_
> +#define _NB_LIFO_H_
> +
> +struct nb_lifo_elem {
> +	void *data;
> +	struct nb_lifo_elem *next;
> +};
> +
> +struct nb_lifo_head {
> +	struct nb_lifo_elem *top; /**< Stack top */
> +	uint64_t cnt; /**< Modification counter */ };
Minor comment, mentioning ABA problem in the comments for 'cnt' will be helpful.

> +
> +struct nb_lifo {
> +	volatile struct nb_lifo_head head __rte_aligned(16);
> +	rte_atomic64_t len;
> +} __rte_cache_aligned;
> +
> +static __rte_always_inline void
> +nb_lifo_init(struct nb_lifo *lifo)
> +{
> +	memset(lifo, 0, sizeof(*lifo));
> +	rte_atomic64_set(&lifo->len, 0);
> +}
> +
> +static __rte_always_inline unsigned int nb_lifo_len(struct nb_lifo
> +*lifo) {
> +	/* nb_lifo_push() and nb_lifo_pop() do not update the list's
> contents
> +	 * and lifo->len atomically, which can cause the list to appear
> shorter
> +	 * than it actually is if this function is called while other threads
> +	 * are modifying the list.
> +	 *
> +	 * However, given the inherently approximate nature of the
> get_count
> +	 * callback -- even if the list and its size were updated atomically,
> +	 * the size could change between when get_count executes and
> when the
> +	 * value is returned to the caller -- this is acceptable.
> +	 *
> +	 * The lifo->len updates are placed such that the list may appear to
> +	 * have fewer elements than it does, but will never appear to have
> more
> +	 * elements. If the mempool is near-empty to the point that this is a
> +	 * concern, the user should consider increasing the mempool size.
> +	 */
> +	return (unsigned int)rte_atomic64_read(&lifo->len);
> +}
> +
> +static __rte_always_inline void
> +nb_lifo_push(struct nb_lifo *lifo,
> +	     struct nb_lifo_elem *first,
> +	     struct nb_lifo_elem *last,
> +	     unsigned int num)
> +{
> +	while (1) {
> +		struct nb_lifo_head old_head, new_head;
> +
> +		old_head = lifo->head;
> +
> +		/* Swing the top pointer to the first element in the list and
> +		 * make the last element point to the old top.
> +		 */
> +		new_head.top = first;
> +		new_head.cnt = old_head.cnt + 1;
> +
> +		last->next = old_head.top;
> +
> +		if (rte_atomic128_cmpset((volatile uint64_t *)&lifo->head,
> +					 (uint64_t *)&old_head,
> +					 (uint64_t *)&new_head))
> +			break;
> +	}
Minor comment, this can be a do-while loop (for ex: similar to the one in __rte_ring_move_prod_head)

> +
> +	rte_atomic64_add(&lifo->len, num);
> +}
> +
> +static __rte_always_inline void
> +nb_lifo_push_single(struct nb_lifo *lifo, struct nb_lifo_elem *elem) {
> +	nb_lifo_push(lifo, elem, elem, 1);
> +}
> +
> +static __rte_always_inline struct nb_lifo_elem * nb_lifo_pop(struct
> +nb_lifo *lifo,
> +	    unsigned int num,
> +	    void **obj_table,
> +	    struct nb_lifo_elem **last)
> +{
> +	struct nb_lifo_head old_head;
> +
> +	/* Reserve num elements, if available */
> +	while (1) {
> +		uint64_t len = rte_atomic64_read(&lifo->len);
> +
> +		/* Does the list contain enough elements? */
> +		if (len < num)
> +			return NULL;
> +
> +		if (rte_atomic64_cmpset((volatile uint64_t *)&lifo->len,
> +					len, len - num))
> +			break;
> +	}
> +
> +	/* Pop num elements */
> +	while (1) {
> +		struct nb_lifo_head new_head;
> +		struct nb_lifo_elem *tmp;
> +		unsigned int i;
> +
> +		old_head = lifo->head;
> +
> +		tmp = old_head.top;
> +
> +		/* Traverse the list to find the new head. A next pointer will
> +		 * either point to another element or NULL; if a thread
> +		 * encounters a pointer that has already been popped, the
> CAS
> +		 * will fail.
> +		 */
> +		for (i = 0; i < num && tmp != NULL; i++) {
> +			if (obj_table)
This 'if' check can be outside the for loop. May be use RTE_ASSERT in the beginning of the function?

> +				obj_table[i] = tmp->data;
> +			if (last)
> +				*last = tmp;
> +			tmp = tmp->next;
> +		}
> +
> +		/* If NULL was encountered, the list was modified while
> +		 * traversing it. Retry.
> +		 */
> +		if (i != num)
> +			continue;
> +
> +		new_head.top = tmp;
> +		new_head.cnt = old_head.cnt + 1;
> +
> +		if (rte_atomic128_cmpset((volatile uint64_t *)&lifo->head,
> +					 (uint64_t *)&old_head,
> +					 (uint64_t *)&new_head))
> +			break;
> +	}
> +
> +	return old_head.top;
> +}
> +
> +#endif /* _NB_LIFO_H_ */
> diff --git a/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> new file mode 100644
> index 000000000..1818a2cfa
> --- /dev/null
> +++ b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> @@ -0,0 +1,125 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2019 Intel Corporation
> + */
> +
> +#include <stdio.h>
> +#include <rte_mempool.h>
> +#include <rte_malloc.h>
> +
> +#include "nb_lifo.h"
> +
> +struct rte_mempool_nb_stack {
> +	uint64_t size;
> +	struct nb_lifo used_lifo; /**< LIFO containing mempool pointers  */
> +	struct nb_lifo free_lifo; /**< LIFO containing unused LIFO elements
> */
> +};
> +
> +static int
> +nb_stack_alloc(struct rte_mempool *mp)
> +{
> +	struct rte_mempool_nb_stack *s;
> +	struct nb_lifo_elem *elems;
> +	unsigned int n = mp->size;
> +	unsigned int size, i;
> +
> +	size = sizeof(*s) + n * sizeof(struct nb_lifo_elem);
IMO, the allocation of the stack elements can be moved under nb_lifo_init API, it would make the nb stack code modular.

> +
> +	/* Allocate our local memory structure */
> +	s = rte_zmalloc_socket("mempool-nb_stack",
> +			       size,
> +			       RTE_CACHE_LINE_SIZE,
> +			       mp->socket_id);
> +	if (s == NULL) {
> +		RTE_LOG(ERR, MEMPOOL, "Cannot allocate nb_stack!\n");
> +		return -ENOMEM;
> +	}
> +
> +	s->size = n;
> +
> +	nb_lifo_init(&s->used_lifo);
> +	nb_lifo_init(&s->free_lifo);
> +
> +	elems = (struct nb_lifo_elem *)&s[1];
> +	for (i = 0; i < n; i++)
> +		nb_lifo_push_single(&s->free_lifo, &elems[i]);
This also can be added to nb_lifo_init API.

> +
> +	mp->pool_data = s;
> +
> +	return 0;
> +}
> +
> +static int
> +nb_stack_enqueue(struct rte_mempool *mp, void * const *obj_table,
> +		 unsigned int n)
> +{
> +	struct rte_mempool_nb_stack *s = mp->pool_data;
> +	struct nb_lifo_elem *first, *last, *tmp;
> +	unsigned int i;
> +
> +	if (unlikely(n == 0))
> +		return 0;
> +
> +	/* Pop n free elements */
> +	first = nb_lifo_pop(&s->free_lifo, n, NULL, NULL);
> +	if (unlikely(first == NULL))
> +		return -ENOBUFS;
> +
> +	/* Prepare the list elements */
> +	tmp = first;
> +	for (i = 0; i < n; i++) {
> +		tmp->data = obj_table[i];
> +		last = tmp;
> +		tmp = tmp->next;
> +	}
> +
> +	/* Enqueue them to the used list */
> +	nb_lifo_push(&s->used_lifo, first, last, n);
> +
> +	return 0;
> +}
> +
> +static int
> +nb_stack_dequeue(struct rte_mempool *mp, void **obj_table,
> +		 unsigned int n)
> +{
> +	struct rte_mempool_nb_stack *s = mp->pool_data;
> +	struct nb_lifo_elem *first, *last;
> +
> +	if (unlikely(n == 0))
> +		return 0;
> +
> +	/* Pop n used elements */
> +	first = nb_lifo_pop(&s->used_lifo, n, obj_table, &last);
> +	if (unlikely(first == NULL))
> +		return -ENOENT;
> +
> +	/* Enqueue the list elements to the free list */
> +	nb_lifo_push(&s->free_lifo, first, last, n);
> +
> +	return 0;
> +}
> +
> +static unsigned
> +nb_stack_get_count(const struct rte_mempool *mp) {
> +	struct rte_mempool_nb_stack *s = mp->pool_data;
> +
> +	return nb_lifo_len(&s->used_lifo);
> +}
> +
> +static void
> +nb_stack_free(struct rte_mempool *mp)
> +{
> +	rte_free(mp->pool_data);
> +}
> +
> +static struct rte_mempool_ops ops_nb_stack = {
> +	.name = "nb_stack",
> +	.alloc = nb_stack_alloc,
> +	.free = nb_stack_free,
> +	.enqueue = nb_stack_enqueue,
> +	.dequeue = nb_stack_dequeue,
> +	.get_count = nb_stack_get_count
> +};
> +
> +MEMPOOL_REGISTER_OPS(ops_nb_stack);
> diff --git a/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
> b/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
> new file mode 100644
> index 000000000..fc8c95e91
> --- /dev/null
> +++ b/drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
> @@ -0,0 +1,4 @@
> +DPDK_19.05 {
> +
> +	local: *;
> +};
> diff --git a/mk/rte.app.mk b/mk/rte.app.mk index 02e8b6f05..d4b4aaaf6
> 100644
> --- a/mk/rte.app.mk
> +++ b/mk/rte.app.mk
> @@ -131,8 +131,11 @@ endif
>  ifeq ($(CONFIG_RTE_BUILD_SHARED_LIB),n)
>  # plugins (link only if static libraries)
> 
> -_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_BUCKET) += -
> lrte_mempool_bucket
> -_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK)  += -
> lrte_mempool_stack
> +_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_BUCKET)   += -
> lrte_mempool_bucket
> +ifeq ($(CONFIG_RTE_ARCH_X86_64),y)
> +_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += -
> lrte_mempool_nb_stack
> +endif
> +_LDLIBS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK)    += -
> lrte_mempool_stack
>  ifeq ($(CONFIG_RTE_LIBRTE_DPAA_BUS),y)
>  _LDLIBS-$(CONFIG_RTE_LIBRTE_DPAA_MEMPOOL)   += -lrte_mempool_dpaa
>  endif
> --
> 2.13.6

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only)
  2019-01-17 23:03         ` Eads, Gage
@ 2019-01-18  5:27           ` Honnappa Nagarahalli
  2019-01-18 22:01             ` Eads, Gage
  0 siblings, 1 reply; 43+ messages in thread
From: Honnappa Nagarahalli @ 2019-01-18  5:27 UTC (permalink / raw)
  To: Eads, Gage, dev
  Cc: olivier.matz, arybchenko, Richardson, Bruce, Ananyev, Konstantin, nd, nd

> > >
> > > This operation can be used for non-blocking algorithms, such as a
> > > non- blocking stack or ring.
> > >
> > > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > > ---
> > >  .../common/include/arch/x86/rte_atomic_64.h        | 22
> > > ++++++++++++++++++++++
> > >  1 file changed, 22 insertions(+)
> > >
> > > diff --git a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > > b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > > index fd2ec9c53..34c2addf8 100644
> > > --- a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > > +++ b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > Since this is a 128b operation should there be a new file created with
> > the name rte_atomic_128.h?
> >
> > > @@ -34,6 +34,7 @@
> > >  /*
> > >   * Inspired from FreeBSD src/sys/amd64/include/atomic.h
> > >   * Copyright (c) 1998 Doug Rabson
> > > + * Copyright (c) 2019 Intel Corporation
> > >   * All rights reserved.
> > >   */
> > >
> > > @@ -208,4 +209,25 @@ static inline void
> > > rte_atomic64_clear(rte_atomic64_t
> > > *v)  }  #endif
> > >
> > > +static inline int
> > > +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp,
> > > +uint64_t
> > > +*src) {
> > The API name suggests it is a 128b operation. 'dst', 'exp' and 'src'
> > should be pointers to 128b (__int128)? Or we could define our own data
> type.
> 
> I agree, I'm not a big fan of the 64b pointers here. I avoided __int128
> originally because it fails to compile with -pedantic, but on second thought
> (and with your suggestion of a separate data type), we can resolve that with
> this typedef:
> 
> typedef struct {
>         RTE_STD_C11 __int128 val;
> } rte_int128_t;
ok

> 
> > Since, it is a new API, can we define it with memory orderings which
> > will be more conducive to relaxed memory ordering based architectures?
> > You can refer to [1] and [2] for guidance.
> 
> I certainly see the value in controlling the operation's memory ordering, like in
> the __atomic intrinsics, but I'm not sure this patchset is the right place to
> address that. I see that work going a couple ways:
> 1. Expand the existing rte_atomicN_* interfaces with additional arguments. In
> that case, I'd prefer this be done in a separate patchset that addresses all the
> atomic operations, not just cmpset, so the interface changes are chosen
> according to the needs of the full set of atomic operations. If this approach is
> taken then there's no need to solve this while rte_atomic128_cmpset is
> experimental, since all the other functions are non-experimental anyway.
> 
> - Or -
> 
> 2. Don't modify the existing rte_atomicN_* interfaces (or their strongly
> ordered behavior), and instead create new versions of them that take
> additional arguments. In this case, we can implement rte_atomic128_cmpset()
> as is and create a more flexible version in a later patchset.
> 
> Either way, I think the current interface (w.r.t. memory ordering options) can
> work and still leaves us in a good position for future changes/improvements.
> 
I do not see the need to modify/extend the existing rte_atomicN_* APIs as the corresponding __atomic intrinsics serve as replacements. I expect that at some point, DPDK code base will not be using rte_atomicN_* APIs.
However, __atomic intrinsics do not support 128b wide parameters. Hence DPDK needs to write its own. Since this is the first API in that regard, I prefer that we start with a signature that resembles __atomic intrinsics which have been proven to provide best flexibility for all the platforms supported by DPDK.

> > If this an external API, it requires 'experimental' tag.
> 
> Good catch -- will fix.
> 
> >
> > 1. https://github.com/ARM-
> > software/progress64/blob/master/src/lockfree/aarch64.h#L63
> 
> I didn't know about aarch64's CASP instruction -- very cool!
> 
> > 2. https://github.com/ARM-
> > software/progress64/blob/master/src/lockfree/x86-64.h#L34
> >
> > > +	uint8_t res;
> > > +
> > > +	asm volatile (
> > > +		      MPLOCKED
> > > +		      "cmpxchg16b %[dst];"
> > > +		      " sete %[res]"
> > > +		      : [dst] "=m" (*dst),
> > > +			[res] "=r" (res)
> > > +		      : "c" (src[1]),
> > > +			"b" (src[0]),
> > > +			"m" (*dst),
> > > +			"d" (exp[1]),
> > > +			"a" (exp[0])
> > > +		      : "memory");
> > > +
> > > +	return res;
> > > +}
> > > +
> > >  #endif /* _RTE_ATOMIC_X86_64_H_ */
> > > --
> > > 2.13.6

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/2] mempool/nb_stack: add non-blocking stack mempool
  2019-01-18  5:05         ` Honnappa Nagarahalli
@ 2019-01-18 20:09           ` Eads, Gage
  2019-01-19  0:00           ` Eads, Gage
  1 sibling, 0 replies; 43+ messages in thread
From: Eads, Gage @ 2019-01-18 20:09 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev
  Cc: olivier.matz, arybchenko, Richardson, Bruce, Ananyev, Konstantin,
	Gavin Hu (Arm Technology China),
	nd, nd



> -----Original Message-----
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Thursday, January 17, 2019 11:05 PM
> To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>; nd <nd@arm.com>; Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v4 2/2] mempool/nb_stack: add non-blocking
> stack mempool
> 
> Hi Gage,
>      Thank you for your contribution on non-blocking data structures. I think they
> are important to extend DPDK into additional use cases.
> 

Glad to hear it. Be sure to check out my non-blocking ring patchset as well, if you haven't already: http://mails.dpdk.org/archives/dev/2019-January/123774.html

> I am wondering if it makes sense to decouple the NB stack data structure from
> mempool driver (similar to rte_ring)? I see that stack based mempool
> implements the stack data structure in the driver. But, NB stack might not be
> such a trivial data structure. It might be useful for the applications or other use
> cases as well.
> 

I agree -- and you're not the first to suggest this :).

I'm going to defer that work to a later patchset; creating a new lib/ directory requires tech board approval (IIRC), which would unnecessarily slow down this mempool handler from getting merged.

> I also suggest that we use C11 __atomic_xxx APIs for memory operations. The
> rte_atomic64_xxx APIs use __sync_xxx APIs which do not provide the capability
> to express memory orderings.
> 

Ok, I will add those (dependent on RTE_USE_C11_MEM_MODEL).

> Please find few comments inline.
> 
> >
> > This commit adds support for non-blocking (linked list based) stack
> > mempool handler. The stack uses a 128-bit compare-and-swap
> > instruction, and thus is limited to x86_64. The 128-bit CAS atomically
> > updates the stack top pointer and a modification counter, which
> > protects against the ABA problem.
> >
> > In mempool_perf_autotest the lock-based stack outperforms the non-
> > blocking handler*, however:
> > - For applications with preemptible pthreads, a lock-based stack's
> >   worst-case performance (i.e. one thread being preempted while
> >   holding the spinlock) is much worse than the non-blocking stack's.
> > - Using per-thread mempool caches will largely mitigate the performance
> >   difference.
> >
> > *Test setup: x86_64 build with default config, dual-socket Xeon
> > E5-2699 v4, running on isolcpus cores with a tickless scheduler. The
> > lock-based stack's rate_persec was 1x-3.5x the non-blocking stack's.
> >
> > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
> > ---
> >  MAINTAINERS                                        |   4 +
> >  config/common_base                                 |   1 +
> >  doc/guides/prog_guide/env_abstraction_layer.rst    |   5 +
> >  drivers/mempool/Makefile                           |   3 +
> >  drivers/mempool/meson.build                        |   3 +-
> >  drivers/mempool/nb_stack/Makefile                  |  23 ++++
> >  drivers/mempool/nb_stack/meson.build               |   6 +
> >  drivers/mempool/nb_stack/nb_lifo.h                 | 147
> > +++++++++++++++++++++
> >  drivers/mempool/nb_stack/rte_mempool_nb_stack.c    | 125
> > ++++++++++++++++++
> >  .../nb_stack/rte_mempool_nb_stack_version.map      |   4 +
> >  mk/rte.app.mk                                      |   7 +-
> >  11 files changed, 325 insertions(+), 3 deletions(-)  create mode
> > 100644 drivers/mempool/nb_stack/Makefile  create mode 100644
> > drivers/mempool/nb_stack/meson.build
> >  create mode 100644 drivers/mempool/nb_stack/nb_lifo.h
> >  create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> >  create mode 100644
> > drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
> >
> > diff --git a/MAINTAINERS b/MAINTAINERS index 470f36b9c..5519d3323
> > 100644
> > --- a/MAINTAINERS
> > +++ b/MAINTAINERS
> > @@ -416,6 +416,10 @@ M: Artem V. Andreev <artem.andreev@oktetlabs.ru>
> >  M: Andrew Rybchenko <arybchenko@solarflare.com>
> >  F: drivers/mempool/bucket/
> >
> > +Non-blocking stack memory pool
> > +M: Gage Eads <gage.eads@intel.com>
> > +F: drivers/mempool/nb_stack/
> > +
> >
> >  Bus Drivers
> >  -----------
> > diff --git a/config/common_base b/config/common_base index
> > 964a6956e..8a51f36b1 100644
> > --- a/config/common_base
> > +++ b/config/common_base
> > @@ -726,6 +726,7 @@ CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n  #
> > CONFIG_RTE_DRIVER_MEMPOOL_BUCKET=y
> >  CONFIG_RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB=64
> > +CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK=y
> >  CONFIG_RTE_DRIVER_MEMPOOL_RING=y
> >  CONFIG_RTE_DRIVER_MEMPOOL_STACK=y
> >
> > diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst
> > b/doc/guides/prog_guide/env_abstraction_layer.rst
> > index 929d76dba..9497b879c 100644
> > --- a/doc/guides/prog_guide/env_abstraction_layer.rst
> > +++ b/doc/guides/prog_guide/env_abstraction_layer.rst
> > @@ -541,6 +541,11 @@ Known Issues
> >
> >    5. It MUST not be used by multi-producer/consumer pthreads, whose
> > scheduling policies are SCHED_FIFO or SCHED_RR.
> >
> > +  Alternatively, x86_64 applications can use the non-blocking stack
> > mempool handler. When considering this handler, note that:
> > +
> > +  - it is limited to the x86_64 platform, because it uses an
> > + instruction (16-
> > byte compare-and-swap) that is not available on other platforms.
>                                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Arm
> architecture supports similar instructions. I suggest to simplify this statement to
> indicate that 'nb_stack feature is available for x86_64 platforms currently'
> 

Will do.

> > +  - it has worse average-case performance than the non-preemptive
> > rte_ring, but software caching (e.g. the mempool cache) can mitigate
> > this by reducing the number of handler operations.
> > +
> >  + rte_timer
> >
> >    Running  ``rte_timer_manage()`` on a non-EAL pthread is not allowed.
> > However, resetting/stopping the timer from a non-EAL pthread is allowed.
> > diff --git a/drivers/mempool/Makefile b/drivers/mempool/Makefile index
> > 28c2e8360..895cf8a34 100644
> > --- a/drivers/mempool/Makefile
> > +++ b/drivers/mempool/Makefile
> > @@ -10,6 +10,9 @@ endif
> >  ifeq ($(CONFIG_RTE_EAL_VFIO)$(CONFIG_RTE_LIBRTE_FSLMC_BUS),yy)
> >  DIRS-$(CONFIG_RTE_LIBRTE_DPAA2_MEMPOOL) += dpaa2  endif
> > +ifeq ($(CONFIG_RTE_ARCH_X86_64),y)
> > +DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += nb_stack endif
> >  DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_RING) += ring
> >  DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK) += stack
> >  DIRS-$(CONFIG_RTE_LIBRTE_OCTEONTX_MEMPOOL) += octeontx diff --git
> > a/drivers/mempool/meson.build b/drivers/mempool/meson.build index
> > 4527d9806..220cfaf63 100644
> > --- a/drivers/mempool/meson.build
> > +++ b/drivers/mempool/meson.build
> > @@ -1,7 +1,8 @@
> >  # SPDX-License-Identifier: BSD-3-Clause  # Copyright(c) 2017 Intel
> > Corporation
> >
> > -drivers = ['bucket', 'dpaa', 'dpaa2', 'octeontx', 'ring', 'stack']
> > +drivers = ['bucket', 'dpaa', 'dpaa2', 'nb_stack', 'octeontx', 'ring',
> > +'stack']
> > +
> >  std_deps = ['mempool']
> >  config_flag_fmt = 'RTE_LIBRTE_@0@_MEMPOOL'
> >  driver_name_fmt = 'rte_mempool_@0@'
> > diff --git a/drivers/mempool/nb_stack/Makefile
> > b/drivers/mempool/nb_stack/Makefile
> > new file mode 100644
> > index 000000000..318b18283
> > --- /dev/null
> > +++ b/drivers/mempool/nb_stack/Makefile
> > @@ -0,0 +1,23 @@
> > +# SPDX-License-Identifier: BSD-3-Clause # Copyright(c) 2019 Intel
> > +Corporation
> > +
> > +include $(RTE_SDK)/mk/rte.vars.mk
> > +
> > +#
> > +# library name
> > +#
> > +LIB = librte_mempool_nb_stack.a
> > +
> > +CFLAGS += -O3
> > +CFLAGS += $(WERROR_FLAGS)
> > +
> > +# Headers
> > +LDLIBS += -lrte_eal -lrte_mempool
> > +
> > +EXPORT_MAP := rte_mempool_nb_stack_version.map
> > +
> > +LIBABIVER := 1
> > +
> > +SRCS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) +=
> > rte_mempool_nb_stack.c
> > +
> > +include $(RTE_SDK)/mk/rte.lib.mk
> > diff --git a/drivers/mempool/nb_stack/meson.build
> > b/drivers/mempool/nb_stack/meson.build
> > new file mode 100644
> > index 000000000..7dec72242
> > --- /dev/null
> > +++ b/drivers/mempool/nb_stack/meson.build
> > @@ -0,0 +1,6 @@
> > +# SPDX-License-Identifier: BSD-3-Clause # Copyright(c) 2019 Intel
> > +Corporation
> > +
> > +build = dpdk_conf.has('RTE_ARCH_X86_64')
> > +
> > +sources = files('rte_mempool_nb_stack.c')
> > diff --git a/drivers/mempool/nb_stack/nb_lifo.h
> > b/drivers/mempool/nb_stack/nb_lifo.h
> > new file mode 100644
> > index 000000000..ad4a3401f
> > --- /dev/null
> > +++ b/drivers/mempool/nb_stack/nb_lifo.h
> > @@ -0,0 +1,147 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2019 Intel Corporation  */
> > +
> > +#ifndef _NB_LIFO_H_
> > +#define _NB_LIFO_H_
> > +
> > +struct nb_lifo_elem {
> > +	void *data;
> > +	struct nb_lifo_elem *next;
> > +};
> > +
> > +struct nb_lifo_head {
> > +	struct nb_lifo_elem *top; /**< Stack top */
> > +	uint64_t cnt; /**< Modification counter */ };
> Minor comment, mentioning ABA problem in the comments for 'cnt' will be
> helpful.
> 

Sure.

> > +
> > +struct nb_lifo {
> > +	volatile struct nb_lifo_head head __rte_aligned(16);
> > +	rte_atomic64_t len;
> > +} __rte_cache_aligned;
> > +
> > +static __rte_always_inline void
> > +nb_lifo_init(struct nb_lifo *lifo)
> > +{
> > +	memset(lifo, 0, sizeof(*lifo));
> > +	rte_atomic64_set(&lifo->len, 0);
> > +}
> > +
> > +static __rte_always_inline unsigned int nb_lifo_len(struct nb_lifo
> > +*lifo) {
> > +	/* nb_lifo_push() and nb_lifo_pop() do not update the list's
> > contents
> > +	 * and lifo->len atomically, which can cause the list to appear
> > shorter
> > +	 * than it actually is if this function is called while other threads
> > +	 * are modifying the list.
> > +	 *
> > +	 * However, given the inherently approximate nature of the
> > get_count
> > +	 * callback -- even if the list and its size were updated atomically,
> > +	 * the size could change between when get_count executes and
> > when the
> > +	 * value is returned to the caller -- this is acceptable.
> > +	 *
> > +	 * The lifo->len updates are placed such that the list may appear to
> > +	 * have fewer elements than it does, but will never appear to have
> > more
> > +	 * elements. If the mempool is near-empty to the point that this is a
> > +	 * concern, the user should consider increasing the mempool size.
> > +	 */
> > +	return (unsigned int)rte_atomic64_read(&lifo->len);
> > +}
> > +
> > +static __rte_always_inline void
> > +nb_lifo_push(struct nb_lifo *lifo,
> > +	     struct nb_lifo_elem *first,
> > +	     struct nb_lifo_elem *last,
> > +	     unsigned int num)
> > +{
> > +	while (1) {
> > +		struct nb_lifo_head old_head, new_head;
> > +
> > +		old_head = lifo->head;
> > +
> > +		/* Swing the top pointer to the first element in the list and
> > +		 * make the last element point to the old top.
> > +		 */
> > +		new_head.top = first;
> > +		new_head.cnt = old_head.cnt + 1;
> > +
> > +		last->next = old_head.top;
> > +
> > +		if (rte_atomic128_cmpset((volatile uint64_t *)&lifo->head,
> > +					 (uint64_t *)&old_head,
> > +					 (uint64_t *)&new_head))
> > +			break;
> > +	}
> Minor comment, this can be a do-while loop (for ex: similar to the one in
> __rte_ring_move_prod_head)
> 

Sure.

> > +
> > +	rte_atomic64_add(&lifo->len, num);
> > +}
> > +
> > +static __rte_always_inline void
> > +nb_lifo_push_single(struct nb_lifo *lifo, struct nb_lifo_elem *elem) {
> > +	nb_lifo_push(lifo, elem, elem, 1);
> > +}
> > +
> > +static __rte_always_inline struct nb_lifo_elem * nb_lifo_pop(struct
> > +nb_lifo *lifo,
> > +	    unsigned int num,
> > +	    void **obj_table,
> > +	    struct nb_lifo_elem **last)
> > +{
> > +	struct nb_lifo_head old_head;
> > +
> > +	/* Reserve num elements, if available */
> > +	while (1) {
> > +		uint64_t len = rte_atomic64_read(&lifo->len);
> > +
> > +		/* Does the list contain enough elements? */
> > +		if (len < num)
> > +			return NULL;
> > +
> > +		if (rte_atomic64_cmpset((volatile uint64_t *)&lifo->len,
> > +					len, len - num))
> > +			break;
> > +	}
> > +
> > +	/* Pop num elements */
> > +	while (1) {
> > +		struct nb_lifo_head new_head;
> > +		struct nb_lifo_elem *tmp;
> > +		unsigned int i;
> > +
> > +		old_head = lifo->head;
> > +
> > +		tmp = old_head.top;
> > +
> > +		/* Traverse the list to find the new head. A next pointer will
> > +		 * either point to another element or NULL; if a thread
> > +		 * encounters a pointer that has already been popped, the
> > CAS
> > +		 * will fail.
> > +		 */
> > +		for (i = 0; i < num && tmp != NULL; i++) {
> > +			if (obj_table)
> This 'if' check can be outside the for loop. May be use RTE_ASSERT in the
> beginning of the function?
> 

A NULL obj_table pointer isn't an error -- nb_stack_enqueue() calls this function with NULL because it doesn't need the popped elements added to a table. When the compiler inlines this function into nb_stack_enqueue(), it can use constant propagation to optimize away the if-statement.

I don't think that's possible for the other caller, nb_stack_dequeue, though, unless we add a NULL pointer check to the beginning of that function. Then it would be guaranteed that obj_table is non-NULL, and the compiler can optimize away the if-statement. I'll add that.

> > +				obj_table[i] = tmp->data;
> > +			if (last)
> > +				*last = tmp;
> > +			tmp = tmp->next;
> > +		}
> > +
> > +		/* If NULL was encountered, the list was modified while
> > +		 * traversing it. Retry.
> > +		 */
> > +		if (i != num)
> > +			continue;
> > +
> > +		new_head.top = tmp;
> > +		new_head.cnt = old_head.cnt + 1;
> > +
> > +		if (rte_atomic128_cmpset((volatile uint64_t *)&lifo->head,
> > +					 (uint64_t *)&old_head,
> > +					 (uint64_t *)&new_head))
> > +			break;
> > +	}
> > +
> > +	return old_head.top;
> > +}
> > +
> > +#endif /* _NB_LIFO_H_ */
> > diff --git a/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> > b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> > new file mode 100644
> > index 000000000..1818a2cfa
> > --- /dev/null
> > +++ b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> > @@ -0,0 +1,125 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright(c) 2019 Intel Corporation  */
> > +
> > +#include <stdio.h>
> > +#include <rte_mempool.h>
> > +#include <rte_malloc.h>
> > +
> > +#include "nb_lifo.h"
> > +
> > +struct rte_mempool_nb_stack {
> > +	uint64_t size;
> > +	struct nb_lifo used_lifo; /**< LIFO containing mempool pointers  */
> > +	struct nb_lifo free_lifo; /**< LIFO containing unused LIFO elements
> > */
> > +};
> > +
> > +static int
> > +nb_stack_alloc(struct rte_mempool *mp) {
> > +	struct rte_mempool_nb_stack *s;
> > +	struct nb_lifo_elem *elems;
> > +	unsigned int n = mp->size;
> > +	unsigned int size, i;
> > +
> > +	size = sizeof(*s) + n * sizeof(struct nb_lifo_elem);
> IMO, the allocation of the stack elements can be moved under nb_lifo_init API,
> it would make the nb stack code modular.
> 

(see below)

> > +
> > +	/* Allocate our local memory structure */
> > +	s = rte_zmalloc_socket("mempool-nb_stack",
> > +			       size,
> > +			       RTE_CACHE_LINE_SIZE,
> > +			       mp->socket_id);
> > +	if (s == NULL) {
> > +		RTE_LOG(ERR, MEMPOOL, "Cannot allocate nb_stack!\n");
> > +		return -ENOMEM;
> > +	}
> > +
> > +	s->size = n;
> > +
> > +	nb_lifo_init(&s->used_lifo);
> > +	nb_lifo_init(&s->free_lifo);
> > +
> > +	elems = (struct nb_lifo_elem *)&s[1];
> > +	for (i = 0; i < n; i++)
> > +		nb_lifo_push_single(&s->free_lifo, &elems[i]);
> This also can be added to nb_lifo_init API.
> 

Sure, good suggestions. I'll address this.

Appreciate the feedback!

Thanks,
Gage

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only)
  2019-01-18  5:27           ` Honnappa Nagarahalli
@ 2019-01-18 22:01             ` Eads, Gage
  2019-01-22 20:30               ` Honnappa Nagarahalli
  0 siblings, 1 reply; 43+ messages in thread
From: Eads, Gage @ 2019-01-18 22:01 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev
  Cc: olivier.matz, arybchenko, Richardson, Bruce, Ananyev, Konstantin, nd, nd



> -----Original Message-----
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Thursday, January 17, 2019 11:28 PM
> To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; nd <nd@arm.com>; nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only)
> 
> > > >
> > > > This operation can be used for non-blocking algorithms, such as a
> > > > non- blocking stack or ring.
> > > >
> > > > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > > > ---
> > > >  .../common/include/arch/x86/rte_atomic_64.h        | 22
> > > > ++++++++++++++++++++++
> > > >  1 file changed, 22 insertions(+)
> > > >
> > > > diff --git
> > > > a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > > > b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > > > index fd2ec9c53..34c2addf8 100644
> > > > --- a/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > > > +++ b/lib/librte_eal/common/include/arch/x86/rte_atomic_64.h
> > > Since this is a 128b operation should there be a new file created
> > > with the name rte_atomic_128.h?
> > >
> > > > @@ -34,6 +34,7 @@
> > > >  /*
> > > >   * Inspired from FreeBSD src/sys/amd64/include/atomic.h
> > > >   * Copyright (c) 1998 Doug Rabson
> > > > + * Copyright (c) 2019 Intel Corporation
> > > >   * All rights reserved.
> > > >   */
> > > >
> > > > @@ -208,4 +209,25 @@ static inline void
> > > > rte_atomic64_clear(rte_atomic64_t
> > > > *v)  }  #endif
> > > >
> > > > +static inline int
> > > > +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp,
> > > > +uint64_t
> > > > +*src) {
> > > The API name suggests it is a 128b operation. 'dst', 'exp' and 'src'
> > > should be pointers to 128b (__int128)? Or we could define our own
> > > data
> > type.
> >
> > I agree, I'm not a big fan of the 64b pointers here. I avoided
> > __int128 originally because it fails to compile with -pedantic, but on
> > second thought (and with your suggestion of a separate data type), we
> > can resolve that with this typedef:
> >
> > typedef struct {
> >         RTE_STD_C11 __int128 val;
> > } rte_int128_t;
> ok
> 
> >
> > > Since, it is a new API, can we define it with memory orderings which
> > > will be more conducive to relaxed memory ordering based architectures?
> > > You can refer to [1] and [2] for guidance.
> >
> > I certainly see the value in controlling the operation's memory
> > ordering, like in the __atomic intrinsics, but I'm not sure this
> > patchset is the right place to address that. I see that work going a couple
> ways:
> > 1. Expand the existing rte_atomicN_* interfaces with additional
> > arguments. In that case, I'd prefer this be done in a separate
> > patchset that addresses all the atomic operations, not just cmpset, so
> > the interface changes are chosen according to the needs of the full
> > set of atomic operations. If this approach is taken then there's no
> > need to solve this while rte_atomic128_cmpset is experimental, since all the
> other functions are non-experimental anyway.
> >
> > - Or -
> >
> > 2. Don't modify the existing rte_atomicN_* interfaces (or their
> > strongly ordered behavior), and instead create new versions of them
> > that take additional arguments. In this case, we can implement
> > rte_atomic128_cmpset() as is and create a more flexible version in a later
> patchset.
> >
> > Either way, I think the current interface (w.r.t. memory ordering
> > options) can work and still leaves us in a good position for future
> changes/improvements.
> >
> I do not see the need to modify/extend the existing rte_atomicN_* APIs as the
> corresponding __atomic intrinsics serve as replacements. I expect that at some
> point, DPDK code base will not be using rte_atomicN_* APIs.
> However, __atomic intrinsics do not support 128b wide parameters. Hence

I don't think that's correct. From the GCC docs:

"16-byte integral types are also allowed if `__int128' (see __int128) is supported by the architecture."

This works with x86 -64 -- I assume aarch64 also, but haven't confirmed.

Source: https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/_005f_005fatomic-Builtins.html

> DPDK needs to write its own. Since this is the first API in that regard, I prefer that
> we start with a signature that resembles __atomic intrinsics which have been
> proven to provide best flexibility for all the platforms supported by DPDK.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/2] mempool/nb_stack: add non-blocking stack mempool
  2019-01-18  5:05         ` Honnappa Nagarahalli
  2019-01-18 20:09           ` Eads, Gage
@ 2019-01-19  0:00           ` Eads, Gage
  2019-01-19  0:15             ` Thomas Monjalon
  1 sibling, 1 reply; 43+ messages in thread
From: Eads, Gage @ 2019-01-19  0:00 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev
  Cc: olivier.matz, arybchenko, Richardson, Bruce, Ananyev, Konstantin,
	Gavin Hu (Arm Technology China),
	nd, nd



> -----Original Message-----
> From: Eads, Gage
> Sent: Friday, January 18, 2019 2:10 PM
> To: 'Honnappa Nagarahalli' <Honnappa.Nagarahalli@arm.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>; nd <nd@arm.com>; nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v4 2/2] mempool/nb_stack: add non-blocking
> stack mempool
> 
> 
> 
> > -----Original Message-----
> > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > Sent: Thursday, January 17, 2019 11:05 PM
> > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson,
> > Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; Gavin Hu (Arm Technology China)
> > <Gavin.Hu@arm.com>; nd <nd@arm.com>; Honnappa Nagarahalli
> > <Honnappa.Nagarahalli@arm.com>; nd <nd@arm.com>
> > Subject: RE: [dpdk-dev] [PATCH v4 2/2] mempool/nb_stack: add
> > non-blocking stack mempool
> >
> > Hi Gage,
> >      Thank you for your contribution on non-blocking data structures.
> > I think they are important to extend DPDK into additional use cases.
> >
> 
> Glad to hear it. Be sure to check out my non-blocking ring patchset as well, if
> you haven't already: http://mails.dpdk.org/archives/dev/2019-
> January/123774.html
> 
> > I am wondering if it makes sense to decouple the NB stack data
> > structure from mempool driver (similar to rte_ring)? I see that stack
> > based mempool implements the stack data structure in the driver. But,
> > NB stack might not be such a trivial data structure. It might be
> > useful for the applications or other use cases as well.
> >
> 
> I agree -- and you're not the first to suggest this :).
> 
> I'm going to defer that work to a later patchset; creating a new lib/ directory
> requires tech board approval (IIRC), which would unnecessarily slow down this
> mempool handler from getting merged.
> 
> > I also suggest that we use C11 __atomic_xxx APIs for memory
> > operations. The rte_atomic64_xxx APIs use __sync_xxx APIs which do not
> > provide the capability to express memory orderings.
> >
> 
> Ok, I will add those (dependent on RTE_USE_C11_MEM_MODEL).
> 
> > Please find few comments inline.
> >
> > >
> > > This commit adds support for non-blocking (linked list based) stack
> > > mempool handler. The stack uses a 128-bit compare-and-swap
> > > instruction, and thus is limited to x86_64. The 128-bit CAS
> > > atomically updates the stack top pointer and a modification counter,
> > > which protects against the ABA problem.
> > >
> > > In mempool_perf_autotest the lock-based stack outperforms the non-
> > > blocking handler*, however:
> > > - For applications with preemptible pthreads, a lock-based stack's
> > >   worst-case performance (i.e. one thread being preempted while
> > >   holding the spinlock) is much worse than the non-blocking stack's.
> > > - Using per-thread mempool caches will largely mitigate the performance
> > >   difference.
> > >
> > > *Test setup: x86_64 build with default config, dual-socket Xeon
> > > E5-2699 v4, running on isolcpus cores with a tickless scheduler. The
> > > lock-based stack's rate_persec was 1x-3.5x the non-blocking stack's.
> > >
> > > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > > Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
> > > ---
> > >  MAINTAINERS                                        |   4 +
> > >  config/common_base                                 |   1 +
> > >  doc/guides/prog_guide/env_abstraction_layer.rst    |   5 +
> > >  drivers/mempool/Makefile                           |   3 +
> > >  drivers/mempool/meson.build                        |   3 +-
> > >  drivers/mempool/nb_stack/Makefile                  |  23 ++++
> > >  drivers/mempool/nb_stack/meson.build               |   6 +
> > >  drivers/mempool/nb_stack/nb_lifo.h                 | 147
> > > +++++++++++++++++++++
> > >  drivers/mempool/nb_stack/rte_mempool_nb_stack.c    | 125
> > > ++++++++++++++++++
> > >  .../nb_stack/rte_mempool_nb_stack_version.map      |   4 +
> > >  mk/rte.app.mk                                      |   7 +-
> > >  11 files changed, 325 insertions(+), 3 deletions(-)  create mode
> > > 100644 drivers/mempool/nb_stack/Makefile  create mode 100644
> > > drivers/mempool/nb_stack/meson.build
> > >  create mode 100644 drivers/mempool/nb_stack/nb_lifo.h
> > >  create mode 100644 drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> > >  create mode 100644
> > > drivers/mempool/nb_stack/rte_mempool_nb_stack_version.map
> > >
> > > diff --git a/MAINTAINERS b/MAINTAINERS index 470f36b9c..5519d3323
> > > 100644
> > > --- a/MAINTAINERS
> > > +++ b/MAINTAINERS
> > > @@ -416,6 +416,10 @@ M: Artem V. Andreev
> > > <artem.andreev@oktetlabs.ru>
> > >  M: Andrew Rybchenko <arybchenko@solarflare.com>
> > >  F: drivers/mempool/bucket/
> > >
> > > +Non-blocking stack memory pool
> > > +M: Gage Eads <gage.eads@intel.com>
> > > +F: drivers/mempool/nb_stack/
> > > +
> > >
> > >  Bus Drivers
> > >  -----------
> > > diff --git a/config/common_base b/config/common_base index
> > > 964a6956e..8a51f36b1 100644
> > > --- a/config/common_base
> > > +++ b/config/common_base
> > > @@ -726,6 +726,7 @@ CONFIG_RTE_LIBRTE_MEMPOOL_DEBUG=n  #
> > > CONFIG_RTE_DRIVER_MEMPOOL_BUCKET=y
> > >  CONFIG_RTE_DRIVER_MEMPOOL_BUCKET_SIZE_KB=64
> > > +CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK=y
> > >  CONFIG_RTE_DRIVER_MEMPOOL_RING=y
> > >  CONFIG_RTE_DRIVER_MEMPOOL_STACK=y
> > >
> > > diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst
> > > b/doc/guides/prog_guide/env_abstraction_layer.rst
> > > index 929d76dba..9497b879c 100644
> > > --- a/doc/guides/prog_guide/env_abstraction_layer.rst
> > > +++ b/doc/guides/prog_guide/env_abstraction_layer.rst
> > > @@ -541,6 +541,11 @@ Known Issues
> > >
> > >    5. It MUST not be used by multi-producer/consumer pthreads, whose
> > > scheduling policies are SCHED_FIFO or SCHED_RR.
> > >
> > > +  Alternatively, x86_64 applications can use the non-blocking stack
> > > mempool handler. When considering this handler, note that:
> > > +
> > > +  - it is limited to the x86_64 platform, because it uses an
> > > + instruction (16-
> > > byte compare-and-swap) that is not available on other platforms.
> >
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Arm architecture supports similar
> > instructions. I suggest to simplify this statement to indicate that 'nb_stack
> feature is available for x86_64 platforms currently'
> >
> 
> Will do.
> 
> > > +  - it has worse average-case performance than the non-preemptive
> > > rte_ring, but software caching (e.g. the mempool cache) can mitigate
> > > this by reducing the number of handler operations.
> > > +
> > >  + rte_timer
> > >
> > >    Running  ``rte_timer_manage()`` on a non-EAL pthread is not allowed.
> > > However, resetting/stopping the timer from a non-EAL pthread is allowed.
> > > diff --git a/drivers/mempool/Makefile b/drivers/mempool/Makefile
> > > index
> > > 28c2e8360..895cf8a34 100644
> > > --- a/drivers/mempool/Makefile
> > > +++ b/drivers/mempool/Makefile
> > > @@ -10,6 +10,9 @@ endif
> > >  ifeq ($(CONFIG_RTE_EAL_VFIO)$(CONFIG_RTE_LIBRTE_FSLMC_BUS),yy)
> > >  DIRS-$(CONFIG_RTE_LIBRTE_DPAA2_MEMPOOL) += dpaa2  endif
> > > +ifeq ($(CONFIG_RTE_ARCH_X86_64),y)
> > > +DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) += nb_stack endif
> > >  DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_RING) += ring
> > >  DIRS-$(CONFIG_RTE_DRIVER_MEMPOOL_STACK) += stack
> > >  DIRS-$(CONFIG_RTE_LIBRTE_OCTEONTX_MEMPOOL) += octeontx diff --git
> > > a/drivers/mempool/meson.build b/drivers/mempool/meson.build index
> > > 4527d9806..220cfaf63 100644
> > > --- a/drivers/mempool/meson.build
> > > +++ b/drivers/mempool/meson.build
> > > @@ -1,7 +1,8 @@
> > >  # SPDX-License-Identifier: BSD-3-Clause  # Copyright(c) 2017 Intel
> > > Corporation
> > >
> > > -drivers = ['bucket', 'dpaa', 'dpaa2', 'octeontx', 'ring', 'stack']
> > > +drivers = ['bucket', 'dpaa', 'dpaa2', 'nb_stack', 'octeontx',
> > > +'ring', 'stack']
> > > +
> > >  std_deps = ['mempool']
> > >  config_flag_fmt = 'RTE_LIBRTE_@0@_MEMPOOL'
> > >  driver_name_fmt = 'rte_mempool_@0@'
> > > diff --git a/drivers/mempool/nb_stack/Makefile
> > > b/drivers/mempool/nb_stack/Makefile
> > > new file mode 100644
> > > index 000000000..318b18283
> > > --- /dev/null
> > > +++ b/drivers/mempool/nb_stack/Makefile
> > > @@ -0,0 +1,23 @@
> > > +# SPDX-License-Identifier: BSD-3-Clause # Copyright(c) 2019 Intel
> > > +Corporation
> > > +
> > > +include $(RTE_SDK)/mk/rte.vars.mk
> > > +
> > > +#
> > > +# library name
> > > +#
> > > +LIB = librte_mempool_nb_stack.a
> > > +
> > > +CFLAGS += -O3
> > > +CFLAGS += $(WERROR_FLAGS)
> > > +
> > > +# Headers
> > > +LDLIBS += -lrte_eal -lrte_mempool
> > > +
> > > +EXPORT_MAP := rte_mempool_nb_stack_version.map
> > > +
> > > +LIBABIVER := 1
> > > +
> > > +SRCS-$(CONFIG_RTE_DRIVER_MEMPOOL_NB_STACK) +=
> > > rte_mempool_nb_stack.c
> > > +
> > > +include $(RTE_SDK)/mk/rte.lib.mk
> > > diff --git a/drivers/mempool/nb_stack/meson.build
> > > b/drivers/mempool/nb_stack/meson.build
> > > new file mode 100644
> > > index 000000000..7dec72242
> > > --- /dev/null
> > > +++ b/drivers/mempool/nb_stack/meson.build
> > > @@ -0,0 +1,6 @@
> > > +# SPDX-License-Identifier: BSD-3-Clause # Copyright(c) 2019 Intel
> > > +Corporation
> > > +
> > > +build = dpdk_conf.has('RTE_ARCH_X86_64')
> > > +
> > > +sources = files('rte_mempool_nb_stack.c')
> > > diff --git a/drivers/mempool/nb_stack/nb_lifo.h
> > > b/drivers/mempool/nb_stack/nb_lifo.h
> > > new file mode 100644
> > > index 000000000..ad4a3401f
> > > --- /dev/null
> > > +++ b/drivers/mempool/nb_stack/nb_lifo.h
> > > @@ -0,0 +1,147 @@
> > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > + * Copyright(c) 2019 Intel Corporation  */
> > > +
> > > +#ifndef _NB_LIFO_H_
> > > +#define _NB_LIFO_H_
> > > +
> > > +struct nb_lifo_elem {
> > > +	void *data;
> > > +	struct nb_lifo_elem *next;
> > > +};
> > > +
> > > +struct nb_lifo_head {
> > > +	struct nb_lifo_elem *top; /**< Stack top */
> > > +	uint64_t cnt; /**< Modification counter */ };
> > Minor comment, mentioning ABA problem in the comments for 'cnt' will
> > be helpful.
> >
> 
> Sure.
> 
> > > +
> > > +struct nb_lifo {
> > > +	volatile struct nb_lifo_head head __rte_aligned(16);
> > > +	rte_atomic64_t len;
> > > +} __rte_cache_aligned;
> > > +
> > > +static __rte_always_inline void
> > > +nb_lifo_init(struct nb_lifo *lifo)
> > > +{
> > > +	memset(lifo, 0, sizeof(*lifo));
> > > +	rte_atomic64_set(&lifo->len, 0);
> > > +}
> > > +
> > > +static __rte_always_inline unsigned int nb_lifo_len(struct nb_lifo
> > > +*lifo) {
> > > +	/* nb_lifo_push() and nb_lifo_pop() do not update the list's
> > > contents
> > > +	 * and lifo->len atomically, which can cause the list to appear
> > > shorter
> > > +	 * than it actually is if this function is called while other threads
> > > +	 * are modifying the list.
> > > +	 *
> > > +	 * However, given the inherently approximate nature of the
> > > get_count
> > > +	 * callback -- even if the list and its size were updated atomically,
> > > +	 * the size could change between when get_count executes and
> > > when the
> > > +	 * value is returned to the caller -- this is acceptable.
> > > +	 *
> > > +	 * The lifo->len updates are placed such that the list may appear to
> > > +	 * have fewer elements than it does, but will never appear to have
> > > more
> > > +	 * elements. If the mempool is near-empty to the point that this is a
> > > +	 * concern, the user should consider increasing the mempool size.
> > > +	 */
> > > +	return (unsigned int)rte_atomic64_read(&lifo->len);
> > > +}
> > > +
> > > +static __rte_always_inline void
> > > +nb_lifo_push(struct nb_lifo *lifo,
> > > +	     struct nb_lifo_elem *first,
> > > +	     struct nb_lifo_elem *last,
> > > +	     unsigned int num)
> > > +{
> > > +	while (1) {
> > > +		struct nb_lifo_head old_head, new_head;
> > > +
> > > +		old_head = lifo->head;
> > > +
> > > +		/* Swing the top pointer to the first element in the list and
> > > +		 * make the last element point to the old top.
> > > +		 */
> > > +		new_head.top = first;
> > > +		new_head.cnt = old_head.cnt + 1;
> > > +
> > > +		last->next = old_head.top;
> > > +
> > > +		if (rte_atomic128_cmpset((volatile uint64_t *)&lifo->head,
> > > +					 (uint64_t *)&old_head,
> > > +					 (uint64_t *)&new_head))
> > > +			break;
> > > +	}
> > Minor comment, this can be a do-while loop (for ex: similar to the one
> > in
> > __rte_ring_move_prod_head)
> >
> 
> Sure.
> 
> > > +
> > > +	rte_atomic64_add(&lifo->len, num); }
> > > +
> > > +static __rte_always_inline void
> > > +nb_lifo_push_single(struct nb_lifo *lifo, struct nb_lifo_elem *elem) {
> > > +	nb_lifo_push(lifo, elem, elem, 1); }
> > > +
> > > +static __rte_always_inline struct nb_lifo_elem * nb_lifo_pop(struct
> > > +nb_lifo *lifo,
> > > +	    unsigned int num,
> > > +	    void **obj_table,
> > > +	    struct nb_lifo_elem **last)
> > > +{
> > > +	struct nb_lifo_head old_head;
> > > +
> > > +	/* Reserve num elements, if available */
> > > +	while (1) {
> > > +		uint64_t len = rte_atomic64_read(&lifo->len);
> > > +
> > > +		/* Does the list contain enough elements? */
> > > +		if (len < num)
> > > +			return NULL;
> > > +
> > > +		if (rte_atomic64_cmpset((volatile uint64_t *)&lifo->len,
> > > +					len, len - num))
> > > +			break;
> > > +	}
> > > +
> > > +	/* Pop num elements */
> > > +	while (1) {
> > > +		struct nb_lifo_head new_head;
> > > +		struct nb_lifo_elem *tmp;
> > > +		unsigned int i;
> > > +
> > > +		old_head = lifo->head;
> > > +
> > > +		tmp = old_head.top;
> > > +
> > > +		/* Traverse the list to find the new head. A next pointer will
> > > +		 * either point to another element or NULL; if a thread
> > > +		 * encounters a pointer that has already been popped, the
> > > CAS
> > > +		 * will fail.
> > > +		 */
> > > +		for (i = 0; i < num && tmp != NULL; i++) {
> > > +			if (obj_table)
> > This 'if' check can be outside the for loop. May be use RTE_ASSERT in
> > the beginning of the function?
> >
> 
> A NULL obj_table pointer isn't an error -- nb_stack_enqueue() calls this function
> with NULL because it doesn't need the popped elements added to a table. When
> the compiler inlines this function into nb_stack_enqueue(), it can use constant
> propagation to optimize away the if-statement.
> 
> I don't think that's possible for the other caller, nb_stack_dequeue, though,
> unless we add a NULL pointer check to the beginning of that function. Then it
> would be guaranteed that obj_table is non-NULL, and the compiler can optimize
> away the if-statement. I'll add that.
> 
> > > +				obj_table[i] = tmp->data;
> > > +			if (last)
> > > +				*last = tmp;
> > > +			tmp = tmp->next;
> > > +		}
> > > +
> > > +		/* If NULL was encountered, the list was modified while
> > > +		 * traversing it. Retry.
> > > +		 */
> > > +		if (i != num)
> > > +			continue;
> > > +
> > > +		new_head.top = tmp;
> > > +		new_head.cnt = old_head.cnt + 1;
> > > +
> > > +		if (rte_atomic128_cmpset((volatile uint64_t *)&lifo->head,
> > > +					 (uint64_t *)&old_head,
> > > +					 (uint64_t *)&new_head))
> > > +			break;
> > > +	}
> > > +
> > > +	return old_head.top;
> > > +}
> > > +
> > > +#endif /* _NB_LIFO_H_ */
> > > diff --git a/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> > > b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> > > new file mode 100644
> > > index 000000000..1818a2cfa
> > > --- /dev/null
> > > +++ b/drivers/mempool/nb_stack/rte_mempool_nb_stack.c
> > > @@ -0,0 +1,125 @@
> > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > + * Copyright(c) 2019 Intel Corporation  */
> > > +
> > > +#include <stdio.h>
> > > +#include <rte_mempool.h>
> > > +#include <rte_malloc.h>
> > > +
> > > +#include "nb_lifo.h"
> > > +
> > > +struct rte_mempool_nb_stack {
> > > +	uint64_t size;
> > > +	struct nb_lifo used_lifo; /**< LIFO containing mempool pointers  */
> > > +	struct nb_lifo free_lifo; /**< LIFO containing unused LIFO
> > > +elements
> > > */
> > > +};
> > > +
> > > +static int
> > > +nb_stack_alloc(struct rte_mempool *mp) {
> > > +	struct rte_mempool_nb_stack *s;
> > > +	struct nb_lifo_elem *elems;
> > > +	unsigned int n = mp->size;
> > > +	unsigned int size, i;
> > > +
> > > +	size = sizeof(*s) + n * sizeof(struct nb_lifo_elem);
> > IMO, the allocation of the stack elements can be moved under
> > nb_lifo_init API, it would make the nb stack code modular.
> >
> 
> (see below)
> 
> > > +
> > > +	/* Allocate our local memory structure */
> > > +	s = rte_zmalloc_socket("mempool-nb_stack",
> > > +			       size,
> > > +			       RTE_CACHE_LINE_SIZE,
> > > +			       mp->socket_id);
> > > +	if (s == NULL) {
> > > +		RTE_LOG(ERR, MEMPOOL, "Cannot allocate nb_stack!\n");
> > > +		return -ENOMEM;
> > > +	}
> > > +
> > > +	s->size = n;
> > > +
> > > +	nb_lifo_init(&s->used_lifo);
> > > +	nb_lifo_init(&s->free_lifo);
> > > +
> > > +	elems = (struct nb_lifo_elem *)&s[1];
> > > +	for (i = 0; i < n; i++)
> > > +		nb_lifo_push_single(&s->free_lifo, &elems[i]);
> > This also can be added to nb_lifo_init API.
> >
> 
> Sure, good suggestions. I'll address this.
> 

On second thought, moving this push code into nb_lifo_init() doesn't work since we do it for one LIFO and not the other.

I'll think on it, but I'm not going to spend too many cycles -- modularizing nb_lifo can be deferred to the patchset that moves it to a separate library.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/2] mempool/nb_stack: add non-blocking stack mempool
  2019-01-19  0:00           ` Eads, Gage
@ 2019-01-19  0:15             ` Thomas Monjalon
  2019-01-22 18:24               ` Eads, Gage
  0 siblings, 1 reply; 43+ messages in thread
From: Thomas Monjalon @ 2019-01-19  0:15 UTC (permalink / raw)
  To: Eads, Gage
  Cc: dev, Honnappa Nagarahalli, olivier.matz, arybchenko, Richardson,
	Bruce, Ananyev, Konstantin, Gavin Hu (Arm Technology China),
	nd

19/01/2019 01:00, Eads, Gage:
> > > I am wondering if it makes sense to decouple the NB stack data
> > > structure from mempool driver (similar to rte_ring)? I see that stack
> > > based mempool implements the stack data structure in the driver. But,
> > > NB stack might not be such a trivial data structure. It might be
> > > useful for the applications or other use cases as well.
> > >
> > 
> > I agree -- and you're not the first to suggest this :).
> > 
> > I'm going to defer that work to a later patchset; creating a new lib/ directory
> > requires tech board approval (IIRC), which would unnecessarily slow down this
> > mempool handler from getting merged.

You have time. This patch cannot go in 19.02.
You just need to be ready for 19.05-rc1 (more than 2 months).

[...]
> modularizing nb_lifo can be deferred to the patchset that moves it to a separate library.

Please introduce it in the right place from the beginning.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH v4 2/2] mempool/nb_stack: add non-blocking stack mempool
  2019-01-19  0:15             ` Thomas Monjalon
@ 2019-01-22 18:24               ` Eads, Gage
  0 siblings, 0 replies; 43+ messages in thread
From: Eads, Gage @ 2019-01-22 18:24 UTC (permalink / raw)
  To: Thomas Monjalon
  Cc: dev, Honnappa Nagarahalli, olivier.matz, arybchenko, Richardson,
	Bruce, Ananyev, Konstantin, Gavin Hu (Arm Technology China),
	nd



> -----Original Message-----
> From: Thomas Monjalon [mailto:thomas@monjalon.net]
> Sent: Friday, January 18, 2019 6:15 PM
> To: Eads, Gage <gage.eads@intel.com>
> Cc: dev@dpdk.org; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>; nd <nd@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v4 2/2] mempool/nb_stack: add non-blocking
> stack mempool
> 
> 19/01/2019 01:00, Eads, Gage:
> > > > I am wondering if it makes sense to decouple the NB stack data
> > > > structure from mempool driver (similar to rte_ring)? I see that
> > > > stack based mempool implements the stack data structure in the
> > > > driver. But, NB stack might not be such a trivial data structure.
> > > > It might be useful for the applications or other use cases as well.
> > > >
> > >
> > > I agree -- and you're not the first to suggest this :).
> > >
> > > I'm going to defer that work to a later patchset; creating a new
> > > lib/ directory requires tech board approval (IIRC), which would
> > > unnecessarily slow down this mempool handler from getting merged.
> 
> You have time. This patch cannot go in 19.02.
> You just need to be ready for 19.05-rc1 (more than 2 months).
> 

Understood, I was concerned the process could take longer. The code itself isn't too complicated, but I'm not sure how much time it'll take to reach consensus on the interface. On the other hand, it may go quickly if we model it after the ring API -- my current thinking -- which folks seem to generally like.

Anyway, I'll rework this into a standalone stack library.

Thanks,
Gage

> [...]
> > modularizing nb_lifo can be deferred to the patchset that moves it to a
> separate library.
> 
> Please introduce it in the right place from the beginning.
> 

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only)
  2019-01-18 22:01             ` Eads, Gage
@ 2019-01-22 20:30               ` Honnappa Nagarahalli
  2019-01-22 22:25                 ` Eads, Gage
  0 siblings, 1 reply; 43+ messages in thread
From: Honnappa Nagarahalli @ 2019-01-22 20:30 UTC (permalink / raw)
  To: Eads, Gage, dev
  Cc: olivier.matz, arybchenko, Richardson, Bruce, Ananyev, Konstantin,
	nd, chaozhu, jerinj, hemant.agrawal, nd

Added other platform owners.

> > > > > @@ -208,4 +209,25 @@ static inline void
> > > > > rte_atomic64_clear(rte_atomic64_t
> > > > > *v)  }  #endif
> > > > >
> > > > > +static inline int
> > > > > +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp,
> > > > > +uint64_t
> > > > > +*src) {
> > > > The API name suggests it is a 128b operation. 'dst', 'exp' and 'src'
> > > > should be pointers to 128b (__int128)? Or we could define our own
> > > > data
> > > type.
> > >
> > > I agree, I'm not a big fan of the 64b pointers here. I avoided
> > > __int128 originally because it fails to compile with -pedantic, but
> > > on second thought (and with your suggestion of a separate data
> > > type), we can resolve that with this typedef:
> > >
> > > typedef struct {
> > >         RTE_STD_C11 __int128 val;
> > > } rte_int128_t;
> > ok
> >
> > >
> > > > Since, it is a new API, can we define it with memory orderings
> > > > which will be more conducive to relaxed memory ordering based
> architectures?
> > > > You can refer to [1] and [2] for guidance.
> > >
> > > I certainly see the value in controlling the operation's memory
> > > ordering, like in the __atomic intrinsics, but I'm not sure this
> > > patchset is the right place to address that. I see that work going a
> > > couple
> > ways:
> > > 1. Expand the existing rte_atomicN_* interfaces with additional
> > > arguments. In that case, I'd prefer this be done in a separate
> > > patchset that addresses all the atomic operations, not just cmpset,
> > > so the interface changes are chosen according to the needs of the
> > > full set of atomic operations. If this approach is taken then
> > > there's no need to solve this while rte_atomic128_cmpset is
> > > experimental, since all the
> > other functions are non-experimental anyway.
> > >
> > > - Or -
> > >
> > > 2. Don't modify the existing rte_atomicN_* interfaces (or their
> > > strongly ordered behavior), and instead create new versions of them
> > > that take additional arguments. In this case, we can implement
> > > rte_atomic128_cmpset() as is and create a more flexible version in a
> > > later
> > patchset.
> > >
> > > Either way, I think the current interface (w.r.t. memory ordering
> > > options) can work and still leaves us in a good position for future
> > changes/improvements.
> > >
> > I do not see the need to modify/extend the existing rte_atomicN_* APIs
> > as the corresponding __atomic intrinsics serve as replacements. I
> > expect that at some point, DPDK code base will not be using
> rte_atomicN_* APIs.
> > However, __atomic intrinsics do not support 128b wide parameters.
> > Hence
> 
> I don't think that's correct. From the GCC docs:
> 
> "16-byte integral types are also allowed if `__int128' (see __int128) is
> supported by the architecture."
> 
> This works with x86 -64 -- I assume aarch64 also, but haven't confirmed.
> 
> Source: https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/_005f_005fatomic-
> Builtins.html
(following is based on my reading, not based on experiments)
My understanding is that the __atomic_load_n/store_n were implemented using a compare-and-swap HW instruction (due to lack of HW 128b atomic load and store instructions). This introduced the side effect or store/load respectively. Where as the user does not expect such side effects. As suggested in [1], it looks like GCC delegated the implementation to libatomic which 'it seems' uses locks to implement 128b __atomic intrinsics (needs to be verified)

If __atomic intrinsics, for any of the supported platforms, do not have an optimal implementation, I prefer to add a DPDK API as an abstraction. A given platform can choose to implement such an API using __atomic intrinsics if it wants. The DPDK API can be similar to __atomic_compare_exchange_n.

[1] https://patchwork.ozlabs.org/patch/721686/
[2] https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html

> 
> > DPDK needs to write its own. Since this is the first API in that
> > regard, I prefer that we start with a signature that resembles
> > __atomic intrinsics which have been proven to provide best flexibility for
> all the platforms supported by DPDK.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only)
  2019-01-22 20:30               ` Honnappa Nagarahalli
@ 2019-01-22 22:25                 ` Eads, Gage
  2019-01-24  5:21                   ` Honnappa Nagarahalli
  0 siblings, 1 reply; 43+ messages in thread
From: Eads, Gage @ 2019-01-22 22:25 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev
  Cc: olivier.matz, arybchenko, Richardson, Bruce, Ananyev, Konstantin,
	nd, chaozhu, jerinj, hemant.agrawal, nd



> -----Original Message-----
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Tuesday, January 22, 2019 2:31 PM
> To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; nd <nd@arm.com>;
> chaozhu@linux.vnet.ibm.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only)
> 
> Added other platform owners.
> 
> > > > > > @@ -208,4 +209,25 @@ static inline void
> > > > > > rte_atomic64_clear(rte_atomic64_t
> > > > > > *v)  }  #endif
> > > > > >
> > > > > > +static inline int
> > > > > > +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp,
> > > > > > +uint64_t
> > > > > > +*src) {
> > > > > The API name suggests it is a 128b operation. 'dst', 'exp' and 'src'
> > > > > should be pointers to 128b (__int128)? Or we could define our
> > > > > own data
> > > > type.
> > > >
> > > > I agree, I'm not a big fan of the 64b pointers here. I avoided
> > > > __int128 originally because it fails to compile with -pedantic,
> > > > but on second thought (and with your suggestion of a separate data
> > > > type), we can resolve that with this typedef:
> > > >
> > > > typedef struct {
> > > >         RTE_STD_C11 __int128 val;
> > > > } rte_int128_t;
> > > ok
> > >
> > > >
> > > > > Since, it is a new API, can we define it with memory orderings
> > > > > which will be more conducive to relaxed memory ordering based
> > architectures?
> > > > > You can refer to [1] and [2] for guidance.
> > > >
> > > > I certainly see the value in controlling the operation's memory
> > > > ordering, like in the __atomic intrinsics, but I'm not sure this
> > > > patchset is the right place to address that. I see that work going
> > > > a couple
> > > ways:
> > > > 1. Expand the existing rte_atomicN_* interfaces with additional
> > > > arguments. In that case, I'd prefer this be done in a separate
> > > > patchset that addresses all the atomic operations, not just
> > > > cmpset, so the interface changes are chosen according to the needs
> > > > of the full set of atomic operations. If this approach is taken
> > > > then there's no need to solve this while rte_atomic128_cmpset is
> > > > experimental, since all the
> > > other functions are non-experimental anyway.
> > > >
> > > > - Or -
> > > >
> > > > 2. Don't modify the existing rte_atomicN_* interfaces (or their
> > > > strongly ordered behavior), and instead create new versions of
> > > > them that take additional arguments. In this case, we can
> > > > implement
> > > > rte_atomic128_cmpset() as is and create a more flexible version in
> > > > a later
> > > patchset.
> > > >
> > > > Either way, I think the current interface (w.r.t. memory ordering
> > > > options) can work and still leaves us in a good position for
> > > > future
> > > changes/improvements.
> > > >
> > > I do not see the need to modify/extend the existing rte_atomicN_*
> > > APIs as the corresponding __atomic intrinsics serve as replacements.
> > > I expect that at some point, DPDK code base will not be using
> > rte_atomicN_* APIs.
> > > However, __atomic intrinsics do not support 128b wide parameters.
> > > Hence
> >
> > I don't think that's correct. From the GCC docs:
> >
> > "16-byte integral types are also allowed if `__int128' (see __int128)
> > is supported by the architecture."
> >
> > This works with x86 -64 -- I assume aarch64 also, but haven't confirmed.
> >
> > Source: https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/_005f_005fatomic-
> > Builtins.html
> (following is based on my reading, not based on experiments) My understanding
> is that the __atomic_load_n/store_n were implemented using a compare-and-
> swap HW instruction (due to lack of HW 128b atomic load and store
> instructions). This introduced the side effect or store/load respectively. Where
> as the user does not expect such side effects. As suggested in [1], it looks like
> GCC delegated the implementation to libatomic which 'it seems' uses locks to
> implement 128b __atomic intrinsics (needs to be verified)
>
> If __atomic intrinsics, for any of the supported platforms, do not have an
> optimal implementation, I prefer to add a DPDK API as an abstraction. A given
> platform can choose to implement such an API using __atomic intrinsics if it
> wants. The DPDK API can be similar to __atomic_compare_exchange_n.
> 

Certainly. From the linked discussions, I see how this would affect the design of (hypothetical functions) rte_atomic128_read() and rte_atomic128_set(), but I don't see anything that suggests (for the architectures being discussed) that __atomic_compare_exchange_n is suboptimal.

> [1] https://patchwork.ozlabs.org/patch/721686/
> [2] https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html
> 
> >
> > > DPDK needs to write its own. Since this is the first API in that
> > > regard, I prefer that we start with a signature that resembles
> > > __atomic intrinsics which have been proven to provide best
> > > flexibility for
> > all the platforms supported by DPDK.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only)
  2019-01-22 22:25                 ` Eads, Gage
@ 2019-01-24  5:21                   ` Honnappa Nagarahalli
  2019-01-25 17:19                     ` Eads, Gage
  0 siblings, 1 reply; 43+ messages in thread
From: Honnappa Nagarahalli @ 2019-01-24  5:21 UTC (permalink / raw)
  To: Eads, Gage, dev
  Cc: olivier.matz, arybchenko, Richardson, Bruce, Ananyev, Konstantin,
	nd, chaozhu, jerinj, hemant.agrawal, nd

> >
> > Added other platform owners.
> >
> > > > > > > @@ -208,4 +209,25 @@ static inline void
> > > > > > > rte_atomic64_clear(rte_atomic64_t
> > > > > > > *v)  }  #endif
> > > > > > >
> > > > > > > +static inline int
> > > > > > > +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t *exp,
> > > > > > > +uint64_t
> > > > > > > +*src) {
> > > > > > The API name suggests it is a 128b operation. 'dst', 'exp' and 'src'
> > > > > > should be pointers to 128b (__int128)? Or we could define our
> > > > > > own data
> > > > > type.
> > > > >
> > > > > I agree, I'm not a big fan of the 64b pointers here. I avoided
> > > > > __int128 originally because it fails to compile with -pedantic,
> > > > > but on second thought (and with your suggestion of a separate
> > > > > data type), we can resolve that with this typedef:
> > > > >
> > > > > typedef struct {
> > > > >         RTE_STD_C11 __int128 val; } rte_int128_t;
> > > > ok
> > > >
> > > > >
> > > > > > Since, it is a new API, can we define it with memory orderings
> > > > > > which will be more conducive to relaxed memory ordering based
> > > architectures?
> > > > > > You can refer to [1] and [2] for guidance.
> > > > >
> > > > > I certainly see the value in controlling the operation's memory
> > > > > ordering, like in the __atomic intrinsics, but I'm not sure this
> > > > > patchset is the right place to address that. I see that work
> > > > > going a couple
> > > > ways:
> > > > > 1. Expand the existing rte_atomicN_* interfaces with additional
> > > > > arguments. In that case, I'd prefer this be done in a separate
> > > > > patchset that addresses all the atomic operations, not just
> > > > > cmpset, so the interface changes are chosen according to the
> > > > > needs of the full set of atomic operations. If this approach is
> > > > > taken then there's no need to solve this while
> > > > > rte_atomic128_cmpset is experimental, since all the
> > > > other functions are non-experimental anyway.
> > > > >
> > > > > - Or -
> > > > >
> > > > > 2. Don't modify the existing rte_atomicN_* interfaces (or their
> > > > > strongly ordered behavior), and instead create new versions of
> > > > > them that take additional arguments. In this case, we can
> > > > > implement
> > > > > rte_atomic128_cmpset() as is and create a more flexible version
> > > > > in a later
> > > > patchset.
> > > > >
> > > > > Either way, I think the current interface (w.r.t. memory
> > > > > ordering
> > > > > options) can work and still leaves us in a good position for
> > > > > future
> > > > changes/improvements.
> > > > >
> > > > I do not see the need to modify/extend the existing rte_atomicN_*
> > > > APIs as the corresponding __atomic intrinsics serve as replacements.
> > > > I expect that at some point, DPDK code base will not be using
> > > rte_atomicN_* APIs.
> > > > However, __atomic intrinsics do not support 128b wide parameters.
> > > > Hence
> > >
> > > I don't think that's correct. From the GCC docs:
> > >
> > > "16-byte integral types are also allowed if `__int128' (see
> > > __int128) is supported by the architecture."
> > >
> > > This works with x86 -64 -- I assume aarch64 also, but haven't confirmed.
> > >
> > > Source:
> > > https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/_005f_005fatomic-
> > > Builtins.html
> > (following is based on my reading, not based on experiments) My
> > understanding is that the __atomic_load_n/store_n were implemented
> > using a compare-and- swap HW instruction (due to lack of HW 128b
> > atomic load and store instructions). This introduced the side effect
> > or store/load respectively. Where as the user does not expect such
> > side effects. As suggested in [1], it looks like GCC delegated the
> > implementation to libatomic which 'it seems' uses locks to implement
> > 128b __atomic intrinsics (needs to be verified)
> >
> > If __atomic intrinsics, for any of the supported platforms, do not
> > have an optimal implementation, I prefer to add a DPDK API as an
> > abstraction. A given platform can choose to implement such an API
> > using __atomic intrinsics if it wants. The DPDK API can be similar to
> __atomic_compare_exchange_n.
> >
> 
> Certainly. From the linked discussions, I see how this would affect the design
> of (hypothetical functions) rte_atomic128_read() and rte_atomic128_set(),
> but I don't see anything that suggests (for the architectures being discussed)
> that __atomic_compare_exchange_n is suboptimal.
I wrote some code and generated assembly to verify what is happening. On aarch64, this call is delegated to libatomic and libatomic needs to be linked. In the generated assembly, it is clear that it uses locks (pthread mutex lock) to provide atomicity for. For 32b and 64b the compiler generates the expected inline assembly. Both, ' __atomic_always_lock_free' and ' __atomic_is_lock_free' return 0 to indicate that 128b __atomic intrinsics are not lock free. (gcc - 8.2)

Out of curiosity, I also did similar experiments on x86 (CPU E5-2660 v4). Even after using -mcx16 flag the call is delegated to libatomic. I see the 'lock cmpxchg16b' in the generated assembly. But, ' __atomic_always_lock_free' and ' __atomic_is_lock_free' return 0 to indicate that 128b __atomic intrinsics are NOT lock free with gcc version 7.3.0. However with gcc version 5.4.0, ' __atomic_is_lock_free' returns 1. I found more discussion at [3]. However, I am not an expert on x86.

[3] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878

These problems do not exist with 32b and 64b.

> 
> > [1] https://patchwork.ozlabs.org/patch/721686/
> > [2] https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html
> >
> > >
> > > > DPDK needs to write its own. Since this is the first API in that
> > > > regard, I prefer that we start with a signature that resembles
> > > > __atomic intrinsics which have been proven to provide best
> > > > flexibility for
> > > all the platforms supported by DPDK.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only)
  2019-01-24  5:21                   ` Honnappa Nagarahalli
@ 2019-01-25 17:19                     ` Eads, Gage
  0 siblings, 0 replies; 43+ messages in thread
From: Eads, Gage @ 2019-01-25 17:19 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev
  Cc: olivier.matz, arybchenko, Richardson, Bruce, Ananyev, Konstantin,
	nd, chaozhu, jerinj, hemant.agrawal, nd



> -----Original Message-----
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Wednesday, January 23, 2019 11:22 PM
> To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; nd <nd@arm.com>;
> chaozhu@linux.vnet.ibm.com; jerinj@marvell.com; hemant.agrawal@nxp.com;
> nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only)
> 
> > >
> > > Added other platform owners.
> > >
> > > > > > > > @@ -208,4 +209,25 @@ static inline void
> > > > > > > > rte_atomic64_clear(rte_atomic64_t
> > > > > > > > *v)  }  #endif
> > > > > > > >
> > > > > > > > +static inline int
> > > > > > > > +rte_atomic128_cmpset(volatile uint64_t *dst, uint64_t
> > > > > > > > +*exp, uint64_t
> > > > > > > > +*src) {
> > > > > > > The API name suggests it is a 128b operation. 'dst', 'exp' and 'src'
> > > > > > > should be pointers to 128b (__int128)? Or we could define
> > > > > > > our own data
> > > > > > type.
> > > > > >
> > > > > > I agree, I'm not a big fan of the 64b pointers here. I avoided
> > > > > > __int128 originally because it fails to compile with
> > > > > > -pedantic, but on second thought (and with your suggestion of
> > > > > > a separate data type), we can resolve that with this typedef:
> > > > > >
> > > > > > typedef struct {
> > > > > >         RTE_STD_C11 __int128 val; } rte_int128_t;
> > > > > ok
> > > > >
> > > > > >
> > > > > > > Since, it is a new API, can we define it with memory
> > > > > > > orderings which will be more conducive to relaxed memory
> > > > > > > ordering based
> > > > architectures?
> > > > > > > You can refer to [1] and [2] for guidance.
> > > > > >
> > > > > > I certainly see the value in controlling the operation's
> > > > > > memory ordering, like in the __atomic intrinsics, but I'm not
> > > > > > sure this patchset is the right place to address that. I see
> > > > > > that work going a couple
> > > > > ways:
> > > > > > 1. Expand the existing rte_atomicN_* interfaces with
> > > > > > additional arguments. In that case, I'd prefer this be done in
> > > > > > a separate patchset that addresses all the atomic operations,
> > > > > > not just cmpset, so the interface changes are chosen according
> > > > > > to the needs of the full set of atomic operations. If this
> > > > > > approach is taken then there's no need to solve this while
> > > > > > rte_atomic128_cmpset is experimental, since all the
> > > > > other functions are non-experimental anyway.
> > > > > >
> > > > > > - Or -
> > > > > >
> > > > > > 2. Don't modify the existing rte_atomicN_* interfaces (or
> > > > > > their strongly ordered behavior), and instead create new
> > > > > > versions of them that take additional arguments. In this case,
> > > > > > we can implement
> > > > > > rte_atomic128_cmpset() as is and create a more flexible
> > > > > > version in a later
> > > > > patchset.
> > > > > >
> > > > > > Either way, I think the current interface (w.r.t. memory
> > > > > > ordering
> > > > > > options) can work and still leaves us in a good position for
> > > > > > future
> > > > > changes/improvements.
> > > > > >
> > > > > I do not see the need to modify/extend the existing
> > > > > rte_atomicN_* APIs as the corresponding __atomic intrinsics serve as
> replacements.
> > > > > I expect that at some point, DPDK code base will not be using
> > > > rte_atomicN_* APIs.
> > > > > However, __atomic intrinsics do not support 128b wide parameters.
> > > > > Hence
> > > >
> > > > I don't think that's correct. From the GCC docs:
> > > >
> > > > "16-byte integral types are also allowed if `__int128' (see
> > > > __int128) is supported by the architecture."
> > > >
> > > > This works with x86 -64 -- I assume aarch64 also, but haven't confirmed.
> > > >
> > > > Source:
> > > > https://gcc.gnu.org/onlinedocs/gcc-4.7.0/gcc/_005f_005fatomic-
> > > > Builtins.html
> > > (following is based on my reading, not based on experiments) My
> > > understanding is that the __atomic_load_n/store_n were implemented
> > > using a compare-and- swap HW instruction (due to lack of HW 128b
> > > atomic load and store instructions). This introduced the side effect
> > > or store/load respectively. Where as the user does not expect such
> > > side effects. As suggested in [1], it looks like GCC delegated the
> > > implementation to libatomic which 'it seems' uses locks to implement
> > > 128b __atomic intrinsics (needs to be verified)
> > >
> > > If __atomic intrinsics, for any of the supported platforms, do not
> > > have an optimal implementation, I prefer to add a DPDK API as an
> > > abstraction. A given platform can choose to implement such an API
> > > using __atomic intrinsics if it wants. The DPDK API can be similar
> > > to
> > __atomic_compare_exchange_n.
> > >
> >
> > Certainly. From the linked discussions, I see how this would affect
> > the design of (hypothetical functions) rte_atomic128_read() and
> > rte_atomic128_set(), but I don't see anything that suggests (for the
> > architectures being discussed) that __atomic_compare_exchange_n is
> suboptimal.
> I wrote some code and generated assembly to verify what is happening. On
> aarch64, this call is delegated to libatomic and libatomic needs to be linked. In
> the generated assembly, it is clear that it uses locks (pthread mutex lock) to
> provide atomicity for. For 32b and 64b the compiler generates the expected
> inline assembly. Both, ' __atomic_always_lock_free' and '
> __atomic_is_lock_free' return 0 to indicate that 128b __atomic intrinsics are not
> lock free. (gcc - 8.2)
> 
> Out of curiosity, I also did similar experiments on x86 (CPU E5-2660 v4). Even
> after using -mcx16 flag the call is delegated to libatomic. I see the 'lock
> cmpxchg16b' in the generated assembly. But, ' __atomic_always_lock_free' and
> ' __atomic_is_lock_free' return 0 to indicate that 128b __atomic intrinsics are
> NOT lock free with gcc version 7.3.0. However with gcc version 5.4.0, '
> __atomic_is_lock_free' returns 1. I found more discussion at [3]. However, I am
> not an expert on x86.
> 
> [3] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80878
> 
> These problems do not exist with 32b and 64b.
> 

Thanks for investigating this. If GCC doesn't compile optimal code for aarch64 (i.e. LDXP+STXP or CASP) I don't think we have a choice but to use our own implementation for 128-bit atomics, and it makes sense to model the interface after the __atomic instrinsics as you suggested.

> >
> > > [1] https://patchwork.ozlabs.org/patch/721686/
> > > [2] https://gcc.gnu.org/ml/gcc/2017-01/msg00167.html
> > >
> > > >
> > > > > DPDK needs to write its own. Since this is the first API in that
> > > > > regard, I prefer that we start with a signature that resembles
> > > > > __atomic intrinsics which have been proven to provide best
> > > > > flexibility for
> > > > all the platforms supported by DPDK.

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2019-01-25 17:19 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-10 20:55 [dpdk-dev] [PATCH 0/3] Add non-blocking stack mempool handler Gage Eads
2019-01-10 20:55 ` [dpdk-dev] [PATCH 1/3] eal: add 128-bit cmpset (x86-64 only) Gage Eads
2019-01-13 12:18   ` Andrew Rybchenko
2019-01-14  4:29     ` Varghese, Vipin
2019-01-14 15:46       ` Eads, Gage
2019-01-16  4:34         ` Varghese, Vipin
2019-01-14 15:43     ` Eads, Gage
2019-01-10 20:55 ` [dpdk-dev] [PATCH 2/3] mempool/nb_stack: add non-blocking stack mempool Gage Eads
2019-01-13 13:31   ` Andrew Rybchenko
2019-01-14 16:22     ` Eads, Gage
2019-01-10 20:55 ` [dpdk-dev] [PATCH 3/3] doc: add NB stack comment to EAL "known issues" Gage Eads
2019-01-15 22:32 ` [dpdk-dev] [PATCH v2 0/2] Add non-blocking stack mempool handler Gage Eads
2019-01-15 22:32   ` [dpdk-dev] [PATCH v2 1/2] eal: add 128-bit cmpset (x86-64 only) Gage Eads
2019-01-17  8:49     ` Gavin Hu (Arm Technology China)
2019-01-17 15:14       ` Eads, Gage
2019-01-17 15:57         ` Gavin Hu (Arm Technology China)
2019-01-15 22:32   ` [dpdk-dev] [PATCH v2 2/2] mempool/nb_stack: add non-blocking stack mempool Gage Eads
2019-01-16  7:13     ` Andrew Rybchenko
2019-01-17  8:06     ` Gavin Hu (Arm Technology China)
2019-01-17 14:11       ` Eads, Gage
2019-01-17 14:20         ` Bruce Richardson
2019-01-17 15:16           ` Eads, Gage
2019-01-17 15:42             ` Gavin Hu (Arm Technology China)
2019-01-17 20:41               ` Eads, Gage
2019-01-16 15:18   ` [dpdk-dev] [PATCH v3 0/2] Add non-blocking stack mempool handler Gage Eads
2019-01-16 15:18     ` [dpdk-dev] [PATCH v3 1/2] eal: add 128-bit cmpset (x86-64 only) Gage Eads
2019-01-17 15:45       ` Honnappa Nagarahalli
2019-01-17 23:03         ` Eads, Gage
2019-01-18  5:27           ` Honnappa Nagarahalli
2019-01-18 22:01             ` Eads, Gage
2019-01-22 20:30               ` Honnappa Nagarahalli
2019-01-22 22:25                 ` Eads, Gage
2019-01-24  5:21                   ` Honnappa Nagarahalli
2019-01-25 17:19                     ` Eads, Gage
2019-01-16 15:18     ` [dpdk-dev] [PATCH v3 2/2] mempool/nb_stack: add non-blocking stack mempool Gage Eads
2019-01-17 15:36     ` [dpdk-dev] [PATCH v4 0/2] Add non-blocking stack mempool handler Gage Eads
2019-01-17 15:36       ` [dpdk-dev] [PATCH v4 1/2] eal: add 128-bit cmpset (x86-64 only) Gage Eads
2019-01-17 15:36       ` [dpdk-dev] [PATCH v4 2/2] mempool/nb_stack: add non-blocking stack mempool Gage Eads
2019-01-18  5:05         ` Honnappa Nagarahalli
2019-01-18 20:09           ` Eads, Gage
2019-01-19  0:00           ` Eads, Gage
2019-01-19  0:15             ` Thomas Monjalon
2019-01-22 18:24               ` Eads, Gage

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).