DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH 0/6] Add non-blocking ring
@ 2019-01-10 21:01 Gage Eads
  2019-01-10 21:01 ` [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width size Gage Eads
                   ` (6 more replies)
  0 siblings, 7 replies; 123+ messages in thread
From: Gage Eads @ 2019-01-10 21:01 UTC (permalink / raw)
  To: dev; +Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev

For some users, the rte ring's "non-preemptive" constraint is not acceptable;
for example, if the application uses a mixture of pinned high-priority threads
and multiplexed low-priority threads that share a mempool.

This patchset introduces a non-blocking ring, on top of which a mempool can run.
Crucially, the non-blocking algorithm relies on a 128-bit compare-and-swap, so
it is limited to x86_64 machines.

The ring uses more compare-and-swap atomic operations than the regular rte ring:
With no contention, an enqueue of n pointers uses (1 + 2n) CAS operations and a
dequeue of n pointers uses 2. This algorithm has worse average-case performance
than the regular rte ring (particularly a highly-contended ring with large bulk
accesses), however:
- For applications with preemptible pthreads, the regular rte ring's worst-case
  performance (i.e. one thread being preempted in the update_tail() critical
  section) is much worse than the non-blocking ring's.
- Software caching can mitigate the average case performance for ring-based
  algorithms. For example, a non-blocking ring based mempool (a likely use case
  for this ring) with per-thread caching.

The non-blocking ring is enabled via a new flag, RING_F_NB. For ease-of-use,
existing ring enqueue/dequeue functions work with both "regular" and
non-blocking rings.

This patchset also adds non-blocking versions of ring_autotest and
ring_perf_autotest, and a non-blocking ring based mempool.

This patchset makes ABI changes, and thus an ABI update announcement and
deprecation cycle are required.

This patchset depends on the non-blocking stack patchset[1].

[1] http://mails.dpdk.org/archives/dev/2019-January/122923.html

Gage Eads (6):
  ring: change head and tail to pointer-width size
  ring: add a non-blocking implementation
  test_ring: add non-blocking ring autotest
  test_ring_perf: add non-blocking ring perf test
  mempool/ring: add non-blocking ring handlers
  doc: add NB ring comment to EAL "known issues"

 doc/guides/prog_guide/env_abstraction_layer.rst |   2 +-
 drivers/mempool/ring/rte_mempool_ring.c         |  58 ++-
 lib/librte_eventdev/rte_event_ring.h            |   6 +-
 lib/librte_ring/rte_ring.c                      |  53 ++-
 lib/librte_ring/rte_ring.h                      | 555 ++++++++++++++++++++++--
 lib/librte_ring/rte_ring_generic.h              |  16 +-
 lib/librte_ring/rte_ring_version.map            |   7 +
 test/test/test_ring.c                           |  57 ++-
 test/test/test_ring_perf.c                      |  19 +-
 9 files changed, 689 insertions(+), 84 deletions(-)

-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width size
  2019-01-10 21:01 [dpdk-dev] [PATCH 0/6] Add non-blocking ring Gage Eads
@ 2019-01-10 21:01 ` Gage Eads
  2019-01-11  4:38   ` Stephen Hemminger
                     ` (2 more replies)
  2019-01-10 21:01 ` [dpdk-dev] [PATCH 2/6] ring: add a non-blocking implementation Gage Eads
                   ` (5 subsequent siblings)
  6 siblings, 3 replies; 123+ messages in thread
From: Gage Eads @ 2019-01-10 21:01 UTC (permalink / raw)
  To: dev; +Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev

For 64-bit architectures, doubling the head and tail index widths greatly
increases the time it takes for them to wrap-around (with current CPU
speeds, it won't happen within the author's lifetime). This is important in
avoiding the ABA problem -- in which a thread mistakes reading the same
tail index in two accesses to mean that the ring was not modified in the
intervening time -- in the upcoming non-blocking ring implementation. Using
a 64-bit index makes the possibility of this occurring effectively zero.

I tested this commit's performance impact with an x86_64 build on a
dual-socket Xeon E5-2699 v4 using ring_perf_autotest, and the change made
no significant difference -- the few differences appear to be system noise.
(The test ran on isolcpus cores using a tickless scheduler, but some
variation was stll observed.) Each test was run three times and the results
were averaged:

                                  | 64b head/tail cycle cost minus
             Test                 |     32b head/tail cycle cost
------------------------------------------------------------------
SP/SC single enq/dequeue          | 0.33
MP/MC single enq/dequeue          | 0.00
SP/SC burst enq/dequeue (size 8)  | 0.00
MP/MC burst enq/dequeue (size 8)  | 1.00
SP/SC burst enq/dequeue (size 32) | 0.00
MP/MC burst enq/dequeue (size 32) | -1.00
SC empty dequeue                  | 0.01
MC empty dequeue                  | 0.00

Single lcore:
SP/SC bulk enq/dequeue (size 8)   | -0.36
MP/MC bulk enq/dequeue (size 8)   | 0.99
SP/SC bulk enq/dequeue (size 32)  | -0.40
MP/MC bulk enq/dequeue (size 32)  | -0.57

Two physical cores:
SP/SC bulk enq/dequeue (size 8)   | -0.49
MP/MC bulk enq/dequeue (size 8)   | 0.19
SP/SC bulk enq/dequeue (size 32)  | -0.28
MP/MC bulk enq/dequeue (size 32)  | -0.62

Two NUMA nodes:
SP/SC bulk enq/dequeue (size 8)   | 3.25
MP/MC bulk enq/dequeue (size 8)   | 1.87
SP/SC bulk enq/dequeue (size 32)  | -0.44
MP/MC bulk enq/dequeue (size 32)  | -1.10

An earlier version of this patch changed the head and tail indexes to
uint64_t, but that caused a performance drop on 32-bit builds. With
uintptr_t, no performance difference is observed on an i686 build.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_eventdev/rte_event_ring.h |  6 +++---
 lib/librte_ring/rte_ring.c           | 10 +++++-----
 lib/librte_ring/rte_ring.h           | 20 ++++++++++----------
 lib/librte_ring/rte_ring_generic.h   | 16 +++++++++-------
 4 files changed, 27 insertions(+), 25 deletions(-)

diff --git a/lib/librte_eventdev/rte_event_ring.h b/lib/librte_eventdev/rte_event_ring.h
index 827a3209e..eae70f904 100644
--- a/lib/librte_eventdev/rte_event_ring.h
+++ b/lib/librte_eventdev/rte_event_ring.h
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2016-2017 Intel Corporation
+ * Copyright(c) 2016-2019 Intel Corporation
  */
 
 /**
@@ -88,7 +88,7 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r,
 		const struct rte_event *events,
 		unsigned int n, uint16_t *free_space)
 {
-	uint32_t prod_head, prod_next;
+	uintptr_t prod_head, prod_next;
 	uint32_t free_entries;
 
 	n = __rte_ring_move_prod_head(&r->r, r->r.prod.single, n,
@@ -129,7 +129,7 @@ rte_event_ring_dequeue_burst(struct rte_event_ring *r,
 		struct rte_event *events,
 		unsigned int n, uint16_t *available)
 {
-	uint32_t cons_head, cons_next;
+	uintptr_t cons_head, cons_next;
 	uint32_t entries;
 
 	n = __rte_ring_move_cons_head(&r->r, r->r.cons.single, n,
diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
index d215acecc..b15ee0eb3 100644
--- a/lib/librte_ring/rte_ring.c
+++ b/lib/librte_ring/rte_ring.c
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  *
- * Copyright (c) 2010-2015 Intel Corporation
+ * Copyright (c) 2010-2019 Intel Corporation
  * Copyright (c) 2007,2008 Kip Macy kmacy@freebsd.org
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
@@ -227,10 +227,10 @@ rte_ring_dump(FILE *f, const struct rte_ring *r)
 	fprintf(f, "  flags=%x\n", r->flags);
 	fprintf(f, "  size=%"PRIu32"\n", r->size);
 	fprintf(f, "  capacity=%"PRIu32"\n", r->capacity);
-	fprintf(f, "  ct=%"PRIu32"\n", r->cons.tail);
-	fprintf(f, "  ch=%"PRIu32"\n", r->cons.head);
-	fprintf(f, "  pt=%"PRIu32"\n", r->prod.tail);
-	fprintf(f, "  ph=%"PRIu32"\n", r->prod.head);
+	fprintf(f, "  ct=%"PRIuPTR"\n", r->cons.tail);
+	fprintf(f, "  ch=%"PRIuPTR"\n", r->cons.head);
+	fprintf(f, "  pt=%"PRIuPTR"\n", r->prod.tail);
+	fprintf(f, "  ph=%"PRIuPTR"\n", r->prod.head);
 	fprintf(f, "  used=%u\n", rte_ring_count(r));
 	fprintf(f, "  avail=%u\n", rte_ring_free_count(r));
 }
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index af5444a9f..12af64e13 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  *
- * Copyright (c) 2010-2017 Intel Corporation
+ * Copyright (c) 2010-2019 Intel Corporation
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
@@ -65,8 +65,8 @@ struct rte_memzone; /* forward declaration, so as not to require memzone.h */
 
 /* structure to hold a pair of head/tail values and other metadata */
 struct rte_ring_headtail {
-	volatile uint32_t head;  /**< Prod/consumer head. */
-	volatile uint32_t tail;  /**< Prod/consumer tail. */
+	volatile uintptr_t head;  /**< Prod/consumer head. */
+	volatile uintptr_t tail;  /**< Prod/consumer tail. */
 	uint32_t single;         /**< True if single prod/cons */
 };
 
@@ -242,7 +242,7 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 #define ENQUEUE_PTRS(r, ring_start, prod_head, obj_table, n, obj_type) do { \
 	unsigned int i; \
 	const uint32_t size = (r)->size; \
-	uint32_t idx = prod_head & (r)->mask; \
+	uintptr_t idx = prod_head & (r)->mask; \
 	obj_type *ring = (obj_type *)ring_start; \
 	if (likely(idx + n < size)) { \
 		for (i = 0; i < (n & ((~(unsigned)0x3))); i+=4, idx+=4) { \
@@ -272,7 +272,7 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
  * single and multi consumer dequeue functions */
 #define DEQUEUE_PTRS(r, ring_start, cons_head, obj_table, n, obj_type) do { \
 	unsigned int i; \
-	uint32_t idx = cons_head & (r)->mask; \
+	uintptr_t idx = cons_head & (r)->mask; \
 	const uint32_t size = (r)->size; \
 	obj_type *ring = (obj_type *)ring_start; \
 	if (likely(idx + n < size)) { \
@@ -338,7 +338,7 @@ __rte_ring_do_enqueue(struct rte_ring *r, void * const *obj_table,
 		 unsigned int n, enum rte_ring_queue_behavior behavior,
 		 unsigned int is_sp, unsigned int *free_space)
 {
-	uint32_t prod_head, prod_next;
+	uintptr_t prod_head, prod_next;
 	uint32_t free_entries;
 
 	n = __rte_ring_move_prod_head(r, is_sp, n, behavior,
@@ -380,7 +380,7 @@ __rte_ring_do_dequeue(struct rte_ring *r, void **obj_table,
 		 unsigned int n, enum rte_ring_queue_behavior behavior,
 		 unsigned int is_sc, unsigned int *available)
 {
-	uint32_t cons_head, cons_next;
+	uintptr_t cons_head, cons_next;
 	uint32_t entries;
 
 	n = __rte_ring_move_cons_head(r, (int)is_sc, n, behavior,
@@ -681,9 +681,9 @@ rte_ring_dequeue(struct rte_ring *r, void **obj_p)
 static inline unsigned
 rte_ring_count(const struct rte_ring *r)
 {
-	uint32_t prod_tail = r->prod.tail;
-	uint32_t cons_tail = r->cons.tail;
-	uint32_t count = (prod_tail - cons_tail) & r->mask;
+	uintptr_t prod_tail = r->prod.tail;
+	uintptr_t cons_tail = r->cons.tail;
+	uintptr_t count = (prod_tail - cons_tail) & r->mask;
 	return (count > r->capacity) ? r->capacity : count;
 }
 
diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index ea7dbe5b9..3fd1150f6 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  *
- * Copyright (c) 2010-2017 Intel Corporation
+ * Copyright (c) 2010-2019 Intel Corporation
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
@@ -11,7 +11,7 @@
 #define _RTE_RING_GENERIC_H_
 
 static __rte_always_inline void
-update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
+update_tail(struct rte_ring_headtail *ht, uintptr_t old_val, uintptr_t new_val,
 		uint32_t single, uint32_t enqueue)
 {
 	if (enqueue)
@@ -55,7 +55,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 static __rte_always_inline unsigned int
 __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		unsigned int n, enum rte_ring_queue_behavior behavior,
-		uint32_t *old_head, uint32_t *new_head,
+		uintptr_t *old_head, uintptr_t *new_head,
 		uint32_t *free_entries)
 {
 	const uint32_t capacity = r->capacity;
@@ -93,7 +93,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		if (is_sp)
 			r->prod.head = *new_head, success = 1;
 		else
-			success = rte_atomic32_cmpset(&r->prod.head,
+			/* Built-in used to handle variable-sized head index. */
+			success = __sync_bool_compare_and_swap(&r->prod.head,
 					*old_head, *new_head);
 	} while (unlikely(success == 0));
 	return n;
@@ -125,7 +126,7 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 static __rte_always_inline unsigned int
 __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
 		unsigned int n, enum rte_ring_queue_behavior behavior,
-		uint32_t *old_head, uint32_t *new_head,
+		uintptr_t *old_head, uintptr_t *new_head,
 		uint32_t *entries)
 {
 	unsigned int max = n;
@@ -161,8 +162,9 @@ __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
 		if (is_sc)
 			r->cons.head = *new_head, success = 1;
 		else
-			success = rte_atomic32_cmpset(&r->cons.head, *old_head,
-					*new_head);
+			/* Built-in used to handle variable-sized head index. */
+			success = __sync_bool_compare_and_swap(&r->cons.head,
+					*old_head, *new_head);
 	} while (unlikely(success == 0));
 	return n;
 }
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH 2/6] ring: add a non-blocking implementation
  2019-01-10 21:01 [dpdk-dev] [PATCH 0/6] Add non-blocking ring Gage Eads
  2019-01-10 21:01 ` [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width size Gage Eads
@ 2019-01-10 21:01 ` Gage Eads
  2019-01-10 21:01 ` [dpdk-dev] [PATCH 3/6] test_ring: add non-blocking ring autotest Gage Eads
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-01-10 21:01 UTC (permalink / raw)
  To: dev; +Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev

This commit adds support for non-blocking circular ring enqueue and dequeue
functions. The ring uses a 128-bit compare-and-swap instruction, and thus
is limited to x86_64.

The algorithm is based on the original rte ring (derived from FreeBSD's
bufring.h) and inspired by Michael and Scott's non-blocking concurrent
queue. Importantly, it adds a modification counter to each ring entry to
ensure only one thread can write to an unused entry.

-----
Algorithm:

Multi-producer non-blocking enqueue:
1. Move the producer head index 'n' locations forward, effectively
   reserving 'n' locations.
2. For each pointer:
 a. Read the producer tail index, then ring[tail]. If ring[tail]'s
    modification counter isn't 'tail', retry.
 b. Construct the new entry: {pointer, tail + ring size}
 c. Compare-and-swap the old entry with the new. If unsuccessful, the
    next loop iteration will try to enqueue this pointer again.
 d. Compare-and-swap the tail index with 'tail + 1', whether or not step 2c
    succeeded. This guarantees threads can make forward progress.

Multi-consumer non-blocking dequeue:
1. Move the consumer head index 'n' locations forward, effectively
   reserving 'n' pointers to be dequeued.
2. Copy 'n' pointers into the caller's object table (ignoring the
   modification counter), starting from ring[tail], then compare-and-swap
   the tail index with 'tail + n'.  If unsuccessful, repeat step 2.

-----
Discussion:

There are two cases where the ABA problem is mitigated:
1. Enqueueing a pointer to the ring: without a modification counter
   tied to the tail index, the index could become stale by the time the
   enqueue happens, causing it to overwrite valid data. Tying the
   counter to the tail index gives us an expected value (as opposed to,
   say, a monotonically incrementing counter).

   Since the counter will eventually wrap, there is potential for the ABA
   problem. However, using a 64-bit counter makes this likelihood
   effectively zero.

2. Updating a tail index: the ABA problem can occur if the thread is
   preempted and the tail index wraps around. However, using 64-bit indexes
   makes this likelihood effectively zero.

With no contention, an enqueue of n pointers uses (1 + 2n) CAS operations
and a dequeue of n pointers uses 2. This algorithm has worse average-case
performance than the regular rte ring (particularly a highly-contended ring
with large bulk accesses), however:
- For applications with preemptible pthreads, the regular rte ring's
  worst-case performance (i.e. one thread being preempted in the
  update_tail() critical section) is much worse than the non-blocking
  ring's.
- Software caching can mitigate the average case performance for ring-based
  algorithms. For example, a non-blocking ring based mempool (a likely use
  case for this ring) with per-thread caching.

The non-blocking ring is enabled via a new flag, RING_F_NB. Because the
ring's memsize is now a function of its flags (the non-blocking ring
requires 128b for each entry), this commit adds a new argument ('flags') to
rte_ring_get_memsize().

For ease-of-use, existing ring enqueue and dequeue functions work on both
regular and non-blocking rings. This introduces an additional branch in
the datapath, but this should be a highly predictable branch.
ring_perf_autotest shows a negligible performance impact; it's hard to
distinguish a real difference versus system noise.

                                  | ring_perf_autotest cycles with branch -
             Test                 |   ring_perf_autotest cycles without
------------------------------------------------------------------
SP/SC single enq/dequeue          | 0.33
MP/MC single enq/dequeue          | -4.00
SP/SC burst enq/dequeue (size 8)  | 0.00
MP/MC burst enq/dequeue (size 8)  | 0.00
SP/SC burst enq/dequeue (size 32) | 0.00
MP/MC burst enq/dequeue (size 32) | 0.00
SC empty dequeue                  | 1.00
MC empty dequeue                  | 0.00

Single lcore:
SP/SC bulk enq/dequeue (size 8)   | 0.49
MP/MC bulk enq/dequeue (size 8)   | 0.08
SP/SC bulk enq/dequeue (size 32)  | 0.07
MP/MC bulk enq/dequeue (size 32)  | 0.09

Two physical cores:
SP/SC bulk enq/dequeue (size 8)   | 0.19
MP/MC bulk enq/dequeue (size 8)   | -0.37
SP/SC bulk enq/dequeue (size 32)  | 0.09
MP/MC bulk enq/dequeue (size 32)  | -0.05

Two NUMA nodes:
SP/SC bulk enq/dequeue (size 8)   | -1.96
MP/MC bulk enq/dequeue (size 8)   | 0.88
SP/SC bulk enq/dequeue (size 32)  | 0.10
MP/MC bulk enq/dequeue (size 32)  | 0.46

Test setup: x86_64 build with default config, dual-socket Xeon E5-2699 v4,
running on isolcpus cores with a tickless scheduler. Each test run three
times and the results averaged.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_ring/rte_ring.c           |  43 ++-
 lib/librte_ring/rte_ring.h           | 535 +++++++++++++++++++++++++++++++++--
 lib/librte_ring/rte_ring_version.map |   7 +
 3 files changed, 554 insertions(+), 31 deletions(-)

diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
index b15ee0eb3..bd1282eac 100644
--- a/lib/librte_ring/rte_ring.c
+++ b/lib/librte_ring/rte_ring.c
@@ -45,9 +45,9 @@ EAL_REGISTER_TAILQ(rte_ring_tailq)
 
 /* return the size of memory occupied by a ring */
 ssize_t
-rte_ring_get_memsize(unsigned count)
+rte_ring_get_memsize_v1902(unsigned int count, unsigned int flags)
 {
-	ssize_t sz;
+	ssize_t sz, elt_sz;
 
 	/* count must be a power of 2 */
 	if ((!POWEROF2(count)) || (count > RTE_RING_SZ_MASK )) {
@@ -57,10 +57,23 @@ rte_ring_get_memsize(unsigned count)
 		return -EINVAL;
 	}
 
-	sz = sizeof(struct rte_ring) + count * sizeof(void *);
+	elt_sz = (flags & RING_F_NB) ? 2 * sizeof(void *) : sizeof(void *);
+
+	sz = sizeof(struct rte_ring) + count * elt_sz;
 	sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
 	return sz;
 }
+BIND_DEFAULT_SYMBOL(rte_ring_get_memsize, _v1902, 19.02);
+MAP_STATIC_SYMBOL(ssize_t rte_ring_get_memsize(unsigned int count,
+					       unsigned int flags),
+		  rte_ring_get_memsize_v1902);
+
+ssize_t
+rte_ring_get_memsize_v20(unsigned int count)
+{
+	return rte_ring_get_memsize_v1902(count, 0);
+}
+VERSION_SYMBOL(rte_ring_get_memsize, _v20, 2.0);
 
 int
 rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
@@ -103,6 +116,20 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 	r->prod.head = r->cons.head = 0;
 	r->prod.tail = r->cons.tail = 0;
 
+	if (flags & RING_F_NB) {
+		uint64_t i;
+
+		for (i = 0; i < r->size; i++) {
+			struct nb_ring_entry *ring_ptr, *base;
+
+			base = ((struct nb_ring_entry *) &r[1]);
+
+			ring_ptr = &base[i & r->mask];
+
+			ring_ptr->cnt = i;
+		}
+	}
+
 	return 0;
 }
 
@@ -123,11 +150,19 @@ rte_ring_create(const char *name, unsigned count, int socket_id,
 
 	ring_list = RTE_TAILQ_CAST(rte_ring_tailq.head, rte_ring_list);
 
+#if !defined(RTE_ARCH_X86_64)
+	if (flags & RING_F_NB) {
+		printf("RING_F_NB is only supported on x86-64 platforms\n");
+		rte_errno = EINVAL;
+		return NULL;
+	}
+#endif
+
 	/* for an exact size ring, round up from count to a power of two */
 	if (flags & RING_F_EXACT_SZ)
 		count = rte_align32pow2(count + 1);
 
-	ring_size = rte_ring_get_memsize(count);
+	ring_size = rte_ring_get_memsize(count, flags);
 	if (ring_size < 0) {
 		rte_errno = ring_size;
 		return NULL;
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 12af64e13..95bcdc4db 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -117,6 +117,18 @@ struct rte_ring {
  */
 #define RING_F_EXACT_SZ 0x0004
 #define RTE_RING_SZ_MASK  (0x7fffffffU) /**< Ring size mask */
+/**
+ * The ring uses non-blocking enqueue and dequeue functions. These functions
+ * do not have the "non-preemptive" constraint of a regular rte ring, and thus
+ * are suited for applications using preemptible pthreads. However, the
+ * non-blocking functions have worse average-case performance than their
+ * regular rte ring counterparts. When used as the handler for a mempool,
+ * per-thread caching can mitigate the performance difference by reducing the
+ * number (and contention) of ring accesses.
+ *
+ * This flag is only supported on x86_64 platforms.
+ */
+#define RING_F_NB 0x0008
 
 /* @internal defines for passing to the enqueue dequeue worker functions */
 #define __IS_SP 1
@@ -134,11 +146,15 @@ struct rte_ring {
  *
  * @param count
  *   The number of elements in the ring (must be a power of 2).
+ * @param flags
+ *   The flags the ring will be created with.
  * @return
  *   - The memory size needed for the ring on success.
  *   - -EINVAL if count is not a power of 2.
  */
-ssize_t rte_ring_get_memsize(unsigned count);
+ssize_t rte_ring_get_memsize(unsigned int count, unsigned int flags);
+ssize_t rte_ring_get_memsize_v20(unsigned int count);
+ssize_t rte_ring_get_memsize_v1902(unsigned int count, unsigned int flags);
 
 /**
  * Initialize a ring structure.
@@ -171,6 +187,10 @@ ssize_t rte_ring_get_memsize(unsigned count);
  *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
  *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
  *      is "single-consumer". Otherwise, it is "multi-consumers".
+ *    - RING_F_EXACT_SZ: If this flag is set, count can be a non-power-of-2
+ *      number, but up to half the ring space may be wasted.
+ *    - RING_F_NB: (x86_64 only) If this flag is set, the ring uses
+ *      non-blocking variants of the dequeue and enqueue functions.
  * @return
  *   0 on success, or a negative value on error.
  */
@@ -206,12 +226,17 @@ int rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
  *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
  *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
  *      is "single-consumer". Otherwise, it is "multi-consumers".
+ *    - RING_F_EXACT_SZ: If this flag is set, count can be a non-power-of-2
+ *      number, but up to half the ring space may be wasted.
+ *    - RING_F_NB: (x86_64 only) If this flag is set, the ring uses
+ *      non-blocking variants of the dequeue and enqueue functions.
  * @return
  *   On success, the pointer to the new allocated ring. NULL on error with
  *    rte_errno set appropriately. Possible errno values include:
  *    - E_RTE_NO_CONFIG - function could not get pointer to rte_config structure
  *    - E_RTE_SECONDARY - function was called from a secondary process instance
- *    - EINVAL - count provided is not a power of 2
+ *    - EINVAL - count provided is not a power of 2, or RING_F_NB is used on an
+ *      unsupported platform
  *    - ENOSPC - the maximum number of memzones has already been allocated
  *    - EEXIST - a memzone with the same name already exists
  *    - ENOMEM - no appropriate memory area found in which to create memzone
@@ -267,6 +292,50 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 	} \
 } while (0)
 
+/* The actual enqueue of pointers on the ring.
+ * Used only by the single-producer non-blocking enqueue function, but
+ * out-lined here for code readability.
+ */
+#define ENQUEUE_PTRS_NB(r, ring_start, prod_head, obj_table, n) do { \
+	unsigned int i; \
+	const uint32_t size = (r)->size; \
+	uintptr_t idx = prod_head & (r)->mask; \
+	uintptr_t new_cnt = prod_head + size; \
+	struct nb_ring_entry *ring = (struct nb_ring_entry *)ring_start; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & ((~(unsigned)0x3))); i += 4, idx += 4) { \
+			ring[idx].ptr = obj_table[i]; \
+			ring[idx].cnt = new_cnt + i;  \
+			ring[idx + 1].ptr = obj_table[i + 1]; \
+			ring[idx + 1].cnt = new_cnt + i + 1;  \
+			ring[idx + 2].ptr = obj_table[i + 2]; \
+			ring[idx + 2].cnt = new_cnt + i + 2;  \
+			ring[idx + 3].ptr = obj_table[i + 3]; \
+			ring[idx + 3].cnt = new_cnt + i + 3;  \
+		} \
+		switch (n & 0x3) { \
+		case 3: \
+			ring[idx].cnt = new_cnt + i; \
+			ring[idx++].ptr = obj_table[i++]; /* fallthrough */ \
+		case 2: \
+			ring[idx].cnt = new_cnt + i; \
+			ring[idx++].ptr = obj_table[i++]; /* fallthrough */ \
+		case 1: \
+			ring[idx].cnt = new_cnt + i; \
+			ring[idx++].ptr = obj_table[i++]; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) { \
+			ring[idx].cnt = new_cnt + i;  \
+			ring[idx].ptr = obj_table[i]; \
+		} \
+		for (idx = 0; i < n; i++, idx++) {    \
+			ring[idx].cnt = new_cnt + i;  \
+			ring[idx].ptr = obj_table[i]; \
+		} \
+	} \
+} while (0)
+
 /* the actual copy of pointers on the ring to obj_table.
  * Placed here since identical code needed in both
  * single and multi consumer dequeue functions */
@@ -298,6 +367,39 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 	} \
 } while (0)
 
+/* The actual copy of pointers on the ring to obj_table.
+ * Placed here since identical code needed in both
+ * single and multi consumer non-blocking dequeue functions.
+ */
+#define DEQUEUE_PTRS_NB(r, ring_start, cons_head, obj_table, n) do { \
+	unsigned int i; \
+	uintptr_t idx = cons_head & (r)->mask; \
+	const uint32_t size = (r)->size; \
+	struct nb_ring_entry *ring = (struct nb_ring_entry *)ring_start; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & (~(unsigned)0x3)); i += 4, idx += 4) {\
+			obj_table[i] = ring[idx].ptr; \
+			obj_table[i + 1] = ring[idx + 1].ptr; \
+			obj_table[i + 2] = ring[idx + 2].ptr; \
+			obj_table[i + 3] = ring[idx + 3].ptr; \
+		} \
+		switch (n & 0x3) { \
+		case 3: \
+			obj_table[i++] = ring[idx++].ptr; /* fallthrough */ \
+		case 2: \
+			obj_table[i++] = ring[idx++].ptr; /* fallthrough */ \
+		case 1: \
+			obj_table[i++] = ring[idx++].ptr; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) \
+			obj_table[i] = ring[idx].ptr; \
+		for (idx = 0; i < n; i++, idx++) \
+			obj_table[i] = ring[idx].ptr; \
+	} \
+} while (0)
+
+
 /* Between load and load. there might be cpu reorder in weak model
  * (powerpc/arm).
  * There are 2 choices for the users
@@ -313,6 +415,314 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 #include "rte_ring_generic.h"
 #endif
 
+/* @internal 128-bit structure used by the non-blocking ring */
+struct nb_ring_entry {
+	void *ptr; /**< Data pointer */
+	uint64_t cnt; /**< Modification counter */
+};
+
+/* The non-blocking ring algorithm is based on the original rte ring (derived
+ * from FreeBSD's bufring.h) and inspired by Michael and Scott's non-blocking
+ * concurrent queue.
+ */
+
+/**
+ * @internal
+ *   Enqueue several objects on the non-blocking ring (single-producer only)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_enqueue_sp(struct rte_ring *r, void * const *obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+	uintptr_t head, next;
+	uint32_t free_entries;
+
+	n = __rte_ring_move_prod_head(r, 1, n, behavior,
+				      &head, &next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	ENQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
+
+	r->prod.tail += n;
+
+end:
+	if (free_space != NULL)
+		*free_space = free_entries - n;
+	return n;
+}
+
+/**
+ * @internal
+ *   Enqueue several objects on the non-blocking ring (multi-producer safe)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_enqueue_mp(struct rte_ring *r, void * const *obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+#ifdef RTE_ARCH_X86_64
+	uintptr_t head, next, tail;
+	uint32_t free_entries;
+	unsigned int i;
+
+	n = __rte_ring_move_prod_head(r, 0, n, behavior,
+				      &head, &next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	for (i = 0; i < n; /* i incremented if enqueue succeeds */) {
+		struct nb_ring_entry old_value, new_value;
+		struct nb_ring_entry *ring_ptr;
+
+		/* Enqueue to the tail entry. If another thread wins the race,
+		 * retry with the new tail.
+		 */
+		tail = r->prod.tail;
+
+		ring_ptr = &((struct nb_ring_entry *)&r[1])[tail & r->mask];
+
+		old_value = *ring_ptr;
+
+		/* If the tail entry's modification counter doesn't match the
+		 * producer tail index, it's already been updated.
+		 */
+		if ((old_value.cnt) != tail)
+			continue;
+
+		/* Prepare the new entry. The cnt field mitigates the ABA
+		 * problem on the ring write.
+		 */
+		new_value.ptr = obj_table[i];
+		new_value.cnt = tail + r->size;
+
+		if (rte_atomic128_cmpset((volatile void *)ring_ptr,
+					 (uint64_t *)&old_value,
+					 (uint64_t *)&new_value))
+			i++;
+
+		/* Every thread attempts the cmpset, so they don't have to wait
+		 * for the thread that successfully enqueued to the ring.
+		 * Using a 64-bit tail mitigates the ABA problem here.
+		 *
+		 * Built-in used to handle variable-sized tail index.
+		 */
+		__sync_bool_compare_and_swap(&r->prod.tail, tail, tail + 1);
+	}
+
+end:
+	if (free_space != NULL)
+		*free_space = free_entries - n;
+	return n;
+#else
+	RTE_SET_USED(r);
+	RTE_SET_USED(obj_table);
+	RTE_SET_USED(n);
+	RTE_SET_USED(behavior);
+	RTE_SET_USED(free_space);
+	return 0;
+#endif
+}
+
+/**
+ * @internal Enqueue several objects on the non-blocking ring
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param is_sp
+ *   Indicates whether to use single producer or multi-producer head update
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_enqueue(struct rte_ring *r, void * const *obj_table,
+			 unsigned int n, enum rte_ring_queue_behavior behavior,
+			 unsigned int is_sp, unsigned int *free_space)
+{
+	if (is_sp)
+		return __rte_ring_do_nb_enqueue_sp(r, obj_table, n,
+						   behavior, free_space);
+	else
+		return __rte_ring_do_nb_enqueue_mp(r, obj_table, n,
+						   behavior, free_space);
+}
+
+/**
+ * @internal
+ *   Dequeue several objects from the non-blocking ring (single-consumer only)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_dequeue_sc(struct rte_ring *r, void **obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *available)
+{
+	uintptr_t head, next;
+	uint32_t entries;
+
+	n = __rte_ring_move_cons_head(r, 1, n, behavior,
+				      &head, &next, &entries);
+	if (n == 0)
+		goto end;
+
+	DEQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
+
+	r->cons.tail += n;
+
+end:
+	if (available != NULL)
+		*available = entries - n;
+	return n;
+}
+
+/**
+ * @internal
+ *   Dequeue several objects from the non-blocking ring (multi-consumer safe)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_dequeue_mc(struct rte_ring *r, void **obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *available)
+{
+	uintptr_t head, next;
+	uint32_t entries;
+
+	n = __rte_ring_move_cons_head(r, 0, n, behavior,
+				      &head, &next, &entries);
+	if (n == 0)
+		goto end;
+
+	while (1) {
+		uintptr_t tail = r->cons.tail;
+
+		/* Dequeue from the cons tail onwards. If multiple threads read
+		 * the same pointers, the thread that successfully performs the
+		 * CAS will keep them and the other(s) will retry.
+		 */
+		DEQUEUE_PTRS_NB(r, &r[1], tail, obj_table, n);
+
+		next = tail + n;
+
+		/* Built-in used to handle variable-sized tail index. */
+		if (__sync_bool_compare_and_swap(&r->cons.tail, tail, next)) {
+			/* There is potential for the ABA problem here, but
+			 * that is mitigated by the large (64-bit) tail.
+			 */
+			break;
+		}
+	}
+
+end:
+	if (available != NULL)
+		*available = entries - n;
+	return n;
+}
+
+/**
+ * @internal Dequeue several objects from the non-blocking ring
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_dequeue(struct rte_ring *r, void **obj_table,
+		 unsigned int n, enum rte_ring_queue_behavior behavior,
+		 unsigned int is_sc, unsigned int *available)
+{
+	if (is_sc)
+		return __rte_ring_do_nb_dequeue_sc(r, obj_table, n,
+						   behavior, available);
+	else
+		return __rte_ring_do_nb_dequeue_mc(r, obj_table, n,
+						   behavior, available);
+}
+
 /**
  * @internal Enqueue several objects on the ring
  *
@@ -420,8 +830,14 @@ static __rte_always_inline unsigned int
 rte_ring_mp_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_MP, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_MP,
+						free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_MP,
+					     free_space);
 }
 
 /**
@@ -443,8 +859,14 @@ static __rte_always_inline unsigned int
 rte_ring_sp_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_SP, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_SP,
+						free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_SP,
+					     free_space);
 }
 
 /**
@@ -470,8 +892,14 @@ static __rte_always_inline unsigned int
 rte_ring_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
 		      unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			r->prod.single, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED,
+						r->prod.single, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED,
+					     r->prod.single, free_space);
 }
 
 /**
@@ -554,8 +982,14 @@ static __rte_always_inline unsigned int
 rte_ring_mc_dequeue_bulk(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_MC, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_MC,
+						available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_MC,
+					     available);
 }
 
 /**
@@ -578,8 +1012,14 @@ static __rte_always_inline unsigned int
 rte_ring_sc_dequeue_bulk(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_SC, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_SC,
+						available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_SC,
+					     available);
 }
 
 /**
@@ -605,8 +1045,14 @@ static __rte_always_inline unsigned int
 rte_ring_dequeue_bulk(struct rte_ring *r, void **obj_table, unsigned int n,
 		unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-				r->cons.single, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED,
+						r->cons.single, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED,
+					     r->cons.single, available);
 }
 
 /**
@@ -803,8 +1249,14 @@ static __rte_always_inline unsigned
 rte_ring_mp_enqueue_burst(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_MP, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_MP, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_MP, free_space);
 }
 
 /**
@@ -826,8 +1278,14 @@ static __rte_always_inline unsigned
 rte_ring_sp_enqueue_burst(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_SP, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_SP, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_SP, free_space);
 }
 
 /**
@@ -853,8 +1311,14 @@ static __rte_always_inline unsigned
 rte_ring_enqueue_burst(struct rte_ring *r, void * const *obj_table,
 		      unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_VARIABLE,
-			r->prod.single, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						r->prod.single, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     r->prod.single, free_space);
 }
 
 /**
@@ -881,8 +1345,14 @@ static __rte_always_inline unsigned
 rte_ring_mc_dequeue_burst(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_MC, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_MC, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_MC, available);
 }
 
 /**
@@ -906,8 +1376,14 @@ static __rte_always_inline unsigned
 rte_ring_sc_dequeue_burst(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_SC, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_SC, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_SC, available);
 }
 
 /**
@@ -933,9 +1409,14 @@ static __rte_always_inline unsigned
 rte_ring_dequeue_burst(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n,
-				RTE_RING_QUEUE_VARIABLE,
-				r->cons.single, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						r->cons.single, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     r->cons.single, available);
 }
 
 #ifdef __cplusplus
diff --git a/lib/librte_ring/rte_ring_version.map b/lib/librte_ring/rte_ring_version.map
index d935efd0d..8969467af 100644
--- a/lib/librte_ring/rte_ring_version.map
+++ b/lib/librte_ring/rte_ring_version.map
@@ -17,3 +17,10 @@ DPDK_2.2 {
 	rte_ring_free;
 
 } DPDK_2.0;
+
+DPDK_19.05 {
+	global:
+
+	rte_ring_get_memsize;
+
+} DPDK_2.2;
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH 3/6] test_ring: add non-blocking ring autotest
  2019-01-10 21:01 [dpdk-dev] [PATCH 0/6] Add non-blocking ring Gage Eads
  2019-01-10 21:01 ` [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width size Gage Eads
  2019-01-10 21:01 ` [dpdk-dev] [PATCH 2/6] ring: add a non-blocking implementation Gage Eads
@ 2019-01-10 21:01 ` Gage Eads
  2019-01-10 21:01 ` [dpdk-dev] [PATCH 4/6] test_ring_perf: add non-blocking ring perf test Gage Eads
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-01-10 21:01 UTC (permalink / raw)
  To: dev; +Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev

ring_nb_autotest re-uses the ring_autotest code by wrapping its top-level
function with one that takes a 'flags' argument.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 test/test/test_ring.c | 57 ++++++++++++++++++++++++++++++++-------------------
 1 file changed, 36 insertions(+), 21 deletions(-)

diff --git a/test/test/test_ring.c b/test/test/test_ring.c
index aaf1e70ad..ff410d978 100644
--- a/test/test/test_ring.c
+++ b/test/test/test_ring.c
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2010-2019 Intel Corporation
  */
 
 #include <string.h>
@@ -601,18 +601,20 @@ test_ring_burst_basic(struct rte_ring *r)
  * it will always fail to create ring with a wrong ring size number in this function
  */
 static int
-test_ring_creation_with_wrong_size(void)
+test_ring_creation_with_wrong_size(unsigned int flags)
 {
 	struct rte_ring * rp = NULL;
 
 	/* Test if ring size is not power of 2 */
-	rp = rte_ring_create("test_bad_ring_size", RING_SIZE + 1, SOCKET_ID_ANY, 0);
+	rp = rte_ring_create("test_bad_ring_size", RING_SIZE + 1,
+			     SOCKET_ID_ANY, flags);
 	if (NULL != rp) {
 		return -1;
 	}
 
 	/* Test if ring size is exceeding the limit */
-	rp = rte_ring_create("test_bad_ring_size", (RTE_RING_SZ_MASK + 1), SOCKET_ID_ANY, 0);
+	rp = rte_ring_create("test_bad_ring_size", (RTE_RING_SZ_MASK + 1),
+			     SOCKET_ID_ANY, flags);
 	if (NULL != rp) {
 		return -1;
 	}
@@ -623,11 +625,11 @@ test_ring_creation_with_wrong_size(void)
  * it tests if it would always fail to create ring with an used ring name
  */
 static int
-test_ring_creation_with_an_used_name(void)
+test_ring_creation_with_an_used_name(unsigned int flags)
 {
 	struct rte_ring * rp;
 
-	rp = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, 0);
+	rp = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, flags);
 	if (NULL != rp)
 		return -1;
 
@@ -639,10 +641,10 @@ test_ring_creation_with_an_used_name(void)
  * function to fail correctly
  */
 static int
-test_create_count_odd(void)
+test_create_count_odd(unsigned int flags)
 {
 	struct rte_ring *r = rte_ring_create("test_ring_count",
-			4097, SOCKET_ID_ANY, 0 );
+			4097, SOCKET_ID_ANY, flags);
 	if(r != NULL){
 		return -1;
 	}
@@ -665,7 +667,7 @@ test_lookup_null(void)
  * it tests some more basic ring operations
  */
 static int
-test_ring_basic_ex(void)
+test_ring_basic_ex(unsigned int flags)
 {
 	int ret = -1;
 	unsigned i;
@@ -679,7 +681,7 @@ test_ring_basic_ex(void)
 	}
 
 	rp = rte_ring_create("test_ring_basic_ex", RING_SIZE, SOCKET_ID_ANY,
-			RING_F_SP_ENQ | RING_F_SC_DEQ);
+			RING_F_SP_ENQ | RING_F_SC_DEQ | flags);
 	if (rp == NULL) {
 		printf("test_ring_basic_ex fail to create ring\n");
 		goto fail_test;
@@ -737,7 +739,7 @@ test_ring_basic_ex(void)
 }
 
 static int
-test_ring_with_exact_size(void)
+test_ring_with_exact_size(unsigned int flags)
 {
 	struct rte_ring *std_ring = NULL, *exact_sz_ring = NULL;
 	void *ptr_array[16];
@@ -746,13 +748,13 @@ test_ring_with_exact_size(void)
 	int ret = -1;
 
 	std_ring = rte_ring_create("std", ring_sz, rte_socket_id(),
-			RING_F_SP_ENQ | RING_F_SC_DEQ);
+			RING_F_SP_ENQ | RING_F_SC_DEQ | flags);
 	if (std_ring == NULL) {
 		printf("%s: error, can't create std ring\n", __func__);
 		goto end;
 	}
 	exact_sz_ring = rte_ring_create("exact sz", ring_sz, rte_socket_id(),
-			RING_F_SP_ENQ | RING_F_SC_DEQ | RING_F_EXACT_SZ);
+		RING_F_SP_ENQ | RING_F_SC_DEQ | RING_F_EXACT_SZ | flags);
 	if (exact_sz_ring == NULL) {
 		printf("%s: error, can't create exact size ring\n", __func__);
 		goto end;
@@ -808,17 +810,17 @@ test_ring_with_exact_size(void)
 }
 
 static int
-test_ring(void)
+__test_ring(unsigned int flags)
 {
 	struct rte_ring *r = NULL;
 
 	/* some more basic operations */
-	if (test_ring_basic_ex() < 0)
+	if (test_ring_basic_ex(flags) < 0)
 		goto test_fail;
 
 	rte_atomic32_init(&synchro);
 
-	r = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, 0);
+	r = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, flags);
 	if (r == NULL)
 		goto test_fail;
 
@@ -837,27 +839,27 @@ test_ring(void)
 		goto test_fail;
 
 	/* basic operations */
-	if ( test_create_count_odd() < 0){
+	if (test_create_count_odd(flags) < 0) {
 		printf("Test failed to detect odd count\n");
 		goto test_fail;
 	} else
 		printf("Test detected odd count\n");
 
-	if ( test_lookup_null() < 0){
+	if (test_lookup_null() < 0) {
 		printf("Test failed to detect NULL ring lookup\n");
 		goto test_fail;
 	} else
 		printf("Test detected NULL ring lookup\n");
 
 	/* test of creating ring with wrong size */
-	if (test_ring_creation_with_wrong_size() < 0)
+	if (test_ring_creation_with_wrong_size(flags) < 0)
 		goto test_fail;
 
 	/* test of creation ring with an used name */
-	if (test_ring_creation_with_an_used_name() < 0)
+	if (test_ring_creation_with_an_used_name(flags) < 0)
 		goto test_fail;
 
-	if (test_ring_with_exact_size() < 0)
+	if (test_ring_with_exact_size(flags) < 0)
 		goto test_fail;
 
 	/* dump the ring status */
@@ -873,4 +875,17 @@ test_ring(void)
 	return -1;
 }
 
+static int
+test_ring(void)
+{
+	return __test_ring(0);
+}
+
+static int
+test_nb_ring(void)
+{
+	return __test_ring(RING_F_NB);
+}
+
 REGISTER_TEST_COMMAND(ring_autotest, test_ring);
+REGISTER_TEST_COMMAND(ring_nb_autotest, test_nb_ring);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH 4/6] test_ring_perf: add non-blocking ring perf test
  2019-01-10 21:01 [dpdk-dev] [PATCH 0/6] Add non-blocking ring Gage Eads
                   ` (2 preceding siblings ...)
  2019-01-10 21:01 ` [dpdk-dev] [PATCH 3/6] test_ring: add non-blocking ring autotest Gage Eads
@ 2019-01-10 21:01 ` Gage Eads
  2019-01-10 21:01 ` [dpdk-dev] [PATCH 5/6] mempool/ring: add non-blocking ring handlers Gage Eads
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-01-10 21:01 UTC (permalink / raw)
  To: dev; +Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev

nb_ring_perf_autotest re-uses the ring_perf_autotest code by wrapping its
top-level function with one that takes a 'flags' argument.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 test/test/test_ring_perf.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/test/test/test_ring_perf.c b/test/test/test_ring_perf.c
index ebb3939f5..380c4b4a1 100644
--- a/test/test/test_ring_perf.c
+++ b/test/test/test_ring_perf.c
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2010-2019 Intel Corporation
  */
 
 
@@ -363,12 +363,12 @@ test_bulk_enqueue_dequeue(struct rte_ring *r)
 }
 
 static int
-test_ring_perf(void)
+__test_ring_perf(unsigned int flags)
 {
 	struct lcore_pair cores;
 	struct rte_ring *r = NULL;
 
-	r = rte_ring_create(RING_NAME, RING_SIZE, rte_socket_id(), 0);
+	r = rte_ring_create(RING_NAME, RING_SIZE, rte_socket_id(), flags);
 	if (r == NULL)
 		return -1;
 
@@ -398,4 +398,17 @@ test_ring_perf(void)
 	return 0;
 }
 
+static int
+test_ring_perf(void)
+{
+	return __test_ring_perf(0);
+}
+
+static int
+test_nb_ring_perf(void)
+{
+	return __test_ring_perf(RING_F_NB);
+}
+
 REGISTER_TEST_COMMAND(ring_perf_autotest, test_ring_perf);
+REGISTER_TEST_COMMAND(ring_nb_perf_autotest, test_nb_ring_perf);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH 5/6] mempool/ring: add non-blocking ring handlers
  2019-01-10 21:01 [dpdk-dev] [PATCH 0/6] Add non-blocking ring Gage Eads
                   ` (3 preceding siblings ...)
  2019-01-10 21:01 ` [dpdk-dev] [PATCH 4/6] test_ring_perf: add non-blocking ring perf test Gage Eads
@ 2019-01-10 21:01 ` Gage Eads
  2019-01-13 13:43   ` Andrew Rybchenko
  2019-01-10 21:01 ` [dpdk-dev] [PATCH 6/6] doc: add NB ring comment to EAL "known issues" Gage Eads
  2019-01-15 23:52 ` [dpdk-dev] [PATCH v2 0/5] Add non-blocking ring Gage Eads
  6 siblings, 1 reply; 123+ messages in thread
From: Gage Eads @ 2019-01-10 21:01 UTC (permalink / raw)
  To: dev; +Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev

These handlers allow an application to create a mempool based on the
non-blocking ring, with any combination of single/multi producer/consumer.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 drivers/mempool/ring/rte_mempool_ring.c | 58 +++++++++++++++++++++++++++++++--
 1 file changed, 55 insertions(+), 3 deletions(-)

diff --git a/drivers/mempool/ring/rte_mempool_ring.c b/drivers/mempool/ring/rte_mempool_ring.c
index bc123fc52..013dac3bc 100644
--- a/drivers/mempool/ring/rte_mempool_ring.c
+++ b/drivers/mempool/ring/rte_mempool_ring.c
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2016 Intel Corporation
+ * Copyright(c) 2010-2019 Intel Corporation
  */
 
 #include <stdio.h>
@@ -47,11 +47,11 @@ common_ring_get_count(const struct rte_mempool *mp)
 
 
 static int
-common_ring_alloc(struct rte_mempool *mp)
+__common_ring_alloc(struct rte_mempool *mp, int rg_flags)
 {
-	int rg_flags = 0, ret;
 	char rg_name[RTE_RING_NAMESIZE];
 	struct rte_ring *r;
+	int ret;
 
 	ret = snprintf(rg_name, sizeof(rg_name),
 		RTE_MEMPOOL_MZ_FORMAT, mp->name);
@@ -82,6 +82,18 @@ common_ring_alloc(struct rte_mempool *mp)
 	return 0;
 }
 
+static int
+common_ring_alloc(struct rte_mempool *mp)
+{
+	return __common_ring_alloc(mp, 0);
+}
+
+static int
+common_ring_alloc_nb(struct rte_mempool *mp)
+{
+	return __common_ring_alloc(mp, RING_F_NB);
+}
+
 static void
 common_ring_free(struct rte_mempool *mp)
 {
@@ -130,7 +142,47 @@ static const struct rte_mempool_ops ops_sp_mc = {
 	.get_count = common_ring_get_count,
 };
 
+static const struct rte_mempool_ops ops_mp_mc_nb = {
+	.name = "ring_mp_mc_nb",
+	.alloc = common_ring_alloc_nb,
+	.free = common_ring_free,
+	.enqueue = common_ring_mp_enqueue,
+	.dequeue = common_ring_mc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
+static const struct rte_mempool_ops ops_sp_sc_nb = {
+	.name = "ring_sp_sc_nb",
+	.alloc = common_ring_alloc_nb,
+	.free = common_ring_free,
+	.enqueue = common_ring_sp_enqueue,
+	.dequeue = common_ring_sc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
+static const struct rte_mempool_ops ops_mp_sc_nb = {
+	.name = "ring_mp_sc_nb",
+	.alloc = common_ring_alloc_nb,
+	.free = common_ring_free,
+	.enqueue = common_ring_mp_enqueue,
+	.dequeue = common_ring_sc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
+static const struct rte_mempool_ops ops_sp_mc_nb = {
+	.name = "ring_sp_mc_nb",
+	.alloc = common_ring_alloc_nb,
+	.free = common_ring_free,
+	.enqueue = common_ring_sp_enqueue,
+	.dequeue = common_ring_mc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
 MEMPOOL_REGISTER_OPS(ops_mp_mc);
 MEMPOOL_REGISTER_OPS(ops_sp_sc);
 MEMPOOL_REGISTER_OPS(ops_mp_sc);
 MEMPOOL_REGISTER_OPS(ops_sp_mc);
+MEMPOOL_REGISTER_OPS(ops_mp_mc_nb);
+MEMPOOL_REGISTER_OPS(ops_sp_sc_nb);
+MEMPOOL_REGISTER_OPS(ops_mp_sc_nb);
+MEMPOOL_REGISTER_OPS(ops_sp_mc_nb);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH 6/6] doc: add NB ring comment to EAL "known issues"
  2019-01-10 21:01 [dpdk-dev] [PATCH 0/6] Add non-blocking ring Gage Eads
                   ` (4 preceding siblings ...)
  2019-01-10 21:01 ` [dpdk-dev] [PATCH 5/6] mempool/ring: add non-blocking ring handlers Gage Eads
@ 2019-01-10 21:01 ` Gage Eads
  2019-01-11  2:51   ` Varghese, Vipin
  2019-01-15 23:52 ` [dpdk-dev] [PATCH v2 0/5] Add non-blocking ring Gage Eads
  6 siblings, 1 reply; 123+ messages in thread
From: Gage Eads @ 2019-01-10 21:01 UTC (permalink / raw)
  To: dev; +Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev

This comment makes users aware of the non-blocking ring option and its
caveats.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 doc/guides/prog_guide/env_abstraction_layer.rst | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index 9497b879c..b6ac236d6 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -541,7 +541,7 @@ Known Issues
 
   5. It MUST not be used by multi-producer/consumer pthreads, whose scheduling policies are SCHED_FIFO or SCHED_RR.
 
-  Alternatively, x86_64 applications can use the non-blocking stack mempool handler. When considering this handler, note that:
+  Alternatively, x86_64 applications can use the non-blocking ring or stack mempool handlers. When considering one of them, note that:
 
   - it is limited to the x86_64 platform, because it uses an instruction (16-byte compare-and-swap) that is not available on other platforms.
   - it has worse average-case performance than the non-preemptive rte_ring, but software caching (e.g. the mempool cache) can mitigate this by reducing the number of handler operations.
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH 6/6] doc: add NB ring comment to EAL "known issues"
  2019-01-10 21:01 ` [dpdk-dev] [PATCH 6/6] doc: add NB ring comment to EAL "known issues" Gage Eads
@ 2019-01-11  2:51   ` Varghese, Vipin
  2019-01-11 19:30     ` Eads, Gage
  0 siblings, 1 reply; 123+ messages in thread
From: Varghese, Vipin @ 2019-01-11  2:51 UTC (permalink / raw)
  To: Eads, Gage, dev
  Cc: olivier.matz, arybchenko, Richardson, Bruce, Ananyev, Konstantin

Hi Gage,

Humble suggestion from my end, as per DPDK 19.02-rc1 the documentation and code change have to be in same patch. Can you please take a look into it.

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Gage Eads
> Sent: Friday, January 11, 2019 2:31 AM
> To: dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>
> Subject: [dpdk-dev] [PATCH 6/6] doc: add NB ring comment to EAL "known
> issues"
> 
> This comment makes users aware of the non-blocking ring option and its
> caveats.
> 
> Signed-off-by: Gage Eads <gage.eads@intel.com>
> ---
>  doc/guides/prog_guide/env_abstraction_layer.rst | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst
> b/doc/guides/prog_guide/env_abstraction_layer.rst
> index 9497b879c..b6ac236d6 100644
> --- a/doc/guides/prog_guide/env_abstraction_layer.rst
> +++ b/doc/guides/prog_guide/env_abstraction_layer.rst
> @@ -541,7 +541,7 @@ Known Issues
> 
>    5. It MUST not be used by multi-producer/consumer pthreads, whose
> scheduling policies are SCHED_FIFO or SCHED_RR.
> 
> -  Alternatively, x86_64 applications can use the non-blocking stack mempool
> handler. When considering this handler, note that:
> +  Alternatively, x86_64 applications can use the non-blocking ring or stack
> mempool handlers. When considering one of them, note that:
> 
>    - it is limited to the x86_64 platform, because it uses an instruction (16-byte
> compare-and-swap) that is not available on other platforms.
>    - it has worse average-case performance than the non-preemptive rte_ring,
> but software caching (e.g. the mempool cache) can mitigate this by reducing
> the number of handler operations.
> --
> 2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width size
  2019-01-10 21:01 ` [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width size Gage Eads
@ 2019-01-11  4:38   ` Stephen Hemminger
  2019-01-11 19:07     ` Eads, Gage
  2019-01-11 10:25   ` Burakov, Anatoly
  2019-01-11 10:40   ` Burakov, Anatoly
  2 siblings, 1 reply; 123+ messages in thread
From: Stephen Hemminger @ 2019-01-11  4:38 UTC (permalink / raw)
  To: Gage Eads
  Cc: dev, olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev

On Thu, 10 Jan 2019 15:01:17 -0600
Gage Eads <gage.eads@intel.com> wrote:

> For 64-bit architectures, doubling the head and tail index widths greatly
> increases the time it takes for them to wrap-around (with current CPU
> speeds, it won't happen within the author's lifetime). This is important in
> avoiding the ABA problem -- in which a thread mistakes reading the same
> tail index in two accesses to mean that the ring was not modified in the
> intervening time -- in the upcoming non-blocking ring implementation. Using
> a 64-bit index makes the possibility of this occurring effectively zero.
> 
> I tested this commit's performance impact with an x86_64 build on a
> dual-socket Xeon E5-2699 v4 using ring_perf_autotest, and the change made
> no significant difference -- the few differences appear to be system noise.
> (The test ran on isolcpus cores using a tickless scheduler, but some
> variation was stll observed.) Each test was run three times and the results
> were averaged:
> 
>                                   | 64b head/tail cycle cost minus
>              Test                 |     32b head/tail cycle cost
> ------------------------------------------------------------------
> SP/SC single enq/dequeue          | 0.33
> MP/MC single enq/dequeue          | 0.00
> SP/SC burst enq/dequeue (size 8)  | 0.00
> MP/MC burst enq/dequeue (size 8)  | 1.00
> SP/SC burst enq/dequeue (size 32) | 0.00
> MP/MC burst enq/dequeue (size 32) | -1.00
> SC empty dequeue                  | 0.01
> MC empty dequeue                  | 0.00
> 
> Single lcore:
> SP/SC bulk enq/dequeue (size 8)   | -0.36
> MP/MC bulk enq/dequeue (size 8)   | 0.99
> SP/SC bulk enq/dequeue (size 32)  | -0.40
> MP/MC bulk enq/dequeue (size 32)  | -0.57
> 
> Two physical cores:
> SP/SC bulk enq/dequeue (size 8)   | -0.49
> MP/MC bulk enq/dequeue (size 8)   | 0.19
> SP/SC bulk enq/dequeue (size 32)  | -0.28
> MP/MC bulk enq/dequeue (size 32)  | -0.62
> 
> Two NUMA nodes:
> SP/SC bulk enq/dequeue (size 8)   | 3.25
> MP/MC bulk enq/dequeue (size 8)   | 1.87
> SP/SC bulk enq/dequeue (size 32)  | -0.44
> MP/MC bulk enq/dequeue (size 32)  | -1.10
> 
> An earlier version of this patch changed the head and tail indexes to
> uint64_t, but that caused a performance drop on 32-bit builds. With
> uintptr_t, no performance difference is observed on an i686 build.
> 
> Signed-off-by: Gage Eads <gage.eads@intel.com>
> ---
>  lib/librte_eventdev/rte_event_ring.h |  6 +++---
>  lib/librte_ring/rte_ring.c           | 10 +++++-----
>  lib/librte_ring/rte_ring.h           | 20 ++++++++++----------
>  lib/librte_ring/rte_ring_generic.h   | 16 +++++++++-------
>  4 files changed, 27 insertions(+), 25 deletions(-)
> 
> diff --git a/lib/librte_eventdev/rte_event_ring.h b/lib/librte_eventdev/rte_event_ring.h
> index 827a3209e..eae70f904 100644
> --- a/lib/librte_eventdev/rte_event_ring.h
> +++ b/lib/librte_eventdev/rte_event_ring.h
> @@ -1,5 +1,5 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
> - * Copyright(c) 2016-2017 Intel Corporation
> + * Copyright(c) 2016-2019 Intel Corporation
>   */
>  
>  /**
> @@ -88,7 +88,7 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r,
>  		const struct rte_event *events,
>  		unsigned int n, uint16_t *free_space)
>  {
> -	uint32_t prod_head, prod_next;
> +	uintptr_t prod_head, prod_next;
>  	uint32_t free_entries;
>  
>  	n = __rte_ring_move_prod_head(&r->r, r->r.prod.single, n,
> @@ -129,7 +129,7 @@ rte_event_ring_dequeue_burst(struct rte_event_ring *r,
>  		struct rte_event *events,
>  		unsigned int n, uint16_t *available)
>  {
> -	uint32_t cons_head, cons_next;
> +	uintptr_t cons_head, cons_next;
>  	uint32_t entries;
>  
>  	n = __rte_ring_move_cons_head(&r->r, r->r.cons.single, n,
> diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
> index d215acecc..b15ee0eb3 100644
> --- a/lib/librte_ring/rte_ring.c
> +++ b/lib/librte_ring/rte_ring.c
> @@ -1,6 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   *
> - * Copyright (c) 2010-2015 Intel Corporation
> + * Copyright (c) 2010-2019 Intel Corporation
>   * Copyright (c) 2007,2008 Kip Macy kmacy@freebsd.org
>   * All rights reserved.
>   * Derived from FreeBSD's bufring.h
> @@ -227,10 +227,10 @@ rte_ring_dump(FILE *f, const struct rte_ring *r)
>  	fprintf(f, "  flags=%x\n", r->flags);
>  	fprintf(f, "  size=%"PRIu32"\n", r->size);
>  	fprintf(f, "  capacity=%"PRIu32"\n", r->capacity);
> -	fprintf(f, "  ct=%"PRIu32"\n", r->cons.tail);
> -	fprintf(f, "  ch=%"PRIu32"\n", r->cons.head);
> -	fprintf(f, "  pt=%"PRIu32"\n", r->prod.tail);
> -	fprintf(f, "  ph=%"PRIu32"\n", r->prod.head);
> +	fprintf(f, "  ct=%"PRIuPTR"\n", r->cons.tail);
> +	fprintf(f, "  ch=%"PRIuPTR"\n", r->cons.head);
> +	fprintf(f, "  pt=%"PRIuPTR"\n", r->prod.tail);
> +	fprintf(f, "  ph=%"PRIuPTR"\n", r->prod.head);
>  	fprintf(f, "  used=%u\n", rte_ring_count(r));
>  	fprintf(f, "  avail=%u\n", rte_ring_free_count(r));
>  }
> diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> index af5444a9f..12af64e13 100644
> --- a/lib/librte_ring/rte_ring.h
> +++ b/lib/librte_ring/rte_ring.h
> @@ -1,6 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   *
> - * Copyright (c) 2010-2017 Intel Corporation
> + * Copyright (c) 2010-2019 Intel Corporation
>   * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
>   * All rights reserved.
>   * Derived from FreeBSD's bufring.h
> @@ -65,8 +65,8 @@ struct rte_memzone; /* forward declaration, so as not to require memzone.h */
>  
>  /* structure to hold a pair of head/tail values and other metadata */
>  struct rte_ring_headtail {
> -	volatile uint32_t head;  /**< Prod/consumer head. */
> -	volatile uint32_t tail;  /**< Prod/consumer tail. */
> +	volatile uintptr_t head;  /**< Prod/consumer head. */
> +	volatile uintptr_t tail;  /**< Prod/consumer tail. */
>  	uint32_t single;         /**< True if single prod/cons */
>  };

Isn't this a major ABI change which will break existing applications?

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width size
  2019-01-10 21:01 ` [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width size Gage Eads
  2019-01-11  4:38   ` Stephen Hemminger
@ 2019-01-11 10:25   ` Burakov, Anatoly
  2019-01-11 19:12     ` Eads, Gage
  2019-01-11 10:40   ` Burakov, Anatoly
  2 siblings, 1 reply; 123+ messages in thread
From: Burakov, Anatoly @ 2019-01-11 10:25 UTC (permalink / raw)
  To: Gage Eads, dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev

On 10-Jan-19 9:01 PM, Gage Eads wrote:
> For 64-bit architectures, doubling the head and tail index widths greatly
> increases the time it takes for them to wrap-around (with current CPU
> speeds, it won't happen within the author's lifetime). This is important in
> avoiding the ABA problem -- in which a thread mistakes reading the same
> tail index in two accesses to mean that the ring was not modified in the
> intervening time -- in the upcoming non-blocking ring implementation. Using
> a 64-bit index makes the possibility of this occurring effectively zero.
> 
> I tested this commit's performance impact with an x86_64 build on a
> dual-socket Xeon E5-2699 v4 using ring_perf_autotest, and the change made
> no significant difference -- the few differences appear to be system noise.
> (The test ran on isolcpus cores using a tickless scheduler, but some
> variation was stll observed.) Each test was run three times and the results
> were averaged:
> 
>                                    | 64b head/tail cycle cost minus
>               Test                 |     32b head/tail cycle cost
> ------------------------------------------------------------------
> SP/SC single enq/dequeue          | 0.33
> MP/MC single enq/dequeue          | 0.00
> SP/SC burst enq/dequeue (size 8)  | 0.00
> MP/MC burst enq/dequeue (size 8)  | 1.00
> SP/SC burst enq/dequeue (size 32) | 0.00
> MP/MC burst enq/dequeue (size 32) | -1.00
> SC empty dequeue                  | 0.01
> MC empty dequeue                  | 0.00
> 
> Single lcore:
> SP/SC bulk enq/dequeue (size 8)   | -0.36
> MP/MC bulk enq/dequeue (size 8)   | 0.99
> SP/SC bulk enq/dequeue (size 32)  | -0.40
> MP/MC bulk enq/dequeue (size 32)  | -0.57
> 
> Two physical cores:
> SP/SC bulk enq/dequeue (size 8)   | -0.49
> MP/MC bulk enq/dequeue (size 8)   | 0.19
> SP/SC bulk enq/dequeue (size 32)  | -0.28
> MP/MC bulk enq/dequeue (size 32)  | -0.62
> 
> Two NUMA nodes:
> SP/SC bulk enq/dequeue (size 8)   | 3.25
> MP/MC bulk enq/dequeue (size 8)   | 1.87
> SP/SC bulk enq/dequeue (size 32)  | -0.44
> MP/MC bulk enq/dequeue (size 32)  | -1.10
> 
> An earlier version of this patch changed the head and tail indexes to
> uint64_t, but that caused a performance drop on 32-bit builds. With
> uintptr_t, no performance difference is observed on an i686 build.
> 
> Signed-off-by: Gage Eads <gage.eads@intel.com>
> ---

You're breaking the ABI - version bump for affected libraries is needed.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width size
  2019-01-10 21:01 ` [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width size Gage Eads
  2019-01-11  4:38   ` Stephen Hemminger
  2019-01-11 10:25   ` Burakov, Anatoly
@ 2019-01-11 10:40   ` Burakov, Anatoly
  2019-01-11 10:58     ` Bruce Richardson
  2 siblings, 1 reply; 123+ messages in thread
From: Burakov, Anatoly @ 2019-01-11 10:40 UTC (permalink / raw)
  To: Gage Eads, dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev

<...>

> + * Copyright(c) 2016-2019 Intel Corporation
>    */
>   
>   /**
> @@ -88,7 +88,7 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r,
>   		const struct rte_event *events,
>   		unsigned int n, uint16_t *free_space)
>   {
> -	uint32_t prod_head, prod_next;
> +	uintptr_t prod_head, prod_next;

I would also question the use of uinptr_t. I think semnatically, size_t 
is more appropriate.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width size
  2019-01-11 10:40   ` Burakov, Anatoly
@ 2019-01-11 10:58     ` Bruce Richardson
  2019-01-11 11:30       ` Burakov, Anatoly
  0 siblings, 1 reply; 123+ messages in thread
From: Bruce Richardson @ 2019-01-11 10:58 UTC (permalink / raw)
  To: Burakov, Anatoly
  Cc: Gage Eads, dev, olivier.matz, arybchenko, konstantin.ananyev

On Fri, Jan 11, 2019 at 10:40:19AM +0000, Burakov, Anatoly wrote:
> <...>
> 
> > + * Copyright(c) 2016-2019 Intel Corporation
> >    */
> >   /**
> > @@ -88,7 +88,7 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r,
> >   		const struct rte_event *events,
> >   		unsigned int n, uint16_t *free_space)
> >   {
> > -	uint32_t prod_head, prod_next;
> > +	uintptr_t prod_head, prod_next;
> 
> I would also question the use of uinptr_t. I think semnatically, size_t is
> more appropriate.
> 
Yes, it would, but I believe in this case they want to use the largest size
of (unsigned)int where there exists an atomic for manipulating 2 of them
simultaneously. [The largest size is to minimize any chance of an ABA issue
occuring]. Therefore we need 32-bit values on 32-bit and 64-bit on 64, and
I suspect the best way to guarantee this is to use pointer-sized values. If
size_t is guaranteed across all OS's to have the same size as uintptr_t it
could also be used, though.

/Bruce

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width size
  2019-01-11 10:58     ` Bruce Richardson
@ 2019-01-11 11:30       ` Burakov, Anatoly
       [not found]         ` <20190111115851.GC3336@bricha3-MOBL.ger.corp.intel.com>
  0 siblings, 1 reply; 123+ messages in thread
From: Burakov, Anatoly @ 2019-01-11 11:30 UTC (permalink / raw)
  To: Bruce Richardson
  Cc: Gage Eads, dev, olivier.matz, arybchenko, konstantin.ananyev

On 11-Jan-19 10:58 AM, Bruce Richardson wrote:
> On Fri, Jan 11, 2019 at 10:40:19AM +0000, Burakov, Anatoly wrote:
>> <...>
>>
>>> + * Copyright(c) 2016-2019 Intel Corporation
>>>     */
>>>    /**
>>> @@ -88,7 +88,7 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r,
>>>    		const struct rte_event *events,
>>>    		unsigned int n, uint16_t *free_space)
>>>    {
>>> -	uint32_t prod_head, prod_next;
>>> +	uintptr_t prod_head, prod_next;
>>
>> I would also question the use of uinptr_t. I think semnatically, size_t is
>> more appropriate.
>>
> Yes, it would, but I believe in this case they want to use the largest size
> of (unsigned)int where there exists an atomic for manipulating 2 of them
> simultaneously. [The largest size is to minimize any chance of an ABA issue
> occuring]. Therefore we need 32-bit values on 32-bit and 64-bit on 64, and
> I suspect the best way to guarantee this is to use pointer-sized values. If
> size_t is guaranteed across all OS's to have the same size as uintptr_t it
> could also be used, though.
> 
> /Bruce
> 

Technically, size_t and uintptr_t are not guaranteed to match. In 
practice, they won't match only on architectures that DPDK doesn't 
intend to run on (such as 16-bit segmented archs, where size_t would be 
16-bit but uinptr_t would be 32-bit).

In all the rest of DPDK code, we use size_t for this kind of thing.

-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width size
  2019-01-11  4:38   ` Stephen Hemminger
@ 2019-01-11 19:07     ` Eads, Gage
  0 siblings, 0 replies; 123+ messages in thread
From: Eads, Gage @ 2019-01-11 19:07 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, olivier.matz, arybchenko, Richardson, Bruce, Ananyev, Konstantin



> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Thursday, January 10, 2019 10:39 PM
> To: Eads, Gage <gage.eads@intel.com>
> Cc: dev@dpdk.org; olivier.matz@6wind.com; arybchenko@solarflare.com;
> Richardson, Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width
> size
> 
> On Thu, 10 Jan 2019 15:01:17 -0600
> Gage Eads <gage.eads@intel.com> wrote:
> 
> > For 64-bit architectures, doubling the head and tail index widths
> > greatly increases the time it takes for them to wrap-around (with
> > current CPU speeds, it won't happen within the author's lifetime).
> > This is important in avoiding the ABA problem -- in which a thread
> > mistakes reading the same tail index in two accesses to mean that the
> > ring was not modified in the intervening time -- in the upcoming
> > non-blocking ring implementation. Using a 64-bit index makes the possibility of
> this occurring effectively zero.
> >
> > I tested this commit's performance impact with an x86_64 build on a
> > dual-socket Xeon E5-2699 v4 using ring_perf_autotest, and the change
> > made no significant difference -- the few differences appear to be system
> noise.
> > (The test ran on isolcpus cores using a tickless scheduler, but some
> > variation was stll observed.) Each test was run three times and the
> > results were averaged:
> >
> >                                   | 64b head/tail cycle cost minus
> >              Test                 |     32b head/tail cycle cost
> > ------------------------------------------------------------------
> > SP/SC single enq/dequeue          | 0.33
> > MP/MC single enq/dequeue          | 0.00
> > SP/SC burst enq/dequeue (size 8)  | 0.00 MP/MC burst enq/dequeue (size
> > 8)  | 1.00 SP/SC burst enq/dequeue (size 32) | 0.00 MP/MC burst
> > enq/dequeue (size 32) | -1.00
> > SC empty dequeue                  | 0.01
> > MC empty dequeue                  | 0.00
> >
> > Single lcore:
> > SP/SC bulk enq/dequeue (size 8)   | -0.36
> > MP/MC bulk enq/dequeue (size 8)   | 0.99
> > SP/SC bulk enq/dequeue (size 32)  | -0.40 MP/MC bulk enq/dequeue (size
> > 32)  | -0.57
> >
> > Two physical cores:
> > SP/SC bulk enq/dequeue (size 8)   | -0.49
> > MP/MC bulk enq/dequeue (size 8)   | 0.19
> > SP/SC bulk enq/dequeue (size 32)  | -0.28 MP/MC bulk enq/dequeue (size
> > 32)  | -0.62
> >
> > Two NUMA nodes:
> > SP/SC bulk enq/dequeue (size 8)   | 3.25
> > MP/MC bulk enq/dequeue (size 8)   | 1.87
> > SP/SC bulk enq/dequeue (size 32)  | -0.44 MP/MC bulk enq/dequeue (size
> > 32)  | -1.10
> >
> > An earlier version of this patch changed the head and tail indexes to
> > uint64_t, but that caused a performance drop on 32-bit builds. With
> > uintptr_t, no performance difference is observed on an i686 build.
> >
> > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > ---
> >  lib/librte_eventdev/rte_event_ring.h |  6 +++---
> >  lib/librte_ring/rte_ring.c           | 10 +++++-----
> >  lib/librte_ring/rte_ring.h           | 20 ++++++++++----------
> >  lib/librte_ring/rte_ring_generic.h   | 16 +++++++++-------
> >  4 files changed, 27 insertions(+), 25 deletions(-)
> >
> > diff --git a/lib/librte_eventdev/rte_event_ring.h
> > b/lib/librte_eventdev/rte_event_ring.h
> > index 827a3209e..eae70f904 100644
> > --- a/lib/librte_eventdev/rte_event_ring.h
> > +++ b/lib/librte_eventdev/rte_event_ring.h
> > @@ -1,5 +1,5 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> > - * Copyright(c) 2016-2017 Intel Corporation
> > + * Copyright(c) 2016-2019 Intel Corporation
> >   */
> >
> >  /**
> > @@ -88,7 +88,7 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r,
> >  		const struct rte_event *events,
> >  		unsigned int n, uint16_t *free_space)  {
> > -	uint32_t prod_head, prod_next;
> > +	uintptr_t prod_head, prod_next;
> >  	uint32_t free_entries;
> >
> >  	n = __rte_ring_move_prod_head(&r->r, r->r.prod.single, n, @@ -129,7
> > +129,7 @@ rte_event_ring_dequeue_burst(struct rte_event_ring *r,
> >  		struct rte_event *events,
> >  		unsigned int n, uint16_t *available)  {
> > -	uint32_t cons_head, cons_next;
> > +	uintptr_t cons_head, cons_next;
> >  	uint32_t entries;
> >
> >  	n = __rte_ring_move_cons_head(&r->r, r->r.cons.single, n, diff --git
> > a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c index
> > d215acecc..b15ee0eb3 100644
> > --- a/lib/librte_ring/rte_ring.c
> > +++ b/lib/librte_ring/rte_ring.c
> > @@ -1,6 +1,6 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   *
> > - * Copyright (c) 2010-2015 Intel Corporation
> > + * Copyright (c) 2010-2019 Intel Corporation
> >   * Copyright (c) 2007,2008 Kip Macy kmacy@freebsd.org
> >   * All rights reserved.
> >   * Derived from FreeBSD's bufring.h
> > @@ -227,10 +227,10 @@ rte_ring_dump(FILE *f, const struct rte_ring *r)
> >  	fprintf(f, "  flags=%x\n", r->flags);
> >  	fprintf(f, "  size=%"PRIu32"\n", r->size);
> >  	fprintf(f, "  capacity=%"PRIu32"\n", r->capacity);
> > -	fprintf(f, "  ct=%"PRIu32"\n", r->cons.tail);
> > -	fprintf(f, "  ch=%"PRIu32"\n", r->cons.head);
> > -	fprintf(f, "  pt=%"PRIu32"\n", r->prod.tail);
> > -	fprintf(f, "  ph=%"PRIu32"\n", r->prod.head);
> > +	fprintf(f, "  ct=%"PRIuPTR"\n", r->cons.tail);
> > +	fprintf(f, "  ch=%"PRIuPTR"\n", r->cons.head);
> > +	fprintf(f, "  pt=%"PRIuPTR"\n", r->prod.tail);
> > +	fprintf(f, "  ph=%"PRIuPTR"\n", r->prod.head);
> >  	fprintf(f, "  used=%u\n", rte_ring_count(r));
> >  	fprintf(f, "  avail=%u\n", rte_ring_free_count(r));  } diff --git
> > a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index
> > af5444a9f..12af64e13 100644
> > --- a/lib/librte_ring/rte_ring.h
> > +++ b/lib/librte_ring/rte_ring.h
> > @@ -1,6 +1,6 @@
> >  /* SPDX-License-Identifier: BSD-3-Clause
> >   *
> > - * Copyright (c) 2010-2017 Intel Corporation
> > + * Copyright (c) 2010-2019 Intel Corporation
> >   * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
> >   * All rights reserved.
> >   * Derived from FreeBSD's bufring.h
> > @@ -65,8 +65,8 @@ struct rte_memzone; /* forward declaration, so as
> > not to require memzone.h */
> >
> >  /* structure to hold a pair of head/tail values and other metadata */
> > struct rte_ring_headtail {
> > -	volatile uint32_t head;  /**< Prod/consumer head. */
> > -	volatile uint32_t tail;  /**< Prod/consumer tail. */
> > +	volatile uintptr_t head;  /**< Prod/consumer head. */
> > +	volatile uintptr_t tail;  /**< Prod/consumer tail. */
> >  	uint32_t single;         /**< True if single prod/cons */
> >  };
> 
> Isn't this a major ABI change which will break existing applications?

Correct, and this patch needs to be reworked with the RTE_NEXT_ABI ifdef, as described in the versioning guidelines. I had misunderstood the ABI change procedure, but I'll fix this in v2.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width size
  2019-01-11 10:25   ` Burakov, Anatoly
@ 2019-01-11 19:12     ` Eads, Gage
  2019-01-11 19:55       ` Stephen Hemminger
  0 siblings, 1 reply; 123+ messages in thread
From: Eads, Gage @ 2019-01-11 19:12 UTC (permalink / raw)
  To: Burakov, Anatoly, dev
  Cc: olivier.matz, arybchenko, Richardson, Bruce, Ananyev, Konstantin



> -----Original Message-----
> From: Burakov, Anatoly
> Sent: Friday, January 11, 2019 4:25 AM
> To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width
> size
> 
> On 10-Jan-19 9:01 PM, Gage Eads wrote:
> > For 64-bit architectures, doubling the head and tail index widths
> > greatly increases the time it takes for them to wrap-around (with
> > current CPU speeds, it won't happen within the author's lifetime).
> > This is important in avoiding the ABA problem -- in which a thread
> > mistakes reading the same tail index in two accesses to mean that the
> > ring was not modified in the intervening time -- in the upcoming
> > non-blocking ring implementation. Using a 64-bit index makes the possibility of
> this occurring effectively zero.
> >
> > I tested this commit's performance impact with an x86_64 build on a
> > dual-socket Xeon E5-2699 v4 using ring_perf_autotest, and the change
> > made no significant difference -- the few differences appear to be system
> noise.
> > (The test ran on isolcpus cores using a tickless scheduler, but some
> > variation was stll observed.) Each test was run three times and the
> > results were averaged:
> >
> >                                    | 64b head/tail cycle cost minus
> >               Test                 |     32b head/tail cycle cost
> > ------------------------------------------------------------------
> > SP/SC single enq/dequeue          | 0.33
> > MP/MC single enq/dequeue          | 0.00
> > SP/SC burst enq/dequeue (size 8)  | 0.00 MP/MC burst enq/dequeue (size
> > 8)  | 1.00 SP/SC burst enq/dequeue (size 32) | 0.00 MP/MC burst
> > enq/dequeue (size 32) | -1.00
> > SC empty dequeue                  | 0.01
> > MC empty dequeue                  | 0.00
> >
> > Single lcore:
> > SP/SC bulk enq/dequeue (size 8)   | -0.36
> > MP/MC bulk enq/dequeue (size 8)   | 0.99
> > SP/SC bulk enq/dequeue (size 32)  | -0.40 MP/MC bulk enq/dequeue (size
> > 32)  | -0.57
> >
> > Two physical cores:
> > SP/SC bulk enq/dequeue (size 8)   | -0.49
> > MP/MC bulk enq/dequeue (size 8)   | 0.19
> > SP/SC bulk enq/dequeue (size 32)  | -0.28 MP/MC bulk enq/dequeue (size
> > 32)  | -0.62
> >
> > Two NUMA nodes:
> > SP/SC bulk enq/dequeue (size 8)   | 3.25
> > MP/MC bulk enq/dequeue (size 8)   | 1.87
> > SP/SC bulk enq/dequeue (size 32)  | -0.44 MP/MC bulk enq/dequeue (size
> > 32)  | -1.10
> >
> > An earlier version of this patch changed the head and tail indexes to
> > uint64_t, but that caused a performance drop on 32-bit builds. With
> > uintptr_t, no performance difference is observed on an i686 build.
> >
> > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > ---
> 
> You're breaking the ABI - version bump for affected libraries is needed.
> 
> --
> Thanks,
> Anatoly

If I'm reading the versioning guidelines correctly, I'll need to gate the changes with the RTE_NEXT_ABI macro and provide a deprecation notice, then after a full deprecation cycle we can revert that and bump the library version. Not to mention the 3 ML ACKs.

I'll address this in v2.

Thanks,
Gage

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width size
       [not found]         ` <20190111115851.GC3336@bricha3-MOBL.ger.corp.intel.com>
@ 2019-01-11 19:27           ` Eads, Gage
  2019-01-21 14:14             ` Burakov, Anatoly
  0 siblings, 1 reply; 123+ messages in thread
From: Eads, Gage @ 2019-01-11 19:27 UTC (permalink / raw)
  To: Richardson, Bruce, Burakov, Anatoly
  Cc: dev, olivier.matz, arybchenko, Ananyev, Konstantin



> -----Original Message-----
> From: Richardson, Bruce
> Sent: Friday, January 11, 2019 5:59 AM
> To: Burakov, Anatoly <anatoly.burakov@intel.com>
> Cc: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org;
> olivier.matz@6wind.com; arybchenko@solarflare.com; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width
> size
> 
> On Fri, Jan 11, 2019 at 11:30:24AM +0000, Burakov, Anatoly wrote:
> > On 11-Jan-19 10:58 AM, Bruce Richardson wrote:
> > > On Fri, Jan 11, 2019 at 10:40:19AM +0000, Burakov, Anatoly wrote:
> > > > <...>
> > > >
> > > > > + * Copyright(c) 2016-2019 Intel Corporation
> > > > >     */
> > > > >    /**
> > > > > @@ -88,7 +88,7 @@ rte_event_ring_enqueue_burst(struct
> rte_event_ring *r,
> > > > >    		const struct rte_event *events,
> > > > >    		unsigned int n, uint16_t *free_space)
> > > > >    {
> > > > > -	uint32_t prod_head, prod_next;
> > > > > +	uintptr_t prod_head, prod_next;
> > > >
> > > > I would also question the use of uinptr_t. I think semnatically,
> > > > size_t is more appropriate.
> > > >
> > > Yes, it would, but I believe in this case they want to use the
> > > largest size of (unsigned)int where there exists an atomic for
> > > manipulating 2 of them simultaneously. [The largest size is to
> > > minimize any chance of an ABA issue occuring]. Therefore we need
> > > 32-bit values on 32-bit and 64-bit on 64, and I suspect the best way
> > > to guarantee this is to use pointer-sized values. If size_t is
> > > guaranteed across all OS's to have the same size as uintptr_t it could also be
> used, though.
> > >
> > > /Bruce
> > >
> >
> > Technically, size_t and uintptr_t are not guaranteed to match. In
> > practice, they won't match only on architectures that DPDK doesn't
> > intend to run on (such as 16-bit segmented archs, where size_t would
> > be 16-bit but uinptr_t would be 32-bit).
> >
> > In all the rest of DPDK code, we use size_t for this kind of thing.
> >
> 
> Ok.
> If we do use size_t, I think we also need to add a compile-time check into the
> build too, to error out if sizeof(size_t) != sizeof(uintptr_t).

Ok, I wasn't aware of the precedent of using size_t for this purpose. I'll change it and look into adding a static assert.

Thanks,
Gage

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH 6/6] doc: add NB ring comment to EAL "known issues"
  2019-01-11  2:51   ` Varghese, Vipin
@ 2019-01-11 19:30     ` Eads, Gage
  2019-01-14  0:07       ` Varghese, Vipin
  0 siblings, 1 reply; 123+ messages in thread
From: Eads, Gage @ 2019-01-11 19:30 UTC (permalink / raw)
  To: Varghese, Vipin, dev
  Cc: olivier.matz, arybchenko, Richardson, Bruce, Ananyev, Konstantin

> Hi Gage,
> 
> Humble suggestion from my end, as per DPDK 19.02-rc1 the documentation and
> code change have to be in same patch. Can you please take a look into it.
> 

Certainly, and I'll fix this in my other docs change patch (http://mails.dpdk.org/archives/dev/2019-January/122926.html). Thanks for the heads up.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width size
  2019-01-11 19:12     ` Eads, Gage
@ 2019-01-11 19:55       ` Stephen Hemminger
  2019-01-15 15:48         ` Eads, Gage
  0 siblings, 1 reply; 123+ messages in thread
From: Stephen Hemminger @ 2019-01-11 19:55 UTC (permalink / raw)
  To: Eads, Gage
  Cc: Burakov, Anatoly, dev, olivier.matz, arybchenko, Richardson,
	Bruce, Ananyev, Konstantin

On Fri, 11 Jan 2019 19:12:40 +0000
"Eads, Gage" <gage.eads@intel.com> wrote:

> > -----Original Message-----
> > From: Burakov, Anatoly
> > Sent: Friday, January 11, 2019 4:25 AM
> > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> > <bruce.richardson@intel.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width
> > size
> > 
> > On 10-Jan-19 9:01 PM, Gage Eads wrote:  
> > > For 64-bit architectures, doubling the head and tail index widths
> > > greatly increases the time it takes for them to wrap-around (with
> > > current CPU speeds, it won't happen within the author's lifetime).
> > > This is important in avoiding the ABA problem -- in which a thread
> > > mistakes reading the same tail index in two accesses to mean that the
> > > ring was not modified in the intervening time -- in the upcoming
> > > non-blocking ring implementation. Using a 64-bit index makes the possibility of  
> > this occurring effectively zero.  
> > >
> > > I tested this commit's performance impact with an x86_64 build on a
> > > dual-socket Xeon E5-2699 v4 using ring_perf_autotest, and the change
> > > made no significant difference -- the few differences appear to be system  
> > noise.  
> > > (The test ran on isolcpus cores using a tickless scheduler, but some
> > > variation was stll observed.) Each test was run three times and the
> > > results were averaged:
> > >
> > >                                    | 64b head/tail cycle cost minus
> > >               Test                 |     32b head/tail cycle cost
> > > ------------------------------------------------------------------
> > > SP/SC single enq/dequeue          | 0.33
> > > MP/MC single enq/dequeue          | 0.00
> > > SP/SC burst enq/dequeue (size 8)  | 0.00 MP/MC burst enq/dequeue (size
> > > 8)  | 1.00 SP/SC burst enq/dequeue (size 32) | 0.00 MP/MC burst
> > > enq/dequeue (size 32) | -1.00
> > > SC empty dequeue                  | 0.01
> > > MC empty dequeue                  | 0.00
> > >
> > > Single lcore:
> > > SP/SC bulk enq/dequeue (size 8)   | -0.36
> > > MP/MC bulk enq/dequeue (size 8)   | 0.99
> > > SP/SC bulk enq/dequeue (size 32)  | -0.40 MP/MC bulk enq/dequeue (size
> > > 32)  | -0.57
> > >
> > > Two physical cores:
> > > SP/SC bulk enq/dequeue (size 8)   | -0.49
> > > MP/MC bulk enq/dequeue (size 8)   | 0.19
> > > SP/SC bulk enq/dequeue (size 32)  | -0.28 MP/MC bulk enq/dequeue (size
> > > 32)  | -0.62
> > >
> > > Two NUMA nodes:
> > > SP/SC bulk enq/dequeue (size 8)   | 3.25
> > > MP/MC bulk enq/dequeue (size 8)   | 1.87
> > > SP/SC bulk enq/dequeue (size 32)  | -0.44 MP/MC bulk enq/dequeue (size
> > > 32)  | -1.10
> > >
> > > An earlier version of this patch changed the head and tail indexes to
> > > uint64_t, but that caused a performance drop on 32-bit builds. With
> > > uintptr_t, no performance difference is observed on an i686 build.
> > >
> > > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > > ---  
> > 
> > You're breaking the ABI - version bump for affected libraries is needed.
> > 
> > --
> > Thanks,
> > Anatoly  
> 
> If I'm reading the versioning guidelines correctly, I'll need to gate the changes with the RTE_NEXT_ABI macro and provide a deprecation notice, then after a full deprecation cycle we can revert that and bump the library version. Not to mention the 3 ML ACKs.
> 
> I'll address this in v2.

My understanding is that RTE_NEXT_API method is not used any more. Replaced by rte_experimental.
But this kind of change is more of a flag day event. Which means it needs to be pushed
off to a release that is planned as an ABI break (usually once a year) which would
mean 19.11.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH 5/6] mempool/ring: add non-blocking ring handlers
  2019-01-10 21:01 ` [dpdk-dev] [PATCH 5/6] mempool/ring: add non-blocking ring handlers Gage Eads
@ 2019-01-13 13:43   ` Andrew Rybchenko
  0 siblings, 0 replies; 123+ messages in thread
From: Andrew Rybchenko @ 2019-01-13 13:43 UTC (permalink / raw)
  To: Gage Eads, dev; +Cc: olivier.matz, bruce.richardson, konstantin.ananyev

On 1/11/19 12:01 AM, Gage Eads wrote:
> These handlers allow an application to create a mempool based on the
> non-blocking ring, with any combination of single/multi producer/consumer.
>
> Signed-off-by: Gage Eads <gage.eads@intel.com>

Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>

Of course, it should be mentioned in release notes finally.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH 6/6] doc: add NB ring comment to EAL "known issues"
  2019-01-11 19:30     ` Eads, Gage
@ 2019-01-14  0:07       ` Varghese, Vipin
  0 siblings, 0 replies; 123+ messages in thread
From: Varghese, Vipin @ 2019-01-14  0:07 UTC (permalink / raw)
  To: Eads, Gage, dev
  Cc: olivier.matz, arybchenko, Richardson, Bruce, Ananyev, Konstantin

Thanks for understanding

> -----Original Message-----
> From: Eads, Gage
> Sent: Saturday, January 12, 2019 1:01 AM
> To: Varghese, Vipin <vipin.varghese@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>
> Subject: RE: [dpdk-dev] [PATCH 6/6] doc: add NB ring comment to EAL "known
> issues"
> 
> > Hi Gage,
> >
> > Humble suggestion from my end, as per DPDK 19.02-rc1 the documentation
> > and code change have to be in same patch. Can you please take a look into
> it.
> >
> 
> Certainly, and I'll fix this in my other docs change patch
> (http://mails.dpdk.org/archives/dev/2019-January/122926.html). Thanks for
> the heads up.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width size
  2019-01-11 19:55       ` Stephen Hemminger
@ 2019-01-15 15:48         ` Eads, Gage
  0 siblings, 0 replies; 123+ messages in thread
From: Eads, Gage @ 2019-01-15 15:48 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Burakov, Anatoly, dev, olivier.matz, arybchenko, Richardson,
	Bruce, Ananyev, Konstantin



> -----Original Message-----
> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Friday, January 11, 2019 1:55 PM
> To: Eads, Gage <gage.eads@intel.com>
> Cc: Burakov, Anatoly <anatoly.burakov@intel.com>; dev@dpdk.org;
> olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width
> size
> 
> On Fri, 11 Jan 2019 19:12:40 +0000
> "Eads, Gage" <gage.eads@intel.com> wrote:
> 
> > > -----Original Message-----
> > > From: Burakov, Anatoly
> > > Sent: Friday, January 11, 2019 4:25 AM
> > > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> > > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson,
> > > Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> > > <konstantin.ananyev@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to
> > > pointer-width size
> > >
> > > On 10-Jan-19 9:01 PM, Gage Eads wrote:
> > > > For 64-bit architectures, doubling the head and tail index widths
> > > > greatly increases the time it takes for them to wrap-around (with
> > > > current CPU speeds, it won't happen within the author's lifetime).
> > > > This is important in avoiding the ABA problem -- in which a thread
> > > > mistakes reading the same tail index in two accesses to mean that
> > > > the ring was not modified in the intervening time -- in the
> > > > upcoming non-blocking ring implementation. Using a 64-bit index
> > > > makes the possibility of
> > > this occurring effectively zero.
> > > >
> > > > I tested this commit's performance impact with an x86_64 build on
> > > > a dual-socket Xeon E5-2699 v4 using ring_perf_autotest, and the
> > > > change made no significant difference -- the few differences
> > > > appear to be system
> > > noise.
> > > > (The test ran on isolcpus cores using a tickless scheduler, but
> > > > some variation was stll observed.) Each test was run three times
> > > > and the results were averaged:
> > > >
> > > >                                    | 64b head/tail cycle cost minus
> > > >               Test                 |     32b head/tail cycle cost
> > > > ------------------------------------------------------------------
> > > > SP/SC single enq/dequeue          | 0.33
> > > > MP/MC single enq/dequeue          | 0.00
> > > > SP/SC burst enq/dequeue (size 8)  | 0.00 MP/MC burst enq/dequeue
> > > > (size
> > > > 8)  | 1.00 SP/SC burst enq/dequeue (size 32) | 0.00 MP/MC burst
> > > > enq/dequeue (size 32) | -1.00
> > > > SC empty dequeue                  | 0.01
> > > > MC empty dequeue                  | 0.00
> > > >
> > > > Single lcore:
> > > > SP/SC bulk enq/dequeue (size 8)   | -0.36
> > > > MP/MC bulk enq/dequeue (size 8)   | 0.99
> > > > SP/SC bulk enq/dequeue (size 32)  | -0.40 MP/MC bulk enq/dequeue
> > > > (size
> > > > 32)  | -0.57
> > > >
> > > > Two physical cores:
> > > > SP/SC bulk enq/dequeue (size 8)   | -0.49
> > > > MP/MC bulk enq/dequeue (size 8)   | 0.19
> > > > SP/SC bulk enq/dequeue (size 32)  | -0.28 MP/MC bulk enq/dequeue
> > > > (size
> > > > 32)  | -0.62
> > > >
> > > > Two NUMA nodes:
> > > > SP/SC bulk enq/dequeue (size 8)   | 3.25
> > > > MP/MC bulk enq/dequeue (size 8)   | 1.87
> > > > SP/SC bulk enq/dequeue (size 32)  | -0.44 MP/MC bulk enq/dequeue
> > > > (size
> > > > 32)  | -1.10
> > > >
> > > > An earlier version of this patch changed the head and tail indexes
> > > > to uint64_t, but that caused a performance drop on 32-bit builds.
> > > > With uintptr_t, no performance difference is observed on an i686 build.
> > > >
> > > > Signed-off-by: Gage Eads <gage.eads@intel.com>
> > > > ---
> > >
> > > You're breaking the ABI - version bump for affected libraries is needed.
> > >
> > > --
> > > Thanks,
> > > Anatoly
> >
> > If I'm reading the versioning guidelines correctly, I'll need to gate the changes
> with the RTE_NEXT_ABI macro and provide a deprecation notice, then after a
> full deprecation cycle we can revert that and bump the library version. Not to
> mention the 3 ML ACKs.
> >
> > I'll address this in v2.
> 
> My understanding is that RTE_NEXT_API method is not used any more. Replaced
> by rte_experimental.
> But this kind of change is more of a flag day event. Which means it needs to be
> pushed off to a release that is planned as an ABI break (usually once a year)
> which would mean 19.11.

In recent release notes, I see ABI changes can happen more frequently than once per year; 18.11, 18.05, 17.11, and 17.08 have ABI changes -- and soon 19.02 will too.

At any rate, I'll create a separate deprecation notice patch and update this patchset accordingly.

Thanks,
Gage

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v2 0/5] Add non-blocking ring
  2019-01-10 21:01 [dpdk-dev] [PATCH 0/6] Add non-blocking ring Gage Eads
                   ` (5 preceding siblings ...)
  2019-01-10 21:01 ` [dpdk-dev] [PATCH 6/6] doc: add NB ring comment to EAL "known issues" Gage Eads
@ 2019-01-15 23:52 ` Gage Eads
  2019-01-15 23:52   ` [dpdk-dev] [PATCH v2 1/5] ring: change head and tail to pointer-width size Gage Eads
                     ` (6 more replies)
  6 siblings, 7 replies; 123+ messages in thread
From: Gage Eads @ 2019-01-15 23:52 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev, stephen

For some users, the rte ring's "non-preemptive" constraint is not acceptable;
for example, if the application uses a mixture of pinned high-priority threads
and multiplexed low-priority threads that share a mempool.

This patchset introduces a non-blocking ring, on top of which a mempool can run.
Crucially, the non-blocking algorithm relies on a 128-bit compare-and-swap, so
it is limited to x86_64 machines.

The ring uses more compare-and-swap atomic operations than the regular rte ring:
With no contention, an enqueue of n pointers uses (1 + 2n) CAS operations and a
dequeue of n pointers uses 2. This algorithm has worse average-case performance
than the regular rte ring (particularly a highly-contended ring with large bulk
accesses), however:
- For applications with preemptible pthreads, the regular rte ring's worst-case
  performance (i.e. one thread being preempted in the update_tail() critical
  section) is much worse than the non-blocking ring's.
- Software caching can mitigate the average case performance for ring-based
  algorithms. For example, a non-blocking ring based mempool (a likely use case
  for this ring) with per-thread caching.

The non-blocking ring is enabled via a new flag, RING_F_NB. For ease-of-use,
existing ring enqueue/dequeue functions work with both "regular" and
non-blocking rings.

This patchset also adds non-blocking versions of ring_autotest and
ring_perf_autotest, and a non-blocking ring based mempool.

This patchset makes ABI and API changes; a deprecation notice will be
posted in a separate commit.

This patchset depends on the non-blocking stack patchset[1].

[1] http://mails.dpdk.org/archives/dev/2019-January/123470.html

v2:
 - Merge separate docs commit into patch #5
 - Convert uintptr_t to size_t
 - Add a compile-time check for the size of size_t
 - Fix a space-after-typecast issue
 - Fix an unnecessary-parentheses checkpatch warning
 - Bump librte_ring's library version

Gage Eads (5):
  ring: change head and tail to pointer-width size
  ring: add a non-blocking implementation
  test_ring: add non-blocking ring autotest
  test_ring_perf: add non-blocking ring perf test
  mempool/ring: add non-blocking ring handlers

 doc/guides/prog_guide/env_abstraction_layer.rst |   2 +-
 drivers/mempool/ring/rte_mempool_ring.c         |  58 ++-
 lib/librte_eventdev/rte_event_ring.h            |   6 +-
 lib/librte_ring/Makefile                        |   2 +-
 lib/librte_ring/meson.build                     |   2 +-
 lib/librte_ring/rte_ring.c                      |  53 ++-
 lib/librte_ring/rte_ring.h                      | 564 ++++++++++++++++++++++--
 lib/librte_ring/rte_ring_generic.h              |  16 +-
 lib/librte_ring/rte_ring_version.map            |   7 +
 test/test/test_ring.c                           |  57 ++-
 test/test/test_ring_perf.c                      |  19 +-
 11 files changed, 699 insertions(+), 87 deletions(-)

-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v2 1/5] ring: change head and tail to pointer-width size
  2019-01-15 23:52 ` [dpdk-dev] [PATCH v2 0/5] Add non-blocking ring Gage Eads
@ 2019-01-15 23:52   ` Gage Eads
  2019-01-15 23:52   ` [dpdk-dev] [PATCH v2 2/5] ring: add a non-blocking implementation Gage Eads
                     ` (5 subsequent siblings)
  6 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-01-15 23:52 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev, stephen

For 64-bit architectures, doubling the head and tail index widths greatly
increases the time it takes for them to wrap-around (with current CPU
speeds, it won't happen within the author's lifetime). This is important in
avoiding the ABA problem -- in which a thread mistakes reading the same
tail index in two accesses to mean that the ring was not modified in the
intervening time -- in the upcoming non-blocking ring implementation. Using
a 64-bit index makes the possibility of this occurring effectively zero.

I tested this commit's performance impact with an x86_64 build on a
dual-socket Xeon E5-2699 v4 using ring_perf_autotest, and the change made
no significant difference -- the few differences appear to be system noise.
(The test ran on isolcpus cores using a tickless scheduler, but some
variation was stll observed.) Each test was run three times and the results
were averaged:

                                  | 64b head/tail cycle cost minus
             Test                 |     32b head/tail cycle cost
------------------------------------------------------------------
SP/SC single enq/dequeue          | 0.33
MP/MC single enq/dequeue          | 0.00
SP/SC burst enq/dequeue (size 8)  | 0.00
MP/MC burst enq/dequeue (size 8)  | 1.00
SP/SC burst enq/dequeue (size 32) | 0.00
MP/MC burst enq/dequeue (size 32) | -1.00
SC empty dequeue                  | 0.01
MC empty dequeue                  | 0.00

Single lcore:
SP/SC bulk enq/dequeue (size 8)   | -0.36
MP/MC bulk enq/dequeue (size 8)   | 0.99
SP/SC bulk enq/dequeue (size 32)  | -0.40
MP/MC bulk enq/dequeue (size 32)  | -0.57

Two physical cores:
SP/SC bulk enq/dequeue (size 8)   | -0.49
MP/MC bulk enq/dequeue (size 8)   | 0.19
SP/SC bulk enq/dequeue (size 32)  | -0.28
MP/MC bulk enq/dequeue (size 32)  | -0.62

Two NUMA nodes:
SP/SC bulk enq/dequeue (size 8)   | 3.25
MP/MC bulk enq/dequeue (size 8)   | 1.87
SP/SC bulk enq/dequeue (size 32)  | -0.44
MP/MC bulk enq/dequeue (size 32)  | -1.10

An earlier version of this patch changed the head and tail indexes to
uint64_t, but that caused a performance drop on 32-bit builds. With
size_t, no performance difference is observed on an i686 build.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_eventdev/rte_event_ring.h |  6 +++---
 lib/librte_ring/Makefile             |  2 +-
 lib/librte_ring/meson.build          |  2 +-
 lib/librte_ring/rte_ring.c           | 10 +++++-----
 lib/librte_ring/rte_ring.h           | 29 ++++++++++++++++++-----------
 lib/librte_ring/rte_ring_generic.h   | 16 +++++++++-------
 6 files changed, 37 insertions(+), 28 deletions(-)

diff --git a/lib/librte_eventdev/rte_event_ring.h b/lib/librte_eventdev/rte_event_ring.h
index 827a3209e..da8886c08 100644
--- a/lib/librte_eventdev/rte_event_ring.h
+++ b/lib/librte_eventdev/rte_event_ring.h
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2016-2017 Intel Corporation
+ * Copyright(c) 2016-2019 Intel Corporation
  */
 
 /**
@@ -88,7 +88,7 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r,
 		const struct rte_event *events,
 		unsigned int n, uint16_t *free_space)
 {
-	uint32_t prod_head, prod_next;
+	size_t prod_head, prod_next;
 	uint32_t free_entries;
 
 	n = __rte_ring_move_prod_head(&r->r, r->r.prod.single, n,
@@ -129,7 +129,7 @@ rte_event_ring_dequeue_burst(struct rte_event_ring *r,
 		struct rte_event *events,
 		unsigned int n, uint16_t *available)
 {
-	uint32_t cons_head, cons_next;
+	size_t cons_head, cons_next;
 	uint32_t entries;
 
 	n = __rte_ring_move_cons_head(&r->r, r->r.cons.single, n,
diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index 21a36770d..c106f9908 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -11,7 +11,7 @@ LDLIBS += -lrte_eal
 
 EXPORT_MAP := rte_ring_version.map
 
-LIBABIVER := 2
+LIBABIVER := 3
 
 # all source are stored in SRCS-y
 SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
diff --git a/lib/librte_ring/meson.build b/lib/librte_ring/meson.build
index ab8b0b469..3603fdb7a 100644
--- a/lib/librte_ring/meson.build
+++ b/lib/librte_ring/meson.build
@@ -1,7 +1,7 @@
 # SPDX-License-Identifier: BSD-3-Clause
 # Copyright(c) 2017 Intel Corporation
 
-version = 2
+version = 3
 sources = files('rte_ring.c')
 headers = files('rte_ring.h',
 		'rte_ring_c11_mem.h',
diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
index d215acecc..b15ee0eb3 100644
--- a/lib/librte_ring/rte_ring.c
+++ b/lib/librte_ring/rte_ring.c
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  *
- * Copyright (c) 2010-2015 Intel Corporation
+ * Copyright (c) 2010-2019 Intel Corporation
  * Copyright (c) 2007,2008 Kip Macy kmacy@freebsd.org
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
@@ -227,10 +227,10 @@ rte_ring_dump(FILE *f, const struct rte_ring *r)
 	fprintf(f, "  flags=%x\n", r->flags);
 	fprintf(f, "  size=%"PRIu32"\n", r->size);
 	fprintf(f, "  capacity=%"PRIu32"\n", r->capacity);
-	fprintf(f, "  ct=%"PRIu32"\n", r->cons.tail);
-	fprintf(f, "  ch=%"PRIu32"\n", r->cons.head);
-	fprintf(f, "  pt=%"PRIu32"\n", r->prod.tail);
-	fprintf(f, "  ph=%"PRIu32"\n", r->prod.head);
+	fprintf(f, "  ct=%"PRIuPTR"\n", r->cons.tail);
+	fprintf(f, "  ch=%"PRIuPTR"\n", r->cons.head);
+	fprintf(f, "  pt=%"PRIuPTR"\n", r->prod.tail);
+	fprintf(f, "  ph=%"PRIuPTR"\n", r->prod.head);
 	fprintf(f, "  used=%u\n", rte_ring_count(r));
 	fprintf(f, "  avail=%u\n", rte_ring_free_count(r));
 }
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index af5444a9f..213c50708 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  *
- * Copyright (c) 2010-2017 Intel Corporation
+ * Copyright (c) 2010-2019 Intel Corporation
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
@@ -63,11 +63,18 @@ enum rte_ring_queue_behavior {
 
 struct rte_memzone; /* forward declaration, so as not to require memzone.h */
 
+/* Verify that size_t is the same as uintptr_t, which rte_ring (among other
+ * components) assumes.
+ */
+#if UINTPTR_MAX != SIZE_MAX
+#error "DPDK requires sizeof(size_t) == sizeof(uintptr_t)"
+#endif
+
 /* structure to hold a pair of head/tail values and other metadata */
 struct rte_ring_headtail {
-	volatile uint32_t head;  /**< Prod/consumer head. */
-	volatile uint32_t tail;  /**< Prod/consumer tail. */
-	uint32_t single;         /**< True if single prod/cons */
+	volatile size_t head;  /**< Prod/consumer head. */
+	volatile size_t tail;  /**< Prod/consumer tail. */
+	uint32_t single;       /**< True if single prod/cons */
 };
 
 /**
@@ -242,7 +249,7 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 #define ENQUEUE_PTRS(r, ring_start, prod_head, obj_table, n, obj_type) do { \
 	unsigned int i; \
 	const uint32_t size = (r)->size; \
-	uint32_t idx = prod_head & (r)->mask; \
+	size_t idx = prod_head & (r)->mask; \
 	obj_type *ring = (obj_type *)ring_start; \
 	if (likely(idx + n < size)) { \
 		for (i = 0; i < (n & ((~(unsigned)0x3))); i+=4, idx+=4) { \
@@ -272,7 +279,7 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
  * single and multi consumer dequeue functions */
 #define DEQUEUE_PTRS(r, ring_start, cons_head, obj_table, n, obj_type) do { \
 	unsigned int i; \
-	uint32_t idx = cons_head & (r)->mask; \
+	size_t idx = cons_head & (r)->mask; \
 	const uint32_t size = (r)->size; \
 	obj_type *ring = (obj_type *)ring_start; \
 	if (likely(idx + n < size)) { \
@@ -338,7 +345,7 @@ __rte_ring_do_enqueue(struct rte_ring *r, void * const *obj_table,
 		 unsigned int n, enum rte_ring_queue_behavior behavior,
 		 unsigned int is_sp, unsigned int *free_space)
 {
-	uint32_t prod_head, prod_next;
+	size_t prod_head, prod_next;
 	uint32_t free_entries;
 
 	n = __rte_ring_move_prod_head(r, is_sp, n, behavior,
@@ -380,7 +387,7 @@ __rte_ring_do_dequeue(struct rte_ring *r, void **obj_table,
 		 unsigned int n, enum rte_ring_queue_behavior behavior,
 		 unsigned int is_sc, unsigned int *available)
 {
-	uint32_t cons_head, cons_next;
+	size_t cons_head, cons_next;
 	uint32_t entries;
 
 	n = __rte_ring_move_cons_head(r, (int)is_sc, n, behavior,
@@ -681,9 +688,9 @@ rte_ring_dequeue(struct rte_ring *r, void **obj_p)
 static inline unsigned
 rte_ring_count(const struct rte_ring *r)
 {
-	uint32_t prod_tail = r->prod.tail;
-	uint32_t cons_tail = r->cons.tail;
-	uint32_t count = (prod_tail - cons_tail) & r->mask;
+	size_t prod_tail = r->prod.tail;
+	size_t cons_tail = r->cons.tail;
+	size_t count = (prod_tail - cons_tail) & r->mask;
 	return (count > r->capacity) ? r->capacity : count;
 }
 
diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index ea7dbe5b9..31e38adcc 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  *
- * Copyright (c) 2010-2017 Intel Corporation
+ * Copyright (c) 2010-2019 Intel Corporation
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
@@ -11,7 +11,7 @@
 #define _RTE_RING_GENERIC_H_
 
 static __rte_always_inline void
-update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
+update_tail(struct rte_ring_headtail *ht, size_t old_val, size_t new_val,
 		uint32_t single, uint32_t enqueue)
 {
 	if (enqueue)
@@ -55,7 +55,7 @@ update_tail(struct rte_ring_headtail *ht, uint32_t old_val, uint32_t new_val,
 static __rte_always_inline unsigned int
 __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		unsigned int n, enum rte_ring_queue_behavior behavior,
-		uint32_t *old_head, uint32_t *new_head,
+		size_t *old_head, size_t *new_head,
 		uint32_t *free_entries)
 {
 	const uint32_t capacity = r->capacity;
@@ -93,7 +93,8 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 		if (is_sp)
 			r->prod.head = *new_head, success = 1;
 		else
-			success = rte_atomic32_cmpset(&r->prod.head,
+			/* Built-in used to handle variable-sized head index. */
+			success = __sync_bool_compare_and_swap(&r->prod.head,
 					*old_head, *new_head);
 	} while (unlikely(success == 0));
 	return n;
@@ -125,7 +126,7 @@ __rte_ring_move_prod_head(struct rte_ring *r, unsigned int is_sp,
 static __rte_always_inline unsigned int
 __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
 		unsigned int n, enum rte_ring_queue_behavior behavior,
-		uint32_t *old_head, uint32_t *new_head,
+		size_t *old_head, size_t *new_head,
 		uint32_t *entries)
 {
 	unsigned int max = n;
@@ -161,8 +162,9 @@ __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
 		if (is_sc)
 			r->cons.head = *new_head, success = 1;
 		else
-			success = rte_atomic32_cmpset(&r->cons.head, *old_head,
-					*new_head);
+			/* Built-in used to handle variable-sized head index. */
+			success = __sync_bool_compare_and_swap(&r->cons.head,
+					*old_head, *new_head);
 	} while (unlikely(success == 0));
 	return n;
 }
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v2 2/5] ring: add a non-blocking implementation
  2019-01-15 23:52 ` [dpdk-dev] [PATCH v2 0/5] Add non-blocking ring Gage Eads
  2019-01-15 23:52   ` [dpdk-dev] [PATCH v2 1/5] ring: change head and tail to pointer-width size Gage Eads
@ 2019-01-15 23:52   ` Gage Eads
  2019-01-15 23:52   ` [dpdk-dev] [PATCH v2 3/5] test_ring: add non-blocking ring autotest Gage Eads
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-01-15 23:52 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev, stephen

This commit adds support for non-blocking circular ring enqueue and dequeue
functions. The ring uses a 128-bit compare-and-swap instruction, and thus
is limited to x86_64.

The algorithm is based on the original rte ring (derived from FreeBSD's
bufring.h) and inspired by Michael and Scott's non-blocking concurrent
queue. Importantly, it adds a modification counter to each ring entry to
ensure only one thread can write to an unused entry.

-----
Algorithm:

Multi-producer non-blocking enqueue:
1. Move the producer head index 'n' locations forward, effectively
   reserving 'n' locations.
2. For each pointer:
 a. Read the producer tail index, then ring[tail]. If ring[tail]'s
    modification counter isn't 'tail', retry.
 b. Construct the new entry: {pointer, tail + ring size}
 c. Compare-and-swap the old entry with the new. If unsuccessful, the
    next loop iteration will try to enqueue this pointer again.
 d. Compare-and-swap the tail index with 'tail + 1', whether or not step 2c
    succeeded. This guarantees threads can make forward progress.

Multi-consumer non-blocking dequeue:
1. Move the consumer head index 'n' locations forward, effectively
   reserving 'n' pointers to be dequeued.
2. Copy 'n' pointers into the caller's object table (ignoring the
   modification counter), starting from ring[tail], then compare-and-swap
   the tail index with 'tail + n'.  If unsuccessful, repeat step 2.

-----
Discussion:

There are two cases where the ABA problem is mitigated:
1. Enqueueing a pointer to the ring: without a modification counter
   tied to the tail index, the index could become stale by the time the
   enqueue happens, causing it to overwrite valid data. Tying the
   counter to the tail index gives us an expected value (as opposed to,
   say, a monotonically incrementing counter).

   Since the counter will eventually wrap, there is potential for the ABA
   problem. However, using a 64-bit counter makes this likelihood
   effectively zero.

2. Updating a tail index: the ABA problem can occur if the thread is
   preempted and the tail index wraps around. However, using 64-bit indexes
   makes this likelihood effectively zero.

With no contention, an enqueue of n pointers uses (1 + 2n) CAS operations
and a dequeue of n pointers uses 2. This algorithm has worse average-case
performance than the regular rte ring (particularly a highly-contended ring
with large bulk accesses), however:
- For applications with preemptible pthreads, the regular rte ring's
  worst-case performance (i.e. one thread being preempted in the
  update_tail() critical section) is much worse than the non-blocking
  ring's.
- Software caching can mitigate the average case performance for ring-based
  algorithms. For example, a non-blocking ring based mempool (a likely use
  case for this ring) with per-thread caching.

The non-blocking ring is enabled via a new flag, RING_F_NB. Because the
ring's memsize is now a function of its flags (the non-blocking ring
requires 128b for each entry), this commit adds a new argument ('flags') to
rte_ring_get_memsize().

For ease-of-use, existing ring enqueue and dequeue functions work on both
regular and non-blocking rings. This introduces an additional branch in
the datapath, but this should be a highly predictable branch.
ring_perf_autotest shows a negligible performance impact; it's hard to
distinguish a real difference versus system noise.

                                  | ring_perf_autotest cycles with branch -
             Test                 |   ring_perf_autotest cycles without
------------------------------------------------------------------
SP/SC single enq/dequeue          | 0.33
MP/MC single enq/dequeue          | -4.00
SP/SC burst enq/dequeue (size 8)  | 0.00
MP/MC burst enq/dequeue (size 8)  | 0.00
SP/SC burst enq/dequeue (size 32) | 0.00
MP/MC burst enq/dequeue (size 32) | 0.00
SC empty dequeue                  | 1.00
MC empty dequeue                  | 0.00

Single lcore:
SP/SC bulk enq/dequeue (size 8)   | 0.49
MP/MC bulk enq/dequeue (size 8)   | 0.08
SP/SC bulk enq/dequeue (size 32)  | 0.07
MP/MC bulk enq/dequeue (size 32)  | 0.09

Two physical cores:
SP/SC bulk enq/dequeue (size 8)   | 0.19
MP/MC bulk enq/dequeue (size 8)   | -0.37
SP/SC bulk enq/dequeue (size 32)  | 0.09
MP/MC bulk enq/dequeue (size 32)  | -0.05

Two NUMA nodes:
SP/SC bulk enq/dequeue (size 8)   | -1.96
MP/MC bulk enq/dequeue (size 8)   | 0.88
SP/SC bulk enq/dequeue (size 32)  | 0.10
MP/MC bulk enq/dequeue (size 32)  | 0.46

Test setup: x86_64 build with default config, dual-socket Xeon E5-2699 v4,
running on isolcpus cores with a tickless scheduler. Each test run three
times and the results averaged.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_ring/rte_ring.c           |  43 ++-
 lib/librte_ring/rte_ring.h           | 535 +++++++++++++++++++++++++++++++++--
 lib/librte_ring/rte_ring_version.map |   7 +
 3 files changed, 554 insertions(+), 31 deletions(-)

diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
index b15ee0eb3..783c96568 100644
--- a/lib/librte_ring/rte_ring.c
+++ b/lib/librte_ring/rte_ring.c
@@ -45,9 +45,9 @@ EAL_REGISTER_TAILQ(rte_ring_tailq)
 
 /* return the size of memory occupied by a ring */
 ssize_t
-rte_ring_get_memsize(unsigned count)
+rte_ring_get_memsize_v1905(unsigned int count, unsigned int flags)
 {
-	ssize_t sz;
+	ssize_t sz, elt_sz;
 
 	/* count must be a power of 2 */
 	if ((!POWEROF2(count)) || (count > RTE_RING_SZ_MASK )) {
@@ -57,10 +57,23 @@ rte_ring_get_memsize(unsigned count)
 		return -EINVAL;
 	}
 
-	sz = sizeof(struct rte_ring) + count * sizeof(void *);
+	elt_sz = (flags & RING_F_NB) ? 2 * sizeof(void *) : sizeof(void *);
+
+	sz = sizeof(struct rte_ring) + count * elt_sz;
 	sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
 	return sz;
 }
+BIND_DEFAULT_SYMBOL(rte_ring_get_memsize, _v1905, 19.05);
+MAP_STATIC_SYMBOL(ssize_t rte_ring_get_memsize(unsigned int count,
+					       unsigned int flags),
+		  rte_ring_get_memsize_v1905);
+
+ssize_t
+rte_ring_get_memsize_v20(unsigned int count)
+{
+	return rte_ring_get_memsize_v1905(count, 0);
+}
+VERSION_SYMBOL(rte_ring_get_memsize, _v20, 2.0);
 
 int
 rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
@@ -103,6 +116,20 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 	r->prod.head = r->cons.head = 0;
 	r->prod.tail = r->cons.tail = 0;
 
+	if (flags & RING_F_NB) {
+		uint64_t i;
+
+		for (i = 0; i < r->size; i++) {
+			struct nb_ring_entry *ring_ptr, *base;
+
+			base = ((struct nb_ring_entry *)&r[1]);
+
+			ring_ptr = &base[i & r->mask];
+
+			ring_ptr->cnt = i;
+		}
+	}
+
 	return 0;
 }
 
@@ -123,11 +150,19 @@ rte_ring_create(const char *name, unsigned count, int socket_id,
 
 	ring_list = RTE_TAILQ_CAST(rte_ring_tailq.head, rte_ring_list);
 
+#if !defined(RTE_ARCH_X86_64)
+	if (flags & RING_F_NB) {
+		printf("RING_F_NB is only supported on x86-64 platforms\n");
+		rte_errno = EINVAL;
+		return NULL;
+	}
+#endif
+
 	/* for an exact size ring, round up from count to a power of two */
 	if (flags & RING_F_EXACT_SZ)
 		count = rte_align32pow2(count + 1);
 
-	ring_size = rte_ring_get_memsize(count);
+	ring_size = rte_ring_get_memsize(count, flags);
 	if (ring_size < 0) {
 		rte_errno = ring_size;
 		return NULL;
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 213c50708..0648b09fb 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -124,6 +124,18 @@ struct rte_ring {
  */
 #define RING_F_EXACT_SZ 0x0004
 #define RTE_RING_SZ_MASK  (0x7fffffffU) /**< Ring size mask */
+/**
+ * The ring uses non-blocking enqueue and dequeue functions. These functions
+ * do not have the "non-preemptive" constraint of a regular rte ring, and thus
+ * are suited for applications using preemptible pthreads. However, the
+ * non-blocking functions have worse average-case performance than their
+ * regular rte ring counterparts. When used as the handler for a mempool,
+ * per-thread caching can mitigate the performance difference by reducing the
+ * number (and contention) of ring accesses.
+ *
+ * This flag is only supported on x86_64 platforms.
+ */
+#define RING_F_NB 0x0008
 
 /* @internal defines for passing to the enqueue dequeue worker functions */
 #define __IS_SP 1
@@ -141,11 +153,15 @@ struct rte_ring {
  *
  * @param count
  *   The number of elements in the ring (must be a power of 2).
+ * @param flags
+ *   The flags the ring will be created with.
  * @return
  *   - The memory size needed for the ring on success.
  *   - -EINVAL if count is not a power of 2.
  */
-ssize_t rte_ring_get_memsize(unsigned count);
+ssize_t rte_ring_get_memsize(unsigned int count, unsigned int flags);
+ssize_t rte_ring_get_memsize_v20(unsigned int count);
+ssize_t rte_ring_get_memsize_v1905(unsigned int count, unsigned int flags);
 
 /**
  * Initialize a ring structure.
@@ -178,6 +194,10 @@ ssize_t rte_ring_get_memsize(unsigned count);
  *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
  *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
  *      is "single-consumer". Otherwise, it is "multi-consumers".
+ *    - RING_F_EXACT_SZ: If this flag is set, count can be a non-power-of-2
+ *      number, but up to half the ring space may be wasted.
+ *    - RING_F_NB: (x86_64 only) If this flag is set, the ring uses
+ *      non-blocking variants of the dequeue and enqueue functions.
  * @return
  *   0 on success, or a negative value on error.
  */
@@ -213,12 +233,17 @@ int rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
  *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
  *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
  *      is "single-consumer". Otherwise, it is "multi-consumers".
+ *    - RING_F_EXACT_SZ: If this flag is set, count can be a non-power-of-2
+ *      number, but up to half the ring space may be wasted.
+ *    - RING_F_NB: (x86_64 only) If this flag is set, the ring uses
+ *      non-blocking variants of the dequeue and enqueue functions.
  * @return
  *   On success, the pointer to the new allocated ring. NULL on error with
  *    rte_errno set appropriately. Possible errno values include:
  *    - E_RTE_NO_CONFIG - function could not get pointer to rte_config structure
  *    - E_RTE_SECONDARY - function was called from a secondary process instance
- *    - EINVAL - count provided is not a power of 2
+ *    - EINVAL - count provided is not a power of 2, or RING_F_NB is used on an
+ *      unsupported platform
  *    - ENOSPC - the maximum number of memzones has already been allocated
  *    - EEXIST - a memzone with the same name already exists
  *    - ENOMEM - no appropriate memory area found in which to create memzone
@@ -274,6 +299,50 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 	} \
 } while (0)
 
+/* The actual enqueue of pointers on the ring.
+ * Used only by the single-producer non-blocking enqueue function, but
+ * out-lined here for code readability.
+ */
+#define ENQUEUE_PTRS_NB(r, ring_start, prod_head, obj_table, n) do { \
+	unsigned int i; \
+	const uint32_t size = (r)->size; \
+	size_t idx = prod_head & (r)->mask; \
+	size_t new_cnt = prod_head + size; \
+	struct nb_ring_entry *ring = (struct nb_ring_entry *)ring_start; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & ((~(unsigned)0x3))); i += 4, idx += 4) { \
+			ring[idx].ptr = obj_table[i]; \
+			ring[idx].cnt = new_cnt + i;  \
+			ring[idx + 1].ptr = obj_table[i + 1]; \
+			ring[idx + 1].cnt = new_cnt + i + 1;  \
+			ring[idx + 2].ptr = obj_table[i + 2]; \
+			ring[idx + 2].cnt = new_cnt + i + 2;  \
+			ring[idx + 3].ptr = obj_table[i + 3]; \
+			ring[idx + 3].cnt = new_cnt + i + 3;  \
+		} \
+		switch (n & 0x3) { \
+		case 3: \
+			ring[idx].cnt = new_cnt + i; \
+			ring[idx++].ptr = obj_table[i++]; /* fallthrough */ \
+		case 2: \
+			ring[idx].cnt = new_cnt + i; \
+			ring[idx++].ptr = obj_table[i++]; /* fallthrough */ \
+		case 1: \
+			ring[idx].cnt = new_cnt + i; \
+			ring[idx++].ptr = obj_table[i++]; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) { \
+			ring[idx].cnt = new_cnt + i;  \
+			ring[idx].ptr = obj_table[i]; \
+		} \
+		for (idx = 0; i < n; i++, idx++) {    \
+			ring[idx].cnt = new_cnt + i;  \
+			ring[idx].ptr = obj_table[i]; \
+		} \
+	} \
+} while (0)
+
 /* the actual copy of pointers on the ring to obj_table.
  * Placed here since identical code needed in both
  * single and multi consumer dequeue functions */
@@ -305,6 +374,39 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 	} \
 } while (0)
 
+/* The actual copy of pointers on the ring to obj_table.
+ * Placed here since identical code needed in both
+ * single and multi consumer non-blocking dequeue functions.
+ */
+#define DEQUEUE_PTRS_NB(r, ring_start, cons_head, obj_table, n) do { \
+	unsigned int i; \
+	size_t idx = cons_head & (r)->mask; \
+	const uint32_t size = (r)->size; \
+	struct nb_ring_entry *ring = (struct nb_ring_entry *)ring_start; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & (~(unsigned)0x3)); i += 4, idx += 4) {\
+			obj_table[i] = ring[idx].ptr; \
+			obj_table[i + 1] = ring[idx + 1].ptr; \
+			obj_table[i + 2] = ring[idx + 2].ptr; \
+			obj_table[i + 3] = ring[idx + 3].ptr; \
+		} \
+		switch (n & 0x3) { \
+		case 3: \
+			obj_table[i++] = ring[idx++].ptr; /* fallthrough */ \
+		case 2: \
+			obj_table[i++] = ring[idx++].ptr; /* fallthrough */ \
+		case 1: \
+			obj_table[i++] = ring[idx++].ptr; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) \
+			obj_table[i] = ring[idx].ptr; \
+		for (idx = 0; i < n; i++, idx++) \
+			obj_table[i] = ring[idx].ptr; \
+	} \
+} while (0)
+
+
 /* Between load and load. there might be cpu reorder in weak model
  * (powerpc/arm).
  * There are 2 choices for the users
@@ -320,6 +422,314 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 #include "rte_ring_generic.h"
 #endif
 
+/* @internal 128-bit structure used by the non-blocking ring */
+struct nb_ring_entry {
+	void *ptr; /**< Data pointer */
+	uint64_t cnt; /**< Modification counter */
+};
+
+/* The non-blocking ring algorithm is based on the original rte ring (derived
+ * from FreeBSD's bufring.h) and inspired by Michael and Scott's non-blocking
+ * concurrent queue.
+ */
+
+/**
+ * @internal
+ *   Enqueue several objects on the non-blocking ring (single-producer only)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_enqueue_sp(struct rte_ring *r, void * const *obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+	uint32_t free_entries;
+	size_t head, next;
+
+	n = __rte_ring_move_prod_head(r, 1, n, behavior,
+				      &head, &next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	ENQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
+
+	r->prod.tail += n;
+
+end:
+	if (free_space != NULL)
+		*free_space = free_entries - n;
+	return n;
+}
+
+/**
+ * @internal
+ *   Enqueue several objects on the non-blocking ring (multi-producer safe)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_enqueue_mp(struct rte_ring *r, void * const *obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+#ifdef RTE_ARCH_X86_64
+	size_t head, next, tail;
+	uint32_t free_entries;
+	unsigned int i;
+
+	n = __rte_ring_move_prod_head(r, 0, n, behavior,
+				      &head, &next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	for (i = 0; i < n; /* i incremented if enqueue succeeds */) {
+		struct nb_ring_entry old_value, new_value;
+		struct nb_ring_entry *ring_ptr;
+
+		/* Enqueue to the tail entry. If another thread wins the race,
+		 * retry with the new tail.
+		 */
+		tail = r->prod.tail;
+
+		ring_ptr = &((struct nb_ring_entry *)&r[1])[tail & r->mask];
+
+		old_value = *ring_ptr;
+
+		/* If the tail entry's modification counter doesn't match the
+		 * producer tail index, it's already been updated.
+		 */
+		if (old_value.cnt != tail)
+			continue;
+
+		/* Prepare the new entry. The cnt field mitigates the ABA
+		 * problem on the ring write.
+		 */
+		new_value.ptr = obj_table[i];
+		new_value.cnt = tail + r->size;
+
+		if (rte_atomic128_cmpset((volatile void *)ring_ptr,
+					 (uint64_t *)&old_value,
+					 (uint64_t *)&new_value))
+			i++;
+
+		/* Every thread attempts the cmpset, so they don't have to wait
+		 * for the thread that successfully enqueued to the ring.
+		 * Using a 64-bit tail mitigates the ABA problem here.
+		 *
+		 * Built-in used to handle variable-sized tail index.
+		 */
+		__sync_bool_compare_and_swap(&r->prod.tail, tail, tail + 1);
+	}
+
+end:
+	if (free_space != NULL)
+		*free_space = free_entries - n;
+	return n;
+#else
+	RTE_SET_USED(r);
+	RTE_SET_USED(obj_table);
+	RTE_SET_USED(n);
+	RTE_SET_USED(behavior);
+	RTE_SET_USED(free_space);
+	return 0;
+#endif
+}
+
+/**
+ * @internal Enqueue several objects on the non-blocking ring
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param is_sp
+ *   Indicates whether to use single producer or multi-producer head update
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_enqueue(struct rte_ring *r, void * const *obj_table,
+			 unsigned int n, enum rte_ring_queue_behavior behavior,
+			 unsigned int is_sp, unsigned int *free_space)
+{
+	if (is_sp)
+		return __rte_ring_do_nb_enqueue_sp(r, obj_table, n,
+						   behavior, free_space);
+	else
+		return __rte_ring_do_nb_enqueue_mp(r, obj_table, n,
+						   behavior, free_space);
+}
+
+/**
+ * @internal
+ *   Dequeue several objects from the non-blocking ring (single-consumer only)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_dequeue_sc(struct rte_ring *r, void **obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *available)
+{
+	size_t head, next;
+	uint32_t entries;
+
+	n = __rte_ring_move_cons_head(r, 1, n, behavior,
+				      &head, &next, &entries);
+	if (n == 0)
+		goto end;
+
+	DEQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
+
+	r->cons.tail += n;
+
+end:
+	if (available != NULL)
+		*available = entries - n;
+	return n;
+}
+
+/**
+ * @internal
+ *   Dequeue several objects from the non-blocking ring (multi-consumer safe)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_dequeue_mc(struct rte_ring *r, void **obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *available)
+{
+	size_t head, next;
+	uint32_t entries;
+
+	n = __rte_ring_move_cons_head(r, 0, n, behavior,
+				      &head, &next, &entries);
+	if (n == 0)
+		goto end;
+
+	while (1) {
+		size_t tail = r->cons.tail;
+
+		/* Dequeue from the cons tail onwards. If multiple threads read
+		 * the same pointers, the thread that successfully performs the
+		 * CAS will keep them and the other(s) will retry.
+		 */
+		DEQUEUE_PTRS_NB(r, &r[1], tail, obj_table, n);
+
+		next = tail + n;
+
+		/* Built-in used to handle variable-sized tail index. */
+		if (__sync_bool_compare_and_swap(&r->cons.tail, tail, next)) {
+			/* There is potential for the ABA problem here, but
+			 * that is mitigated by the large (64-bit) tail.
+			 */
+			break;
+		}
+	}
+
+end:
+	if (available != NULL)
+		*available = entries - n;
+	return n;
+}
+
+/**
+ * @internal Dequeue several objects from the non-blocking ring
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_dequeue(struct rte_ring *r, void **obj_table,
+		 unsigned int n, enum rte_ring_queue_behavior behavior,
+		 unsigned int is_sc, unsigned int *available)
+{
+	if (is_sc)
+		return __rte_ring_do_nb_dequeue_sc(r, obj_table, n,
+						   behavior, available);
+	else
+		return __rte_ring_do_nb_dequeue_mc(r, obj_table, n,
+						   behavior, available);
+}
+
 /**
  * @internal Enqueue several objects on the ring
  *
@@ -427,8 +837,14 @@ static __rte_always_inline unsigned int
 rte_ring_mp_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_MP, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_MP,
+						free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_MP,
+					     free_space);
 }
 
 /**
@@ -450,8 +866,14 @@ static __rte_always_inline unsigned int
 rte_ring_sp_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_SP, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_SP,
+						free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_SP,
+					     free_space);
 }
 
 /**
@@ -477,8 +899,14 @@ static __rte_always_inline unsigned int
 rte_ring_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
 		      unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			r->prod.single, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED,
+						r->prod.single, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED,
+					     r->prod.single, free_space);
 }
 
 /**
@@ -561,8 +989,14 @@ static __rte_always_inline unsigned int
 rte_ring_mc_dequeue_bulk(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_MC, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_MC,
+						available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_MC,
+					     available);
 }
 
 /**
@@ -585,8 +1019,14 @@ static __rte_always_inline unsigned int
 rte_ring_sc_dequeue_bulk(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_SC, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_SC,
+						available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_SC,
+					     available);
 }
 
 /**
@@ -612,8 +1052,14 @@ static __rte_always_inline unsigned int
 rte_ring_dequeue_bulk(struct rte_ring *r, void **obj_table, unsigned int n,
 		unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-				r->cons.single, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED,
+						r->cons.single, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED,
+					     r->cons.single, available);
 }
 
 /**
@@ -810,8 +1256,14 @@ static __rte_always_inline unsigned
 rte_ring_mp_enqueue_burst(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_MP, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_MP, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_MP, free_space);
 }
 
 /**
@@ -833,8 +1285,14 @@ static __rte_always_inline unsigned
 rte_ring_sp_enqueue_burst(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_SP, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_SP, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_SP, free_space);
 }
 
 /**
@@ -860,8 +1318,14 @@ static __rte_always_inline unsigned
 rte_ring_enqueue_burst(struct rte_ring *r, void * const *obj_table,
 		      unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_VARIABLE,
-			r->prod.single, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						r->prod.single, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     r->prod.single, free_space);
 }
 
 /**
@@ -888,8 +1352,14 @@ static __rte_always_inline unsigned
 rte_ring_mc_dequeue_burst(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_MC, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_MC, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_MC, available);
 }
 
 /**
@@ -913,8 +1383,14 @@ static __rte_always_inline unsigned
 rte_ring_sc_dequeue_burst(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_SC, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_SC, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_SC, available);
 }
 
 /**
@@ -940,9 +1416,14 @@ static __rte_always_inline unsigned
 rte_ring_dequeue_burst(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n,
-				RTE_RING_QUEUE_VARIABLE,
-				r->cons.single, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						r->cons.single, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     r->cons.single, available);
 }
 
 #ifdef __cplusplus
diff --git a/lib/librte_ring/rte_ring_version.map b/lib/librte_ring/rte_ring_version.map
index d935efd0d..8969467af 100644
--- a/lib/librte_ring/rte_ring_version.map
+++ b/lib/librte_ring/rte_ring_version.map
@@ -17,3 +17,10 @@ DPDK_2.2 {
 	rte_ring_free;
 
 } DPDK_2.0;
+
+DPDK_19.05 {
+	global:
+
+	rte_ring_get_memsize;
+
+} DPDK_2.2;
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v2 3/5] test_ring: add non-blocking ring autotest
  2019-01-15 23:52 ` [dpdk-dev] [PATCH v2 0/5] Add non-blocking ring Gage Eads
  2019-01-15 23:52   ` [dpdk-dev] [PATCH v2 1/5] ring: change head and tail to pointer-width size Gage Eads
  2019-01-15 23:52   ` [dpdk-dev] [PATCH v2 2/5] ring: add a non-blocking implementation Gage Eads
@ 2019-01-15 23:52   ` Gage Eads
  2019-01-15 23:52   ` [dpdk-dev] [PATCH v2 4/5] test_ring_perf: add non-blocking ring perf test Gage Eads
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-01-15 23:52 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev, stephen

ring_nb_autotest re-uses the ring_autotest code by wrapping its top-level
function with one that takes a 'flags' argument.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 test/test/test_ring.c | 57 ++++++++++++++++++++++++++++++++-------------------
 1 file changed, 36 insertions(+), 21 deletions(-)

diff --git a/test/test/test_ring.c b/test/test/test_ring.c
index aaf1e70ad..ff410d978 100644
--- a/test/test/test_ring.c
+++ b/test/test/test_ring.c
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2010-2019 Intel Corporation
  */
 
 #include <string.h>
@@ -601,18 +601,20 @@ test_ring_burst_basic(struct rte_ring *r)
  * it will always fail to create ring with a wrong ring size number in this function
  */
 static int
-test_ring_creation_with_wrong_size(void)
+test_ring_creation_with_wrong_size(unsigned int flags)
 {
 	struct rte_ring * rp = NULL;
 
 	/* Test if ring size is not power of 2 */
-	rp = rte_ring_create("test_bad_ring_size", RING_SIZE + 1, SOCKET_ID_ANY, 0);
+	rp = rte_ring_create("test_bad_ring_size", RING_SIZE + 1,
+			     SOCKET_ID_ANY, flags);
 	if (NULL != rp) {
 		return -1;
 	}
 
 	/* Test if ring size is exceeding the limit */
-	rp = rte_ring_create("test_bad_ring_size", (RTE_RING_SZ_MASK + 1), SOCKET_ID_ANY, 0);
+	rp = rte_ring_create("test_bad_ring_size", (RTE_RING_SZ_MASK + 1),
+			     SOCKET_ID_ANY, flags);
 	if (NULL != rp) {
 		return -1;
 	}
@@ -623,11 +625,11 @@ test_ring_creation_with_wrong_size(void)
  * it tests if it would always fail to create ring with an used ring name
  */
 static int
-test_ring_creation_with_an_used_name(void)
+test_ring_creation_with_an_used_name(unsigned int flags)
 {
 	struct rte_ring * rp;
 
-	rp = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, 0);
+	rp = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, flags);
 	if (NULL != rp)
 		return -1;
 
@@ -639,10 +641,10 @@ test_ring_creation_with_an_used_name(void)
  * function to fail correctly
  */
 static int
-test_create_count_odd(void)
+test_create_count_odd(unsigned int flags)
 {
 	struct rte_ring *r = rte_ring_create("test_ring_count",
-			4097, SOCKET_ID_ANY, 0 );
+			4097, SOCKET_ID_ANY, flags);
 	if(r != NULL){
 		return -1;
 	}
@@ -665,7 +667,7 @@ test_lookup_null(void)
  * it tests some more basic ring operations
  */
 static int
-test_ring_basic_ex(void)
+test_ring_basic_ex(unsigned int flags)
 {
 	int ret = -1;
 	unsigned i;
@@ -679,7 +681,7 @@ test_ring_basic_ex(void)
 	}
 
 	rp = rte_ring_create("test_ring_basic_ex", RING_SIZE, SOCKET_ID_ANY,
-			RING_F_SP_ENQ | RING_F_SC_DEQ);
+			RING_F_SP_ENQ | RING_F_SC_DEQ | flags);
 	if (rp == NULL) {
 		printf("test_ring_basic_ex fail to create ring\n");
 		goto fail_test;
@@ -737,7 +739,7 @@ test_ring_basic_ex(void)
 }
 
 static int
-test_ring_with_exact_size(void)
+test_ring_with_exact_size(unsigned int flags)
 {
 	struct rte_ring *std_ring = NULL, *exact_sz_ring = NULL;
 	void *ptr_array[16];
@@ -746,13 +748,13 @@ test_ring_with_exact_size(void)
 	int ret = -1;
 
 	std_ring = rte_ring_create("std", ring_sz, rte_socket_id(),
-			RING_F_SP_ENQ | RING_F_SC_DEQ);
+			RING_F_SP_ENQ | RING_F_SC_DEQ | flags);
 	if (std_ring == NULL) {
 		printf("%s: error, can't create std ring\n", __func__);
 		goto end;
 	}
 	exact_sz_ring = rte_ring_create("exact sz", ring_sz, rte_socket_id(),
-			RING_F_SP_ENQ | RING_F_SC_DEQ | RING_F_EXACT_SZ);
+		RING_F_SP_ENQ | RING_F_SC_DEQ | RING_F_EXACT_SZ | flags);
 	if (exact_sz_ring == NULL) {
 		printf("%s: error, can't create exact size ring\n", __func__);
 		goto end;
@@ -808,17 +810,17 @@ test_ring_with_exact_size(void)
 }
 
 static int
-test_ring(void)
+__test_ring(unsigned int flags)
 {
 	struct rte_ring *r = NULL;
 
 	/* some more basic operations */
-	if (test_ring_basic_ex() < 0)
+	if (test_ring_basic_ex(flags) < 0)
 		goto test_fail;
 
 	rte_atomic32_init(&synchro);
 
-	r = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, 0);
+	r = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, flags);
 	if (r == NULL)
 		goto test_fail;
 
@@ -837,27 +839,27 @@ test_ring(void)
 		goto test_fail;
 
 	/* basic operations */
-	if ( test_create_count_odd() < 0){
+	if (test_create_count_odd(flags) < 0) {
 		printf("Test failed to detect odd count\n");
 		goto test_fail;
 	} else
 		printf("Test detected odd count\n");
 
-	if ( test_lookup_null() < 0){
+	if (test_lookup_null() < 0) {
 		printf("Test failed to detect NULL ring lookup\n");
 		goto test_fail;
 	} else
 		printf("Test detected NULL ring lookup\n");
 
 	/* test of creating ring with wrong size */
-	if (test_ring_creation_with_wrong_size() < 0)
+	if (test_ring_creation_with_wrong_size(flags) < 0)
 		goto test_fail;
 
 	/* test of creation ring with an used name */
-	if (test_ring_creation_with_an_used_name() < 0)
+	if (test_ring_creation_with_an_used_name(flags) < 0)
 		goto test_fail;
 
-	if (test_ring_with_exact_size() < 0)
+	if (test_ring_with_exact_size(flags) < 0)
 		goto test_fail;
 
 	/* dump the ring status */
@@ -873,4 +875,17 @@ test_ring(void)
 	return -1;
 }
 
+static int
+test_ring(void)
+{
+	return __test_ring(0);
+}
+
+static int
+test_nb_ring(void)
+{
+	return __test_ring(RING_F_NB);
+}
+
 REGISTER_TEST_COMMAND(ring_autotest, test_ring);
+REGISTER_TEST_COMMAND(ring_nb_autotest, test_nb_ring);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v2 4/5] test_ring_perf: add non-blocking ring perf test
  2019-01-15 23:52 ` [dpdk-dev] [PATCH v2 0/5] Add non-blocking ring Gage Eads
                     ` (2 preceding siblings ...)
  2019-01-15 23:52   ` [dpdk-dev] [PATCH v2 3/5] test_ring: add non-blocking ring autotest Gage Eads
@ 2019-01-15 23:52   ` Gage Eads
  2019-01-15 23:52   ` [dpdk-dev] [PATCH v2 5/5] mempool/ring: add non-blocking ring handlers Gage Eads
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-01-15 23:52 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev, stephen

nb_ring_perf_autotest re-uses the ring_perf_autotest code by wrapping its
top-level function with one that takes a 'flags' argument.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 test/test/test_ring_perf.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/test/test/test_ring_perf.c b/test/test/test_ring_perf.c
index ebb3939f5..380c4b4a1 100644
--- a/test/test/test_ring_perf.c
+++ b/test/test/test_ring_perf.c
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2010-2019 Intel Corporation
  */
 
 
@@ -363,12 +363,12 @@ test_bulk_enqueue_dequeue(struct rte_ring *r)
 }
 
 static int
-test_ring_perf(void)
+__test_ring_perf(unsigned int flags)
 {
 	struct lcore_pair cores;
 	struct rte_ring *r = NULL;
 
-	r = rte_ring_create(RING_NAME, RING_SIZE, rte_socket_id(), 0);
+	r = rte_ring_create(RING_NAME, RING_SIZE, rte_socket_id(), flags);
 	if (r == NULL)
 		return -1;
 
@@ -398,4 +398,17 @@ test_ring_perf(void)
 	return 0;
 }
 
+static int
+test_ring_perf(void)
+{
+	return __test_ring_perf(0);
+}
+
+static int
+test_nb_ring_perf(void)
+{
+	return __test_ring_perf(RING_F_NB);
+}
+
 REGISTER_TEST_COMMAND(ring_perf_autotest, test_ring_perf);
+REGISTER_TEST_COMMAND(ring_nb_perf_autotest, test_nb_ring_perf);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v2 5/5] mempool/ring: add non-blocking ring handlers
  2019-01-15 23:52 ` [dpdk-dev] [PATCH v2 0/5] Add non-blocking ring Gage Eads
                     ` (3 preceding siblings ...)
  2019-01-15 23:52   ` [dpdk-dev] [PATCH v2 4/5] test_ring_perf: add non-blocking ring perf test Gage Eads
@ 2019-01-15 23:52   ` Gage Eads
  2019-01-16  0:26   ` [dpdk-dev] [PATCH v2 0/5] Add non-blocking ring Stephen Hemminger
  2019-01-18 15:23   ` [dpdk-dev] [PATCH v3 " Gage Eads
  6 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-01-15 23:52 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev, stephen

These handlers allow an application to create a mempool based on the
non-blocking ring, with any combination of single/multi producer/consumer.

Also, add a note to the programmer's guide's "known issues" section.

Signed-off-by: Gage Eads <gage.eads@intel.com>
Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
---
 doc/guides/prog_guide/env_abstraction_layer.rst |  2 +-
 drivers/mempool/ring/rte_mempool_ring.c         | 58 +++++++++++++++++++++++--
 2 files changed, 56 insertions(+), 4 deletions(-)

diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index 9497b879c..b6ac236d6 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -541,7 +541,7 @@ Known Issues
 
   5. It MUST not be used by multi-producer/consumer pthreads, whose scheduling policies are SCHED_FIFO or SCHED_RR.
 
-  Alternatively, x86_64 applications can use the non-blocking stack mempool handler. When considering this handler, note that:
+  Alternatively, x86_64 applications can use the non-blocking ring or stack mempool handlers. When considering one of them, note that:
 
   - it is limited to the x86_64 platform, because it uses an instruction (16-byte compare-and-swap) that is not available on other platforms.
   - it has worse average-case performance than the non-preemptive rte_ring, but software caching (e.g. the mempool cache) can mitigate this by reducing the number of handler operations.
diff --git a/drivers/mempool/ring/rte_mempool_ring.c b/drivers/mempool/ring/rte_mempool_ring.c
index bc123fc52..013dac3bc 100644
--- a/drivers/mempool/ring/rte_mempool_ring.c
+++ b/drivers/mempool/ring/rte_mempool_ring.c
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2016 Intel Corporation
+ * Copyright(c) 2010-2019 Intel Corporation
  */
 
 #include <stdio.h>
@@ -47,11 +47,11 @@ common_ring_get_count(const struct rte_mempool *mp)
 
 
 static int
-common_ring_alloc(struct rte_mempool *mp)
+__common_ring_alloc(struct rte_mempool *mp, int rg_flags)
 {
-	int rg_flags = 0, ret;
 	char rg_name[RTE_RING_NAMESIZE];
 	struct rte_ring *r;
+	int ret;
 
 	ret = snprintf(rg_name, sizeof(rg_name),
 		RTE_MEMPOOL_MZ_FORMAT, mp->name);
@@ -82,6 +82,18 @@ common_ring_alloc(struct rte_mempool *mp)
 	return 0;
 }
 
+static int
+common_ring_alloc(struct rte_mempool *mp)
+{
+	return __common_ring_alloc(mp, 0);
+}
+
+static int
+common_ring_alloc_nb(struct rte_mempool *mp)
+{
+	return __common_ring_alloc(mp, RING_F_NB);
+}
+
 static void
 common_ring_free(struct rte_mempool *mp)
 {
@@ -130,7 +142,47 @@ static const struct rte_mempool_ops ops_sp_mc = {
 	.get_count = common_ring_get_count,
 };
 
+static const struct rte_mempool_ops ops_mp_mc_nb = {
+	.name = "ring_mp_mc_nb",
+	.alloc = common_ring_alloc_nb,
+	.free = common_ring_free,
+	.enqueue = common_ring_mp_enqueue,
+	.dequeue = common_ring_mc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
+static const struct rte_mempool_ops ops_sp_sc_nb = {
+	.name = "ring_sp_sc_nb",
+	.alloc = common_ring_alloc_nb,
+	.free = common_ring_free,
+	.enqueue = common_ring_sp_enqueue,
+	.dequeue = common_ring_sc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
+static const struct rte_mempool_ops ops_mp_sc_nb = {
+	.name = "ring_mp_sc_nb",
+	.alloc = common_ring_alloc_nb,
+	.free = common_ring_free,
+	.enqueue = common_ring_mp_enqueue,
+	.dequeue = common_ring_sc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
+static const struct rte_mempool_ops ops_sp_mc_nb = {
+	.name = "ring_sp_mc_nb",
+	.alloc = common_ring_alloc_nb,
+	.free = common_ring_free,
+	.enqueue = common_ring_sp_enqueue,
+	.dequeue = common_ring_mc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
 MEMPOOL_REGISTER_OPS(ops_mp_mc);
 MEMPOOL_REGISTER_OPS(ops_sp_sc);
 MEMPOOL_REGISTER_OPS(ops_mp_sc);
 MEMPOOL_REGISTER_OPS(ops_sp_mc);
+MEMPOOL_REGISTER_OPS(ops_mp_mc_nb);
+MEMPOOL_REGISTER_OPS(ops_sp_sc_nb);
+MEMPOOL_REGISTER_OPS(ops_mp_sc_nb);
+MEMPOOL_REGISTER_OPS(ops_sp_mc_nb);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v2 0/5] Add non-blocking ring
  2019-01-15 23:52 ` [dpdk-dev] [PATCH v2 0/5] Add non-blocking ring Gage Eads
                     ` (4 preceding siblings ...)
  2019-01-15 23:52   ` [dpdk-dev] [PATCH v2 5/5] mempool/ring: add non-blocking ring handlers Gage Eads
@ 2019-01-16  0:26   ` Stephen Hemminger
  2019-01-18 15:23   ` [dpdk-dev] [PATCH v3 " Gage Eads
  6 siblings, 0 replies; 123+ messages in thread
From: Stephen Hemminger @ 2019-01-16  0:26 UTC (permalink / raw)
  To: Gage Eads
  Cc: dev, olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev

On Tue, 15 Jan 2019 17:52:22 -0600
Gage Eads <gage.eads@intel.com> wrote:

> For some users, the rte ring's "non-preemptive" constraint is not acceptable;
> for example, if the application uses a mixture of pinned high-priority threads
> and multiplexed low-priority threads that share a mempool.
> 
> This patchset introduces a non-blocking ring, on top of which a mempool can run.
> Crucially, the non-blocking algorithm relies on a 128-bit compare-and-swap, so
> it is limited to x86_64 machines.
> 
> The ring uses more compare-and-swap atomic operations than the regular rte ring:
> With no contention, an enqueue of n pointers uses (1 + 2n) CAS operations and a
> dequeue of n pointers uses 2. This algorithm has worse average-case performance
> than the regular rte ring (particularly a highly-contended ring with large bulk
> accesses), however:
> - For applications with preemptible pthreads, the regular rte ring's worst-case
>   performance (i.e. one thread being preempted in the update_tail() critical
>   section) is much worse than the non-blocking ring's.
> - Software caching can mitigate the average case performance for ring-based
>   algorithms. For example, a non-blocking ring based mempool (a likely use case
>   for this ring) with per-thread caching.
> 
> The non-blocking ring is enabled via a new flag, RING_F_NB. For ease-of-use,
> existing ring enqueue/dequeue functions work with both "regular" and
> non-blocking rings.
> 
> This patchset also adds non-blocking versions of ring_autotest and
> ring_perf_autotest, and a non-blocking ring based mempool.
> 
> This patchset makes ABI and API changes; a deprecation notice will be
> posted in a separate commit.
> 
> This patchset depends on the non-blocking stack patchset[1].
> 
> [1] http://mails.dpdk.org/archives/dev/2019-January/123470.html
> 
> v2:
>  - Merge separate docs commit into patch #5
>  - Convert uintptr_t to size_t
>  - Add a compile-time check for the size of size_t
>  - Fix a space-after-typecast issue
>  - Fix an unnecessary-parentheses checkpatch warning
>  - Bump librte_ring's library version
> 
> Gage Eads (5):
>   ring: change head and tail to pointer-width size
>   ring: add a non-blocking implementation
>   test_ring: add non-blocking ring autotest
>   test_ring_perf: add non-blocking ring perf test
>   mempool/ring: add non-blocking ring handlers
> 
>  doc/guides/prog_guide/env_abstraction_layer.rst |   2 +-
>  drivers/mempool/ring/rte_mempool_ring.c         |  58 ++-
>  lib/librte_eventdev/rte_event_ring.h            |   6 +-
>  lib/librte_ring/Makefile                        |   2 +-
>  lib/librte_ring/meson.build                     |   2 +-
>  lib/librte_ring/rte_ring.c                      |  53 ++-
>  lib/librte_ring/rte_ring.h                      | 564 ++++++++++++++++++++++--
>  lib/librte_ring/rte_ring_generic.h              |  16 +-
>  lib/librte_ring/rte_ring_version.map            |   7 +
>  test/test/test_ring.c                           |  57 ++-
>  test/test/test_ring_perf.c                      |  19 +-
>  11 files changed, 699 insertions(+), 87 deletions(-)
> 

Just bumping the version number is not enough.
This looks like an ABI breakage for existing users.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
  2019-01-15 23:52 ` [dpdk-dev] [PATCH v2 0/5] Add non-blocking ring Gage Eads
                     ` (5 preceding siblings ...)
  2019-01-16  0:26   ` [dpdk-dev] [PATCH v2 0/5] Add non-blocking ring Stephen Hemminger
@ 2019-01-18 15:23   ` Gage Eads
  2019-01-18 15:23     ` [dpdk-dev] [PATCH v3 1/5] ring: add 64-bit headtail structure Gage Eads
                       ` (7 more replies)
  6 siblings, 8 replies; 123+ messages in thread
From: Gage Eads @ 2019-01-18 15:23 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev, stephen

For some users, the rte ring's "non-preemptive" constraint is not acceptable;
for example, if the application uses a mixture of pinned high-priority threads
and multiplexed low-priority threads that share a mempool.

This patchset introduces a non-blocking ring, on top of which a mempool can run.
Crucially, the non-blocking algorithm relies on a 128-bit compare-and-swap, so
it is currently limited to x86_64 machines. This is also an experimental API,
so RING_F_NB users must build with the ALLOW_EXPERIMENTAL_API flag.

The ring uses more compare-and-swap atomic operations than the regular rte ring:
With no contention, an enqueue of n pointers uses (1 + 2n) CAS operations and a
dequeue of n pointers uses 2. This algorithm has worse average-case performance
than the regular rte ring (particularly a highly-contended ring with large bulk
accesses), however:
- For applications with preemptible pthreads, the regular rte ring's worst-case
  performance (i.e. one thread being preempted in the update_tail() critical
  section) is much worse than the non-blocking ring's.
- Software caching can mitigate the average case performance for ring-based
  algorithms. For example, a non-blocking ring based mempool (a likely use case
  for this ring) with per-thread caching.

The non-blocking ring is enabled via a new flag, RING_F_NB. For ease-of-use,
existing ring enqueue/dequeue functions work with both "regular" and
non-blocking rings.

This patchset also adds non-blocking versions of ring_autotest and
ring_perf_autotest, and a non-blocking ring based mempool.

This patchset makes one API change; a deprecation notice will be posted in a
separate commit.

This patchset depends on the non-blocking stack patchset[1].

[1] http://mails.dpdk.org/archives/dev/2019-January/123653.html

v3:
 - Avoid the ABI break by putting 64-bit head and tail values in the same
   cacheline as struct rte_ring's prod and cons members.
 - Don't attempt to compile rte_atomic128_cmpset without
   ALLOW_EXPERIMENTAL_API, as this would break a large number of libraries.
 - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case someone tries
   to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
 - Update the ring mempool to use experimental APIs
 - Clarify that RINB_F_NB is only limited to x86_64 currently; ARMv8.1-A builds
   can eventually support it with the CASP instruction.

v2:
 - Merge separate docs commit into patch #5
 - Convert uintptr_t to size_t
 - Add a compile-time check for the size of size_t
 - Fix a space-after-typecast issue
 - Fix an unnecessary-parentheses checkpatch warning
 - Bump librte_ring's library version

Gage Eads (5):
  ring: add 64-bit headtail structure
  ring: add a non-blocking implementation
  test_ring: add non-blocking ring autotest
  test_ring_perf: add non-blocking ring perf test
  mempool/ring: add non-blocking ring handlers

 doc/guides/prog_guide/env_abstraction_layer.rst |   2 +-
 drivers/mempool/ring/Makefile                   |   1 +
 drivers/mempool/ring/meson.build                |   2 +
 drivers/mempool/ring/rte_mempool_ring.c         |  58 ++-
 lib/librte_eventdev/rte_event_ring.h            |   2 +-
 lib/librte_ring/Makefile                        |   3 +-
 lib/librte_ring/rte_ring.c                      |  72 ++-
 lib/librte_ring/rte_ring.h                      | 574 ++++++++++++++++++++++--
 lib/librte_ring/rte_ring_generic_64.h           | 152 +++++++
 lib/librte_ring/rte_ring_version.map            |   7 +
 test/test/test_ring.c                           |  57 ++-
 test/test/test_ring_perf.c                      |  19 +-
 12 files changed, 874 insertions(+), 75 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_generic_64.h

-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v3 1/5] ring: add 64-bit headtail structure
  2019-01-18 15:23   ` [dpdk-dev] [PATCH v3 " Gage Eads
@ 2019-01-18 15:23     ` Gage Eads
  2019-01-18 15:23     ` [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation Gage Eads
                       ` (6 subsequent siblings)
  7 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-01-18 15:23 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev, stephen

64-bit head and tail index widths greatly increases the time it takes for
them to wrap-around (with current CPU speeds, it won't happen within the
author's lifetime). This is important in avoiding the ABA problem -- in
which a thread mistakes reading the same tail index in two accesses to mean
that the ring was not modified in the intervening time -- in the upcoming
non-blocking ring implementation. Using a 64-bit index makes the
possibility of this occurring effectively zero.

This commit places the new producer and consumer structures in the same
location in struct rte_ring as their 32-bit counterparts. Since the 32-bit
versions are padded out to a cache line, there is space for the new
structure without affecting the layout of struct rte_ring. Thus, the ABI is
preserved.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_eventdev/rte_event_ring.h  |   2 +-
 lib/librte_ring/Makefile              |   3 +-
 lib/librte_ring/rte_ring.h            |  24 +++++-
 lib/librte_ring/rte_ring_generic_64.h | 152 ++++++++++++++++++++++++++++++++++
 4 files changed, 176 insertions(+), 5 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_generic_64.h

diff --git a/lib/librte_eventdev/rte_event_ring.h b/lib/librte_eventdev/rte_event_ring.h
index 827a3209e..5fcb2d5f7 100644
--- a/lib/librte_eventdev/rte_event_ring.h
+++ b/lib/librte_eventdev/rte_event_ring.h
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2016-2017 Intel Corporation
+ * Copyright(c) 2016-2019 Intel Corporation
  */
 
 /**
diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index 21a36770d..18c48fbc8 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -19,6 +19,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
 					rte_ring_generic.h \
-					rte_ring_c11_mem.h
+					rte_ring_c11_mem.h \
+					rte_ring_generic_64.h
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index af5444a9f..b270a4746 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  *
- * Copyright (c) 2010-2017 Intel Corporation
+ * Copyright (c) 2010-2019 Intel Corporation
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
@@ -70,6 +70,15 @@ struct rte_ring_headtail {
 	uint32_t single;         /**< True if single prod/cons */
 };
 
+/* 64-bit version of rte_ring_headtail, for use by rings that need to avoid
+ * head/tail wrap-around.
+ */
+struct rte_ring_headtail_64 {
+	volatile uint64_t head;  /**< Prod/consumer head. */
+	volatile uint64_t tail;  /**< Prod/consumer tail. */
+	uint32_t single;       /**< True if single prod/cons */
+};
+
 /**
  * An RTE ring structure.
  *
@@ -97,11 +106,19 @@ struct rte_ring {
 	char pad0 __rte_cache_aligned; /**< empty cache line */
 
 	/** Ring producer status. */
-	struct rte_ring_headtail prod __rte_cache_aligned;
+	RTE_STD_C11
+	union {
+		struct rte_ring_headtail prod __rte_cache_aligned;
+		struct rte_ring_headtail_64 prod_64 __rte_cache_aligned;
+	};
 	char pad1 __rte_cache_aligned; /**< empty cache line */
 
 	/** Ring consumer status. */
-	struct rte_ring_headtail cons __rte_cache_aligned;
+	RTE_STD_C11
+	union {
+		struct rte_ring_headtail cons __rte_cache_aligned;
+		struct rte_ring_headtail_64 cons_64 __rte_cache_aligned;
+	};
 	char pad2 __rte_cache_aligned; /**< empty cache line */
 };
 
@@ -312,6 +329,7 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 #else
 #include "rte_ring_generic.h"
 #endif
+#include "rte_ring_generic_64.h"
 
 /**
  * @internal Enqueue several objects on the ring
diff --git a/lib/librte_ring/rte_ring_generic_64.h b/lib/librte_ring/rte_ring_generic_64.h
new file mode 100644
index 000000000..58de144c6
--- /dev/null
+++ b/lib/librte_ring/rte_ring_generic_64.h
@@ -0,0 +1,152 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Copyright (c) 2010-2019 Intel Corporation
+ * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
+ * All rights reserved.
+ * Derived from FreeBSD's bufring.h
+ * Used as BSD-3 Licensed with permission from Kip Macy.
+ */
+
+#ifndef _RTE_RING_GENERIC_64_H_
+#define _RTE_RING_GENERIC_64_H_
+
+/**
+ * @internal This function updates the producer head for enqueue using
+ *	     64-bit head/tail values.
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sp
+ *   Indicates whether multi-producer path is needed or not
+ * @param n
+ *   The number of elements we will want to enqueue, i.e. how far should the
+ *   head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
+ * @param old_head
+ *   Returns head value as it was before the move, i.e. where enqueue starts
+ * @param new_head
+ *   Returns the current/new head value i.e. where enqueue finishes
+ * @param free_entries
+ *   Returns the amount of free space in the ring BEFORE head was moved
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_move_prod_head_64(struct rte_ring *r, unsigned int is_sp,
+		unsigned int n, enum rte_ring_queue_behavior behavior,
+		uint64_t *old_head, uint64_t *new_head,
+		uint32_t *free_entries)
+{
+	const uint32_t capacity = r->capacity;
+	unsigned int max = n;
+	int success;
+
+	do {
+		/* Reset n to the initial burst count */
+		n = max;
+
+		*old_head = r->prod_64.head;
+
+		/* add rmb barrier to avoid load/load reorder in weak
+		 * memory model. It is noop on x86
+		 */
+		rte_smp_rmb();
+
+		/*
+		 *  The subtraction is done between two unsigned 64bits value
+		 * (the result is always modulo 64 bits even if we have
+		 * *old_head > cons_tail). So 'free_entries' is always between 0
+		 * and capacity (which is < size).
+		 */
+		*free_entries = (capacity + r->cons_64.tail - *old_head);
+
+		/* check that we have enough room in ring */
+		if (unlikely(n > *free_entries))
+			n = (behavior == RTE_RING_QUEUE_FIXED) ?
+					0 : *free_entries;
+
+		if (n == 0)
+			return 0;
+
+		*new_head = *old_head + n;
+		if (is_sp)
+			r->prod_64.head = *new_head, success = 1;
+		else
+			success = rte_atomic64_cmpset(&r->prod_64.head,
+					*old_head, *new_head);
+	} while (unlikely(success == 0));
+	return n;
+}
+
+/**
+ * @internal This function updates the consumer head for dequeue using
+ *	     64-bit head/tail values.
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sc
+ *   Indicates whether multi-consumer path is needed or not
+ * @param n
+ *   The number of elements we will want to enqueue, i.e. how far should the
+ *   head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from ring
+ * @param old_head
+ *   Returns head value as it was before the move, i.e. where dequeue starts
+ * @param new_head
+ *   Returns the current/new head value i.e. where dequeue finishes
+ * @param entries
+ *   Returns the number of entries in the ring BEFORE head was moved
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_move_cons_head_64(struct rte_ring *r, unsigned int is_sc,
+		unsigned int n, enum rte_ring_queue_behavior behavior,
+		uint64_t *old_head, uint64_t *new_head,
+		uint32_t *entries)
+{
+	unsigned int max = n;
+	int success;
+
+	do {
+		/* Restore n as it may change every loop */
+		n = max;
+
+		*old_head = r->cons_64.head;
+
+		/* add rmb barrier to avoid load/load reorder in weak
+		 * memory model. It is noop on x86
+		 */
+		rte_smp_rmb();
+
+		/* The subtraction is done between two unsigned 64bits value
+		 * (the result is always modulo 64 bits even if we have
+		 * cons_head > prod_tail). So 'entries' is always between 0
+		 * and size(ring)-1.
+		 */
+		*entries = (r->prod_64.tail - *old_head);
+
+		/* Set the actual entries for dequeue */
+		if (n > *entries)
+			n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : *entries;
+
+		if (unlikely(n == 0))
+			return 0;
+
+		*new_head = *old_head + n;
+		if (is_sc)
+			r->cons_64.head = *new_head, success = 1;
+		else
+			success = rte_atomic64_cmpset(&r->cons_64.head,
+					*old_head, *new_head);
+	} while (unlikely(success == 0));
+	return n;
+}
+
+#endif /* _RTE_RING_GENERIC_64_H_ */
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation
  2019-01-18 15:23   ` [dpdk-dev] [PATCH v3 " Gage Eads
  2019-01-18 15:23     ` [dpdk-dev] [PATCH v3 1/5] ring: add 64-bit headtail structure Gage Eads
@ 2019-01-18 15:23     ` Gage Eads
  2019-01-22 10:12       ` Ola Liljedahl
  2019-01-22 14:49       ` Ola Liljedahl
  2019-01-18 15:23     ` [dpdk-dev] [PATCH v3 3/5] test_ring: add non-blocking ring autotest Gage Eads
                       ` (5 subsequent siblings)
  7 siblings, 2 replies; 123+ messages in thread
From: Gage Eads @ 2019-01-18 15:23 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev, stephen

This commit adds support for non-blocking circular ring enqueue and dequeue
functions. The ring uses a 128-bit compare-and-swap instruction, and thus
is currently limited to x86_64.

The algorithm is based on the original rte ring (derived from FreeBSD's
bufring.h) and inspired by Michael and Scott's non-blocking concurrent
queue. Importantly, it adds a modification counter to each ring entry to
ensure only one thread can write to an unused entry.

-----
Algorithm:

Multi-producer non-blocking enqueue:
1. Move the producer head index 'n' locations forward, effectively
   reserving 'n' locations.
2. For each pointer:
 a. Read the producer tail index, then ring[tail]. If ring[tail]'s
    modification counter isn't 'tail', retry.
 b. Construct the new entry: {pointer, tail + ring size}
 c. Compare-and-swap the old entry with the new. If unsuccessful, the
    next loop iteration will try to enqueue this pointer again.
 d. Compare-and-swap the tail index with 'tail + 1', whether or not step 2c
    succeeded. This guarantees threads can make forward progress.

Multi-consumer non-blocking dequeue:
1. Move the consumer head index 'n' locations forward, effectively
   reserving 'n' pointers to be dequeued.
2. Copy 'n' pointers into the caller's object table (ignoring the
   modification counter), starting from ring[tail], then compare-and-swap
   the tail index with 'tail + n'.  If unsuccessful, repeat step 2.

-----
Discussion:

There are two cases where the ABA problem is mitigated:
1. Enqueueing a pointer to the ring: without a modification counter
   tied to the tail index, the index could become stale by the time the
   enqueue happens, causing it to overwrite valid data. Tying the
   counter to the tail index gives us an expected value (as opposed to,
   say, a monotonically incrementing counter).

   Since the counter will eventually wrap, there is potential for the ABA
   problem. However, using a 64-bit counter makes this likelihood
   effectively zero.

2. Updating a tail index: the ABA problem can occur if the thread is
   preempted and the tail index wraps around. However, using 64-bit indexes
   makes this likelihood effectively zero.

With no contention, an enqueue of n pointers uses (1 + 2n) CAS operations
and a dequeue of n pointers uses 2. This algorithm has worse average-case
performance than the regular rte ring (particularly a highly-contended ring
with large bulk accesses), however:
- For applications with preemptible pthreads, the regular rte ring's
  worst-case performance (i.e. one thread being preempted in the
  update_tail() critical section) is much worse than the non-blocking
  ring's.
- Software caching can mitigate the average case performance for ring-based
  algorithms. For example, a non-blocking ring based mempool (a likely use
  case for this ring) with per-thread caching.

The non-blocking ring is enabled via a new flag, RING_F_NB. Because the
ring's memsize is now a function of its flags (the non-blocking ring
requires 128b for each entry), this commit adds a new argument ('flags') to
rte_ring_get_memsize(). An API deprecation notice will be sent in a
separate commit.

For ease-of-use, existing ring enqueue and dequeue functions work on both
regular and non-blocking rings. This introduces an additional branch in
the datapath, but this should be a highly predictable branch.
ring_perf_autotest shows a negligible performance impact; it's hard to
distinguish a real difference versus system noise.

                                  | ring_perf_autotest cycles with branch -
             Test                 |   ring_perf_autotest cycles without
------------------------------------------------------------------
SP/SC single enq/dequeue          | 0.33
MP/MC single enq/dequeue          | -4.00
SP/SC burst enq/dequeue (size 8)  | 0.00
MP/MC burst enq/dequeue (size 8)  | 0.00
SP/SC burst enq/dequeue (size 32) | 0.00
MP/MC burst enq/dequeue (size 32) | 0.00
SC empty dequeue                  | 1.00
MC empty dequeue                  | 0.00

Single lcore:
SP/SC bulk enq/dequeue (size 8)   | 0.49
MP/MC bulk enq/dequeue (size 8)   | 0.08
SP/SC bulk enq/dequeue (size 32)  | 0.07
MP/MC bulk enq/dequeue (size 32)  | 0.09

Two physical cores:
SP/SC bulk enq/dequeue (size 8)   | 0.19
MP/MC bulk enq/dequeue (size 8)   | -0.37
SP/SC bulk enq/dequeue (size 32)  | 0.09
MP/MC bulk enq/dequeue (size 32)  | -0.05

Two NUMA nodes:
SP/SC bulk enq/dequeue (size 8)   | -1.96
MP/MC bulk enq/dequeue (size 8)   | 0.88
SP/SC bulk enq/dequeue (size 32)  | 0.10
MP/MC bulk enq/dequeue (size 32)  | 0.46

Test setup: x86_64 build with default config, dual-socket Xeon E5-2699 v4,
running on isolcpus cores with a tickless scheduler. Each test run three
times and the results averaged.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_ring/rte_ring.c           |  72 ++++-
 lib/librte_ring/rte_ring.h           | 550 +++++++++++++++++++++++++++++++++--
 lib/librte_ring/rte_ring_version.map |   7 +
 3 files changed, 587 insertions(+), 42 deletions(-)

diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
index d215acecc..f3378dccd 100644
--- a/lib/librte_ring/rte_ring.c
+++ b/lib/librte_ring/rte_ring.c
@@ -45,9 +45,9 @@ EAL_REGISTER_TAILQ(rte_ring_tailq)
 
 /* return the size of memory occupied by a ring */
 ssize_t
-rte_ring_get_memsize(unsigned count)
+rte_ring_get_memsize_v1905(unsigned int count, unsigned int flags)
 {
-	ssize_t sz;
+	ssize_t sz, elt_sz;
 
 	/* count must be a power of 2 */
 	if ((!POWEROF2(count)) || (count > RTE_RING_SZ_MASK )) {
@@ -57,10 +57,23 @@ rte_ring_get_memsize(unsigned count)
 		return -EINVAL;
 	}
 
-	sz = sizeof(struct rte_ring) + count * sizeof(void *);
+	elt_sz = (flags & RING_F_NB) ? 2 * sizeof(void *) : sizeof(void *);
+
+	sz = sizeof(struct rte_ring) + count * elt_sz;
 	sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
 	return sz;
 }
+BIND_DEFAULT_SYMBOL(rte_ring_get_memsize, _v1905, 19.05);
+MAP_STATIC_SYMBOL(ssize_t rte_ring_get_memsize(unsigned int count,
+					       unsigned int flags),
+		  rte_ring_get_memsize_v1905);
+
+ssize_t
+rte_ring_get_memsize_v20(unsigned int count)
+{
+	return rte_ring_get_memsize_v1905(count, 0);
+}
+VERSION_SYMBOL(rte_ring_get_memsize, _v20, 2.0);
 
 int
 rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
@@ -82,8 +95,6 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 	if (ret < 0 || ret >= (int)sizeof(r->name))
 		return -ENAMETOOLONG;
 	r->flags = flags;
-	r->prod.single = (flags & RING_F_SP_ENQ) ? __IS_SP : __IS_MP;
-	r->cons.single = (flags & RING_F_SC_DEQ) ? __IS_SC : __IS_MC;
 
 	if (flags & RING_F_EXACT_SZ) {
 		r->size = rte_align32pow2(count + 1);
@@ -100,8 +111,30 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 		r->mask = count - 1;
 		r->capacity = r->mask;
 	}
-	r->prod.head = r->cons.head = 0;
-	r->prod.tail = r->cons.tail = 0;
+
+	if (flags & RING_F_NB) {
+		uint64_t i;
+
+		r->prod_64.single = (flags & RING_F_SP_ENQ) ? __IS_SP : __IS_MP;
+		r->cons_64.single = (flags & RING_F_SC_DEQ) ? __IS_SC : __IS_MC;
+		r->prod_64.head = r->cons_64.head = 0;
+		r->prod_64.tail = r->cons_64.tail = 0;
+
+		for (i = 0; i < r->size; i++) {
+			struct nb_ring_entry *ring_ptr, *base;
+
+			base = ((struct nb_ring_entry *)&r[1]);
+
+			ring_ptr = &base[i & r->mask];
+
+			ring_ptr->cnt = i;
+		}
+	} else {
+		r->prod.single = (flags & RING_F_SP_ENQ) ? __IS_SP : __IS_MP;
+		r->cons.single = (flags & RING_F_SC_DEQ) ? __IS_SC : __IS_MC;
+		r->prod.head = r->cons.head = 0;
+		r->prod.tail = r->cons.tail = 0;
+	}
 
 	return 0;
 }
@@ -123,11 +156,19 @@ rte_ring_create(const char *name, unsigned count, int socket_id,
 
 	ring_list = RTE_TAILQ_CAST(rte_ring_tailq.head, rte_ring_list);
 
+#if !defined(RTE_ARCH_X86_64)
+	if (flags & RING_F_NB) {
+		printf("RING_F_NB is only supported on x86-64 platforms\n");
+		rte_errno = EINVAL;
+		return NULL;
+	}
+#endif
+
 	/* for an exact size ring, round up from count to a power of two */
 	if (flags & RING_F_EXACT_SZ)
 		count = rte_align32pow2(count + 1);
 
-	ring_size = rte_ring_get_memsize(count);
+	ring_size = rte_ring_get_memsize(count, flags);
 	if (ring_size < 0) {
 		rte_errno = ring_size;
 		return NULL;
@@ -227,10 +268,17 @@ rte_ring_dump(FILE *f, const struct rte_ring *r)
 	fprintf(f, "  flags=%x\n", r->flags);
 	fprintf(f, "  size=%"PRIu32"\n", r->size);
 	fprintf(f, "  capacity=%"PRIu32"\n", r->capacity);
-	fprintf(f, "  ct=%"PRIu32"\n", r->cons.tail);
-	fprintf(f, "  ch=%"PRIu32"\n", r->cons.head);
-	fprintf(f, "  pt=%"PRIu32"\n", r->prod.tail);
-	fprintf(f, "  ph=%"PRIu32"\n", r->prod.head);
+	if (r->flags & RING_F_NB) {
+		fprintf(f, "  ct=%"PRIu64"\n", r->cons_64.tail);
+		fprintf(f, "  ch=%"PRIu64"\n", r->cons_64.head);
+		fprintf(f, "  pt=%"PRIu64"\n", r->prod_64.tail);
+		fprintf(f, "  ph=%"PRIu64"\n", r->prod_64.head);
+	} else {
+		fprintf(f, "  ct=%"PRIu32"\n", r->cons.tail);
+		fprintf(f, "  ch=%"PRIu32"\n", r->cons.head);
+		fprintf(f, "  pt=%"PRIu32"\n", r->prod.tail);
+		fprintf(f, "  ph=%"PRIu32"\n", r->prod.head);
+	}
 	fprintf(f, "  used=%u\n", rte_ring_count(r));
 	fprintf(f, "  avail=%u\n", rte_ring_free_count(r));
 }
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index b270a4746..08c9de6a6 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -134,6 +134,18 @@ struct rte_ring {
  */
 #define RING_F_EXACT_SZ 0x0004
 #define RTE_RING_SZ_MASK  (0x7fffffffU) /**< Ring size mask */
+/**
+ * The ring uses non-blocking enqueue and dequeue functions. These functions
+ * do not have the "non-preemptive" constraint of a regular rte ring, and thus
+ * are suited for applications using preemptible pthreads. However, the
+ * non-blocking functions have worse average-case performance than their
+ * regular rte ring counterparts. When used as the handler for a mempool,
+ * per-thread caching can mitigate the performance difference by reducing the
+ * number (and contention) of ring accesses.
+ *
+ * This flag is only supported on x86_64 platforms.
+ */
+#define RING_F_NB 0x0008
 
 /* @internal defines for passing to the enqueue dequeue worker functions */
 #define __IS_SP 1
@@ -151,11 +163,15 @@ struct rte_ring {
  *
  * @param count
  *   The number of elements in the ring (must be a power of 2).
+ * @param flags
+ *   The flags the ring will be created with.
  * @return
  *   - The memory size needed for the ring on success.
  *   - -EINVAL if count is not a power of 2.
  */
-ssize_t rte_ring_get_memsize(unsigned count);
+ssize_t rte_ring_get_memsize(unsigned int count, unsigned int flags);
+ssize_t rte_ring_get_memsize_v20(unsigned int count);
+ssize_t rte_ring_get_memsize_v1905(unsigned int count, unsigned int flags);
 
 /**
  * Initialize a ring structure.
@@ -188,6 +204,10 @@ ssize_t rte_ring_get_memsize(unsigned count);
  *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
  *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
  *      is "single-consumer". Otherwise, it is "multi-consumers".
+ *    - RING_F_EXACT_SZ: If this flag is set, count can be a non-power-of-2
+ *      number, but up to half the ring space may be wasted.
+ *    - RING_F_NB: (x86_64 only) If this flag is set, the ring uses
+ *      non-blocking variants of the dequeue and enqueue functions.
  * @return
  *   0 on success, or a negative value on error.
  */
@@ -223,12 +243,17 @@ int rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
  *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
  *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
  *      is "single-consumer". Otherwise, it is "multi-consumers".
+ *    - RING_F_EXACT_SZ: If this flag is set, count can be a non-power-of-2
+ *      number, but up to half the ring space may be wasted.
+ *    - RING_F_NB: (x86_64 only) If this flag is set, the ring uses
+ *      non-blocking variants of the dequeue and enqueue functions.
  * @return
  *   On success, the pointer to the new allocated ring. NULL on error with
  *    rte_errno set appropriately. Possible errno values include:
  *    - E_RTE_NO_CONFIG - function could not get pointer to rte_config structure
  *    - E_RTE_SECONDARY - function was called from a secondary process instance
- *    - EINVAL - count provided is not a power of 2
+ *    - EINVAL - count provided is not a power of 2, or RING_F_NB is used on an
+ *      unsupported platform
  *    - ENOSPC - the maximum number of memzones has already been allocated
  *    - EEXIST - a memzone with the same name already exists
  *    - ENOMEM - no appropriate memory area found in which to create memzone
@@ -284,6 +309,50 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 	} \
 } while (0)
 
+/* The actual enqueue of pointers on the ring.
+ * Used only by the single-producer non-blocking enqueue function, but
+ * out-lined here for code readability.
+ */
+#define ENQUEUE_PTRS_NB(r, ring_start, prod_head, obj_table, n) do { \
+	unsigned int i; \
+	const uint32_t size = (r)->size; \
+	size_t idx = prod_head & (r)->mask; \
+	size_t new_cnt = prod_head + size; \
+	struct nb_ring_entry *ring = (struct nb_ring_entry *)ring_start; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & ((~(unsigned)0x3))); i += 4, idx += 4) { \
+			ring[idx].ptr = obj_table[i]; \
+			ring[idx].cnt = new_cnt + i;  \
+			ring[idx + 1].ptr = obj_table[i + 1]; \
+			ring[idx + 1].cnt = new_cnt + i + 1;  \
+			ring[idx + 2].ptr = obj_table[i + 2]; \
+			ring[idx + 2].cnt = new_cnt + i + 2;  \
+			ring[idx + 3].ptr = obj_table[i + 3]; \
+			ring[idx + 3].cnt = new_cnt + i + 3;  \
+		} \
+		switch (n & 0x3) { \
+		case 3: \
+			ring[idx].cnt = new_cnt + i; \
+			ring[idx++].ptr = obj_table[i++]; /* fallthrough */ \
+		case 2: \
+			ring[idx].cnt = new_cnt + i; \
+			ring[idx++].ptr = obj_table[i++]; /* fallthrough */ \
+		case 1: \
+			ring[idx].cnt = new_cnt + i; \
+			ring[idx++].ptr = obj_table[i++]; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) { \
+			ring[idx].cnt = new_cnt + i;  \
+			ring[idx].ptr = obj_table[i]; \
+		} \
+		for (idx = 0; i < n; i++, idx++) {    \
+			ring[idx].cnt = new_cnt + i;  \
+			ring[idx].ptr = obj_table[i]; \
+		} \
+	} \
+} while (0)
+
 /* the actual copy of pointers on the ring to obj_table.
  * Placed here since identical code needed in both
  * single and multi consumer dequeue functions */
@@ -315,6 +384,39 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 	} \
 } while (0)
 
+/* The actual copy of pointers on the ring to obj_table.
+ * Placed here since identical code needed in both
+ * single and multi consumer non-blocking dequeue functions.
+ */
+#define DEQUEUE_PTRS_NB(r, ring_start, cons_head, obj_table, n) do { \
+	unsigned int i; \
+	size_t idx = cons_head & (r)->mask; \
+	const uint32_t size = (r)->size; \
+	struct nb_ring_entry *ring = (struct nb_ring_entry *)ring_start; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & (~(unsigned)0x3)); i += 4, idx += 4) {\
+			obj_table[i] = ring[idx].ptr; \
+			obj_table[i + 1] = ring[idx + 1].ptr; \
+			obj_table[i + 2] = ring[idx + 2].ptr; \
+			obj_table[i + 3] = ring[idx + 3].ptr; \
+		} \
+		switch (n & 0x3) { \
+		case 3: \
+			obj_table[i++] = ring[idx++].ptr; /* fallthrough */ \
+		case 2: \
+			obj_table[i++] = ring[idx++].ptr; /* fallthrough */ \
+		case 1: \
+			obj_table[i++] = ring[idx++].ptr; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) \
+			obj_table[i] = ring[idx].ptr; \
+		for (idx = 0; i < n; i++, idx++) \
+			obj_table[i] = ring[idx].ptr; \
+	} \
+} while (0)
+
+
 /* Between load and load. there might be cpu reorder in weak model
  * (powerpc/arm).
  * There are 2 choices for the users
@@ -331,6 +433,319 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 #endif
 #include "rte_ring_generic_64.h"
 
+/* @internal 128-bit structure used by the non-blocking ring */
+struct nb_ring_entry {
+	void *ptr; /**< Data pointer */
+	uint64_t cnt; /**< Modification counter */
+};
+
+/* The non-blocking ring algorithm is based on the original rte ring (derived
+ * from FreeBSD's bufring.h) and inspired by Michael and Scott's non-blocking
+ * concurrent queue.
+ */
+
+/**
+ * @internal
+ *   Enqueue several objects on the non-blocking ring (single-producer only)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_enqueue_sp(struct rte_ring *r, void * const *obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+	uint32_t free_entries;
+	size_t head, next;
+
+	n = __rte_ring_move_prod_head_64(r, 1, n, behavior,
+					 &head, &next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	ENQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
+
+	r->prod_64.tail += n;
+
+end:
+	if (free_space != NULL)
+		*free_space = free_entries - n;
+	return n;
+}
+
+/**
+ * @internal
+ *   Enqueue several objects on the non-blocking ring (multi-producer safe)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_enqueue_mp(struct rte_ring *r, void * const *obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+#if !defined(RTE_ARCH_X86_64) || !defined(ALLOW_EXPERIMENTAL_API)
+	RTE_SET_USED(r);
+	RTE_SET_USED(obj_table);
+	RTE_SET_USED(n);
+	RTE_SET_USED(behavior);
+	RTE_SET_USED(free_space);
+#ifndef ALLOW_EXPERIMENTAL_API
+	printf("[%s()] RING_F_NB requires an experimental API."
+	       " Recompile with ALLOW_EXPERIMENTAL_API to use it.\n"
+	       , __func__);
+#endif
+	return 0;
+#endif
+#if defined(RTE_ARCH_X86_64) && defined(ALLOW_EXPERIMENTAL_API)
+	size_t head, next, tail;
+	uint32_t free_entries;
+	unsigned int i;
+
+	n = __rte_ring_move_prod_head_64(r, 0, n, behavior,
+					 &head, &next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	for (i = 0; i < n; /* i incremented if enqueue succeeds */) {
+		struct nb_ring_entry old_value, new_value;
+		struct nb_ring_entry *ring_ptr;
+
+		/* Enqueue to the tail entry. If another thread wins the race,
+		 * retry with the new tail.
+		 */
+		tail = r->prod_64.tail;
+
+		ring_ptr = &((struct nb_ring_entry *)&r[1])[tail & r->mask];
+
+		old_value = *ring_ptr;
+
+		/* If the tail entry's modification counter doesn't match the
+		 * producer tail index, it's already been updated.
+		 */
+		if (old_value.cnt != tail)
+			continue;
+
+		/* Prepare the new entry. The cnt field mitigates the ABA
+		 * problem on the ring write.
+		 */
+		new_value.ptr = obj_table[i];
+		new_value.cnt = tail + r->size;
+
+		if (rte_atomic128_cmpset((volatile rte_int128_t *)ring_ptr,
+					 (rte_int128_t *)&old_value,
+					 (rte_int128_t *)&new_value))
+			i++;
+
+		/* Every thread attempts the cmpset, so they don't have to wait
+		 * for the thread that successfully enqueued to the ring.
+		 * Using a 64-bit tail mitigates the ABA problem here.
+		 *
+		 * Built-in used to handle variable-sized tail index.
+		 */
+		__sync_bool_compare_and_swap(&r->prod_64.tail, tail, tail + 1);
+	}
+
+end:
+	if (free_space != NULL)
+		*free_space = free_entries - n;
+	return n;
+#endif
+}
+
+/**
+ * @internal Enqueue several objects on the non-blocking ring
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param is_sp
+ *   Indicates whether to use single producer or multi-producer head update
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_enqueue(struct rte_ring *r, void * const *obj_table,
+			 unsigned int n, enum rte_ring_queue_behavior behavior,
+			 unsigned int is_sp, unsigned int *free_space)
+{
+	if (is_sp)
+		return __rte_ring_do_nb_enqueue_sp(r, obj_table, n,
+						   behavior, free_space);
+	else
+		return __rte_ring_do_nb_enqueue_mp(r, obj_table, n,
+						   behavior, free_space);
+}
+
+/**
+ * @internal
+ *   Dequeue several objects from the non-blocking ring (single-consumer only)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_dequeue_sc(struct rte_ring *r, void **obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *available)
+{
+	size_t head, next;
+	uint32_t entries;
+
+	n = __rte_ring_move_cons_head_64(r, 1, n, behavior,
+					 &head, &next, &entries);
+	if (n == 0)
+		goto end;
+
+	DEQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
+
+	r->cons_64.tail += n;
+
+end:
+	if (available != NULL)
+		*available = entries - n;
+	return n;
+}
+
+/**
+ * @internal
+ *   Dequeue several objects from the non-blocking ring (multi-consumer safe)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_dequeue_mc(struct rte_ring *r, void **obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *available)
+{
+	size_t head, next;
+	uint32_t entries;
+
+	n = __rte_ring_move_cons_head_64(r, 0, n, behavior,
+					 &head, &next, &entries);
+	if (n == 0)
+		goto end;
+
+	while (1) {
+		size_t tail = r->cons_64.tail;
+
+		/* Dequeue from the cons tail onwards. If multiple threads read
+		 * the same pointers, the thread that successfully performs the
+		 * CAS will keep them and the other(s) will retry.
+		 */
+		DEQUEUE_PTRS_NB(r, &r[1], tail, obj_table, n);
+
+		next = tail + n;
+
+		/* Built-in used to handle variable-sized tail index. */
+		if (__sync_bool_compare_and_swap(&r->cons_64.tail, tail, next))
+			/* There is potential for the ABA problem here, but
+			 * that is mitigated by the large (64-bit) tail.
+			 */
+			break;
+	}
+
+end:
+	if (available != NULL)
+		*available = entries - n;
+	return n;
+}
+
+/**
+ * @internal Dequeue several objects from the non-blocking ring
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_dequeue(struct rte_ring *r, void **obj_table,
+		 unsigned int n, enum rte_ring_queue_behavior behavior,
+		 unsigned int is_sc, unsigned int *available)
+{
+	if (is_sc)
+		return __rte_ring_do_nb_dequeue_sc(r, obj_table, n,
+						   behavior, available);
+	else
+		return __rte_ring_do_nb_dequeue_mc(r, obj_table, n,
+						   behavior, available);
+}
+
 /**
  * @internal Enqueue several objects on the ring
  *
@@ -438,8 +853,14 @@ static __rte_always_inline unsigned int
 rte_ring_mp_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_MP, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_MP,
+						free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_MP,
+					     free_space);
 }
 
 /**
@@ -461,8 +882,14 @@ static __rte_always_inline unsigned int
 rte_ring_sp_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_SP, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_SP,
+						free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_SP,
+					     free_space);
 }
 
 /**
@@ -488,8 +915,14 @@ static __rte_always_inline unsigned int
 rte_ring_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
 		      unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			r->prod.single, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED,
+						r->prod_64.single, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED,
+					     r->prod.single, free_space);
 }
 
 /**
@@ -572,8 +1005,14 @@ static __rte_always_inline unsigned int
 rte_ring_mc_dequeue_bulk(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_MC, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_MC,
+						available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_MC,
+					     available);
 }
 
 /**
@@ -596,8 +1035,14 @@ static __rte_always_inline unsigned int
 rte_ring_sc_dequeue_bulk(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_SC, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_SC,
+						available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_SC,
+					     available);
 }
 
 /**
@@ -623,8 +1068,14 @@ static __rte_always_inline unsigned int
 rte_ring_dequeue_bulk(struct rte_ring *r, void **obj_table, unsigned int n,
 		unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-				r->cons.single, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED,
+						r->cons_64.single, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED,
+					     r->cons.single, available);
 }
 
 /**
@@ -699,9 +1150,13 @@ rte_ring_dequeue(struct rte_ring *r, void **obj_p)
 static inline unsigned
 rte_ring_count(const struct rte_ring *r)
 {
-	uint32_t prod_tail = r->prod.tail;
-	uint32_t cons_tail = r->cons.tail;
-	uint32_t count = (prod_tail - cons_tail) & r->mask;
+	uint32_t count;
+
+	if (r->flags & RING_F_NB)
+		count = (r->prod_64.tail - r->cons_64.tail) & r->mask;
+	else
+		count = (r->prod.tail - r->cons.tail) & r->mask;
+
 	return (count > r->capacity) ? r->capacity : count;
 }
 
@@ -821,8 +1276,14 @@ static __rte_always_inline unsigned
 rte_ring_mp_enqueue_burst(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_MP, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_MP, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_MP, free_space);
 }
 
 /**
@@ -844,8 +1305,14 @@ static __rte_always_inline unsigned
 rte_ring_sp_enqueue_burst(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_SP, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_SP, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_SP, free_space);
 }
 
 /**
@@ -871,8 +1338,14 @@ static __rte_always_inline unsigned
 rte_ring_enqueue_burst(struct rte_ring *r, void * const *obj_table,
 		      unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_VARIABLE,
-			r->prod.single, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						r->prod_64.single, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     r->prod.single, free_space);
 }
 
 /**
@@ -899,8 +1372,14 @@ static __rte_always_inline unsigned
 rte_ring_mc_dequeue_burst(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_MC, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_MC, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_MC, available);
 }
 
 /**
@@ -924,8 +1403,14 @@ static __rte_always_inline unsigned
 rte_ring_sc_dequeue_burst(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_SC, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_SC, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_SC, available);
 }
 
 /**
@@ -951,9 +1436,14 @@ static __rte_always_inline unsigned
 rte_ring_dequeue_burst(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n,
-				RTE_RING_QUEUE_VARIABLE,
-				r->cons.single, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						r->cons_64.single, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     r->cons.single, available);
 }
 
 #ifdef __cplusplus
diff --git a/lib/librte_ring/rte_ring_version.map b/lib/librte_ring/rte_ring_version.map
index d935efd0d..8969467af 100644
--- a/lib/librte_ring/rte_ring_version.map
+++ b/lib/librte_ring/rte_ring_version.map
@@ -17,3 +17,10 @@ DPDK_2.2 {
 	rte_ring_free;
 
 } DPDK_2.0;
+
+DPDK_19.05 {
+	global:
+
+	rte_ring_get_memsize;
+
+} DPDK_2.2;
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v3 3/5] test_ring: add non-blocking ring autotest
  2019-01-18 15:23   ` [dpdk-dev] [PATCH v3 " Gage Eads
  2019-01-18 15:23     ` [dpdk-dev] [PATCH v3 1/5] ring: add 64-bit headtail structure Gage Eads
  2019-01-18 15:23     ` [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation Gage Eads
@ 2019-01-18 15:23     ` Gage Eads
  2019-01-18 15:23     ` [dpdk-dev] [PATCH v3 4/5] test_ring_perf: add non-blocking ring perf test Gage Eads
                       ` (4 subsequent siblings)
  7 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-01-18 15:23 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev, stephen

ring_nb_autotest re-uses the ring_autotest code by wrapping its top-level
function with one that takes a 'flags' argument.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 test/test/test_ring.c | 57 ++++++++++++++++++++++++++++++++-------------------
 1 file changed, 36 insertions(+), 21 deletions(-)

diff --git a/test/test/test_ring.c b/test/test/test_ring.c
index aaf1e70ad..ff410d978 100644
--- a/test/test/test_ring.c
+++ b/test/test/test_ring.c
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2010-2019 Intel Corporation
  */
 
 #include <string.h>
@@ -601,18 +601,20 @@ test_ring_burst_basic(struct rte_ring *r)
  * it will always fail to create ring with a wrong ring size number in this function
  */
 static int
-test_ring_creation_with_wrong_size(void)
+test_ring_creation_with_wrong_size(unsigned int flags)
 {
 	struct rte_ring * rp = NULL;
 
 	/* Test if ring size is not power of 2 */
-	rp = rte_ring_create("test_bad_ring_size", RING_SIZE + 1, SOCKET_ID_ANY, 0);
+	rp = rte_ring_create("test_bad_ring_size", RING_SIZE + 1,
+			     SOCKET_ID_ANY, flags);
 	if (NULL != rp) {
 		return -1;
 	}
 
 	/* Test if ring size is exceeding the limit */
-	rp = rte_ring_create("test_bad_ring_size", (RTE_RING_SZ_MASK + 1), SOCKET_ID_ANY, 0);
+	rp = rte_ring_create("test_bad_ring_size", (RTE_RING_SZ_MASK + 1),
+			     SOCKET_ID_ANY, flags);
 	if (NULL != rp) {
 		return -1;
 	}
@@ -623,11 +625,11 @@ test_ring_creation_with_wrong_size(void)
  * it tests if it would always fail to create ring with an used ring name
  */
 static int
-test_ring_creation_with_an_used_name(void)
+test_ring_creation_with_an_used_name(unsigned int flags)
 {
 	struct rte_ring * rp;
 
-	rp = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, 0);
+	rp = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, flags);
 	if (NULL != rp)
 		return -1;
 
@@ -639,10 +641,10 @@ test_ring_creation_with_an_used_name(void)
  * function to fail correctly
  */
 static int
-test_create_count_odd(void)
+test_create_count_odd(unsigned int flags)
 {
 	struct rte_ring *r = rte_ring_create("test_ring_count",
-			4097, SOCKET_ID_ANY, 0 );
+			4097, SOCKET_ID_ANY, flags);
 	if(r != NULL){
 		return -1;
 	}
@@ -665,7 +667,7 @@ test_lookup_null(void)
  * it tests some more basic ring operations
  */
 static int
-test_ring_basic_ex(void)
+test_ring_basic_ex(unsigned int flags)
 {
 	int ret = -1;
 	unsigned i;
@@ -679,7 +681,7 @@ test_ring_basic_ex(void)
 	}
 
 	rp = rte_ring_create("test_ring_basic_ex", RING_SIZE, SOCKET_ID_ANY,
-			RING_F_SP_ENQ | RING_F_SC_DEQ);
+			RING_F_SP_ENQ | RING_F_SC_DEQ | flags);
 	if (rp == NULL) {
 		printf("test_ring_basic_ex fail to create ring\n");
 		goto fail_test;
@@ -737,7 +739,7 @@ test_ring_basic_ex(void)
 }
 
 static int
-test_ring_with_exact_size(void)
+test_ring_with_exact_size(unsigned int flags)
 {
 	struct rte_ring *std_ring = NULL, *exact_sz_ring = NULL;
 	void *ptr_array[16];
@@ -746,13 +748,13 @@ test_ring_with_exact_size(void)
 	int ret = -1;
 
 	std_ring = rte_ring_create("std", ring_sz, rte_socket_id(),
-			RING_F_SP_ENQ | RING_F_SC_DEQ);
+			RING_F_SP_ENQ | RING_F_SC_DEQ | flags);
 	if (std_ring == NULL) {
 		printf("%s: error, can't create std ring\n", __func__);
 		goto end;
 	}
 	exact_sz_ring = rte_ring_create("exact sz", ring_sz, rte_socket_id(),
-			RING_F_SP_ENQ | RING_F_SC_DEQ | RING_F_EXACT_SZ);
+		RING_F_SP_ENQ | RING_F_SC_DEQ | RING_F_EXACT_SZ | flags);
 	if (exact_sz_ring == NULL) {
 		printf("%s: error, can't create exact size ring\n", __func__);
 		goto end;
@@ -808,17 +810,17 @@ test_ring_with_exact_size(void)
 }
 
 static int
-test_ring(void)
+__test_ring(unsigned int flags)
 {
 	struct rte_ring *r = NULL;
 
 	/* some more basic operations */
-	if (test_ring_basic_ex() < 0)
+	if (test_ring_basic_ex(flags) < 0)
 		goto test_fail;
 
 	rte_atomic32_init(&synchro);
 
-	r = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, 0);
+	r = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, flags);
 	if (r == NULL)
 		goto test_fail;
 
@@ -837,27 +839,27 @@ test_ring(void)
 		goto test_fail;
 
 	/* basic operations */
-	if ( test_create_count_odd() < 0){
+	if (test_create_count_odd(flags) < 0) {
 		printf("Test failed to detect odd count\n");
 		goto test_fail;
 	} else
 		printf("Test detected odd count\n");
 
-	if ( test_lookup_null() < 0){
+	if (test_lookup_null() < 0) {
 		printf("Test failed to detect NULL ring lookup\n");
 		goto test_fail;
 	} else
 		printf("Test detected NULL ring lookup\n");
 
 	/* test of creating ring with wrong size */
-	if (test_ring_creation_with_wrong_size() < 0)
+	if (test_ring_creation_with_wrong_size(flags) < 0)
 		goto test_fail;
 
 	/* test of creation ring with an used name */
-	if (test_ring_creation_with_an_used_name() < 0)
+	if (test_ring_creation_with_an_used_name(flags) < 0)
 		goto test_fail;
 
-	if (test_ring_with_exact_size() < 0)
+	if (test_ring_with_exact_size(flags) < 0)
 		goto test_fail;
 
 	/* dump the ring status */
@@ -873,4 +875,17 @@ test_ring(void)
 	return -1;
 }
 
+static int
+test_ring(void)
+{
+	return __test_ring(0);
+}
+
+static int
+test_nb_ring(void)
+{
+	return __test_ring(RING_F_NB);
+}
+
 REGISTER_TEST_COMMAND(ring_autotest, test_ring);
+REGISTER_TEST_COMMAND(ring_nb_autotest, test_nb_ring);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v3 4/5] test_ring_perf: add non-blocking ring perf test
  2019-01-18 15:23   ` [dpdk-dev] [PATCH v3 " Gage Eads
                       ` (2 preceding siblings ...)
  2019-01-18 15:23     ` [dpdk-dev] [PATCH v3 3/5] test_ring: add non-blocking ring autotest Gage Eads
@ 2019-01-18 15:23     ` Gage Eads
  2019-01-18 15:23     ` [dpdk-dev] [PATCH v3 5/5] mempool/ring: add non-blocking ring handlers Gage Eads
                       ` (3 subsequent siblings)
  7 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-01-18 15:23 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev, stephen

nb_ring_perf_autotest re-uses the ring_perf_autotest code by wrapping its
top-level function with one that takes a 'flags' argument.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 test/test/test_ring_perf.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/test/test/test_ring_perf.c b/test/test/test_ring_perf.c
index ebb3939f5..380c4b4a1 100644
--- a/test/test/test_ring_perf.c
+++ b/test/test/test_ring_perf.c
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2010-2019 Intel Corporation
  */
 
 
@@ -363,12 +363,12 @@ test_bulk_enqueue_dequeue(struct rte_ring *r)
 }
 
 static int
-test_ring_perf(void)
+__test_ring_perf(unsigned int flags)
 {
 	struct lcore_pair cores;
 	struct rte_ring *r = NULL;
 
-	r = rte_ring_create(RING_NAME, RING_SIZE, rte_socket_id(), 0);
+	r = rte_ring_create(RING_NAME, RING_SIZE, rte_socket_id(), flags);
 	if (r == NULL)
 		return -1;
 
@@ -398,4 +398,17 @@ test_ring_perf(void)
 	return 0;
 }
 
+static int
+test_ring_perf(void)
+{
+	return __test_ring_perf(0);
+}
+
+static int
+test_nb_ring_perf(void)
+{
+	return __test_ring_perf(RING_F_NB);
+}
+
 REGISTER_TEST_COMMAND(ring_perf_autotest, test_ring_perf);
+REGISTER_TEST_COMMAND(ring_nb_perf_autotest, test_nb_ring_perf);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v3 5/5] mempool/ring: add non-blocking ring handlers
  2019-01-18 15:23   ` [dpdk-dev] [PATCH v3 " Gage Eads
                       ` (3 preceding siblings ...)
  2019-01-18 15:23     ` [dpdk-dev] [PATCH v3 4/5] test_ring_perf: add non-blocking ring perf test Gage Eads
@ 2019-01-18 15:23     ` Gage Eads
  2019-01-22  9:27     ` [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring Ola Liljedahl
                       ` (2 subsequent siblings)
  7 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-01-18 15:23 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev, stephen

These handlers allow an application to create a mempool based on the
non-blocking ring, with any combination of single/multi producer/consumer.

Also, add a note to the programmer's guide's "known issues" section.

Signed-off-by: Gage Eads <gage.eads@intel.com>
Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
---
 doc/guides/prog_guide/env_abstraction_layer.rst |  2 +-
 drivers/mempool/ring/Makefile                   |  1 +
 drivers/mempool/ring/meson.build                |  2 +
 drivers/mempool/ring/rte_mempool_ring.c         | 58 +++++++++++++++++++++++--
 4 files changed, 59 insertions(+), 4 deletions(-)

diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index 9497b879c..b6ac236d6 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -541,7 +541,7 @@ Known Issues
 
   5. It MUST not be used by multi-producer/consumer pthreads, whose scheduling policies are SCHED_FIFO or SCHED_RR.
 
-  Alternatively, x86_64 applications can use the non-blocking stack mempool handler. When considering this handler, note that:
+  Alternatively, x86_64 applications can use the non-blocking ring or stack mempool handlers. When considering one of them, note that:
 
   - it is limited to the x86_64 platform, because it uses an instruction (16-byte compare-and-swap) that is not available on other platforms.
   - it has worse average-case performance than the non-preemptive rte_ring, but software caching (e.g. the mempool cache) can mitigate this by reducing the number of handler operations.
diff --git a/drivers/mempool/ring/Makefile b/drivers/mempool/ring/Makefile
index ddab522fe..012ba6966 100644
--- a/drivers/mempool/ring/Makefile
+++ b/drivers/mempool/ring/Makefile
@@ -10,6 +10,7 @@ LIB = librte_mempool_ring.a
 
 CFLAGS += -O3
 CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 LDLIBS += -lrte_eal -lrte_mempool -lrte_ring
 
 EXPORT_MAP := rte_mempool_ring_version.map
diff --git a/drivers/mempool/ring/meson.build b/drivers/mempool/ring/meson.build
index a021e908c..b1cb673cc 100644
--- a/drivers/mempool/ring/meson.build
+++ b/drivers/mempool/ring/meson.build
@@ -1,4 +1,6 @@
 # SPDX-License-Identifier: BSD-3-Clause
 # Copyright(c) 2017 Intel Corporation
 
+allow_experimental_apis = true
+
 sources = files('rte_mempool_ring.c')
diff --git a/drivers/mempool/ring/rte_mempool_ring.c b/drivers/mempool/ring/rte_mempool_ring.c
index bc123fc52..013dac3bc 100644
--- a/drivers/mempool/ring/rte_mempool_ring.c
+++ b/drivers/mempool/ring/rte_mempool_ring.c
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2016 Intel Corporation
+ * Copyright(c) 2010-2019 Intel Corporation
  */
 
 #include <stdio.h>
@@ -47,11 +47,11 @@ common_ring_get_count(const struct rte_mempool *mp)
 
 
 static int
-common_ring_alloc(struct rte_mempool *mp)
+__common_ring_alloc(struct rte_mempool *mp, int rg_flags)
 {
-	int rg_flags = 0, ret;
 	char rg_name[RTE_RING_NAMESIZE];
 	struct rte_ring *r;
+	int ret;
 
 	ret = snprintf(rg_name, sizeof(rg_name),
 		RTE_MEMPOOL_MZ_FORMAT, mp->name);
@@ -82,6 +82,18 @@ common_ring_alloc(struct rte_mempool *mp)
 	return 0;
 }
 
+static int
+common_ring_alloc(struct rte_mempool *mp)
+{
+	return __common_ring_alloc(mp, 0);
+}
+
+static int
+common_ring_alloc_nb(struct rte_mempool *mp)
+{
+	return __common_ring_alloc(mp, RING_F_NB);
+}
+
 static void
 common_ring_free(struct rte_mempool *mp)
 {
@@ -130,7 +142,47 @@ static const struct rte_mempool_ops ops_sp_mc = {
 	.get_count = common_ring_get_count,
 };
 
+static const struct rte_mempool_ops ops_mp_mc_nb = {
+	.name = "ring_mp_mc_nb",
+	.alloc = common_ring_alloc_nb,
+	.free = common_ring_free,
+	.enqueue = common_ring_mp_enqueue,
+	.dequeue = common_ring_mc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
+static const struct rte_mempool_ops ops_sp_sc_nb = {
+	.name = "ring_sp_sc_nb",
+	.alloc = common_ring_alloc_nb,
+	.free = common_ring_free,
+	.enqueue = common_ring_sp_enqueue,
+	.dequeue = common_ring_sc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
+static const struct rte_mempool_ops ops_mp_sc_nb = {
+	.name = "ring_mp_sc_nb",
+	.alloc = common_ring_alloc_nb,
+	.free = common_ring_free,
+	.enqueue = common_ring_mp_enqueue,
+	.dequeue = common_ring_sc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
+static const struct rte_mempool_ops ops_sp_mc_nb = {
+	.name = "ring_sp_mc_nb",
+	.alloc = common_ring_alloc_nb,
+	.free = common_ring_free,
+	.enqueue = common_ring_sp_enqueue,
+	.dequeue = common_ring_mc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
 MEMPOOL_REGISTER_OPS(ops_mp_mc);
 MEMPOOL_REGISTER_OPS(ops_sp_sc);
 MEMPOOL_REGISTER_OPS(ops_mp_sc);
 MEMPOOL_REGISTER_OPS(ops_sp_mc);
+MEMPOOL_REGISTER_OPS(ops_mp_mc_nb);
+MEMPOOL_REGISTER_OPS(ops_sp_sc_nb);
+MEMPOOL_REGISTER_OPS(ops_mp_sc_nb);
+MEMPOOL_REGISTER_OPS(ops_sp_mc_nb);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width size
  2019-01-11 19:27           ` Eads, Gage
@ 2019-01-21 14:14             ` Burakov, Anatoly
  2019-01-22 18:27               ` Eads, Gage
  0 siblings, 1 reply; 123+ messages in thread
From: Burakov, Anatoly @ 2019-01-21 14:14 UTC (permalink / raw)
  To: Eads, Gage, Richardson, Bruce
  Cc: dev, olivier.matz, arybchenko, Ananyev, Konstantin

On 11-Jan-19 7:27 PM, Eads, Gage wrote:
> 
> 
>> -----Original Message-----
>> From: Richardson, Bruce
>> Sent: Friday, January 11, 2019 5:59 AM
>> To: Burakov, Anatoly <anatoly.burakov@intel.com>
>> Cc: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org;
>> olivier.matz@6wind.com; arybchenko@solarflare.com; Ananyev, Konstantin
>> <konstantin.ananyev@intel.com>
>> Subject: Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width
>> size
>>
>> On Fri, Jan 11, 2019 at 11:30:24AM +0000, Burakov, Anatoly wrote:
>>> On 11-Jan-19 10:58 AM, Bruce Richardson wrote:
>>>> On Fri, Jan 11, 2019 at 10:40:19AM +0000, Burakov, Anatoly wrote:
>>>>> <...>
>>>>>
>>>>>> + * Copyright(c) 2016-2019 Intel Corporation
>>>>>>      */
>>>>>>     /**
>>>>>> @@ -88,7 +88,7 @@ rte_event_ring_enqueue_burst(struct
>> rte_event_ring *r,
>>>>>>     		const struct rte_event *events,
>>>>>>     		unsigned int n, uint16_t *free_space)
>>>>>>     {
>>>>>> -	uint32_t prod_head, prod_next;
>>>>>> +	uintptr_t prod_head, prod_next;
>>>>>
>>>>> I would also question the use of uinptr_t. I think semnatically,
>>>>> size_t is more appropriate.
>>>>>
>>>> Yes, it would, but I believe in this case they want to use the
>>>> largest size of (unsigned)int where there exists an atomic for
>>>> manipulating 2 of them simultaneously. [The largest size is to
>>>> minimize any chance of an ABA issue occuring]. Therefore we need
>>>> 32-bit values on 32-bit and 64-bit on 64, and I suspect the best way
>>>> to guarantee this is to use pointer-sized values. If size_t is
>>>> guaranteed across all OS's to have the same size as uintptr_t it could also be
>> used, though.
>>>>
>>>> /Bruce
>>>>
>>>
>>> Technically, size_t and uintptr_t are not guaranteed to match. In
>>> practice, they won't match only on architectures that DPDK doesn't
>>> intend to run on (such as 16-bit segmented archs, where size_t would
>>> be 16-bit but uinptr_t would be 32-bit).
>>>
>>> In all the rest of DPDK code, we use size_t for this kind of thing.
>>>
>>
>> Ok.
>> If we do use size_t, I think we also need to add a compile-time check into the
>> build too, to error out if sizeof(size_t) != sizeof(uintptr_t).
> 
> Ok, I wasn't aware of the precedent of using size_t for this purpose. I'll change it and look into adding a static assert.

RTE_BUILD_BUG_ON?

> 
> Thanks,
> Gage
> 


-- 
Thanks,
Anatoly

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
  2019-01-18 15:23   ` [dpdk-dev] [PATCH v3 " Gage Eads
                       ` (4 preceding siblings ...)
  2019-01-18 15:23     ` [dpdk-dev] [PATCH v3 5/5] mempool/ring: add non-blocking ring handlers Gage Eads
@ 2019-01-22  9:27     ` Ola Liljedahl
  2019-01-22 10:15       ` Ola Liljedahl
                         ` (2 more replies)
  2019-01-25  5:20     ` [dpdk-dev] " Honnappa Nagarahalli
  2019-01-28 18:14     ` [dpdk-dev] [PATCH v4 " Gage Eads
  7 siblings, 3 replies; 123+ messages in thread
From: Ola Liljedahl @ 2019-01-22  9:27 UTC (permalink / raw)
  To: gage.eads, dev
  Cc: olivier.matz, stephen, bruce.richardson, arybchenko, konstantin.ananyev

On Fri, 2019-01-18 at 09:23 -0600, Gage Eads wrote:
> For some users, the rte ring's "non-preemptive" constraint is not
> acceptable;
> for example, if the application uses a mixture of pinned high-
> priority threads
> and multiplexed low-priority threads that share a mempool.
>
> This patchset introduces a non-blocking ring, on top of which a
> mempool can run.
> Crucially, the non-blocking algorithm relies on a 128-bit compare-
> and-swap, so
> it is currently limited to x86_64 machines. This is also an
> experimental API,
> so RING_F_NB users must build with the ALLOW_EXPERIMENTAL_API flag.
>
> The ring uses more compare-and-swap atomic operations than the
> regular rte ring:
> With no contention, an enqueue of n pointers uses (1 + 2n) CAS
> operations and a
> dequeue of n pointers uses 2. This algorithm has worse average-case
> performance
> than the regular rte ring (particularly a highly-contended ring with
> large bulk
> accesses), however:
> - For applications with preemptible pthreads, the regular rte ring's
> worst-case
>   performance (i.e. one thread being preempted in the update_tail()
> critical
>   section) is much worse than the non-blocking ring's.
> - Software caching can mitigate the average case performance for
> ring-based
>   algorithms. For example, a non-blocking ring based mempool (a
> likely use case
>   for this ring) with per-thread caching.
>
> The non-blocking ring is enabled via a new flag, RING_F_NB. For ease-
> of-use,
> existing ring enqueue/dequeue functions work with both "regular" and
> non-blocking rings.
>
> This patchset also adds non-blocking versions of ring_autotest and
> ring_perf_autotest, and a non-blocking ring based mempool.
>
> This patchset makes one API change; a deprecation notice will be
> posted in a
> separate commit.
>
> This patchset depends on the non-blocking stack patchset[1].
>
> [1] http://mails.dpdk.org/archives/dev/2019-January/123653.html
>
> v3:
>  - Avoid the ABI break by putting 64-bit head and tail values in the
> same
>    cacheline as struct rte_ring's prod and cons members.
>  - Don't attempt to compile rte_atomic128_cmpset without
>    ALLOW_EXPERIMENTAL_API, as this would break a large number of
> libraries.
>  - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case
> someone tries
>    to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
>  - Update the ring mempool to use experimental APIs
>  - Clarify that RINB_F_NB is only limited to x86_64 currently;
> ARMv8.1-A builds
>    can eventually support it with the CASP instruction.
ARMv8.0 should be able to implement a 128-bit atomic compare exchange
operation using LDXP/STXP.

From an ARM perspective, I want all atomic operations to take memory
ordering arguments (e.g. acquire, release). Not all usages of e.g.
atomic compare exchange require sequential consistency (which I think
what x86 cmpxchg instruction provides). DPDK functions should not be
modelled after x86 behaviour.

Lock-free 128-bit atomics implementations for ARM/AArch64 and x86-64
are available here:
https://github.com/ARM-software/progress64/blob/master/src/lockfree.h

>
> v2:
>  - Merge separate docs commit into patch #5
>  - Convert uintptr_t to size_t
>  - Add a compile-time check for the size of size_t
>  - Fix a space-after-typecast issue
>  - Fix an unnecessary-parentheses checkpatch warning
>  - Bump librte_ring's library version
>
> Gage Eads (5):
>   ring: add 64-bit headtail structure
>   ring: add a non-blocking implementation
>   test_ring: add non-blocking ring autotest
>   test_ring_perf: add non-blocking ring perf test
>   mempool/ring: add non-blocking ring handlers
>
>  doc/guides/prog_guide/env_abstraction_layer.rst |   2 +-
>  drivers/mempool/ring/Makefile                   |   1 +
>  drivers/mempool/ring/meson.build                |   2 +
>  drivers/mempool/ring/rte_mempool_ring.c         |  58 ++-
>  lib/librte_eventdev/rte_event_ring.h            |   2 +-
>  lib/librte_ring/Makefile                        |   3 +-
>  lib/librte_ring/rte_ring.c                      |  72 ++-
>  lib/librte_ring/rte_ring.h                      | 574
> ++++++++++++++++++++++--
>  lib/librte_ring/rte_ring_generic_64.h           | 152 +++++++
>  lib/librte_ring/rte_ring_version.map            |   7 +
>  test/test/test_ring.c                           |  57 ++-
>  test/test/test_ring_perf.c                      |  19 +-
>  12 files changed, 874 insertions(+), 75 deletions(-)
>  create mode 100644 lib/librte_ring/rte_ring_generic_64.h
>
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation
  2019-01-18 15:23     ` [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation Gage Eads
@ 2019-01-22 10:12       ` Ola Liljedahl
  2019-01-22 14:49       ` Ola Liljedahl
  1 sibling, 0 replies; 123+ messages in thread
From: Ola Liljedahl @ 2019-01-22 10:12 UTC (permalink / raw)
  To: gage.eads, dev
  Cc: olivier.matz, stephen, bruce.richardson, arybchenko, konstantin.ananyev

On Fri, 2019-01-18 at 09:23 -0600, Gage Eads wrote:
> This commit adds support for non-blocking circular ring enqueue and
> dequeue
> functions. The ring uses a 128-bit compare-and-swap instruction, and
> thus
> is currently limited to x86_64.
>
> The algorithm is based on the original rte ring (derived from
> FreeBSD's
> bufring.h) and inspired by Michael and Scott's non-blocking
> concurrent
> queue. Importantly, it adds a modification counter to each ring entry
> to
> ensure only one thread can write to an unused entry.
> -----
> Algorithm:
>
> Multi-producer non-blocking enqueue:
> 1. Move the producer head index 'n' locations forward, effectively
>    reserving 'n' locations.
> 2. For each pointer:
>  a. Read the producer tail index, then ring[tail]. If ring[tail]'s
>     modification counter isn't 'tail', retry.
>  b. Construct the new entry: {pointer, tail + ring size}
>  c. Compare-and-swap the old entry with the new. If unsuccessful, the
>     next loop iteration will try to enqueue this pointer again.
>  d. Compare-and-swap the tail index with 'tail + 1', whether or not
> step 2c
>     succeeded. This guarantees threads can make forward progress.
>
> Multi-consumer non-blocking dequeue:
> 1. Move the consumer head index 'n' locations forward, effectively
>    reserving 'n' pointers to be dequeued.
> 2. Copy 'n' pointers into the caller's object table (ignoring the
>    modification counter), starting from ring[tail], then compare-and-
> swap
>    the tail index with 'tail + n'.  If unsuccessful, repeat step 2.
>
> -----
> Discussion:
>
> There are two cases where the ABA problem is mitigated:
> 1. Enqueueing a pointer to the ring: without a modification counter
>    tied to the tail index, the index could become stale by the time
> the
>    enqueue happens, causing it to overwrite valid data. Tying the
>    counter to the tail index gives us an expected value (as opposed
> to,
>    say, a monotonically incrementing counter).
>
>    Since the counter will eventually wrap, there is potential for the
> ABA
>    problem. However, using a 64-bit counter makes this likelihood
>    effectively zero.
>
> 2. Updating a tail index: the ABA problem can occur if the thread is
>    preempted and the tail index wraps around. However, using 64-bit
> indexes
>    makes this likelihood effectively zero.
>
> With no contention, an enqueue of n pointers uses (1 + 2n) CAS
> operations
> and a dequeue of n pointers uses 2. This algorithm has worse average-
> case
> performance than the regular rte ring (particularly a highly-
> contended ring
> with large bulk accesses), however:
> - For applications with preemptible pthreads, the regular rte ring's
>   worst-case performance (i.e. one thread being preempted in the
>   update_tail() critical section) is much worse than the non-blocking
>   ring's.
> - Software caching can mitigate the average case performance for
> ring-based
>   algorithms. For example, a non-blocking ring based mempool (a
> likely use
>   case for this ring) with per-thread caching.
>
> The non-blocking ring is enabled via a new flag, RING_F_NB. Because
> the
> ring's memsize is now a function of its flags (the non-blocking ring
> requires 128b for each entry), this commit adds a new argument
> ('flags') to
> rte_ring_get_memsize(). An API deprecation notice will be sent in a
> separate commit.
>
> For ease-of-use, existing ring enqueue and dequeue functions work on
> both
> regular and non-blocking rings. This introduces an additional branch
> in
> the datapath, but this should be a highly predictable branch.
> ring_perf_autotest shows a negligible performance impact; it's hard
> to
> distinguish a real difference versus system noise.
>
>                                   | ring_perf_autotest cycles with
> branch -
>              Test                 |   ring_perf_autotest cycles
> without
> ------------------------------------------------------------------
> SP/SC single enq/dequeue          | 0.33
> MP/MC single enq/dequeue          | -4.00
> SP/SC burst enq/dequeue (size 8)  | 0.00
> MP/MC burst enq/dequeue (size 8)  | 0.00
> SP/SC burst enq/dequeue (size 32) | 0.00
> MP/MC burst enq/dequeue (size 32) | 0.00
> SC empty dequeue                  | 1.00
> MC empty dequeue                  | 0.00
>
> Single lcore:
> SP/SC bulk enq/dequeue (size 8)   | 0.49
> MP/MC bulk enq/dequeue (size 8)   | 0.08
> SP/SC bulk enq/dequeue (size 32)  | 0.07
> MP/MC bulk enq/dequeue (size 32)  | 0.09
>
> Two physical cores:
> SP/SC bulk enq/dequeue (size 8)   | 0.19
> MP/MC bulk enq/dequeue (size 8)   | -0.37
> SP/SC bulk enq/dequeue (size 32)  | 0.09
> MP/MC bulk enq/dequeue (size 32)  | -0.05
>
> Two NUMA nodes:
> SP/SC bulk enq/dequeue (size 8)   | -1.96
> MP/MC bulk enq/dequeue (size 8)   | 0.88
> SP/SC bulk enq/dequeue (size 32)  | 0.10
> MP/MC bulk enq/dequeue (size 32)  | 0.46
>
> Test setup: x86_64 build with default config, dual-socket Xeon E5-
> 2699 v4,
> running on isolcpus cores with a tickless scheduler. Each test run
> three
> times and the results averaged.
>
> Signed-off-by: Gage Eads <gage.eads@intel.com>
> ---
>  lib/librte_ring/rte_ring.c           |  72 ++++-
>  lib/librte_ring/rte_ring.h           | 550
> +++++++++++++++++++++++++++++++++--
>  lib/librte_ring/rte_ring_version.map |   7 +
>  3 files changed, 587 insertions(+), 42 deletions(-)
>
> diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
> index d215acecc..f3378dccd 100644
> --- a/lib/librte_ring/rte_ring.c
> +++ b/lib/librte_ring/rte_ring.c
> @@ -45,9 +45,9 @@ EAL_REGISTER_TAILQ(rte_ring_tailq)
>
>  /* return the size of memory occupied by a ring */
>  ssize_t
> -rte_ring_get_memsize(unsigned count)
> +rte_ring_get_memsize_v1905(unsigned int count, unsigned int flags)
>  {
> -ssize_t sz;
> +ssize_t sz, elt_sz;
>
>  /* count must be a power of 2 */
>  if ((!POWEROF2(count)) || (count > RTE_RING_SZ_MASK )) {
> @@ -57,10 +57,23 @@ rte_ring_get_memsize(unsigned count)
>  return -EINVAL;
>  }
>
> -sz = sizeof(struct rte_ring) + count * sizeof(void *);
> +elt_sz = (flags & RING_F_NB) ? 2 * sizeof(void *) :
> sizeof(void *);
> +
> +sz = sizeof(struct rte_ring) + count * elt_sz;
>  sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
>  return sz;
>  }
> +BIND_DEFAULT_SYMBOL(rte_ring_get_memsize, _v1905, 19.05);
> +MAP_STATIC_SYMBOL(ssize_t rte_ring_get_memsize(unsigned int count,
> +       unsigned int flags),
> +  rte_ring_get_memsize_v1905);
> +
> +ssize_t
> +rte_ring_get_memsize_v20(unsigned int count)
> +{
> +return rte_ring_get_memsize_v1905(count, 0);
> +}
> +VERSION_SYMBOL(rte_ring_get_memsize, _v20, 2.0);
>
>  int
>  rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
> @@ -82,8 +95,6 @@ rte_ring_init(struct rte_ring *r, const char *name,
> unsigned count,
>  if (ret < 0 || ret >= (int)sizeof(r->name))
>  return -ENAMETOOLONG;
>  r->flags = flags;
> -r->prod.single = (flags & RING_F_SP_ENQ) ? __IS_SP :
> __IS_MP;
> -r->cons.single = (flags & RING_F_SC_DEQ) ? __IS_SC :
> __IS_MC;
>
>  if (flags & RING_F_EXACT_SZ) {
>  r->size = rte_align32pow2(count + 1);
> @@ -100,8 +111,30 @@ rte_ring_init(struct rte_ring *r, const char
> *name, unsigned count,
>  r->mask = count - 1;
>  r->capacity = r->mask;
>  }
> -r->prod.head = r->cons.head = 0;
> -r->prod.tail = r->cons.tail = 0;
> +
> +if (flags & RING_F_NB) {
> +uint64_t i;
> +
> +r->prod_64.single = (flags & RING_F_SP_ENQ) ?
> __IS_SP : __IS_MP;
> +r->cons_64.single = (flags & RING_F_SC_DEQ) ?
> __IS_SC : __IS_MC;
> +r->prod_64.head = r->cons_64.head = 0;
> +r->prod_64.tail = r->cons_64.tail = 0;
> +
> +for (i = 0; i < r->size; i++) {
> +struct nb_ring_entry *ring_ptr, *base;
> +
> +base = ((struct nb_ring_entry *)&r[1]);
> +
> +ring_ptr = &base[i & r->mask];
> +
> +ring_ptr->cnt = i;
> +}
> +} else {
> +r->prod.single = (flags & RING_F_SP_ENQ) ? __IS_SP :
> __IS_MP;
> +r->cons.single = (flags & RING_F_SC_DEQ) ? __IS_SC :
> __IS_MC;
> +r->prod.head = r->cons.head = 0;
> +r->prod.tail = r->cons.tail = 0;
> +}
>
>  return 0;
>  }
> @@ -123,11 +156,19 @@ rte_ring_create(const char *name, unsigned
> count, int socket_id,
>
>  ring_list = RTE_TAILQ_CAST(rte_ring_tailq.head,
> rte_ring_list);
>
> +#if !defined(RTE_ARCH_X86_64)
> +if (flags & RING_F_NB) {
> +printf("RING_F_NB is only supported on x86-64
> platforms\n");
> +rte_errno = EINVAL;
> +return NULL;
> +}
> +#endif
> +
>  /* for an exact size ring, round up from count to a power of
> two */
>  if (flags & RING_F_EXACT_SZ)
>  count = rte_align32pow2(count + 1);
>
> -ring_size = rte_ring_get_memsize(count);
> +ring_size = rte_ring_get_memsize(count, flags);
>  if (ring_size < 0) {
>  rte_errno = ring_size;
>  return NULL;
> @@ -227,10 +268,17 @@ rte_ring_dump(FILE *f, const struct rte_ring
> *r)
>  fprintf(f, "  flags=%x\n", r->flags);
>  fprintf(f, "  size=%"PRIu32"\n", r->size);
>  fprintf(f, "  capacity=%"PRIu32"\n", r->capacity);
> -fprintf(f, "  ct=%"PRIu32"\n", r->cons.tail);
> -fprintf(f, "  ch=%"PRIu32"\n", r->cons.head);
> -fprintf(f, "  pt=%"PRIu32"\n", r->prod.tail);
> -fprintf(f, "  ph=%"PRIu32"\n", r->prod.head);
> +if (r->flags & RING_F_NB) {
> +fprintf(f, "  ct=%"PRIu64"\n", r->cons_64.tail);
> +fprintf(f, "  ch=%"PRIu64"\n", r->cons_64.head);
> +fprintf(f, "  pt=%"PRIu64"\n", r->prod_64.tail);
> +fprintf(f, "  ph=%"PRIu64"\n", r->prod_64.head);
> +} else {
> +fprintf(f, "  ct=%"PRIu32"\n", r->cons.tail);
> +fprintf(f, "  ch=%"PRIu32"\n", r->cons.head);
> +fprintf(f, "  pt=%"PRIu32"\n", r->prod.tail);
> +fprintf(f, "  ph=%"PRIu32"\n", r->prod.head);
> +}
>  fprintf(f, "  used=%u\n", rte_ring_count(r));
>  fprintf(f, "  avail=%u\n", rte_ring_free_count(r));
>  }
> diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> index b270a4746..08c9de6a6 100644
> --- a/lib/librte_ring/rte_ring.h
> +++ b/lib/librte_ring/rte_ring.h
> @@ -134,6 +134,18 @@ struct rte_ring {
>   */
>  #define RING_F_EXACT_SZ 0x0004
>  #define RTE_RING_SZ_MASK  (0x7fffffffU) /**< Ring size mask */
> +/**
> + * The ring uses non-blocking enqueue and dequeue functions. These
> functions
> + * do not have the "non-preemptive" constraint of a regular rte
> ring, and thus
> + * are suited for applications using preemptible pthreads. However,
> the
> + * non-blocking functions have worse average-case performance than
> their
> + * regular rte ring counterparts. When used as the handler for a
> mempool,
> + * per-thread caching can mitigate the performance difference by
> reducing the
> + * number (and contention) of ring accesses.
> + *
> + * This flag is only supported on x86_64 platforms.
> + */
> +#define RING_F_NB 0x0008
>
>  /* @internal defines for passing to the enqueue dequeue worker
> functions */
>  #define __IS_SP 1
> @@ -151,11 +163,15 @@ struct rte_ring {
>   *
>   * @param count
>   *   The number of elements in the ring (must be a power of 2).
> + * @param flags
> + *   The flags the ring will be created with.
>   * @return
>   *   - The memory size needed for the ring on success.
>   *   - -EINVAL if count is not a power of 2.
>   */
> -ssize_t rte_ring_get_memsize(unsigned count);
> +ssize_t rte_ring_get_memsize(unsigned int count, unsigned int
> flags);
> +ssize_t rte_ring_get_memsize_v20(unsigned int count);
> +ssize_t rte_ring_get_memsize_v1905(unsigned int count, unsigned int
> flags);
>
>  /**
>   * Initialize a ring structure.
> @@ -188,6 +204,10 @@ ssize_t rte_ring_get_memsize(unsigned count);
>   *    - RING_F_SC_DEQ: If this flag is set, the default behavior
> when
>   *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
>   *      is "single-consumer". Otherwise, it is "multi-consumers".
> + *    - RING_F_EXACT_SZ: If this flag is set, count can be a non-
> power-of-2
> + *      number, but up to half the ring space may be wasted.
> + *    - RING_F_NB: (x86_64 only) If this flag is set, the ring uses
> + *      non-blocking variants of the dequeue and enqueue functions.
>   * @return
>   *   0 on success, or a negative value on error.
>   */
> @@ -223,12 +243,17 @@ int rte_ring_init(struct rte_ring *r, const
> char *name, unsigned count,
>   *    - RING_F_SC_DEQ: If this flag is set, the default behavior
> when
>   *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
>   *      is "single-consumer". Otherwise, it is "multi-consumers".
> + *    - RING_F_EXACT_SZ: If this flag is set, count can be a non-
> power-of-2
> + *      number, but up to half the ring space may be wasted.
> + *    - RING_F_NB: (x86_64 only) If this flag is set, the ring uses
> + *      non-blocking variants of the dequeue and enqueue functions.
>   * @return
>   *   On success, the pointer to the new allocated ring. NULL on
> error with
>   *    rte_errno set appropriately. Possible errno values include:
>   *    - E_RTE_NO_CONFIG - function could not get pointer to
> rte_config structure
>   *    - E_RTE_SECONDARY - function was called from a secondary
> process instance
> - *    - EINVAL - count provided is not a power of 2
> + *    - EINVAL - count provided is not a power of 2, or RING_F_NB is
> used on an
> + *      unsupported platform
>   *    - ENOSPC - the maximum number of memzones has already been
> allocated
>   *    - EEXIST - a memzone with the same name already exists
>   *    - ENOMEM - no appropriate memory area found in which to create
> memzone
> @@ -284,6 +309,50 @@ void rte_ring_dump(FILE *f, const struct
> rte_ring *r);
>  } \
>  } while (0)
>
> +/* The actual enqueue of pointers on the ring.
> + * Used only by the single-producer non-blocking enqueue function,
> but
> + * out-lined here for code readability.
> + */
> +#define ENQUEUE_PTRS_NB(r, ring_start, prod_head, obj_table, n) do {
> \
> +unsigned int i; \
> +const uint32_t size = (r)->size; \
> +size_t idx = prod_head & (r)->mask; \
> +size_t new_cnt = prod_head + size; \
> +struct nb_ring_entry *ring = (struct nb_ring_entry
> *)ring_start; \
> +if (likely(idx + n < size)) { \
> +for (i = 0; i < (n & ((~(unsigned)0x3))); i += 4,
> idx += 4) { \
> +ring[idx].ptr = obj_table[i]; \
> +ring[idx].cnt = new_cnt + i;  \
> +ring[idx + 1].ptr = obj_table[i + 1]; \
> +ring[idx + 1].cnt = new_cnt + i + 1;  \
> +ring[idx + 2].ptr = obj_table[i + 2]; \
> +ring[idx + 2].cnt = new_cnt + i + 2;  \
> +ring[idx + 3].ptr = obj_table[i + 3]; \
> +ring[idx + 3].cnt = new_cnt + i + 3;  \
> +} \
> +switch (n & 0x3) { \
> +case 3: \
> +ring[idx].cnt = new_cnt + i; \
> +ring[idx++].ptr = obj_table[i++]; /*
> fallthrough */ \
> +case 2: \
> +ring[idx].cnt = new_cnt + i; \
> +ring[idx++].ptr = obj_table[i++]; /*
> fallthrough */ \
> +case 1: \
> +ring[idx].cnt = new_cnt + i; \
> +ring[idx++].ptr = obj_table[i++]; \
> +} \
> +} else { \
> +for (i = 0; idx < size; i++, idx++) { \
> +ring[idx].cnt = new_cnt + i;  \
> +ring[idx].ptr = obj_table[i]; \
> +} \
> +for (idx = 0; i < n; i++, idx++) {    \
> +ring[idx].cnt = new_cnt + i;  \
> +ring[idx].ptr = obj_table[i]; \
> +} \
> +} \
> +} while (0)
> +
>  /* the actual copy of pointers on the ring to obj_table.
>   * Placed here since identical code needed in both
>   * single and multi consumer dequeue functions */
> @@ -315,6 +384,39 @@ void rte_ring_dump(FILE *f, const struct
> rte_ring *r);
>  } \
>  } while (0)
>
> +/* The actual copy of pointers on the ring to obj_table.
> + * Placed here since identical code needed in both
> + * single and multi consumer non-blocking dequeue functions.
> + */
> +#define DEQUEUE_PTRS_NB(r, ring_start, cons_head, obj_table, n) do {
> \
> +unsigned int i; \
> +size_t idx = cons_head & (r)->mask; \
> +const uint32_t size = (r)->size; \
> +struct nb_ring_entry *ring = (struct nb_ring_entry
> *)ring_start; \
> +if (likely(idx + n < size)) { \
> +for (i = 0; i < (n & (~(unsigned)0x3)); i += 4, idx
> += 4) {\
> +obj_table[i] = ring[idx].ptr; \
> +obj_table[i + 1] = ring[idx + 1].ptr; \
> +obj_table[i + 2] = ring[idx + 2].ptr; \
> +obj_table[i + 3] = ring[idx + 3].ptr; \
> +} \
> +switch (n & 0x3) { \
> +case 3: \
> +obj_table[i++] = ring[idx++].ptr; /*
> fallthrough */ \
> +case 2: \
> +obj_table[i++] = ring[idx++].ptr; /*
> fallthrough */ \
> +case 1: \
> +obj_table[i++] = ring[idx++].ptr; \
> +} \
> +} else { \
> +for (i = 0; idx < size; i++, idx++) \
> +obj_table[i] = ring[idx].ptr; \
> +for (idx = 0; i < n; i++, idx++) \
> +obj_table[i] = ring[idx].ptr; \
> +} \
> +} while (0)
> +
> +
>  /* Between load and load. there might be cpu reorder in weak model
>   * (powerpc/arm).
>   * There are 2 choices for the users
> @@ -331,6 +433,319 @@ void rte_ring_dump(FILE *f, const struct
> rte_ring *r);
>  #endif
>  #include "rte_ring_generic_64.h"
>
> +/* @internal 128-bit structure used by the non-blocking ring */
> +struct nb_ring_entry {
> +void *ptr; /**< Data pointer */
> +uint64_t cnt; /**< Modification counter */
Why not make 'cnt' uintptr_t? This way 32-bit architectures will also
be supported.

> +};
> +
> +/* The non-blocking ring algorithm is based on the original rte ring
> (derived
> + * from FreeBSD's bufring.h) and inspired by Michael and Scott's
> non-blocking
> + * concurrent queue.
> + */
> +
> +/**
> + * @internal
> + *   Enqueue several objects on the non-blocking ring (single-
> producer only)
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects).
> + * @param n
> + *   The number of objects to add in the ring from the obj_table.
> + * @param behavior
> + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the
> ring
> + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to
> the ring
> + * @param free_space
> + *   returns the amount of space after the enqueue operation has
> finished
> + * @return
> + *   Actual number of objects enqueued.
> + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> + */
> +static __rte_always_inline unsigned int
> +__rte_ring_do_nb_enqueue_sp(struct rte_ring *r, void * const
> *obj_table,
> +    unsigned int n,
> +    enum rte_ring_queue_behavior behavior,
> +    unsigned int *free_space)
> +{
> +uint32_t free_entries;
> +size_t head, next;
> +
> +n = __rte_ring_move_prod_head_64(r, 1, n, behavior,
> + &head, &next,
> &free_entries);
> +if (n == 0)
> +goto end;
> +
> +ENQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
> +
> +r->prod_64.tail += n;
Don't we need release order (or smp_wmb) between writing of the ring
pointers and the update of tail? By updating the tail pointer, we are
synchronising with a consumer.

> +
> +end:
> +if (free_space != NULL)
> +*free_space = free_entries - n;
> +return n;
> +}
> +
> +/**
> + * @internal
> + *   Enqueue several objects on the non-blocking ring (multi-
> producer safe)
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects).
> + * @param n
> + *   The number of objects to add in the ring from the obj_table.
> + * @param behavior
> + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the
> ring
> + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to
> the ring
> + * @param free_space
> + *   returns the amount of space after the enqueue operation has
> finished
> + * @return
> + *   Actual number of objects enqueued.
> + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> + */
> +static __rte_always_inline unsigned int
> +__rte_ring_do_nb_enqueue_mp(struct rte_ring *r, void * const
> *obj_table,
> +    unsigned int n,
> +    enum rte_ring_queue_behavior behavior,
> +    unsigned int *free_space)
> +{
> +#if !defined(RTE_ARCH_X86_64) || !defined(ALLOW_EXPERIMENTAL_API)
> +RTE_SET_USED(r);
> +RTE_SET_USED(obj_table);
> +RTE_SET_USED(n);
> +RTE_SET_USED(behavior);
> +RTE_SET_USED(free_space);
> +#ifndef ALLOW_EXPERIMENTAL_API
> +printf("[%s()] RING_F_NB requires an experimental API."
> +       " Recompile with ALLOW_EXPERIMENTAL_API to use it.\n"
> +       , __func__);
> +#endif
> +return 0;
> +#endif
> +#if defined(RTE_ARCH_X86_64) && defined(ALLOW_EXPERIMENTAL_API)
> +size_t head, next, tail;
> +uint32_t free_entries;
> +unsigned int i;
> +
> +n = __rte_ring_move_prod_head_64(r, 0, n, behavior,
> + &head, &next,
> &free_entries);
> +if (n == 0)
> +goto end;
> +
> +for (i = 0; i < n; /* i incremented if enqueue succeeds */)
> {
> +struct nb_ring_entry old_value, new_value;
> +struct nb_ring_entry *ring_ptr;
> +
> +/* Enqueue to the tail entry. If another thread wins
> the race,
> + * retry with the new tail.
> + */
> +tail = r->prod_64.tail;
> +
> +ring_ptr = &((struct nb_ring_entry *)&r[1])[tail &
> r->mask];
This is a very ugly cast. Also I think it is unnecessary. What's
preventing this from being written without a cast? Perhaps the ring
array needs to be a union of "void *" and struct nb_ring_entry?

> +
> +old_value = *ring_ptr;
> +
> +/* If the tail entry's modification counter doesn't
> match the
> + * producer tail index, it's already been updated.
> + */
> +if (old_value.cnt != tail)
> +continue;
Continue restarts the loop at the condition test in the for statement,
'i' and 'n' are unchanged. Then we re-read 'prod_64.tail' and
'ring[tail]'. If some other thread never updates 'prod_64.tail', the
test here (ring[tail].cnt != tail) will still be false and we will spin
forever.
Waiting for other threads <=> blocking behaviour so this is not a non-
blocking design.

> +
> +/* Prepare the new entry. The cnt field mitigates
> the ABA
> + * problem on the ring write.
> + */
> +new_value.ptr = obj_table[i];
> +new_value.cnt = tail + r->size;
> +
> +if (rte_atomic128_cmpset((volatile rte_int128_t
> *)ring_ptr,
> + (rte_int128_t *)&old_value,
> + (rte_int128_t
> *)&new_value))
> +i++;
> +
> +/* Every thread attempts the cmpset, so they don't
> have to wait
> + * for the thread that successfully enqueued to the
> ring.
> + * Using a 64-bit tail mitigates the ABA problem
> here.
> + *
> + * Built-in used to handle variable-sized tail
> index.
> + */
But prod_64.tail is 64 bits so not really variable size?

> +__sync_bool_compare_and_swap(&r->prod_64.tail, tail,
> tail + 1);
What memory order is required here? Why not use
__atomic_compare_exchange() with explicit memory order parameters?

> +}
> +
> +end:
> +if (free_space != NULL)
> +*free_space = free_entries - n;
> +return n;
> +#endif
> +}
> +
> +/**
> + * @internal Enqueue several objects on the non-blocking ring
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects).
> + * @param n
> + *   The number of objects to add in the ring from the obj_table.
> + * @param behavior
> + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the
> ring
> + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to
> the ring
> + * @param is_sp
> + *   Indicates whether to use single producer or multi-producer head
> update
> + * @param free_space
> + *   returns the amount of space after the enqueue operation has
> finished
> + * @return
> + *   Actual number of objects enqueued.
> + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> + */
> +static __rte_always_inline unsigned int
> +__rte_ring_do_nb_enqueue(struct rte_ring *r, void * const
> *obj_table,
> + unsigned int n, enum
> rte_ring_queue_behavior behavior,
> + unsigned int is_sp, unsigned int
> *free_space)
> +{
> +if (is_sp)
> +return __rte_ring_do_nb_enqueue_sp(r, obj_table, n,
> +   behavior,
> free_space);
> +else
> +return __rte_ring_do_nb_enqueue_mp(r, obj_table, n,
> +   behavior,
> free_space);
> +}
> +
> +/**
> + * @internal
> + *   Dequeue several objects from the non-blocking ring (single-
> consumer only)
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects).
> + * @param n
> + *   The number of objects to pull from the ring.
> + * @param behavior
> + *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from
> the ring
> + *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from
> the ring
> + * @param available
> + *   returns the number of remaining ring entries after the dequeue
> has finished
> + * @return
> + *   - Actual number of objects dequeued.
> + *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n
> only.
> + */
> +static __rte_always_inline unsigned int
> +__rte_ring_do_nb_dequeue_sc(struct rte_ring *r, void **obj_table,
> +    unsigned int n,
> +    enum rte_ring_queue_behavior behavior,
> +    unsigned int *available)
> +{
> +size_t head, next;
> +uint32_t entries;
> +
> +n = __rte_ring_move_cons_head_64(r, 1, n, behavior,
> + &head, &next, &entries);
> +if (n == 0)
> +goto end;
> +
> +DEQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
> +
> +r->cons_64.tail += n;
Memory ordering? Consumer synchronises with producer.

> +
> +end:
> +if (available != NULL)
> +*available = entries - n;
> +return n;
> +}
> +
> +/**
> + * @internal
> + *   Dequeue several objects from the non-blocking ring (multi-
> consumer safe)
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects).
> + * @param n
> + *   The number of objects to pull from the ring.
> + * @param behavior
> + *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from
> the ring
> + *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from
> the ring
> + * @param available
> + *   returns the number of remaining ring entries after the dequeue
> has finished
> + * @return
> + *   - Actual number of objects dequeued.
> + *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n
> only.
> + */
> +static __rte_always_inline unsigned int
> +__rte_ring_do_nb_dequeue_mc(struct rte_ring *r, void **obj_table,
> +    unsigned int n,
> +    enum rte_ring_queue_behavior behavior,
> +    unsigned int *available)
> +{
> +size_t head, next;
> +uint32_t entries;
> +
> +n = __rte_ring_move_cons_head_64(r, 0, n, behavior,
> + &head, &next, &entries);
> +if (n == 0)
> +goto end;
> +
> +while (1) {
> +size_t tail = r->cons_64.tail;
> +
> +/* Dequeue from the cons tail onwards. If multiple
> threads read
> + * the same pointers, the thread that successfully
> performs the
> + * CAS will keep them and the other(s) will retry.
> + */
> +DEQUEUE_PTRS_NB(r, &r[1], tail, obj_table, n);
> +
> +next = tail + n;
> +
> +/* Built-in used to handle variable-sized tail
> index. */
> +if (__sync_bool_compare_and_swap(&r->cons_64.tail,
> tail, next))
> +/* There is potential for the ABA problem
> here, but
> + * that is mitigated by the large (64-bit)
> tail.
> + */
> +break;
> +}
> +
> +end:
> +if (available != NULL)
> +*available = entries - n;
> +return n;
> +}
> +
> +/**
> + * @internal Dequeue several objects from the non-blocking ring
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects).
> + * @param n
> + *   The number of objects to pull from the ring.
> + * @param behavior
> + *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from
> the ring
> + *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from
> the ring
> + * @param available
> + *   returns the number of remaining ring entries after the dequeue
> has finished
> + * @return
> + *   - Actual number of objects dequeued.
> + *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n
> only.
> + */
> +static __rte_always_inline unsigned int
> +__rte_ring_do_nb_dequeue(struct rte_ring *r, void **obj_table,
> + unsigned int n, enum rte_ring_queue_behavior
> behavior,
> + unsigned int is_sc, unsigned int *available)
> +{
> +if (is_sc)
> +return __rte_ring_do_nb_dequeue_sc(r, obj_table, n,
> +   behavior,
> available);
> +else
> +return __rte_ring_do_nb_dequeue_mc(r, obj_table, n,
> +   behavior,
> available);
> +}
> +
>  /**
>   * @internal Enqueue several objects on the ring
>   *
> @@ -438,8 +853,14 @@ static __rte_always_inline unsigned int
>  rte_ring_mp_enqueue_bulk(struct rte_ring *r, void * const
> *obj_table,
>   unsigned int n, unsigned int *free_space)
>  {
> -return __rte_ring_do_enqueue(r, obj_table, n,
> RTE_RING_QUEUE_FIXED,
> -__IS_MP, free_space);
> +if (r->flags & RING_F_NB)
> +return __rte_ring_do_nb_enqueue(r, obj_table, n,
> +RTE_RING_QUEUE_FIXED
> , __IS_MP,
> +free_space);
> +else
> +return __rte_ring_do_enqueue(r, obj_table, n,
> +     RTE_RING_QUEUE_FIXED,
> __IS_MP,
> +     free_space);
>  }
>
>  /**
> @@ -461,8 +882,14 @@ static __rte_always_inline unsigned int
>  rte_ring_sp_enqueue_bulk(struct rte_ring *r, void * const
> *obj_table,
>   unsigned int n, unsigned int *free_space)
>  {
> -return __rte_ring_do_enqueue(r, obj_table, n,
> RTE_RING_QUEUE_FIXED,
> -__IS_SP, free_space);
> +if (r->flags & RING_F_NB)
> +return __rte_ring_do_nb_enqueue(r, obj_table, n,
> +RTE_RING_QUEUE_FIXED
> , __IS_SP,
> +free_space);
> +else
> +return __rte_ring_do_enqueue(r, obj_table, n,
> +     RTE_RING_QUEUE_FIXED,
> __IS_SP,
> +     free_space);
>  }
>
>  /**
> @@ -488,8 +915,14 @@ static __rte_always_inline unsigned int
>  rte_ring_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
>        unsigned int n, unsigned int *free_space)
>  {
> -return __rte_ring_do_enqueue(r, obj_table, n,
> RTE_RING_QUEUE_FIXED,
> -r->prod.single, free_space);
> +if (r->flags & RING_F_NB)
> +return __rte_ring_do_nb_enqueue(r, obj_table, n,
> +RTE_RING_QUEUE_FIXED
> ,
> +r->prod_64.single,
> free_space);
> +else
> +return __rte_ring_do_enqueue(r, obj_table, n,
> +     RTE_RING_QUEUE_FIXED,
> +     r->prod.single,
> free_space);
>  }
>
>  /**
> @@ -572,8 +1005,14 @@ static __rte_always_inline unsigned int
>  rte_ring_mc_dequeue_bulk(struct rte_ring *r, void **obj_table,
>  unsigned int n, unsigned int *available)
>  {
> -return __rte_ring_do_dequeue(r, obj_table, n,
> RTE_RING_QUEUE_FIXED,
> -__IS_MC, available);
> +if (r->flags & RING_F_NB)
> +return __rte_ring_do_nb_dequeue(r, obj_table, n,
> +RTE_RING_QUEUE_FIXED
> , __IS_MC,
> +available);
> +else
> +return __rte_ring_do_dequeue(r, obj_table, n,
> +     RTE_RING_QUEUE_FIXED,
> __IS_MC,
> +     available);
>  }
>
>  /**
> @@ -596,8 +1035,14 @@ static __rte_always_inline unsigned int
>  rte_ring_sc_dequeue_bulk(struct rte_ring *r, void **obj_table,
>  unsigned int n, unsigned int *available)
>  {
> -return __rte_ring_do_dequeue(r, obj_table, n,
> RTE_RING_QUEUE_FIXED,
> -__IS_SC, available);
> +if (r->flags & RING_F_NB)
> +return __rte_ring_do_nb_dequeue(r, obj_table, n,
> +RTE_RING_QUEUE_FIXED
> , __IS_SC,
> +available);
> +else
> +return __rte_ring_do_dequeue(r, obj_table, n,
> +     RTE_RING_QUEUE_FIXED,
> __IS_SC,
> +     available);
>  }
>
>  /**
> @@ -623,8 +1068,14 @@ static __rte_always_inline unsigned int
>  rte_ring_dequeue_bulk(struct rte_ring *r, void **obj_table, unsigned
> int n,
>  unsigned int *available)
>  {
> -return __rte_ring_do_dequeue(r, obj_table, n,
> RTE_RING_QUEUE_FIXED,
> -r->cons.single, available);
> +if (r->flags & RING_F_NB)
> +return __rte_ring_do_nb_dequeue(r, obj_table, n,
> +RTE_RING_QUEUE_FIXED
> ,
> +r->cons_64.single,
> available);
> +else
> +return __rte_ring_do_dequeue(r, obj_table, n,
> +     RTE_RING_QUEUE_FIXED,
> +     r->cons.single,
> available);
>  }
>
>  /**
> @@ -699,9 +1150,13 @@ rte_ring_dequeue(struct rte_ring *r, void
> **obj_p)
>  static inline unsigned
>  rte_ring_count(const struct rte_ring *r)
>  {
> -uint32_t prod_tail = r->prod.tail;
> -uint32_t cons_tail = r->cons.tail;
> -uint32_t count = (prod_tail - cons_tail) & r->mask;
> +uint32_t count;
> +
> +if (r->flags & RING_F_NB)
> +count = (r->prod_64.tail - r->cons_64.tail) & r-
> >mask;
> +else
> +count = (r->prod.tail - r->cons.tail) & r->mask;
> +
>  return (count > r->capacity) ? r->capacity : count;
>  }
>
> @@ -821,8 +1276,14 @@ static __rte_always_inline unsigned
>  rte_ring_mp_enqueue_burst(struct rte_ring *r, void * const
> *obj_table,
>   unsigned int n, unsigned int *free_space)
>  {
> -return __rte_ring_do_enqueue(r, obj_table, n,
> -RTE_RING_QUEUE_VARIABLE, __IS_MP,
> free_space);
> +if (r->flags & RING_F_NB)
> +return __rte_ring_do_nb_enqueue(r, obj_table, n,
> +RTE_RING_QUEUE_VARIA
> BLE,
> +__IS_MP,
> free_space);
> +else
> +return __rte_ring_do_enqueue(r, obj_table, n,
> +     RTE_RING_QUEUE_VARIABLE
> ,
> +     __IS_MP, free_space);
>  }
>
>  /**
> @@ -844,8 +1305,14 @@ static __rte_always_inline unsigned
>  rte_ring_sp_enqueue_burst(struct rte_ring *r, void * const
> *obj_table,
>   unsigned int n, unsigned int *free_space)
>  {
> -return __rte_ring_do_enqueue(r, obj_table, n,
> -RTE_RING_QUEUE_VARIABLE, __IS_SP,
> free_space);
> +if (r->flags & RING_F_NB)
> +return __rte_ring_do_nb_enqueue(r, obj_table, n,
> +RTE_RING_QUEUE_VARIA
> BLE,
> +__IS_SP,
> free_space);
> +else
> +return __rte_ring_do_enqueue(r, obj_table, n,
> +     RTE_RING_QUEUE_VARIABLE
> ,
> +     __IS_SP, free_space);
>  }
>
>  /**
> @@ -871,8 +1338,14 @@ static __rte_always_inline unsigned
>  rte_ring_enqueue_burst(struct rte_ring *r, void * const *obj_table,
>        unsigned int n, unsigned int *free_space)
>  {
> -return __rte_ring_do_enqueue(r, obj_table, n,
> RTE_RING_QUEUE_VARIABLE,
> -r->prod.single, free_space);
> +if (r->flags & RING_F_NB)
> +return __rte_ring_do_nb_enqueue(r, obj_table, n,
> +RTE_RING_QUEUE_VARIA
> BLE,
> +r->prod_64.single,
> free_space);
> +else
> +return __rte_ring_do_enqueue(r, obj_table, n,
> +     RTE_RING_QUEUE_VARIABLE
> ,
> +     r->prod.single,
> free_space);
>  }
>
>  /**
> @@ -899,8 +1372,14 @@ static __rte_always_inline unsigned
>  rte_ring_mc_dequeue_burst(struct rte_ring *r, void **obj_table,
>  unsigned int n, unsigned int *available)
>  {
> -return __rte_ring_do_dequeue(r, obj_table, n,
> -RTE_RING_QUEUE_VARIABLE, __IS_MC,
> available);
> +if (r->flags & RING_F_NB)
> +return __rte_ring_do_nb_dequeue(r, obj_table, n,
> +RTE_RING_QUEUE_VARIA
> BLE,
> +__IS_MC, available);
> +else
> +return __rte_ring_do_dequeue(r, obj_table, n,
> +     RTE_RING_QUEUE_VARIABLE
> ,
> +     __IS_MC, available);
>  }
>
>  /**
> @@ -924,8 +1403,14 @@ static __rte_always_inline unsigned
>  rte_ring_sc_dequeue_burst(struct rte_ring *r, void **obj_table,
>  unsigned int n, unsigned int *available)
>  {
> -return __rte_ring_do_dequeue(r, obj_table, n,
> -RTE_RING_QUEUE_VARIABLE, __IS_SC,
> available);
> +if (r->flags & RING_F_NB)
> +return __rte_ring_do_nb_dequeue(r, obj_table, n,
> +RTE_RING_QUEUE_VARIA
> BLE,
> +__IS_SC, available);
> +else
> +return __rte_ring_do_dequeue(r, obj_table, n,
> +     RTE_RING_QUEUE_VARIABLE
> ,
> +     __IS_SC, available);
>  }
>
>  /**
> @@ -951,9 +1436,14 @@ static __rte_always_inline unsigned
>  rte_ring_dequeue_burst(struct rte_ring *r, void **obj_table,
>  unsigned int n, unsigned int *available)
>  {
> -return __rte_ring_do_dequeue(r, obj_table, n,
> -RTE_RING_QUEUE_VARIABLE,
> -r->cons.single, available);
> +if (r->flags & RING_F_NB)
> +return __rte_ring_do_nb_dequeue(r, obj_table, n,
> +RTE_RING_QUEUE_VARIA
> BLE,
> +r->cons_64.single,
> available);
> +else
> +return __rte_ring_do_dequeue(r, obj_table, n,
> +     RTE_RING_QUEUE_VARIABLE
> ,
> +     r->cons.single,
> available);
>  }
>
>  #ifdef __cplusplus
> diff --git a/lib/librte_ring/rte_ring_version.map
> b/lib/librte_ring/rte_ring_version.map
> index d935efd0d..8969467af 100644
> --- a/lib/librte_ring/rte_ring_version.map
> +++ b/lib/librte_ring/rte_ring_version.map
> @@ -17,3 +17,10 @@ DPDK_2.2 {
>  rte_ring_free;
>
>  } DPDK_2.0;
> +
> +DPDK_19.05 {
> +global:
> +
> +rte_ring_get_memsize;
> +
> +} DPDK_2.2;
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
  2019-01-22  9:27     ` [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring Ola Liljedahl
@ 2019-01-22 10:15       ` Ola Liljedahl
  2019-01-22 19:15       ` Eads, Gage
  2019-01-23 16:02       ` Jerin Jacob Kollanukkaran
  2 siblings, 0 replies; 123+ messages in thread
From: Ola Liljedahl @ 2019-01-22 10:15 UTC (permalink / raw)
  To: gage.eads, dev
  Cc: olivier.matz, stephen, bruce.richardson, arybchenko, konstantin.ananyev

Sorry about the confidental footer. I tried to remove it using some Exhange
magic but it seems not to work with Evolution. I'll try some other way.

-- Ola

On Tue, 2019-01-22 at 09:27 +0000, Ola Liljedahl wrote:
> On Fri, 2019-01-18 at 09:23 -0600, Gage Eads wrote:
> >
> > For some users, the rte ring's "non-preemptive" constraint is not
> > acceptable;
> > for example, if the application uses a mixture of pinned high-
> > priority threads
> > and multiplexed low-priority threads that share a mempool.
> >
   I. This patchset introduces a non-blocking ring, on top of which a
> > mempool can run.
> > Crucially, the non-blocking algorithm relies on a 128-bit compare-
> > and-swap, so
> > it is currently limited to x86_64 machines. This is also an
> > experimental API,
> > so RING_F_NB users must build with the ALLOW_EXPERIMENTAL_API flag.
> >
> > The ring uses more compare-and-swap atomic operations than the
> > regular rte ring:
> > With no contention, an enqueue of n pointers uses (1 + 2n) CAS
> > operations and a
> > dequeue of n pointers uses 2. This algorithm has worse average-case
> > performance
> > than the regular rte ring (particularly a highly-contended ring with
> > large bulk
> > accesses), however:
> > - For applications with preemptible pthreads, the regular rte ring's
> > worst-case
> >   performance (i.e. one thread being preempted in the update_tail()
> > critical
> >   section) is much worse than the non-blocking ring's.
> > - Software caching can mitigate the average case performance for
> > ring-based
> >   algorithms. For example, a non-blocking ring based mempool (a
> > likely use case
> >   for this ring) with per-thread caching.
> >
> > The non-blocking ring is enabled via a new flag, RING_F_NB. For ease-
> > of-use,
> > existing ring enqueue/dequeue functions work with both "regular" and
> > non-blocking rings.
> >
> > This patchset also adds non-blocking versions of ring_autotest and
> > ring_perf_autotest, and a non-blocking ring based mempool.
> >
> > This patchset makes one API change; a deprecation notice will be
> > posted in a
> > separate commit.
> >
> > This patchset depends on the non-blocking stack patchset[1].
> >
> > [1] http://mails.dpdk.org/archives/dev/2019-January/123653.html
> >
> > v3:
> >  - Avoid the ABI break by putting 64-bit head and tail values in the
> > same
> >    cacheline as struct rte_ring's prod and cons members.
> >  - Don't attempt to compile rte_atomic128_cmpset without
> >    ALLOW_EXPERIMENTAL_API, as this would break a large number of
> > libraries.
> >  - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case
> > someone tries
> >    to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
> >  - Update the ring mempool to use experimental APIs
> >  - Clarify that RINB_F_NB is only limited to x86_64 currently;
> > ARMv8.1-A builds
> >    can eventually support it with the CASP instruction.
> ARMv8.0 should be able to implement a 128-bit atomic compare exchange
> operation using LDXP/STXP.
>
> From an ARM perspective, I want all atomic operations to take memory
> ordering arguments (e.g. acquire, release). Not all usages of e.g.
> atomic compare exchange require sequential consistency (which I think
> what x86 cmpxchg instruction provides). DPDK functions should not be
> modelled after x86 behaviour.
>
> Lock-free 128-bit atomics implementations for ARM/AArch64 and x86-64
> are available here:
> https://github.com/ARM-software/progress64/blob/master/src/lockfree.h
>
> >
> >
> > v2:
> >  - Merge separate docs commit into patch #5
> >  - Convert uintptr_t to size_t
> >  - Add a compile-time check for the size of size_t
> >  - Fix a space-after-typecast issue
> >  - Fix an unnecessary-parentheses checkpatch warning
> >  - Bump librte_ring's library version
> >
> > Gage Eads (5):
> >   ring: add 64-bit headtail structure
> >   ring: add a non-blocking implementation
> >   test_ring: add non-blocking ring autotest
> >   test_ring_perf: add non-blocking ring perf test
> >   mempool/ring: add non-blocking ring handlers
> >
> >  doc/guides/prog_guide/env_abstraction_layer.rst |   2 +-
> >  drivers/mempool/ring/Makefile                   |   1 +
> >  drivers/mempool/ring/meson.build                |   2 +
> >  drivers/mempool/ring/rte_mempool_ring.c         |  58 ++-
> >  lib/librte_eventdev/rte_event_ring.h            |   2 +-
> >  lib/librte_ring/Makefile                        |   3 +-
> >  lib/librte_ring/rte_ring.c                      |  72 ++-
> >  lib/librte_ring/rte_ring.h                      | 574
> > ++++++++++++++++++++++--
> >  lib/librte_ring/rte_ring_generic_64.h           | 152 +++++++
> >  lib/librte_ring/rte_ring_version.map            |   7 +
> >  test/test/test_ring.c                           |  57 ++-
> >  test/test/test_ring_perf.c                      |  19 +-
> >  12 files changed, 874 insertions(+), 75 deletions(-)
> >  create mode 100644 lib/librte_ring/rte_ring_generic_64.h
> >
> IMPORTANT NOTICE: The contents of this email and any attachments are
> confidential and may also be privileged. If you are not the intended
> recipient, please notify the sender immediately and do not disclose the
> contents to any other person, use it for any purpose, or store or copy the
> information in any medium. Thank you.
IMPORTANT NOTICE: The contents of this email and any attachments are confidential and may also be privileged. If you are not the intended recipient, please notify the sender immediately and do not disclose the contents to any other person, use it for any purpose, or store or copy the information in any medium. Thank you.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation
  2019-01-18 15:23     ` [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation Gage Eads
  2019-01-22 10:12       ` Ola Liljedahl
@ 2019-01-22 14:49       ` Ola Liljedahl
  2019-01-22 21:31         ` Eads, Gage
  1 sibling, 1 reply; 123+ messages in thread
From: Ola Liljedahl @ 2019-01-22 14:49 UTC (permalink / raw)
  To: gage.eads, dev
  Cc: olivier.matz, stephen, nd, bruce.richardson, arybchenko,
	konstantin.ananyev

(resending without the confidential footer, think I figured it out, ignore the
previous email from me in this thread)

-- Ola

On Fri, 2019-01-18 at 09:23 -0600, Gage Eads wrote:
> This commit adds support for non-blocking circular ring enqueue and dequeue
> functions. The ring uses a 128-bit compare-and-swap instruction, and thus
> is currently limited to x86_64.
> 
> The algorithm is based on the original rte ring (derived from FreeBSD's
> bufring.h) and inspired by Michael and Scott's non-blocking concurrent
> queue. Importantly, it adds a modification counter to each ring entry to
> ensure only one thread can write to an unused entry.
> 
> -----
> Algorithm:
> 
> Multi-producer non-blocking enqueue:
> 1. Move the producer head index 'n' locations forward, effectively
>    reserving 'n' locations.
> 2. For each pointer:
>  a. Read the producer tail index, then ring[tail]. If ring[tail]'s
>     modification counter isn't 'tail', retry.
>  b. Construct the new entry: {pointer, tail + ring size}
>  c. Compare-and-swap the old entry with the new. If unsuccessful, the
>     next loop iteration will try to enqueue this pointer again.
>  d. Compare-and-swap the tail index with 'tail + 1', whether or not step 2c
>     succeeded. This guarantees threads can make forward progress.
> 
> Multi-consumer non-blocking dequeue:
> 1. Move the consumer head index 'n' locations forward, effectively
>    reserving 'n' pointers to be dequeued.
> 2. Copy 'n' pointers into the caller's object table (ignoring the
>    modification counter), starting from ring[tail], then compare-and-swap
>    the tail index with 'tail + n'.  If unsuccessful, repeat step 2.
> 
> -----
> Discussion:
> 
> There are two cases where the ABA problem is mitigated:
> 1. Enqueueing a pointer to the ring: without a modification counter
>    tied to the tail index, the index could become stale by the time the
>    enqueue happens, causing it to overwrite valid data. Tying the
>    counter to the tail index gives us an expected value (as opposed to,
>    say, a monotonically incrementing counter).
> 
>    Since the counter will eventually wrap, there is potential for the ABA
>    problem. However, using a 64-bit counter makes this likelihood
>    effectively zero.
> 
> 2. Updating a tail index: the ABA problem can occur if the thread is
>    preempted and the tail index wraps around. However, using 64-bit indexes
>    makes this likelihood effectively zero.
> 
> With no contention, an enqueue of n pointers uses (1 + 2n) CAS operations
> and a dequeue of n pointers uses 2. This algorithm has worse average-case
> performance than the regular rte ring (particularly a highly-contended ring
> with large bulk accesses), however:
> - For applications with preemptible pthreads, the regular rte ring's
>   worst-case performance (i.e. one thread being preempted in the
>   update_tail() critical section) is much worse than the non-blocking
>   ring's.
> - Software caching can mitigate the average case performance for ring-based
>   algorithms. For example, a non-blocking ring based mempool (a likely use
>   case for this ring) with per-thread caching.
> 
> The non-blocking ring is enabled via a new flag, RING_F_NB. Because the
> ring's memsize is now a function of its flags (the non-blocking ring
> requires 128b for each entry), this commit adds a new argument ('flags') to
> rte_ring_get_memsize(). An API deprecation notice will be sent in a
> separate commit.
> 
> For ease-of-use, existing ring enqueue and dequeue functions work on both
> regular and non-blocking rings. This introduces an additional branch in
> the datapath, but this should be a highly predictable branch.
> ring_perf_autotest shows a negligible performance impact; it's hard to
> distinguish a real difference versus system noise.
> 
>                                   | ring_perf_autotest cycles with branch -
>              Test                 |   ring_perf_autotest cycles without
> ------------------------------------------------------------------
> SP/SC single enq/dequeue          | 0.33
> MP/MC single enq/dequeue          | -4.00
> SP/SC burst enq/dequeue (size 8)  | 0.00
> MP/MC burst enq/dequeue (size 8)  | 0.00
> SP/SC burst enq/dequeue (size 32) | 0.00
> MP/MC burst enq/dequeue (size 32) | 0.00
> SC empty dequeue                  | 1.00
> MC empty dequeue                  | 0.00
> 
> Single lcore:
> SP/SC bulk enq/dequeue (size 8)   | 0.49
> MP/MC bulk enq/dequeue (size 8)   | 0.08
> SP/SC bulk enq/dequeue (size 32)  | 0.07
> MP/MC bulk enq/dequeue (size 32)  | 0.09
> 
> Two physical cores:
> SP/SC bulk enq/dequeue (size 8)   | 0.19
> MP/MC bulk enq/dequeue (size 8)   | -0.37
> SP/SC bulk enq/dequeue (size 32)  | 0.09
> MP/MC bulk enq/dequeue (size 32)  | -0.05
> 
> Two NUMA nodes:
> SP/SC bulk enq/dequeue (size 8)   | -1.96
> MP/MC bulk enq/dequeue (size 8)   | 0.88
> SP/SC bulk enq/dequeue (size 32)  | 0.10
> MP/MC bulk enq/dequeue (size 32)  | 0.46
> 
> Test setup: x86_64 build with default config, dual-socket Xeon E5-2699 v4,
> running on isolcpus cores with a tickless scheduler. Each test run three
> times and the results averaged.
> 
> Signed-off-by: Gage Eads <gage.eads@intel.com>
> ---
>  lib/librte_ring/rte_ring.c           |  72 ++++-
>  lib/librte_ring/rte_ring.h           | 550 +++++++++++++++++++++++++++++++++-
> -
>  lib/librte_ring/rte_ring_version.map |   7 +
>  3 files changed, 587 insertions(+), 42 deletions(-)
> 
> diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
> index d215acecc..f3378dccd 100644
> --- a/lib/librte_ring/rte_ring.c
> +++ b/lib/librte_ring/rte_ring.c
> @@ -45,9 +45,9 @@ EAL_REGISTER_TAILQ(rte_ring_tailq)
>  
>  /* return the size of memory occupied by a ring */
>  ssize_t
> -rte_ring_get_memsize(unsigned count)
> +rte_ring_get_memsize_v1905(unsigned int count, unsigned int flags)
>  {
> -	ssize_t sz;
> +	ssize_t sz, elt_sz;
>  
>  	/* count must be a power of 2 */
>  	if ((!POWEROF2(count)) || (count > RTE_RING_SZ_MASK )) {
> @@ -57,10 +57,23 @@ rte_ring_get_memsize(unsigned count)
>  		return -EINVAL;
>  	}
>  
> -	sz = sizeof(struct rte_ring) + count * sizeof(void *);
> +	elt_sz = (flags & RING_F_NB) ? 2 * sizeof(void *) : sizeof(void *);
> +
> +	sz = sizeof(struct rte_ring) + count * elt_sz;
>  	sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
>  	return sz;
>  }
> +BIND_DEFAULT_SYMBOL(rte_ring_get_memsize, _v1905, 19.05);
> +MAP_STATIC_SYMBOL(ssize_t rte_ring_get_memsize(unsigned int count,
> +					       unsigned int flags),
> +		  rte_ring_get_memsize_v1905);
> +
> +ssize_t
> +rte_ring_get_memsize_v20(unsigned int count)
> +{
> +	return rte_ring_get_memsize_v1905(count, 0);
> +}
> +VERSION_SYMBOL(rte_ring_get_memsize, _v20, 2.0);
>  
>  int
>  rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
> @@ -82,8 +95,6 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned
> count,
>  	if (ret < 0 || ret >= (int)sizeof(r->name))
>  		return -ENAMETOOLONG;
>  	r->flags = flags;
> -	r->prod.single = (flags & RING_F_SP_ENQ) ? __IS_SP : __IS_MP;
> -	r->cons.single = (flags & RING_F_SC_DEQ) ? __IS_SC : __IS_MC;
>  
>  	if (flags & RING_F_EXACT_SZ) {
>  		r->size = rte_align32pow2(count + 1);
> @@ -100,8 +111,30 @@ rte_ring_init(struct rte_ring *r, const char *name,
> unsigned count,
>  		r->mask = count - 1;
>  		r->capacity = r->mask;
>  	}
> -	r->prod.head = r->cons.head = 0;
> -	r->prod.tail = r->cons.tail = 0;
> +
> +	if (flags & RING_F_NB) {
> +		uint64_t i;
> +
> +		r->prod_64.single = (flags & RING_F_SP_ENQ) ? __IS_SP :
> __IS_MP;
> +		r->cons_64.single = (flags & RING_F_SC_DEQ) ? __IS_SC :
> __IS_MC;
> +		r->prod_64.head = r->cons_64.head = 0;
> +		r->prod_64.tail = r->cons_64.tail = 0;
> +
> +		for (i = 0; i < r->size; i++) {
> +			struct nb_ring_entry *ring_ptr, *base;
> +
> +			base = ((struct nb_ring_entry *)&r[1]);
> +
> +			ring_ptr = &base[i & r->mask];
> +
> +			ring_ptr->cnt = i;
> +		}
> +	} else {
> +		r->prod.single = (flags & RING_F_SP_ENQ) ? __IS_SP : __IS_MP;
> +		r->cons.single = (flags & RING_F_SC_DEQ) ? __IS_SC : __IS_MC;
> +		r->prod.head = r->cons.head = 0;
> +		r->prod.tail = r->cons.tail = 0;
> +	}
>  
>  	return 0;
>  }
> @@ -123,11 +156,19 @@ rte_ring_create(const char *name, unsigned count, int
> socket_id,
>  
>  	ring_list = RTE_TAILQ_CAST(rte_ring_tailq.head, rte_ring_list);
>  
> +#if !defined(RTE_ARCH_X86_64)
> +	if (flags & RING_F_NB) {
> +		printf("RING_F_NB is only supported on x86-64 platforms\n");
> +		rte_errno = EINVAL;
> +		return NULL;
> +	}
> +#endif
> +
>  	/* for an exact size ring, round up from count to a power of two */
>  	if (flags & RING_F_EXACT_SZ)
>  		count = rte_align32pow2(count + 1);
>  
> -	ring_size = rte_ring_get_memsize(count);
> +	ring_size = rte_ring_get_memsize(count, flags);
>  	if (ring_size < 0) {
>  		rte_errno = ring_size;
>  		return NULL;
> @@ -227,10 +268,17 @@ rte_ring_dump(FILE *f, const struct rte_ring *r)
>  	fprintf(f, "  flags=%x\n", r->flags);
>  	fprintf(f, "  size=%"PRIu32"\n", r->size);
>  	fprintf(f, "  capacity=%"PRIu32"\n", r->capacity);
> -	fprintf(f, "  ct=%"PRIu32"\n", r->cons.tail);
> -	fprintf(f, "  ch=%"PRIu32"\n", r->cons.head);
> -	fprintf(f, "  pt=%"PRIu32"\n", r->prod.tail);
> -	fprintf(f, "  ph=%"PRIu32"\n", r->prod.head);
> +	if (r->flags & RING_F_NB) {
> +		fprintf(f, "  ct=%"PRIu64"\n", r->cons_64.tail);
> +		fprintf(f, "  ch=%"PRIu64"\n", r->cons_64.head);
> +		fprintf(f, "  pt=%"PRIu64"\n", r->prod_64.tail);
> +		fprintf(f, "  ph=%"PRIu64"\n", r->prod_64.head);
> +	} else {
> +		fprintf(f, "  ct=%"PRIu32"\n", r->cons.tail);
> +		fprintf(f, "  ch=%"PRIu32"\n", r->cons.head);
> +		fprintf(f, "  pt=%"PRIu32"\n", r->prod.tail);
> +		fprintf(f, "  ph=%"PRIu32"\n", r->prod.head);
> +	}
>  	fprintf(f, "  used=%u\n", rte_ring_count(r));
>  	fprintf(f, "  avail=%u\n", rte_ring_free_count(r));
>  }
> diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> index b270a4746..08c9de6a6 100644
> --- a/lib/librte_ring/rte_ring.h
> +++ b/lib/librte_ring/rte_ring.h
> @@ -134,6 +134,18 @@ struct rte_ring {
>   */
>  #define RING_F_EXACT_SZ 0x0004
>  #define RTE_RING_SZ_MASK  (0x7fffffffU) /**< Ring size mask */
> +/**
> + * The ring uses non-blocking enqueue and dequeue functions. These functions
> + * do not have the "non-preemptive" constraint of a regular rte ring, and
> thus
> + * are suited for applications using preemptible pthreads. However, the
> + * non-blocking functions have worse average-case performance than their
> + * regular rte ring counterparts. When used as the handler for a mempool,
> + * per-thread caching can mitigate the performance difference by reducing the
> + * number (and contention) of ring accesses.
> + *
> + * This flag is only supported on x86_64 platforms.
> + */
> +#define RING_F_NB 0x0008
>  
>  /* @internal defines for passing to the enqueue dequeue worker functions */
>  #define __IS_SP 1
> @@ -151,11 +163,15 @@ struct rte_ring {
>   *
>   * @param count
>   *   The number of elements in the ring (must be a power of 2).
> + * @param flags
> + *   The flags the ring will be created with.
>   * @return
>   *   - The memory size needed for the ring on success.
>   *   - -EINVAL if count is not a power of 2.
>   */
> -ssize_t rte_ring_get_memsize(unsigned count);
> +ssize_t rte_ring_get_memsize(unsigned int count, unsigned int flags);
> +ssize_t rte_ring_get_memsize_v20(unsigned int count);
> +ssize_t rte_ring_get_memsize_v1905(unsigned int count, unsigned int flags);
>  
>  /**
>   * Initialize a ring structure.
> @@ -188,6 +204,10 @@ ssize_t rte_ring_get_memsize(unsigned count);
>   *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
>   *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
>   *      is "single-consumer". Otherwise, it is "multi-consumers".
> + *    - RING_F_EXACT_SZ: If this flag is set, count can be a non-power-of-2
> + *      number, but up to half the ring space may be wasted.
> + *    - RING_F_NB: (x86_64 only) If this flag is set, the ring uses
> + *      non-blocking variants of the dequeue and enqueue functions.
>   * @return
>   *   0 on success, or a negative value on error.
>   */
> @@ -223,12 +243,17 @@ int rte_ring_init(struct rte_ring *r, const char *name,
> unsigned count,
>   *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
>   *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
>   *      is "single-consumer". Otherwise, it is "multi-consumers".
> + *    - RING_F_EXACT_SZ: If this flag is set, count can be a non-power-of-2
> + *      number, but up to half the ring space may be wasted.
> + *    - RING_F_NB: (x86_64 only) If this flag is set, the ring uses
> + *      non-blocking variants of the dequeue and enqueue functions.
>   * @return
>   *   On success, the pointer to the new allocated ring. NULL on error with
>   *    rte_errno set appropriately. Possible errno values include:
>   *    - E_RTE_NO_CONFIG - function could not get pointer to rte_config
> structure
>   *    - E_RTE_SECONDARY - function was called from a secondary process
> instance
> - *    - EINVAL - count provided is not a power of 2
> + *    - EINVAL - count provided is not a power of 2, or RING_F_NB is used on
> an
> + *      unsupported platform
>   *    - ENOSPC - the maximum number of memzones has already been allocated
>   *    - EEXIST - a memzone with the same name already exists
>   *    - ENOMEM - no appropriate memory area found in which to create memzone
> @@ -284,6 +309,50 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
>  	} \
>  } while (0)
>  
> +/* The actual enqueue of pointers on the ring.
> + * Used only by the single-producer non-blocking enqueue function, but
> + * out-lined here for code readability.
> + */
> +#define ENQUEUE_PTRS_NB(r, ring_start, prod_head, obj_table, n) do { \
> +	unsigned int i; \
> +	const uint32_t size = (r)->size; \
> +	size_t idx = prod_head & (r)->mask; \
> +	size_t new_cnt = prod_head + size; \
> +	struct nb_ring_entry *ring = (struct nb_ring_entry *)ring_start; \
> +	if (likely(idx + n < size)) { \
> +		for (i = 0; i < (n & ((~(unsigned)0x3))); i += 4, idx += 4) {
> \
> +			ring[idx].ptr = obj_table[i]; \
> +			ring[idx].cnt = new_cnt + i;  \
> +			ring[idx + 1].ptr = obj_table[i + 1]; \
> +			ring[idx + 1].cnt = new_cnt + i + 1;  \
> +			ring[idx + 2].ptr = obj_table[i + 2]; \
> +			ring[idx + 2].cnt = new_cnt + i + 2;  \
> +			ring[idx + 3].ptr = obj_table[i + 3]; \
> +			ring[idx + 3].cnt = new_cnt + i + 3;  \
> +		} \
> +		switch (n & 0x3) { \
> +		case 3: \
> +			ring[idx].cnt = new_cnt + i; \
> +			ring[idx++].ptr = obj_table[i++]; /* fallthrough */ \
> +		case 2: \
> +			ring[idx].cnt = new_cnt + i; \
> +			ring[idx++].ptr = obj_table[i++]; /* fallthrough */ \
> +		case 1: \
> +			ring[idx].cnt = new_cnt + i; \
> +			ring[idx++].ptr = obj_table[i++]; \
> +		} \
> +	} else { \
> +		for (i = 0; idx < size; i++, idx++) { \
> +			ring[idx].cnt = new_cnt + i;  \
> +			ring[idx].ptr = obj_table[i]; \
> +		} \
> +		for (idx = 0; i < n; i++, idx++) {    \
> +			ring[idx].cnt = new_cnt + i;  \
> +			ring[idx].ptr = obj_table[i]; \
> +		} \
> +	} \
> +} while (0)
> +
>  /* the actual copy of pointers on the ring to obj_table.
>   * Placed here since identical code needed in both
>   * single and multi consumer dequeue functions */
> @@ -315,6 +384,39 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
>  	} \
>  } while (0)
>  
> +/* The actual copy of pointers on the ring to obj_table.
> + * Placed here since identical code needed in both
> + * single and multi consumer non-blocking dequeue functions.
> + */
> +#define DEQUEUE_PTRS_NB(r, ring_start, cons_head, obj_table, n) do { \
> +	unsigned int i; \
> +	size_t idx = cons_head & (r)->mask; \
> +	const uint32_t size = (r)->size; \
> +	struct nb_ring_entry *ring = (struct nb_ring_entry *)ring_start; \
> +	if (likely(idx + n < size)) { \
> +		for (i = 0; i < (n & (~(unsigned)0x3)); i += 4, idx += 4) {\
> +			obj_table[i] = ring[idx].ptr; \
> +			obj_table[i + 1] = ring[idx + 1].ptr; \
> +			obj_table[i + 2] = ring[idx + 2].ptr; \
> +			obj_table[i + 3] = ring[idx + 3].ptr; \
> +		} \
> +		switch (n & 0x3) { \
> +		case 3: \
> +			obj_table[i++] = ring[idx++].ptr; /* fallthrough */ \
> +		case 2: \
> +			obj_table[i++] = ring[idx++].ptr; /* fallthrough */ \
> +		case 1: \
> +			obj_table[i++] = ring[idx++].ptr; \
> +		} \
> +	} else { \
> +		for (i = 0; idx < size; i++, idx++) \
> +			obj_table[i] = ring[idx].ptr; \
> +		for (idx = 0; i < n; i++, idx++) \
> +			obj_table[i] = ring[idx].ptr; \
> +	} \
> +} while (0)
> +
> +
>  /* Between load and load. there might be cpu reorder in weak model
>   * (powerpc/arm).
>   * There are 2 choices for the users
> @@ -331,6 +433,319 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
>  #endif
>  #include "rte_ring_generic_64.h"
>  
> +/* @internal 128-bit structure used by the non-blocking ring */
> +struct nb_ring_entry {
> +	void *ptr; /**< Data pointer */
> +	uint64_t cnt; /**< Modification counter */
Why not make 'cnt' uintptr_t? This way 32-bit architectures will also
be supported. I think there are some claims that DPDK still supports e.g. ARMv7a
and possibly also 32-bit x86?

> +};
> +
> +/* The non-blocking ring algorithm is based on the original rte ring (derived
> + * from FreeBSD's bufring.h) and inspired by Michael and Scott's non-blocking
> + * concurrent queue.
> + */
> +
> +/**
> + * @internal
> + *   Enqueue several objects on the non-blocking ring (single-producer only)
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects).
> + * @param n
> + *   The number of objects to add in the ring from the obj_table.
> + * @param behavior
> + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
> + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
> + * @param free_space
> + *   returns the amount of space after the enqueue operation has finished
> + * @return
> + *   Actual number of objects enqueued.
> + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> + */
> +static __rte_always_inline unsigned int
> +__rte_ring_do_nb_enqueue_sp(struct rte_ring *r, void * const *obj_table,
> +			    unsigned int n,
> +			    enum rte_ring_queue_behavior behavior,
> +			    unsigned int *free_space)
> +{
> +	uint32_t free_entries;
> +	size_t head, next;
> +
> +	n = __rte_ring_move_prod_head_64(r, 1, n, behavior,
> +					 &head, &next, &free_entries);
> +	if (n == 0)
> +		goto end;
> +
> +	ENQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
> +
> +	r->prod_64.tail += n;
Don't we need release order when (or smp_wmb between) writing of the ring
pointers and the update of tail? By updating the tail pointer, we are
synchronising with a consumer.

I prefer using __atomic operations even for load and store. You can see which
parts of the code that synchronise with each other, e.g. store-release to some
location synchronises with load-acquire from the same location. If you don't
know how different threads synchronise with each other, you are very likely to
make mistakes.

> +
> +end:
> +	if (free_space != NULL)
> +		*free_space = free_entries - n;
> +	return n;
> +}
> +
> +/**
> + * @internal
> + *   Enqueue several objects on the non-blocking ring (multi-producer safe)
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects).
> + * @param n
> + *   The number of objects to add in the ring from the obj_table.
> + * @param behavior
> + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
> + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
> + * @param free_space
> + *   returns the amount of space after the enqueue operation has finished
> + * @return
> + *   Actual number of objects enqueued.
> + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> + */
> +static __rte_always_inline unsigned int
> +__rte_ring_do_nb_enqueue_mp(struct rte_ring *r, void * const *obj_table,
> +			    unsigned int n,
> +			    enum rte_ring_queue_behavior behavior,
> +			    unsigned int *free_space)
> +{
> +#if !defined(RTE_ARCH_X86_64) || !defined(ALLOW_EXPERIMENTAL_API)
> +	RTE_SET_USED(r);
> +	RTE_SET_USED(obj_table);
> +	RTE_SET_USED(n);
> +	RTE_SET_USED(behavior);
> +	RTE_SET_USED(free_space);
> +#ifndef ALLOW_EXPERIMENTAL_API
> +	printf("[%s()] RING_F_NB requires an experimental API."
> +	       " Recompile with ALLOW_EXPERIMENTAL_API to use it.\n"
> +	       , __func__);
> +#endif
> +	return 0;
> +#endif
> +#if defined(RTE_ARCH_X86_64) && defined(ALLOW_EXPERIMENTAL_API)
> +	size_t head, next, tail;
> +	uint32_t free_entries;
> +	unsigned int i;
> +
> +	n = __rte_ring_move_prod_head_64(r, 0, n, behavior,
> +					 &head, &next, &free_entries);
> +	if (n == 0)
> +		goto end;
> +
> +	for (i = 0; i < n; /* i incremented if enqueue succeeds */) {
> +		struct nb_ring_entry old_value, new_value;
> +		struct nb_ring_entry *ring_ptr;
> +
> +		/* Enqueue to the tail entry. If another thread wins the
> race,
> +		 * retry with the new tail.
> +		 */
> +		tail = r->prod_64.tail;
> +
> +		ring_ptr = &((struct nb_ring_entry *)&r[1])[tail & r->mask];
This is an ugly expression and cast. Also I think it is unnecessary. What's
preventing this from being written without a cast? Perhaps the ring array needs
to be a union of "void *" and struct nb_ring_entry?

> +
> +		old_value = *ring_ptr;
> +
> +		/* If the tail entry's modification counter doesn't match the
> +		 * producer tail index, it's already been updated.
> +		 */
> +		if (old_value.cnt != tail)
> +			continue;
Continue restarts the loop at the condition test in the for statement,
'i' and 'n' are unchanged. Then we re-read 'prod_64.tail' and
'ring[tail & mask]'. If some other thread never updates 'prod_64.tail', the
test here (ring[tail].cnt != tail) will still be true and we will spin
forever.

Waiting for other threads <=> blocking behaviour so this is not a non-
blocking design.

> +
> +		/* Prepare the new entry. The cnt field mitigates the ABA
> +		 * problem on the ring write.
> +		 */
> +		new_value.ptr = obj_table[i];
> +		new_value.cnt = tail + r->size;
> +
> +		if (rte_atomic128_cmpset((volatile rte_int128_t *)ring_ptr,
> +					 (rte_int128_t *)&old_value,
> +					 (rte_int128_t *)&new_value))
> +			i++;
> +
> +		/* Every thread attempts the cmpset, so they don't have to
> wait
> +		 * for the thread that successfully enqueued to the ring.
> +		 * Using a 64-bit tail mitigates the ABA problem here.
> +		 *
> +		 * Built-in used to handle variable-sized tail index.
> +		 */
But prod_64.tail is 64 bits so not really variable size?

> +		__sync_bool_compare_and_swap(&r->prod_64.tail, tail, tail +
> 1);
What memory order is required here? Why not use
__atomic_compare_exchange() with explicit memory order parameters?

> +	}
> +
> +end:
> +	if (free_space != NULL)
> +		*free_space = free_entries - n;
> +	return n;
> +#endif
> +}
> +
> +/**
> + * @internal Enqueue several objects on the non-blocking ring
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects).
> + * @param n
> + *   The number of objects to add in the ring from the obj_table.
> + * @param behavior
> + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
> + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
> + * @param is_sp
> + *   Indicates whether to use single producer or multi-producer head update
> + * @param free_space
> + *   returns the amount of space after the enqueue operation has finished
> + * @return
> + *   Actual number of objects enqueued.
> + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> + */
> +static __rte_always_inline unsigned int
> +__rte_ring_do_nb_enqueue(struct rte_ring *r, void * const *obj_table,
> +			 unsigned int n, enum rte_ring_queue_behavior
> behavior,
> +			 unsigned int is_sp, unsigned int *free_space)
> +{
> +	if (is_sp)
> +		return __rte_ring_do_nb_enqueue_sp(r, obj_table, n,
> +						   behavior, free_space);
> +	else
> +		return __rte_ring_do_nb_enqueue_mp(r, obj_table, n,
> +						   behavior, free_space);
> +}
> +
> +/**
> + * @internal
> + *   Dequeue several objects from the non-blocking ring (single-consumer
> only)
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects).
> + * @param n
> + *   The number of objects to pull from the ring.
> + * @param behavior
> + *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
> + *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
> + * @param available
> + *   returns the number of remaining ring entries after the dequeue has
> finished
> + * @return
> + *   - Actual number of objects dequeued.
> + *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> + */
> +static __rte_always_inline unsigned int
> +__rte_ring_do_nb_dequeue_sc(struct rte_ring *r, void **obj_table,
> +			    unsigned int n,
> +			    enum rte_ring_queue_behavior behavior,
> +			    unsigned int *available)
> +{
> +	size_t head, next;
> +	uint32_t entries;
> +
> +	n = __rte_ring_move_cons_head_64(r, 1, n, behavior,
> +					 &head, &next, &entries);
> +	if (n == 0)
> +		goto end;
> +
> +	DEQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
> +
> +	r->cons_64.tail += n;
Memory ordering? Consumer synchronises with producer.

> +
> +end:
> +	if (available != NULL)
> +		*available = entries - n;
> +	return n;
> +}
> +
> +/**
> + * @internal
> + *   Dequeue several objects from the non-blocking ring (multi-consumer safe)
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects).
> + * @param n
> + *   The number of objects to pull from the ring.
> + * @param behavior
> + *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
> + *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
> + * @param available
> + *   returns the number of remaining ring entries after the dequeue has
> finished
> + * @return
> + *   - Actual number of objects dequeued.
> + *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> + */
> +static __rte_always_inline unsigned int
> +__rte_ring_do_nb_dequeue_mc(struct rte_ring *r, void **obj_table,
> +			    unsigned int n,
> +			    enum rte_ring_queue_behavior behavior,
> +			    unsigned int *available)
> +{
> +	size_t head, next;
> +	uint32_t entries;
> +
> +	n = __rte_ring_move_cons_head_64(r, 0, n, behavior,
> +					 &head, &next, &entries);
> +	if (n == 0)
> +		goto end;
> +
> +	while (1) {
> +		size_t tail = r->cons_64.tail;
> +
> +		/* Dequeue from the cons tail onwards. If multiple threads
> read
> +		 * the same pointers, the thread that successfully performs
> the
> +		 * CAS will keep them and the other(s) will retry.
> +		 */
> +		DEQUEUE_PTRS_NB(r, &r[1], tail, obj_table, n);
> +
> +		next = tail + n;
> +
> +		/* Built-in used to handle variable-sized tail index. */
> +		if (__sync_bool_compare_and_swap(&r->cons_64.tail, tail,
> next))
> +			/* There is potential for the ABA problem here, but
> +			 * that is mitigated by the large (64-bit) tail.
> +			 */
> +			break;
> +	}
> +
> +end:
> +	if (available != NULL)
> +		*available = entries - n;
> +	return n;
> +}
> +
> +/**
> + * @internal Dequeue several objects from the non-blocking ring
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects).
> + * @param n
> + *   The number of objects to pull from the ring.
> + * @param behavior
> + *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
> + *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
> + * @param available
> + *   returns the number of remaining ring entries after the dequeue has
> finished
> + * @return
> + *   - Actual number of objects dequeued.
> + *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> + */
> +static __rte_always_inline unsigned int
> +__rte_ring_do_nb_dequeue(struct rte_ring *r, void **obj_table,
> +		 unsigned int n, enum rte_ring_queue_behavior behavior,
> +		 unsigned int is_sc, unsigned int *available)
> +{
> +	if (is_sc)
> +		return __rte_ring_do_nb_dequeue_sc(r, obj_table, n,
> +						   behavior, available);
> +	else
> +		return __rte_ring_do_nb_dequeue_mc(r, obj_table, n,
> +						   behavior, available);
> +}
> +
>  /**
>   * @internal Enqueue several objects on the ring
>   *
> @@ -438,8 +853,14 @@ static __rte_always_inline unsigned int
>  rte_ring_mp_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
>  			 unsigned int n, unsigned int *free_space)
>  {
> -	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
> -			__IS_MP, free_space);
> +	if (r->flags & RING_F_NB)
> +		return __rte_ring_do_nb_enqueue(r, obj_table, n,
> +						RTE_RING_QUEUE_FIXED,
> __IS_MP,
> +						free_space);
> +	else
> +		return __rte_ring_do_enqueue(r, obj_table, n,
> +					     RTE_RING_QUEUE_FIXED, __IS_MP,
> +					     free_space);
>  }
>  
>  /**
> @@ -461,8 +882,14 @@ static __rte_always_inline unsigned int
>  rte_ring_sp_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
>  			 unsigned int n, unsigned int *free_space)
>  {
> -	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
> -			__IS_SP, free_space);
> +	if (r->flags & RING_F_NB)
> +		return __rte_ring_do_nb_enqueue(r, obj_table, n,
> +						RTE_RING_QUEUE_FIXED,
> __IS_SP,
> +						free_space);
> +	else
> +		return __rte_ring_do_enqueue(r, obj_table, n,
> +					     RTE_RING_QUEUE_FIXED, __IS_SP,
> +					     free_space);
>  }
>  
>  /**
> @@ -488,8 +915,14 @@ static __rte_always_inline unsigned int
>  rte_ring_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
>  		      unsigned int n, unsigned int *free_space)
>  {
> -	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
> -			r->prod.single, free_space);
> +	if (r->flags & RING_F_NB)
> +		return __rte_ring_do_nb_enqueue(r, obj_table, n,
> +						RTE_RING_QUEUE_FIXED,
> +						r->prod_64.single,
> free_space);
> +	else
> +		return __rte_ring_do_enqueue(r, obj_table, n,
> +					     RTE_RING_QUEUE_FIXED,
> +					     r->prod.single, free_space);
>  }
>  
>  /**
> @@ -572,8 +1005,14 @@ static __rte_always_inline unsigned int
>  rte_ring_mc_dequeue_bulk(struct rte_ring *r, void **obj_table,
>  		unsigned int n, unsigned int *available)
>  {
> -	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
> -			__IS_MC, available);
> +	if (r->flags & RING_F_NB)
> +		return __rte_ring_do_nb_dequeue(r, obj_table, n,
> +						RTE_RING_QUEUE_FIXED,
> __IS_MC,
> +						available);
> +	else
> +		return __rte_ring_do_dequeue(r, obj_table, n,
> +					     RTE_RING_QUEUE_FIXED, __IS_MC,
> +					     available);
>  }
>  
>  /**
> @@ -596,8 +1035,14 @@ static __rte_always_inline unsigned int
>  rte_ring_sc_dequeue_bulk(struct rte_ring *r, void **obj_table,
>  		unsigned int n, unsigned int *available)
>  {
> -	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
> -			__IS_SC, available);
> +	if (r->flags & RING_F_NB)
> +		return __rte_ring_do_nb_dequeue(r, obj_table, n,
> +						RTE_RING_QUEUE_FIXED,
> __IS_SC,
> +						available);
> +	else
> +		return __rte_ring_do_dequeue(r, obj_table, n,
> +					     RTE_RING_QUEUE_FIXED, __IS_SC,
> +					     available);
>  }
>  
>  /**
> @@ -623,8 +1068,14 @@ static __rte_always_inline unsigned int
>  rte_ring_dequeue_bulk(struct rte_ring *r, void **obj_table, unsigned int n,
>  		unsigned int *available)
>  {
> -	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
> -				r->cons.single, available);
> +	if (r->flags & RING_F_NB)
> +		return __rte_ring_do_nb_dequeue(r, obj_table, n,
> +						RTE_RING_QUEUE_FIXED,
> +						r->cons_64.single,
> available);
> +	else
> +		return __rte_ring_do_dequeue(r, obj_table, n,
> +					     RTE_RING_QUEUE_FIXED,
> +					     r->cons.single, available);
>  }
>  
>  /**
> @@ -699,9 +1150,13 @@ rte_ring_dequeue(struct rte_ring *r, void **obj_p)
>  static inline unsigned
>  rte_ring_count(const struct rte_ring *r)
>  {
> -	uint32_t prod_tail = r->prod.tail;
> -	uint32_t cons_tail = r->cons.tail;
> -	uint32_t count = (prod_tail - cons_tail) & r->mask;
> +	uint32_t count;
> +
> +	if (r->flags & RING_F_NB)
> +		count = (r->prod_64.tail - r->cons_64.tail) & r->mask;
> +	else
> +		count = (r->prod.tail - r->cons.tail) & r->mask;
> +
>  	return (count > r->capacity) ? r->capacity : count;
>  }
>  
> @@ -821,8 +1276,14 @@ static __rte_always_inline unsigned
>  rte_ring_mp_enqueue_burst(struct rte_ring *r, void * const *obj_table,
>  			 unsigned int n, unsigned int *free_space)
>  {
> -	return __rte_ring_do_enqueue(r, obj_table, n,
> -			RTE_RING_QUEUE_VARIABLE, __IS_MP, free_space);
> +	if (r->flags & RING_F_NB)
> +		return __rte_ring_do_nb_enqueue(r, obj_table, n,
> +						RTE_RING_QUEUE_VARIABLE,
> +						__IS_MP, free_space);
> +	else
> +		return __rte_ring_do_enqueue(r, obj_table, n,
> +					     RTE_RING_QUEUE_VARIABLE,
> +					     __IS_MP, free_space);
>  }
>  
>  /**
> @@ -844,8 +1305,14 @@ static __rte_always_inline unsigned
>  rte_ring_sp_enqueue_burst(struct rte_ring *r, void * const *obj_table,
>  			 unsigned int n, unsigned int *free_space)
>  {
> -	return __rte_ring_do_enqueue(r, obj_table, n,
> -			RTE_RING_QUEUE_VARIABLE, __IS_SP, free_space);
> +	if (r->flags & RING_F_NB)
> +		return __rte_ring_do_nb_enqueue(r, obj_table, n,
> +						RTE_RING_QUEUE_VARIABLE,
> +						__IS_SP, free_space);
> +	else
> +		return __rte_ring_do_enqueue(r, obj_table, n,
> +					     RTE_RING_QUEUE_VARIABLE,
> +					     __IS_SP, free_space);
>  }
>  
>  /**
> @@ -871,8 +1338,14 @@ static __rte_always_inline unsigned
>  rte_ring_enqueue_burst(struct rte_ring *r, void * const *obj_table,
>  		      unsigned int n, unsigned int *free_space)
>  {
> -	return __rte_ring_do_enqueue(r, obj_table, n,
> RTE_RING_QUEUE_VARIABLE,
> -			r->prod.single, free_space);
> +	if (r->flags & RING_F_NB)
> +		return __rte_ring_do_nb_enqueue(r, obj_table, n,
> +						RTE_RING_QUEUE_VARIABLE,
> +						r->prod_64.single,
> free_space);
> +	else
> +		return __rte_ring_do_enqueue(r, obj_table, n,
> +					     RTE_RING_QUEUE_VARIABLE,
> +					     r->prod.single, free_space);
>  }
>  
>  /**
> @@ -899,8 +1372,14 @@ static __rte_always_inline unsigned
>  rte_ring_mc_dequeue_burst(struct rte_ring *r, void **obj_table,
>  		unsigned int n, unsigned int *available)
>  {
> -	return __rte_ring_do_dequeue(r, obj_table, n,
> -			RTE_RING_QUEUE_VARIABLE, __IS_MC, available);
> +	if (r->flags & RING_F_NB)
> +		return __rte_ring_do_nb_dequeue(r, obj_table, n,
> +						RTE_RING_QUEUE_VARIABLE,
> +						__IS_MC, available);
> +	else
> +		return __rte_ring_do_dequeue(r, obj_table, n,
> +					     RTE_RING_QUEUE_VARIABLE,
> +					     __IS_MC, available);
>  }
>  
>  /**
> @@ -924,8 +1403,14 @@ static __rte_always_inline unsigned
>  rte_ring_sc_dequeue_burst(struct rte_ring *r, void **obj_table,
>  		unsigned int n, unsigned int *available)
>  {
> -	return __rte_ring_do_dequeue(r, obj_table, n,
> -			RTE_RING_QUEUE_VARIABLE, __IS_SC, available);
> +	if (r->flags & RING_F_NB)
> +		return __rte_ring_do_nb_dequeue(r, obj_table, n,
> +						RTE_RING_QUEUE_VARIABLE,
> +						__IS_SC, available);
> +	else
> +		return __rte_ring_do_dequeue(r, obj_table, n,
> +					     RTE_RING_QUEUE_VARIABLE,
> +					     __IS_SC, available);
>  }
>  
>  /**
> @@ -951,9 +1436,14 @@ static __rte_always_inline unsigned
>  rte_ring_dequeue_burst(struct rte_ring *r, void **obj_table,
>  		unsigned int n, unsigned int *available)
>  {
> -	return __rte_ring_do_dequeue(r, obj_table, n,
> -				RTE_RING_QUEUE_VARIABLE,
> -				r->cons.single, available);
> +	if (r->flags & RING_F_NB)
> +		return __rte_ring_do_nb_dequeue(r, obj_table, n,
> +						RTE_RING_QUEUE_VARIABLE,
> +						r->cons_64.single,
> available);
> +	else
> +		return __rte_ring_do_dequeue(r, obj_table, n,
> +					     RTE_RING_QUEUE_VARIABLE,
> +					     r->cons.single, available);
>  }
>  
>  #ifdef __cplusplus
> diff --git a/lib/librte_ring/rte_ring_version.map
> b/lib/librte_ring/rte_ring_version.map
> index d935efd0d..8969467af 100644
> --- a/lib/librte_ring/rte_ring_version.map
> +++ b/lib/librte_ring/rte_ring_version.map
> @@ -17,3 +17,10 @@ DPDK_2.2 {
>  	rte_ring_free;
>  
>  } DPDK_2.0;
> +
> +DPDK_19.05 {
> +	global:
> +
> +	rte_ring_get_memsize;
> +
> +} DPDK_2.2;
-- 
Ola Liljedahl, Networking System Architect, Arm
Phone +46706866373, Skype ola.liljedahl


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width size
  2019-01-21 14:14             ` Burakov, Anatoly
@ 2019-01-22 18:27               ` Eads, Gage
  0 siblings, 0 replies; 123+ messages in thread
From: Eads, Gage @ 2019-01-22 18:27 UTC (permalink / raw)
  To: Burakov, Anatoly, Richardson, Bruce
  Cc: dev, olivier.matz, arybchenko, Ananyev, Konstantin



> -----Original Message-----
> From: Burakov, Anatoly
> Sent: Monday, January 21, 2019 8:15 AM
> To: Eads, Gage <gage.eads@intel.com>; Richardson, Bruce
> <bruce.richardson@intel.com>
> Cc: dev@dpdk.org; olivier.matz@6wind.com; arybchenko@solarflare.com;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to pointer-width
> size
> 
> On 11-Jan-19 7:27 PM, Eads, Gage wrote:
> >
> >
> >> -----Original Message-----
> >> From: Richardson, Bruce
> >> Sent: Friday, January 11, 2019 5:59 AM
> >> To: Burakov, Anatoly <anatoly.burakov@intel.com>
> >> Cc: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org;
> >> olivier.matz@6wind.com; arybchenko@solarflare.com; Ananyev,
> >> Konstantin <konstantin.ananyev@intel.com>
> >> Subject: Re: [dpdk-dev] [PATCH 1/6] ring: change head and tail to
> >> pointer-width size
> >>
> >> On Fri, Jan 11, 2019 at 11:30:24AM +0000, Burakov, Anatoly wrote:
> >>> On 11-Jan-19 10:58 AM, Bruce Richardson wrote:
> >>>> On Fri, Jan 11, 2019 at 10:40:19AM +0000, Burakov, Anatoly wrote:
> >>>>> <...>
> >>>>>
> >>>>>> + * Copyright(c) 2016-2019 Intel Corporation
> >>>>>>      */
> >>>>>>     /**
> >>>>>> @@ -88,7 +88,7 @@ rte_event_ring_enqueue_burst(struct
> >> rte_event_ring *r,
> >>>>>>     		const struct rte_event *events,
> >>>>>>     		unsigned int n, uint16_t *free_space)
> >>>>>>     {
> >>>>>> -	uint32_t prod_head, prod_next;
> >>>>>> +	uintptr_t prod_head, prod_next;
> >>>>>
> >>>>> I would also question the use of uinptr_t. I think semnatically,
> >>>>> size_t is more appropriate.
> >>>>>
> >>>> Yes, it would, but I believe in this case they want to use the
> >>>> largest size of (unsigned)int where there exists an atomic for
> >>>> manipulating 2 of them simultaneously. [The largest size is to
> >>>> minimize any chance of an ABA issue occuring]. Therefore we need
> >>>> 32-bit values on 32-bit and 64-bit on 64, and I suspect the best
> >>>> way to guarantee this is to use pointer-sized values. If size_t is
> >>>> guaranteed across all OS's to have the same size as uintptr_t it
> >>>> could also be
> >> used, though.
> >>>>
> >>>> /Bruce
> >>>>
> >>>
> >>> Technically, size_t and uintptr_t are not guaranteed to match. In
> >>> practice, they won't match only on architectures that DPDK doesn't
> >>> intend to run on (such as 16-bit segmented archs, where size_t would
> >>> be 16-bit but uinptr_t would be 32-bit).
> >>>
> >>> In all the rest of DPDK code, we use size_t for this kind of thing.
> >>>
> >>
> >> Ok.
> >> If we do use size_t, I think we also need to add a compile-time check
> >> into the build too, to error out if sizeof(size_t) != sizeof(uintptr_t).
> >
> > Ok, I wasn't aware of the precedent of using size_t for this purpose. I'll change
> it and look into adding a static assert.
> 
> RTE_BUILD_BUG_ON?

Appreciate the pointer, but with the changes needed to preserve ABI compatibility* this is no longer necessary.

*http://mails.dpdk.org/archives/dev/2019-January/123775.html

> 
> >
> > Thanks,
> > Gage
> >
> 
> 
> --
> Thanks,
> Anatoly

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
  2019-01-22  9:27     ` [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring Ola Liljedahl
  2019-01-22 10:15       ` Ola Liljedahl
@ 2019-01-22 19:15       ` Eads, Gage
  2019-01-23 16:02       ` Jerin Jacob Kollanukkaran
  2 siblings, 0 replies; 123+ messages in thread
From: Eads, Gage @ 2019-01-22 19:15 UTC (permalink / raw)
  To: Ola Liljedahl, dev
  Cc: olivier.matz, stephen, Richardson, Bruce, arybchenko, Ananyev,
	Konstantin



> -----Original Message-----
> From: Ola Liljedahl [mailto:Ola.Liljedahl@arm.com]
> Sent: Tuesday, January 22, 2019 3:28 AM
> To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; stephen@networkplumber.org; Richardson, Bruce
> <bruce.richardson@intel.com>; arybchenko@solarflare.com; Ananyev,
> Konstantin <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
> 
> On Fri, 2019-01-18 at 09:23 -0600, Gage Eads wrote:
> > For some users, the rte ring's "non-preemptive" constraint is not
> > acceptable; for example, if the application uses a mixture of pinned
> > high- priority threads and multiplexed low-priority threads that share
> > a mempool.
> >
> > This patchset introduces a non-blocking ring, on top of which a
> > mempool can run.
> > Crucially, the non-blocking algorithm relies on a 128-bit compare-
> > and-swap, so it is currently limited to x86_64 machines. This is also
> > an experimental API, so RING_F_NB users must build with the
> > ALLOW_EXPERIMENTAL_API flag.
> >
> > The ring uses more compare-and-swap atomic operations than the regular
> > rte ring:
> > With no contention, an enqueue of n pointers uses (1 + 2n) CAS
> > operations and a dequeue of n pointers uses 2. This algorithm has
> > worse average-case performance than the regular rte ring (particularly
> > a highly-contended ring with large bulk accesses), however:
> > - For applications with preemptible pthreads, the regular rte ring's
> > worst-case
> >   performance (i.e. one thread being preempted in the update_tail()
> > critical
> >   section) is much worse than the non-blocking ring's.
> > - Software caching can mitigate the average case performance for
> > ring-based
> >   algorithms. For example, a non-blocking ring based mempool (a likely
> > use case
> >   for this ring) with per-thread caching.
> >
> > The non-blocking ring is enabled via a new flag, RING_F_NB. For ease-
> > of-use, existing ring enqueue/dequeue functions work with both
> > "regular" and non-blocking rings.
> >
> > This patchset also adds non-blocking versions of ring_autotest and
> > ring_perf_autotest, and a non-blocking ring based mempool.
> >
> > This patchset makes one API change; a deprecation notice will be
> > posted in a separate commit.
> >
> > This patchset depends on the non-blocking stack patchset[1].
> >
> > [1] http://mails.dpdk.org/archives/dev/2019-January/123653.html
> >
> > v3:
> >  - Avoid the ABI break by putting 64-bit head and tail values in the
> > same
> >    cacheline as struct rte_ring's prod and cons members.
> >  - Don't attempt to compile rte_atomic128_cmpset without
> >    ALLOW_EXPERIMENTAL_API, as this would break a large number of
> > libraries.
> >  - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case
> > someone tries
> >    to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
> >  - Update the ring mempool to use experimental APIs
> >  - Clarify that RINB_F_NB is only limited to x86_64 currently;
> > ARMv8.1-A builds
> >    can eventually support it with the CASP instruction.
> ARMv8.0 should be able to implement a 128-bit atomic compare exchange
> operation using LDXP/STXP.

I see, I wasn't aware these instructions were available.

> 
> From an ARM perspective, I want all atomic operations to take memory ordering
> arguments (e.g. acquire, release). Not all usages of e.g.
> atomic compare exchange require sequential consistency (which I think what
> x86 cmpxchg instruction provides). DPDK functions should not be modelled after
> x86 behaviour.
> 
> Lock-free 128-bit atomics implementations for ARM/AArch64 and x86-64 are
> available here:
> https://github.com/ARM-software/progress64/blob/master/src/lockfree.h
> 

Sure, I'll address this in the next patchset.

Thanks,
Gage

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation
  2019-01-22 14:49       ` Ola Liljedahl
@ 2019-01-22 21:31         ` Eads, Gage
  2019-01-23 10:16           ` Ola Liljedahl
  0 siblings, 1 reply; 123+ messages in thread
From: Eads, Gage @ 2019-01-22 21:31 UTC (permalink / raw)
  To: Ola Liljedahl, dev
  Cc: olivier.matz, stephen, nd, Richardson, Bruce, arybchenko,
	Ananyev, Konstantin

Hi Ola,

<snip>

> > @@ -331,6 +433,319 @@ void rte_ring_dump(FILE *f, const struct
> > rte_ring *r);
> >  #endif
> >  #include "rte_ring_generic_64.h"
> >
> > +/* @internal 128-bit structure used by the non-blocking ring */
> > +struct nb_ring_entry {
> > +	void *ptr; /**< Data pointer */
> > +	uint64_t cnt; /**< Modification counter */
> Why not make 'cnt' uintptr_t? This way 32-bit architectures will also be
> supported. I think there are some claims that DPDK still supports e.g. ARMv7a
> and possibly also 32-bit x86?

I chose a 64-bit modification counter because (practically speaking) the ABA problem will not occur with such a large counter -- definitely not within my lifetime. See the "Discussion" section of the commit message for more information.

With a 32-bit counter, there is a very (very) low likelihood of it, but it is possible. Personally, I don't feel comfortable providing such code, because a) I doubt all users would understand the implementation well enough to do the risk/reward analysis, and b) such a bug would be near impossible to reproduce and root-cause if it did occur.

> 
> > +};
> > +
> > +/* The non-blocking ring algorithm is based on the original rte ring
> > +(derived
> > + * from FreeBSD's bufring.h) and inspired by Michael and Scott's
> > +non-blocking
> > + * concurrent queue.
> > + */
> > +
> > +/**
> > + * @internal
> > + *   Enqueue several objects on the non-blocking ring
> > +(single-producer only)
> > + *
> > + * @param r
> > + *   A pointer to the ring structure.
> > + * @param obj_table
> > + *   A pointer to a table of void * pointers (objects).
> > + * @param n
> > + *   The number of objects to add in the ring from the obj_table.
> > + * @param behavior
> > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the
> > +ring
> > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to
> > +the ring
> > + * @param free_space
> > + *   returns the amount of space after the enqueue operation has
> > +finished
> > + * @return
> > + *   Actual number of objects enqueued.
> > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > + */
> > +static __rte_always_inline unsigned int
> > +__rte_ring_do_nb_enqueue_sp(struct rte_ring *r, void * const *obj_table,
> > +			    unsigned int n,
> > +			    enum rte_ring_queue_behavior behavior,
> > +			    unsigned int *free_space)
> > +{
> > +	uint32_t free_entries;
> > +	size_t head, next;
> > +
> > +	n = __rte_ring_move_prod_head_64(r, 1, n, behavior,
> > +					 &head, &next, &free_entries);
> > +	if (n == 0)
> > +		goto end;
> > +
> > +	ENQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
> > +
> > +	r->prod_64.tail += n;
> Don't we need release order when (or smp_wmb between) writing of the ring
> pointers and the update of tail? By updating the tail pointer, we are
> synchronising with a consumer.
> 
> I prefer using __atomic operations even for load and store. You can see which
> parts of the code that synchronise with each other, e.g. store-release to some
> location synchronises with load-acquire from the same location. If you don't
> know how different threads synchronise with each other, you are very likely to
> make mistakes.
> 

You can tell this code was written when I thought x86-64 was the only viable target :). Yes, you are correct.

With regards to using __atomic intrinsics, I'm planning on taking a similar approach to the functions duplicated in rte_ring_generic.h and rte_ring_c11_mem.h: one version that uses rte_atomic functions (and thus stricter memory ordering) and one that uses __atomic intrinsics (and thus can benefit from more relaxed memory ordering).

> > +
> > +end:
> > +	if (free_space != NULL)
> > +		*free_space = free_entries - n;
> > +	return n;
> > +}
> > +
> > +/**
> > + * @internal
> > + *   Enqueue several objects on the non-blocking ring (multi-producer
> > +safe)
> > + *
> > + * @param r
> > + *   A pointer to the ring structure.
> > + * @param obj_table
> > + *   A pointer to a table of void * pointers (objects).
> > + * @param n
> > + *   The number of objects to add in the ring from the obj_table.
> > + * @param behavior
> > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the
> > +ring
> > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to
> > +the ring
> > + * @param free_space
> > + *   returns the amount of space after the enqueue operation has
> > +finished
> > + * @return
> > + *   Actual number of objects enqueued.
> > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > + */
> > +static __rte_always_inline unsigned int
> > +__rte_ring_do_nb_enqueue_mp(struct rte_ring *r, void * const *obj_table,
> > +			    unsigned int n,
> > +			    enum rte_ring_queue_behavior behavior,
> > +			    unsigned int *free_space)
> > +{
> > +#if !defined(RTE_ARCH_X86_64) || !defined(ALLOW_EXPERIMENTAL_API)
> > +	RTE_SET_USED(r);
> > +	RTE_SET_USED(obj_table);
> > +	RTE_SET_USED(n);
> > +	RTE_SET_USED(behavior);
> > +	RTE_SET_USED(free_space);
> > +#ifndef ALLOW_EXPERIMENTAL_API
> > +	printf("[%s()] RING_F_NB requires an experimental API."
> > +	       " Recompile with ALLOW_EXPERIMENTAL_API to use it.\n"
> > +	       , __func__);
> > +#endif
> > +	return 0;
> > +#endif
> > +#if defined(RTE_ARCH_X86_64) && defined(ALLOW_EXPERIMENTAL_API)
> > +	size_t head, next, tail;
> > +	uint32_t free_entries;
> > +	unsigned int i;
> > +
> > +	n = __rte_ring_move_prod_head_64(r, 0, n, behavior,
> > +					 &head, &next, &free_entries);
> > +	if (n == 0)
> > +		goto end;
> > +
> > +	for (i = 0; i < n; /* i incremented if enqueue succeeds */) {
> > +		struct nb_ring_entry old_value, new_value;
> > +		struct nb_ring_entry *ring_ptr;
> > +
> > +		/* Enqueue to the tail entry. If another thread wins the
> > race,
> > +		 * retry with the new tail.
> > +		 */
> > +		tail = r->prod_64.tail;
> > +
> > +		ring_ptr = &((struct nb_ring_entry *)&r[1])[tail & r->mask];
> This is an ugly expression and cast. Also I think it is unnecessary. What's
> preventing this from being written without a cast? Perhaps the ring array needs
> to be a union of "void *" and struct nb_ring_entry?

The cast is necessary for the correct pointer arithmetic (let "uintptr_t base == &r[1]"):
- With cast: ring_ptr = base + sizeof(struct nb_ring_entry) * (tail & r->mask);
- W/o cast: ring_ptr = base + sizeof(struct rte_ring) * (tail & r->mask);

FWIW, this is essentially the same as is done with the second argument (&r[1]) to ENQUEUE_PTRS and DEQUEUE_PTRS, but there it's split across multiple lines of code. The equivalent here would be:
 
struct nb_ring_entry *ring_base = (struct nb_ring_entry*)&r[1];
ring_ptr = ring_base[tail & r->mask];

Which is more legible, I think.

There is no ring array structure in which to add a union; the ring array is a contiguous chunk of memory that immediately follows after the end of a struct rte_ring. We interpret the memory there according to the ring entry data type (void * for regular rings and struct nb_ring_entry for non-blocking rings).

> 
> > +
> > +		old_value = *ring_ptr;
> > +
> > +		/* If the tail entry's modification counter doesn't match the
> > +		 * producer tail index, it's already been updated.
> > +		 */
> > +		if (old_value.cnt != tail)
> > +			continue;
> Continue restarts the loop at the condition test in the for statement, 'i' and 'n'
> are unchanged. Then we re-read 'prod_64.tail' and 'ring[tail & mask]'. If some
> other thread never updates 'prod_64.tail', the test here (ring[tail].cnt != tail) will
> still be true and we will spin forever.
> 
> Waiting for other threads <=> blocking behaviour so this is not a non- blocking
> design.
> 

You're absolutely right. The if-statement was added as optimization to avoid 128-bit cmpset operations that are known to fail, but in this form it violates the non-blocking design.

I see two solutions: 1) drop the if-statement altogether, or 2) attempt to update prod_64.tail before continuing. Both require every thread to attempt to update prod_64.tail on every iteration, but #2 will result in fewer failed 128-bit cmpsets.

> > +
> > +		/* Prepare the new entry. The cnt field mitigates the ABA
> > +		 * problem on the ring write.
> > +		 */
> > +		new_value.ptr = obj_table[i];
> > +		new_value.cnt = tail + r->size;
> > +
> > +		if (rte_atomic128_cmpset((volatile rte_int128_t *)ring_ptr,
> > +					 (rte_int128_t *)&old_value,
> > +					 (rte_int128_t *)&new_value))
> > +			i++;
> > +
> > +		/* Every thread attempts the cmpset, so they don't have to
> > wait
> > +		 * for the thread that successfully enqueued to the ring.
> > +		 * Using a 64-bit tail mitigates the ABA problem here.
> > +		 *
> > +		 * Built-in used to handle variable-sized tail index.
> > +		 */
> But prod_64.tail is 64 bits so not really variable size?
> 

(See next comment)

> > +		__sync_bool_compare_and_swap(&r->prod_64.tail, tail, tail +
> > 1);
> What memory order is required here? Why not use
> __atomic_compare_exchange() with explicit memory order parameters?
> 

This is an artifact from an older patchset that used uintptr_t, and before I learned that other platforms could support this non-blocking algorithm (hence the __sync type intrinsic was sufficient).

At any rate, as described earlier in this response, I plan on writing these functions using the __atomic builtins in the next patchset.

> > +	}
> > +
> > +end:
> > +	if (free_space != NULL)
> > +		*free_space = free_entries - n;
> > +	return n;
> > +#endif
> > +}
> > +
> > +/**
> > + * @internal Enqueue several objects on the non-blocking ring
> > + *
> > + * @param r
> > + *   A pointer to the ring structure.
> > + * @param obj_table
> > + *   A pointer to a table of void * pointers (objects).
> > + * @param n
> > + *   The number of objects to add in the ring from the obj_table.
> > + * @param behavior
> > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the
> > +ring
> > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to
> > +the ring
> > + * @param is_sp
> > + *   Indicates whether to use single producer or multi-producer head
> > +update
> > + * @param free_space
> > + *   returns the amount of space after the enqueue operation has
> > +finished
> > + * @return
> > + *   Actual number of objects enqueued.
> > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > + */
> > +static __rte_always_inline unsigned int
> > +__rte_ring_do_nb_enqueue(struct rte_ring *r, void * const *obj_table,
> > +			 unsigned int n, enum rte_ring_queue_behavior
> > behavior,
> > +			 unsigned int is_sp, unsigned int *free_space) {
> > +	if (is_sp)
> > +		return __rte_ring_do_nb_enqueue_sp(r, obj_table, n,
> > +						   behavior, free_space);
> > +	else
> > +		return __rte_ring_do_nb_enqueue_mp(r, obj_table, n,
> > +						   behavior, free_space);
> > +}
> > +
> > +/**
> > + * @internal
> > + *   Dequeue several objects from the non-blocking ring
> > +(single-consumer
> > only)
> > + *
> > + * @param r
> > + *   A pointer to the ring structure.
> > + * @param obj_table
> > + *   A pointer to a table of void * pointers (objects).
> > + * @param n
> > + *   The number of objects to pull from the ring.
> > + * @param behavior
> > + *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from
> > + the ring
> > + *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from
> > + the ring
> > + * @param available
> > + *   returns the number of remaining ring entries after the dequeue
> > + has
> > finished
> > + * @return
> > + *   - Actual number of objects dequeued.
> > + *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > + */
> > +static __rte_always_inline unsigned int
> > +__rte_ring_do_nb_dequeue_sc(struct rte_ring *r, void **obj_table,
> > +			    unsigned int n,
> > +			    enum rte_ring_queue_behavior behavior,
> > +			    unsigned int *available)
> > +{
> > +	size_t head, next;
> > +	uint32_t entries;
> > +
> > +	n = __rte_ring_move_cons_head_64(r, 1, n, behavior,
> > +					 &head, &next, &entries);
> > +	if (n == 0)
> > +		goto end;
> > +
> > +	DEQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
> > +
> > +	r->cons_64.tail += n;
> Memory ordering? Consumer synchronises with producer.
> 

Agreed, that is missing here. Will fix.

Thanks,
Gage

> --
> Ola Liljedahl, Networking System Architect, Arm Phone +46706866373, Skype
> ola.liljedahl


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation
  2019-01-22 21:31         ` Eads, Gage
@ 2019-01-23 10:16           ` Ola Liljedahl
  2019-01-25 17:21             ` Eads, Gage
  0 siblings, 1 reply; 123+ messages in thread
From: Ola Liljedahl @ 2019-01-23 10:16 UTC (permalink / raw)
  To: gage.eads, dev
  Cc: olivier.matz, stephen, nd, bruce.richardson, arybchenko,
	konstantin.ananyev

On Tue, 2019-01-22 at 21:31 +0000, Eads, Gage wrote:
> Hi Ola,
> 
> <snip>
> 
> > 
> > > 
> > > @@ -331,6 +433,319 @@ void rte_ring_dump(FILE *f, const struct
> > > rte_ring *r);
> > >  #endif
> > >  #include "rte_ring_generic_64.h"
> > > 
> > > +/* @internal 128-bit structure used by the non-blocking ring */
> > > +struct nb_ring_entry {
> > > +	void *ptr; /**< Data pointer */
> > > +	uint64_t cnt; /**< Modification counter */
> > Why not make 'cnt' uintptr_t? This way 32-bit architectures will also be
> > supported. I think there are some claims that DPDK still supports e.g.
> > ARMv7a
> > and possibly also 32-bit x86?
> I chose a 64-bit modification counter because (practically speaking) the ABA
> problem will not occur with such a large counter -- definitely not within my
> lifetime. See the "Discussion" section of the commit message for more
> information.
> 
> With a 32-bit counter, there is a very (very) low likelihood of it, but it is
> possible. Personally, I don't feel comfortable providing such code, because a)
> I doubt all users would understand the implementation well enough to do the
> risk/reward analysis, and b) such a bug would be near impossible to reproduce
> and root-cause if it did occur.
With a 64-bit counter (and 32-bit pointer), 32-bit architectures (e.g. ARMv7a
and probably x86 as well) won't be able to support this as they at best support
64-bit CAS (ARMv7a has LDREXD/STREXD). So you are essentially putting a 64-bit
(and 128-bit CAS) requirement on the implementation.

> 
> > 
> > 
> > > 
> > > +};
> > > +
> > > +/* The non-blocking ring algorithm is based on the original rte ring
> > > +(derived
> > > + * from FreeBSD's bufring.h) and inspired by Michael and Scott's
> > > +non-blocking
> > > + * concurrent queue.
> > > + */
> > > +
> > > +/**
> > > + * @internal
> > > + *   Enqueue several objects on the non-blocking ring
> > > +(single-producer only)
> > > + *
> > > + * @param r
> > > + *   A pointer to the ring structure.
> > > + * @param obj_table
> > > + *   A pointer to a table of void * pointers (objects).
> > > + * @param n
> > > + *   The number of objects to add in the ring from the obj_table.
> > > + * @param behavior
> > > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the
> > > +ring
> > > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to
> > > +the ring
> > > + * @param free_space
> > > + *   returns the amount of space after the enqueue operation has
> > > +finished
> > > + * @return
> > > + *   Actual number of objects enqueued.
> > > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > > + */
> > > +static __rte_always_inline unsigned int
> > > +__rte_ring_do_nb_enqueue_sp(struct rte_ring *r, void * const *obj_table,
> > > +			    unsigned int n,
> > > +			    enum rte_ring_queue_behavior behavior,
> > > +			    unsigned int *free_space)
> > > +{
> > > +	uint32_t free_entries;
> > > +	size_t head, next;
> > > +
> > > +	n = __rte_ring_move_prod_head_64(r, 1, n, behavior,
> > > +					 &head, &next, &free_entries);
> > > +	if (n == 0)
> > > +		goto end;
> > > +
> > > +	ENQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
> > > +
> > > +	r->prod_64.tail += n;
> > Don't we need release order when (or smp_wmb between) writing of the ring
> > pointers and the update of tail? By updating the tail pointer, we are
> > synchronising with a consumer.
> > 
> > I prefer using __atomic operations even for load and store. You can see
> > which
> > parts of the code that synchronise with each other, e.g. store-release to
> > some
> > location synchronises with load-acquire from the same location. If you don't
> > know how different threads synchronise with each other, you are very likely
> > to
> > make mistakes.
> > 
> You can tell this code was written when I thought x86-64 was the only viable
> target :). Yes, you are correct.
> 
> With regards to using __atomic intrinsics, I'm planning on taking a similar
> approach to the functions duplicated in rte_ring_generic.h and
> rte_ring_c11_mem.h: one version that uses rte_atomic functions (and thus
> stricter memory ordering) and one that uses __atomic intrinsics (and thus can
> benefit from more relaxed memory ordering).
What's the advantage of having two different implementations? What is the
disadvantage?

The existing ring buffer code originally had only the "legacy" implementation
which was kept when the __atomic implementation was added. The reason claimed
was that some older compilers for x86 do not support GCC __atomic builtins. But
I thought there was consensus that new functionality could have only __atomic
implementations.

Does the non-blocking ring buffer implementation have to support these older
compilers? Will the applications that require these older compiler be updated to
utilise the non-blocking ring buffer?

> 
> > 
> > > 
> > > +
> > > +end:
> > > +	if (free_space != NULL)
> > > +		*free_space = free_entries - n;
> > > +	return n;
> > > +}
> > > +
> > > +/**
> > > + * @internal
> > > + *   Enqueue several objects on the non-blocking ring (multi-producer
> > > +safe)
> > > + *
> > > + * @param r
> > > + *   A pointer to the ring structure.
> > > + * @param obj_table
> > > + *   A pointer to a table of void * pointers (objects).
> > > + * @param n
> > > + *   The number of objects to add in the ring from the obj_table.
> > > + * @param behavior
> > > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the
> > > +ring
> > > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to
> > > +the ring
> > > + * @param free_space
> > > + *   returns the amount of space after the enqueue operation has
> > > +finished
> > > + * @return
> > > + *   Actual number of objects enqueued.
> > > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > > + */
> > > +static __rte_always_inline unsigned int
> > > +__rte_ring_do_nb_enqueue_mp(struct rte_ring *r, void * const *obj_table,
> > > +			    unsigned int n,
> > > +			    enum rte_ring_queue_behavior behavior,
> > > +			    unsigned int *free_space)
> > > +{
> > > +#if !defined(RTE_ARCH_X86_64) || !defined(ALLOW_EXPERIMENTAL_API)
> > > +	RTE_SET_USED(r);
> > > +	RTE_SET_USED(obj_table);
> > > +	RTE_SET_USED(n);
> > > +	RTE_SET_USED(behavior);
> > > +	RTE_SET_USED(free_space);
> > > +#ifndef ALLOW_EXPERIMENTAL_API
> > > +	printf("[%s()] RING_F_NB requires an experimental API."
> > > +	       " Recompile with ALLOW_EXPERIMENTAL_API to use it.\n"
> > > +	       , __func__);
> > > +#endif
> > > +	return 0;
> > > +#endif
> > > +#if defined(RTE_ARCH_X86_64) && defined(ALLOW_EXPERIMENTAL_API)
> > > +	size_t head, next, tail;
> > > +	uint32_t free_entries;
> > > +	unsigned int i;
> > > +
> > > +	n = __rte_ring_move_prod_head_64(r, 0, n, behavior,
> > > +					 &head, &next, &free_entries);
> > > +	if (n == 0)
> > > +		goto end;
> > > +
> > > +	for (i = 0; i < n; /* i incremented if enqueue succeeds */) {
> > > +		struct nb_ring_entry old_value, new_value;
> > > +		struct nb_ring_entry *ring_ptr;
> > > +
> > > +		/* Enqueue to the tail entry. If another thread wins the
> > > race,
> > > +		 * retry with the new tail.
> > > +		 */
> > > +		tail = r->prod_64.tail;
> > > +
> > > +		ring_ptr = &((struct nb_ring_entry *)&r[1])[tail & r-
> > > >mask];
> > This is an ugly expression and cast. Also I think it is unnecessary. What's
> > preventing this from being written without a cast? Perhaps the ring array
> > needs
> > to be a union of "void *" and struct nb_ring_entry?
> The cast is necessary for the correct pointer arithmetic (let "uintptr_t base
> == &r[1]"):
Yes I know the C language.

> - With cast: ring_ptr = base + sizeof(struct nb_ring_entry) * (tail & r-
> >mask);
> - W/o cast: ring_ptr = base + sizeof(struct rte_ring) * (tail & r->mask);
> 
> FWIW, this is essentially the same as is done with the second argument (&r[1])
> to ENQUEUE_PTRS and DEQUEUE_PTRS, but there it's split across multiple lines
> of code. The equivalent here would be:
>  
> struct nb_ring_entry *ring_base = (struct nb_ring_entry*)&r[1];
> ring_ptr = ring_base[tail & r->mask];
> 
> Which is more legible, I think.
The RTE ring buffer code is not very legible to start with.

> 
> There is no ring array structure in which to add a union; the ring array is a
> contiguous chunk of memory that immediately follows after the end of a struct
> rte_ring. We interpret the memory there according to the ring entry data type
> (void * for regular rings and struct nb_ring_entry for non-blocking rings).
My worry is that we are abusing the C language and creating a monster of fragile
C code that will be more and more difficult to understand and to maintain. At
some point you have to think the question "Are we doing the right thing?".

> 
> > 
> > 
> > > 
> > > +
> > > +		old_value = *ring_ptr;
> > > +
> > > +		/* If the tail entry's modification counter doesn't match
> > > the
> > > +		 * producer tail index, it's already been updated.
> > > +		 */
> > > +		if (old_value.cnt != tail)
> > > +			continue;
> > Continue restarts the loop at the condition test in the for statement, 'i'
> > and 'n'
> > are unchanged. Then we re-read 'prod_64.tail' and 'ring[tail & mask]'. If
> > some
> > other thread never updates 'prod_64.tail', the test here (ring[tail].cnt !=
> > tail) will
> > still be true and we will spin forever.
> > 
> > Waiting for other threads <=> blocking behaviour so this is not a non-
> > blocking
> > design.
> > 
> You're absolutely right. The if-statement was added as optimization to avoid
> 128-bit cmpset operations that are known to fail, but in this form it violates
> the non-blocking design.
> 
> I see two solutions: 1) drop the if-statement altogether, or 2) attempt to
> update prod_64.tail before continuing. Both require every thread to attempt to
> update prod_64.tail on every iteration, but #2 will result in fewer failed
> 128-bit cmpsets.
> 
> > 
> > > 
> > > +
> > > +		/* Prepare the new entry. The cnt field mitigates the ABA
> > > +		 * problem on the ring write.
> > > +		 */
> > > +		new_value.ptr = obj_table[i];
> > > +		new_value.cnt = tail + r->size;
> > > +
> > > +		if (rte_atomic128_cmpset((volatile rte_int128_t
> > > *)ring_ptr,
> > > +					 (rte_int128_t *)&old_value,
> > > +					 (rte_int128_t *)&new_value))
> > > +			i++;
> > > +
> > > +		/* Every thread attempts the cmpset, so they don't have
> > > to
> > > wait
> > > +		 * for the thread that successfully enqueued to the ring.
> > > +		 * Using a 64-bit tail mitigates the ABA problem here.
> > > +		 *
> > > +		 * Built-in used to handle variable-sized tail index.
> > > +		 */
> > But prod_64.tail is 64 bits so not really variable size?
> > 
> (See next comment)
> 
> > 
> > > 
> > > +		__sync_bool_compare_and_swap(&r->prod_64.tail, tail, tail
> > > +
> > > 1);
> > What memory order is required here? Why not use
> > __atomic_compare_exchange() with explicit memory order parameters?
> > 
> This is an artifact from an older patchset that used uintptr_t, and before I
> learned that other platforms could support this non-blocking algorithm (hence
> the __sync type intrinsic was sufficient).
> 
> At any rate, as described earlier in this response, I plan on writing these
> functions using the __atomic builtins in the next patchset.
Great.

> 
> > 
> > > 
> > > +	}
> > > +
> > > +end:
> > > +	if (free_space != NULL)
> > > +		*free_space = free_entries - n;
> > > +	return n;
> > > +#endif
> > > +}
> > > +
> > > +/**
> > > + * @internal Enqueue several objects on the non-blocking ring
> > > + *
> > > + * @param r
> > > + *   A pointer to the ring structure.
> > > + * @param obj_table
> > > + *   A pointer to a table of void * pointers (objects).
> > > + * @param n
> > > + *   The number of objects to add in the ring from the obj_table.
> > > + * @param behavior
> > > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the
> > > +ring
> > > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to
> > > +the ring
> > > + * @param is_sp
> > > + *   Indicates whether to use single producer or multi-producer head
> > > +update
> > > + * @param free_space
> > > + *   returns the amount of space after the enqueue operation has
> > > +finished
> > > + * @return
> > > + *   Actual number of objects enqueued.
> > > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > > + */
> > > +static __rte_always_inline unsigned int
> > > +__rte_ring_do_nb_enqueue(struct rte_ring *r, void * const *obj_table,
> > > +			 unsigned int n, enum rte_ring_queue_behavior
> > > behavior,
> > > +			 unsigned int is_sp, unsigned int *free_space) {
> > > +	if (is_sp)
> > > +		return __rte_ring_do_nb_enqueue_sp(r, obj_table, n,
> > > +						   behavior, free_space);
> > > +	else
> > > +		return __rte_ring_do_nb_enqueue_mp(r, obj_table, n,
> > > +						   behavior, free_space);
> > > +}
> > > +
> > > +/**
> > > + * @internal
> > > + *   Dequeue several objects from the non-blocking ring
> > > +(single-consumer
> > > only)
> > > + *
> > > + * @param r
> > > + *   A pointer to the ring structure.
> > > + * @param obj_table
> > > + *   A pointer to a table of void * pointers (objects).
> > > + * @param n
> > > + *   The number of objects to pull from the ring.
> > > + * @param behavior
> > > + *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from
> > > + the ring
> > > + *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from
> > > + the ring
> > > + * @param available
> > > + *   returns the number of remaining ring entries after the dequeue
> > > + has
> > > finished
> > > + * @return
> > > + *   - Actual number of objects dequeued.
> > > + *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > > + */
> > > +static __rte_always_inline unsigned int
> > > +__rte_ring_do_nb_dequeue_sc(struct rte_ring *r, void **obj_table,
> > > +			    unsigned int n,
> > > +			    enum rte_ring_queue_behavior behavior,
> > > +			    unsigned int *available)
> > > +{
> > > +	size_t head, next;
> > > +	uint32_t entries;
> > > +
> > > +	n = __rte_ring_move_cons_head_64(r, 1, n, behavior,
> > > +					 &head, &next, &entries);
> > > +	if (n == 0)
> > > +		goto end;
> > > +
> > > +	DEQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
> > > +
> > > +	r->cons_64.tail += n;
> > Memory ordering? Consumer synchronises with producer.
> > 
> Agreed, that is missing here. Will fix.
> 
> Thanks,
> Gage
> 
> > 
> > --
> > Ola Liljedahl, Networking System Architect, Arm Phone +46706866373, Skype
> > ola.liljedahl
-- 
Ola Liljedahl, Networking System Architect, Arm
Phone +46706866373, Skype ola.liljedahl


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
  2019-01-22  9:27     ` [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring Ola Liljedahl
  2019-01-22 10:15       ` Ola Liljedahl
  2019-01-22 19:15       ` Eads, Gage
@ 2019-01-23 16:02       ` Jerin Jacob Kollanukkaran
  2019-01-23 16:29         ` Ola Liljedahl
  2 siblings, 1 reply; 123+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2019-01-23 16:02 UTC (permalink / raw)
  To: Ola.Liljedahl, gage.eads, dev
  Cc: olivier.matz, stephen, bruce.richardson, arybchenko, konstantin.ananyev

On Tue, 2019-01-22 at 09:27 +0000, Ola Liljedahl wrote:
> On Fri, 2019-01-18 at 09:23 -0600, Gage Eads wrote:
> > v3:
> >  - Avoid the ABI break by putting 64-bit head and tail values in
> > the
> > same
> >    cacheline as struct rte_ring's prod and cons members.
> >  - Don't attempt to compile rte_atomic128_cmpset without
> >    ALLOW_EXPERIMENTAL_API, as this would break a large number of
> > libraries.
> >  - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case
> > someone tries
> >    to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
> >  - Update the ring mempool to use experimental APIs
> >  - Clarify that RINB_F_NB is only limited to x86_64 currently;
> > ARMv8.1-A builds
> >    can eventually support it with the CASP instruction.
> ARMv8.0 should be able to implement a 128-bit atomic compare exchange
> operation using LDXP/STXP.

Just wondering what would the performance difference between CASP vs
LDXP/STXP on LSE supported machine?

I think, We can not detect the presese of LSE support in compile time.
Right?

The dynamic one will be costly like,

if (hwcaps & HWCAP_ATOMICS) {
	casp
} else {
	ldxp
	stxp
}

> From an ARM perspective, I want all atomic operations to take memory
> ordering arguments (e.g. acquire, release). Not all usages of e.g.

+1

> atomic compare exchange require sequential consistency (which I think
> what x86 cmpxchg instruction provides). DPDK functions should not be
> modelled after x86 behaviour.
> 
> Lock-free 128-bit atomics implementations for ARM/AArch64 and x86-64
> are available here:
> https://github.com/ARM-software/progress64/blob/master/src/lockfree.h
> 

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
  2019-01-23 16:02       ` Jerin Jacob Kollanukkaran
@ 2019-01-23 16:29         ` Ola Liljedahl
  2019-01-28 13:10           ` [dpdk-dev] [EXT] " Jerin Jacob Kollanukkaran
  0 siblings, 1 reply; 123+ messages in thread
From: Ola Liljedahl @ 2019-01-23 16:29 UTC (permalink / raw)
  To: jerinj, gage.eads, dev
  Cc: olivier.matz, stephen, nd, bruce.richardson, arybchenko,
	konstantin.ananyev

On Wed, 2019-01-23 at 16:02 +0000, Jerin Jacob Kollanukkaran wrote:
> On Tue, 2019-01-22 at 09:27 +0000, Ola Liljedahl wrote:
> > 
> > On Fri, 2019-01-18 at 09:23 -0600, Gage Eads wrote:
> > > 
> > > v3:
> > >  - Avoid the ABI break by putting 64-bit head and tail values in
> > > the
> > > same
> > >    cacheline as struct rte_ring's prod and cons members.
> > >  - Don't attempt to compile rte_atomic128_cmpset without
> > >    ALLOW_EXPERIMENTAL_API, as this would break a large number of
> > > libraries.
> > >  - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case
> > > someone tries
> > >    to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
> > >  - Update the ring mempool to use experimental APIs
> > >  - Clarify that RINB_F_NB is only limited to x86_64 currently;
> > > ARMv8.1-A builds
> > >    can eventually support it with the CASP instruction.
> > ARMv8.0 should be able to implement a 128-bit atomic compare exchange
> > operation using LDXP/STXP.
> Just wondering what would the performance difference between CASP vs
> LDXP/STXP on LSE supported machine?
I think that is up to the microarchitecture. But one the ideas behind
introducing the LSE atomics was that they should be "better" than the equivalent
code using exclusives. I think non-conditional LDxxx and STxxx atomics could be
better than using exclusives while conditional atomics (CAS, CASP) might not be
so different (the reason has to do with cache coherency, a core can
speculatively snoop-unique the cache line which is targetted by an atomic
instruction but to what extent that provides a benefit could be depend on
whether the atomic actually performs a store or not).

> 
> I think, We can not detect the presese of LSE support in compile time.
> Right?
Unfortunately, AFAIK GCC doesn't notify the source code that it is targetting
v8.1+ with LSE support. If there were intrinsics for (certain) LSE instructions
(e.g. those not generated by the compiler, e.g. STxxx and CASP), we could use
some corresponding preprocessor define to detect the presence of such intrinsics
(they exist for other intrinsics, e.g. __ARM_FEATURE_QRDMX for SQRDMLAH/SQRDMLSH
instructions and corresponding intrinsics).

I have tried to interest the Arm GCC developers in this but have not yet
succeeded. Perhaps if we have more use cases were atomics intrinsics would be
useful, we could convince them to add such intrinsics to the ACLE (ARM C
Language Extensions). But we will never get intrinsics for exclusives, they are
deemed unsafe for explicit use from C. Instead need to provide inline assembler
that contains the complete exclusives sequence. But in practice it seems to work
with using inline assembler for LDXR and STXR as I do in the lockfree code
linked below.

> 
> The dynamic one will be costly like,
Do you think so? Shouldn't this branch be perfectly predictable? Once in a while
it will fall out of the branch history table but doesn't that mean the
application hasn't been executing this code for some time so not really
performance critical?

> 
> if (hwcaps & HWCAP_ATOMICS) {
> 	casp
> } else {
> 	ldxp
> 	stxp
> }
> 
> > 
> > From an ARM perspective, I want all atomic operations to take memory
> > ordering arguments (e.g. acquire, release). Not all usages of e.g.
> +1
> 
> > 
> > atomic compare exchange require sequential consistency (which I think
> > what x86 cmpxchg instruction provides). DPDK functions should not be
> > modelled after x86 behaviour.
> > 
> > Lock-free 128-bit atomics implementations for ARM/AArch64 and x86-64
> > are available here:
> > https://github.com/ARM-software/progress64/blob/master/src/lockfree.h
> > 
-- 
Ola Liljedahl, Networking System Architect, Arm
Phone +46706866373, Skype ola.liljedahl


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
  2019-01-18 15:23   ` [dpdk-dev] [PATCH v3 " Gage Eads
                       ` (5 preceding siblings ...)
  2019-01-22  9:27     ` [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring Ola Liljedahl
@ 2019-01-25  5:20     ` Honnappa Nagarahalli
  2019-01-25 17:42       ` Eads, Gage
  2019-01-25 17:56       ` Eads, Gage
  2019-01-28 18:14     ` [dpdk-dev] [PATCH v4 " Gage Eads
  7 siblings, 2 replies; 123+ messages in thread
From: Honnappa Nagarahalli @ 2019-01-25  5:20 UTC (permalink / raw)
  To: Gage Eads, dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, nd, thomas, Ola Liljedahl,
	Gavin Hu (Arm Technology China), Song Zhu (Arm Technology China),
	nd

Hi Gage,
	Thank you for this patch. Arm (Ola Liljedahl) had worked on a non-blocking ring algorithm. We were planning to add it to DPDK at some point this year. I am wondering if you would be open to take a look at the algorithm and collaborate?

I am yet to fully understand both the algorithms. But, Ola has reviewed your patch and can provide a quick overview of the differences here.

If you agree, we can send a RFC patch. You can review that and do performance benchmarking on your platforms. I can also benchmark your patch (may be once you fix the issue identified in __rte_ring_do_nb_enqueue_mp  function?) on Arm platforms. May be we can end up with a better combined algorithm.

Hi Thomas/Bruce,
	Please let me know if this is ok and if there is a better way to do this.

Thank you,
Honnappa

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Gage Eads
> Sent: Friday, January 18, 2019 9:23 AM
> To: dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com;
> bruce.richardson@intel.com; konstantin.ananyev@intel.com;
> stephen@networkplumber.org
> Subject: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
> 
> For some users, the rte ring's "non-preemptive" constraint is not acceptable;
> for example, if the application uses a mixture of pinned high-priority threads
> and multiplexed low-priority threads that share a mempool.
> 
> This patchset introduces a non-blocking ring, on top of which a mempool can
> run.
> Crucially, the non-blocking algorithm relies on a 128-bit compare-and-swap,
> so it is currently limited to x86_64 machines. This is also an experimental API,
> so RING_F_NB users must build with the ALLOW_EXPERIMENTAL_API flag.
> 
> The ring uses more compare-and-swap atomic operations than the regular rte
> ring:
> With no contention, an enqueue of n pointers uses (1 + 2n) CAS operations
> and a dequeue of n pointers uses 2. This algorithm has worse average-case
> performance than the regular rte ring (particularly a highly-contended ring
> with large bulk accesses), however:
> - For applications with preemptible pthreads, the regular rte ring's worst-case
>   performance (i.e. one thread being preempted in the update_tail() critical
>   section) is much worse than the non-blocking ring's.
> - Software caching can mitigate the average case performance for ring-based
>   algorithms. For example, a non-blocking ring based mempool (a likely use
> case
>   for this ring) with per-thread caching.
> 
> The non-blocking ring is enabled via a new flag, RING_F_NB. For ease-of-use,
> existing ring enqueue/dequeue functions work with both "regular" and non-
> blocking rings.
> 
> This patchset also adds non-blocking versions of ring_autotest and
> ring_perf_autotest, and a non-blocking ring based mempool.
> 
> This patchset makes one API change; a deprecation notice will be posted in a
> separate commit.
> 
> This patchset depends on the non-blocking stack patchset[1].
> 
> [1] http://mails.dpdk.org/archives/dev/2019-January/123653.html
> 
> v3:
>  - Avoid the ABI break by putting 64-bit head and tail values in the same
>    cacheline as struct rte_ring's prod and cons members.
>  - Don't attempt to compile rte_atomic128_cmpset without
>    ALLOW_EXPERIMENTAL_API, as this would break a large number of libraries.
>  - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case someone
> tries
>    to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
>  - Update the ring mempool to use experimental APIs
>  - Clarify that RINB_F_NB is only limited to x86_64 currently; ARMv8.1-A
> builds
>    can eventually support it with the CASP instruction.
> 
> v2:
>  - Merge separate docs commit into patch #5
>  - Convert uintptr_t to size_t
>  - Add a compile-time check for the size of size_t
>  - Fix a space-after-typecast issue
>  - Fix an unnecessary-parentheses checkpatch warning
>  - Bump librte_ring's library version
> 
> Gage Eads (5):
>   ring: add 64-bit headtail structure
>   ring: add a non-blocking implementation
>   test_ring: add non-blocking ring autotest
>   test_ring_perf: add non-blocking ring perf test
>   mempool/ring: add non-blocking ring handlers
> 
>  doc/guides/prog_guide/env_abstraction_layer.rst |   2 +-
>  drivers/mempool/ring/Makefile                   |   1 +
>  drivers/mempool/ring/meson.build                |   2 +
>  drivers/mempool/ring/rte_mempool_ring.c         |  58 ++-
>  lib/librte_eventdev/rte_event_ring.h            |   2 +-
>  lib/librte_ring/Makefile                        |   3 +-
>  lib/librte_ring/rte_ring.c                      |  72 ++-
>  lib/librte_ring/rte_ring.h                      | 574 ++++++++++++++++++++++--
>  lib/librte_ring/rte_ring_generic_64.h           | 152 +++++++
>  lib/librte_ring/rte_ring_version.map            |   7 +
>  test/test/test_ring.c                           |  57 ++-
>  test/test/test_ring_perf.c                      |  19 +-
>  12 files changed, 874 insertions(+), 75 deletions(-)  create mode 100644
> lib/librte_ring/rte_ring_generic_64.h
> 
> --
> 2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation
  2019-01-23 10:16           ` Ola Liljedahl
@ 2019-01-25 17:21             ` Eads, Gage
  2019-01-28 10:35               ` Ola Liljedahl
  2019-01-28 13:34               ` Jerin Jacob Kollanukkaran
  0 siblings, 2 replies; 123+ messages in thread
From: Eads, Gage @ 2019-01-25 17:21 UTC (permalink / raw)
  To: Ola Liljedahl, dev, jerinj, mczekaj
  Cc: olivier.matz, stephen, nd, Richardson, Bruce, arybchenko,
	Ananyev, Konstantin



> -----Original Message-----
> From: Ola Liljedahl [mailto:Ola.Liljedahl@arm.com]
> Sent: Wednesday, January 23, 2019 4:16 AM
> To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; stephen@networkplumber.org; nd
> <nd@arm.com>; Richardson, Bruce <bruce.richardson@intel.com>;
> arybchenko@solarflare.com; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation
> 
> On Tue, 2019-01-22 at 21:31 +0000, Eads, Gage wrote:
> > Hi Ola,
> >
> > <snip>
> >
> > >
> > > >
> > > > @@ -331,6 +433,319 @@ void rte_ring_dump(FILE *f, const struct
> > > > rte_ring *r);
> > > >  #endif
> > > >  #include "rte_ring_generic_64.h"
> > > >
> > > > +/* @internal 128-bit structure used by the non-blocking ring */
> > > > +struct nb_ring_entry {
> > > > +	void *ptr; /**< Data pointer */
> > > > +	uint64_t cnt; /**< Modification counter */
> > > Why not make 'cnt' uintptr_t? This way 32-bit architectures will
> > > also be supported. I think there are some claims that DPDK still supports e.g.
> > > ARMv7a
> > > and possibly also 32-bit x86?
> > I chose a 64-bit modification counter because (practically speaking)
> > the ABA problem will not occur with such a large counter -- definitely
> > not within my lifetime. See the "Discussion" section of the commit
> > message for more information.
> >
> > With a 32-bit counter, there is a very (very) low likelihood of it,
> > but it is possible. Personally, I don't feel comfortable providing
> > such code, because a) I doubt all users would understand the
> > implementation well enough to do the risk/reward analysis, and b) such
> > a bug would be near impossible to reproduce and root-cause if it did occur.
> With a 64-bit counter (and 32-bit pointer), 32-bit architectures (e.g. ARMv7a and
> probably x86 as well) won't be able to support this as they at best support 64-bit
> CAS (ARMv7a has LDREXD/STREXD). So you are essentially putting a 64-bit (and
> 128-bit CAS) requirement on the implementation.
> 

Yes, I am. I tried to make that clear in the cover letter.

> >
> > >
> > >
> > > >
> > > > +};
> > > > +
> > > > +/* The non-blocking ring algorithm is based on the original rte
> > > > +ring (derived
> > > > + * from FreeBSD's bufring.h) and inspired by Michael and Scott's
> > > > +non-blocking
> > > > + * concurrent queue.
> > > > + */
> > > > +
> > > > +/**
> > > > + * @internal
> > > > + *   Enqueue several objects on the non-blocking ring
> > > > +(single-producer only)
> > > > + *
> > > > + * @param r
> > > > + *   A pointer to the ring structure.
> > > > + * @param obj_table
> > > > + *   A pointer to a table of void * pointers (objects).
> > > > + * @param n
> > > > + *   The number of objects to add in the ring from the obj_table.
> > > > + * @param behavior
> > > > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to
> > > > +the ring
> > > > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible
> > > > +to the ring
> > > > + * @param free_space
> > > > + *   returns the amount of space after the enqueue operation has
> > > > +finished
> > > > + * @return
> > > > + *   Actual number of objects enqueued.
> > > > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > > > + */
> > > > +static __rte_always_inline unsigned int
> > > > +__rte_ring_do_nb_enqueue_sp(struct rte_ring *r, void * const *obj_table,
> > > > +			    unsigned int n,
> > > > +			    enum rte_ring_queue_behavior behavior,
> > > > +			    unsigned int *free_space)
> > > > +{
> > > > +	uint32_t free_entries;
> > > > +	size_t head, next;
> > > > +
> > > > +	n = __rte_ring_move_prod_head_64(r, 1, n, behavior,
> > > > +					 &head, &next, &free_entries);
> > > > +	if (n == 0)
> > > > +		goto end;
> > > > +
> > > > +	ENQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
> > > > +
> > > > +	r->prod_64.tail += n;
> > > Don't we need release order when (or smp_wmb between) writing of the
> > > ring pointers and the update of tail? By updating the tail pointer,
> > > we are synchronising with a consumer.
> > >
> > > I prefer using __atomic operations even for load and store. You can
> > > see which parts of the code that synchronise with each other, e.g.
> > > store-release to some location synchronises with load-acquire from
> > > the same location. If you don't know how different threads
> > > synchronise with each other, you are very likely to make mistakes.
> > >
> > You can tell this code was written when I thought x86-64 was the only
> > viable target :). Yes, you are correct.
> >
> > With regards to using __atomic intrinsics, I'm planning on taking a
> > similar approach to the functions duplicated in rte_ring_generic.h and
> > rte_ring_c11_mem.h: one version that uses rte_atomic functions (and
> > thus stricter memory ordering) and one that uses __atomic intrinsics
> > (and thus can benefit from more relaxed memory ordering).
> What's the advantage of having two different implementations? What is the
> disadvantage?
> 
> The existing ring buffer code originally had only the "legacy" implementation
> which was kept when the __atomic implementation was added. The reason
> claimed was that some older compilers for x86 do not support GCC __atomic
> builtins. But I thought there was consensus that new functionality could have
> only __atomic implementations.
> 

When CONFIG_RTE_RING_USE_C11_MEM_MODEL was introduced, it was left disabled for thunderx[1] for performance reasons. Assuming that hasn't changed, the advantage to having two versions is to best support all of DPDK's platforms. The disadvantage is of course duplicated code and the additional maintenance burden.

That said, if the thunderx maintainers are ok with it, I'm certainly open to only doing the __atomic version. Note that even in the __atomic version, based on Honnapa's findings[2], using a DPDK-defined rte_atomic128_cmpset() (with additional arguments to support machines with weak consistency) appears to be a better option than __atomic_compare_exchange_16.

I couldn't find the discussion about new functionality using __atomic going forward -- can you send a link?

[1] https://mails.dpdk.org/archives/dev/2017-December/082853.html
[2] http://mails.dpdk.org/archives/dev/2019-January/124002.html

> Does the non-blocking ring buffer implementation have to support these older
> compilers? Will the applications that require these older compiler be updated to
> utilise the non-blocking ring buffer?
> 

(See above -- compiler versions wasn't a consideration here.)

> >
> > >
> > > >
> > > > +
> > > > +end:
> > > > +	if (free_space != NULL)
> > > > +		*free_space = free_entries - n;
> > > > +	return n;
> > > > +}
> > > > +
> > > > +/**
> > > > + * @internal
> > > > + *   Enqueue several objects on the non-blocking ring
> > > > +(multi-producer
> > > > +safe)
> > > > + *
> > > > + * @param r
> > > > + *   A pointer to the ring structure.
> > > > + * @param obj_table
> > > > + *   A pointer to a table of void * pointers (objects).
> > > > + * @param n
> > > > + *   The number of objects to add in the ring from the obj_table.
> > > > + * @param behavior
> > > > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to
> > > > +the ring
> > > > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible
> > > > +to the ring
> > > > + * @param free_space
> > > > + *   returns the amount of space after the enqueue operation has
> > > > +finished
> > > > + * @return
> > > > + *   Actual number of objects enqueued.
> > > > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > > > + */
> > > > +static __rte_always_inline unsigned int
> > > > +__rte_ring_do_nb_enqueue_mp(struct rte_ring *r, void * const
> *obj_table,
> > > > +			    unsigned int n,
> > > > +			    enum rte_ring_queue_behavior behavior,
> > > > +			    unsigned int *free_space)
> > > > +{
> > > > +#if !defined(RTE_ARCH_X86_64) || !defined(ALLOW_EXPERIMENTAL_API)
> > > > +	RTE_SET_USED(r);
> > > > +	RTE_SET_USED(obj_table);
> > > > +	RTE_SET_USED(n);
> > > > +	RTE_SET_USED(behavior);
> > > > +	RTE_SET_USED(free_space);
> > > > +#ifndef ALLOW_EXPERIMENTAL_API
> > > > +	printf("[%s()] RING_F_NB requires an experimental API."
> > > > +	       " Recompile with ALLOW_EXPERIMENTAL_API to use it.\n"
> > > > +	       , __func__);
> > > > +#endif
> > > > +	return 0;
> > > > +#endif
> > > > +#if defined(RTE_ARCH_X86_64) && defined(ALLOW_EXPERIMENTAL_API)
> > > > +	size_t head, next, tail;
> > > > +	uint32_t free_entries;
> > > > +	unsigned int i;
> > > > +
> > > > +	n = __rte_ring_move_prod_head_64(r, 0, n, behavior,
> > > > +					 &head, &next, &free_entries);
> > > > +	if (n == 0)
> > > > +		goto end;
> > > > +
> > > > +	for (i = 0; i < n; /* i incremented if enqueue succeeds */) {
> > > > +		struct nb_ring_entry old_value, new_value;
> > > > +		struct nb_ring_entry *ring_ptr;
> > > > +
> > > > +		/* Enqueue to the tail entry. If another thread wins the
> > > > race,
> > > > +		 * retry with the new tail.
> > > > +		 */
> > > > +		tail = r->prod_64.tail;
> > > > +
> > > > +		ring_ptr = &((struct nb_ring_entry *)&r[1])[tail & r-
> > > > >mask];
> > > This is an ugly expression and cast. Also I think it is unnecessary.
> > > What's preventing this from being written without a cast? Perhaps
> > > the ring array needs to be a union of "void *" and struct
> > > nb_ring_entry?
> > The cast is necessary for the correct pointer arithmetic (let
> > "uintptr_t base == &r[1]"):
> Yes I know the C language.
> 
> > - With cast: ring_ptr = base + sizeof(struct nb_ring_entry) * (tail &
> > r-
> > >mask);
> > - W/o cast: ring_ptr = base + sizeof(struct rte_ring) * (tail &
> > r->mask);
> >
> > FWIW, this is essentially the same as is done with the second argument
> > (&r[1]) to ENQUEUE_PTRS and DEQUEUE_PTRS, but there it's split across
> > multiple lines of code. The equivalent here would be:
> >
> > struct nb_ring_entry *ring_base = (struct nb_ring_entry*)&r[1];
> > ring_ptr = ring_base[tail & r->mask];
> >
> > Which is more legible, I think.
> The RTE ring buffer code is not very legible to start with.
> 
> >
> > There is no ring array structure in which to add a union; the ring
> > array is a contiguous chunk of memory that immediately follows after
> > the end of a struct rte_ring. We interpret the memory there according
> > to the ring entry data type (void * for regular rings and struct nb_ring_entry for
> non-blocking rings).
> My worry is that we are abusing the C language and creating a monster of
> fragile C code that will be more and more difficult to understand and to
> maintain. At some point you have to think the question "Are we doing the right
> thing?".
>

I'm not aware of any fragility/maintainability issues in the ring code (though perhaps the maintainers have a different view!), and personally I find the code fairly legible. If you have a specific suggestion, I'll look into incorporating it.

Thanks,
Gage

</snip>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
  2019-01-25  5:20     ` [dpdk-dev] " Honnappa Nagarahalli
@ 2019-01-25 17:42       ` Eads, Gage
  2019-01-25 17:56       ` Eads, Gage
  1 sibling, 0 replies; 123+ messages in thread
From: Eads, Gage @ 2019-01-25 17:42 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev
  Cc: olivier.matz, arybchenko, Richardson, Bruce, Ananyev, Konstantin,
	stephen, nd, thomas, Ola Liljedahl,
	Gavin Hu (Arm Technology China), Song Zhu (Arm Technology China),
	nd

Hi Honnappa,

Works for me -- I'm in favor of the best performing implementation, whoever provides it.

To allow an apples-to-apples comparison, I suggest Ola's/ARM's implementation be made to fit into the rte_ring API with an associated mempool handler. That'll allow us to use the existing ring and mempool performance tests as well. Feel free to use code from this patchset for the rte_ring integration, if that helps, of course.

I expect to have v4 available within the next week.

Thanks,
Gage

> -----Original Message-----
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Thursday, January 24, 2019 11:21 PM
> To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; stephen@networkplumber.org; nd
> <nd@arm.com>; thomas@monjalon.net; Ola Liljedahl
> <Ola.Liljedahl@arm.com>; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>; Song Zhu (Arm Technology China)
> <Song.Zhu@arm.com>; nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
> 
> Hi Gage,
> 	Thank you for this patch. Arm (Ola Liljedahl) had worked on a non-
> blocking ring algorithm. We were planning to add it to DPDK at some point this
> year. I am wondering if you would be open to take a look at the algorithm and
> collaborate?
> 
> I am yet to fully understand both the algorithms. But, Ola has reviewed your
> patch and can provide a quick overview of the differences here.
> 
> If you agree, we can send a RFC patch. You can review that and do performance
> benchmarking on your platforms. I can also benchmark your patch (may be once
> you fix the issue identified in __rte_ring_do_nb_enqueue_mp  function?) on Arm
> platforms. May be we can end up with a better combined algorithm.
> 
> Hi Thomas/Bruce,
> 	Please let me know if this is ok and if there is a better way to do this.
> 
> Thank you,
> Honnappa
> 
> > -----Original Message-----
> > From: dev <dev-bounces@dpdk.org> On Behalf Of Gage Eads
> > Sent: Friday, January 18, 2019 9:23 AM
> > To: dev@dpdk.org
> > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com;
> > bruce.richardson@intel.com; konstantin.ananyev@intel.com;
> > stephen@networkplumber.org
> > Subject: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
> >
> > For some users, the rte ring's "non-preemptive" constraint is not
> > acceptable; for example, if the application uses a mixture of pinned
> > high-priority threads and multiplexed low-priority threads that share a
> mempool.
> >
> > This patchset introduces a non-blocking ring, on top of which a
> > mempool can run.
> > Crucially, the non-blocking algorithm relies on a 128-bit
> > compare-and-swap, so it is currently limited to x86_64 machines. This
> > is also an experimental API, so RING_F_NB users must build with the
> ALLOW_EXPERIMENTAL_API flag.
> >
> > The ring uses more compare-and-swap atomic operations than the regular
> > rte
> > ring:
> > With no contention, an enqueue of n pointers uses (1 + 2n) CAS
> > operations and a dequeue of n pointers uses 2. This algorithm has
> > worse average-case performance than the regular rte ring (particularly
> > a highly-contended ring with large bulk accesses), however:
> > - For applications with preemptible pthreads, the regular rte ring's worst-case
> >   performance (i.e. one thread being preempted in the update_tail() critical
> >   section) is much worse than the non-blocking ring's.
> > - Software caching can mitigate the average case performance for ring-based
> >   algorithms. For example, a non-blocking ring based mempool (a likely
> > use case
> >   for this ring) with per-thread caching.
> >
> > The non-blocking ring is enabled via a new flag, RING_F_NB. For
> > ease-of-use, existing ring enqueue/dequeue functions work with both
> > "regular" and non- blocking rings.
> >
> > This patchset also adds non-blocking versions of ring_autotest and
> > ring_perf_autotest, and a non-blocking ring based mempool.
> >
> > This patchset makes one API change; a deprecation notice will be
> > posted in a separate commit.
> >
> > This patchset depends on the non-blocking stack patchset[1].
> >
> > [1] http://mails.dpdk.org/archives/dev/2019-January/123653.html
> >
> > v3:
> >  - Avoid the ABI break by putting 64-bit head and tail values in the same
> >    cacheline as struct rte_ring's prod and cons members.
> >  - Don't attempt to compile rte_atomic128_cmpset without
> >    ALLOW_EXPERIMENTAL_API, as this would break a large number of libraries.
> >  - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case
> > someone tries
> >    to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
> >  - Update the ring mempool to use experimental APIs
> >  - Clarify that RINB_F_NB is only limited to x86_64 currently;
> > ARMv8.1-A builds
> >    can eventually support it with the CASP instruction.
> >
> > v2:
> >  - Merge separate docs commit into patch #5
> >  - Convert uintptr_t to size_t
> >  - Add a compile-time check for the size of size_t
> >  - Fix a space-after-typecast issue
> >  - Fix an unnecessary-parentheses checkpatch warning
> >  - Bump librte_ring's library version
> >
> > Gage Eads (5):
> >   ring: add 64-bit headtail structure
> >   ring: add a non-blocking implementation
> >   test_ring: add non-blocking ring autotest
> >   test_ring_perf: add non-blocking ring perf test
> >   mempool/ring: add non-blocking ring handlers
> >
> >  doc/guides/prog_guide/env_abstraction_layer.rst |   2 +-
> >  drivers/mempool/ring/Makefile                   |   1 +
> >  drivers/mempool/ring/meson.build                |   2 +
> >  drivers/mempool/ring/rte_mempool_ring.c         |  58 ++-
> >  lib/librte_eventdev/rte_event_ring.h            |   2 +-
> >  lib/librte_ring/Makefile                        |   3 +-
> >  lib/librte_ring/rte_ring.c                      |  72 ++-
> >  lib/librte_ring/rte_ring.h                      | 574 ++++++++++++++++++++++--
> >  lib/librte_ring/rte_ring_generic_64.h           | 152 +++++++
> >  lib/librte_ring/rte_ring_version.map            |   7 +
> >  test/test/test_ring.c                           |  57 ++-
> >  test/test/test_ring_perf.c                      |  19 +-
> >  12 files changed, 874 insertions(+), 75 deletions(-)  create mode
> > 100644 lib/librte_ring/rte_ring_generic_64.h
> >
> > --
> > 2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
  2019-01-25  5:20     ` [dpdk-dev] " Honnappa Nagarahalli
  2019-01-25 17:42       ` Eads, Gage
@ 2019-01-25 17:56       ` Eads, Gage
  2019-01-28 10:41         ` Ola Liljedahl
  1 sibling, 1 reply; 123+ messages in thread
From: Eads, Gage @ 2019-01-25 17:56 UTC (permalink / raw)
  To: Honnappa Nagarahalli, dev
  Cc: olivier.matz, arybchenko, Richardson, Bruce, Ananyev, Konstantin,
	stephen, nd, thomas, Ola Liljedahl,
	Gavin Hu (Arm Technology China), Song Zhu (Arm Technology China),
	nd



> -----Original Message-----
> From: Eads, Gage
> Sent: Friday, January 25, 2019 11:43 AM
> To: 'Honnappa Nagarahalli' <Honnappa.Nagarahalli@arm.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; stephen@networkplumber.org; nd
> <nd@arm.com>; thomas@monjalon.net; Ola Liljedahl
> <Ola.Liljedahl@arm.com>; Gavin Hu (Arm Technology China)
> <Gavin.Hu@arm.com>; Song Zhu (Arm Technology China)
> <Song.Zhu@arm.com>; nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
> 
> Hi Honnappa,
> 
> Works for me -- I'm in favor of the best performing implementation, whoever
> provides it.
> 
> To allow an apples-to-apples comparison, I suggest Ola's/ARM's implementation
> be made to fit into the rte_ring API with an associated mempool handler. That'll
> allow us to use the existing ring and mempool performance tests as well. Feel
> free to use code from this patchset for the rte_ring integration, if that helps, of
> course.
> 

But also, if Ola/ARM's algorithm is sufficiently similar to this one, it's probably better to tweak this patchset's enqueue and dequeue functions with any improvements you can identify rather than creating an entirely separate implementation.

> I expect to have v4 available within the next week.
> 
> Thanks,
> Gage
> 
> > -----Original Message-----
> > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > Sent: Thursday, January 24, 2019 11:21 PM
> > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson,
> > Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; stephen@networkplumber.org; nd
> > <nd@arm.com>; thomas@monjalon.net; Ola Liljedahl
> > <Ola.Liljedahl@arm.com>; Gavin Hu (Arm Technology China)
> > <Gavin.Hu@arm.com>; Song Zhu (Arm Technology China)
> > <Song.Zhu@arm.com>; nd <nd@arm.com>
> > Subject: RE: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
> >
> > Hi Gage,
> > 	Thank you for this patch. Arm (Ola Liljedahl) had worked on a non-
> > blocking ring algorithm. We were planning to add it to DPDK at some
> > point this year. I am wondering if you would be open to take a look at
> > the algorithm and collaborate?
> >
> > I am yet to fully understand both the algorithms. But, Ola has
> > reviewed your patch and can provide a quick overview of the differences here.
> >
> > If you agree, we can send a RFC patch. You can review that and do
> > performance benchmarking on your platforms. I can also benchmark your
> > patch (may be once you fix the issue identified in
> > __rte_ring_do_nb_enqueue_mp  function?) on Arm platforms. May be we can
> end up with a better combined algorithm.
> >
> > Hi Thomas/Bruce,
> > 	Please let me know if this is ok and if there is a better way to do this.
> >
> > Thank you,
> > Honnappa
> >
> > > -----Original Message-----
> > > From: dev <dev-bounces@dpdk.org> On Behalf Of Gage Eads
> > > Sent: Friday, January 18, 2019 9:23 AM
> > > To: dev@dpdk.org
> > > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com;
> > > bruce.richardson@intel.com; konstantin.ananyev@intel.com;
> > > stephen@networkplumber.org
> > > Subject: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
> > >
> > > For some users, the rte ring's "non-preemptive" constraint is not
> > > acceptable; for example, if the application uses a mixture of pinned
> > > high-priority threads and multiplexed low-priority threads that
> > > share a
> > mempool.
> > >
> > > This patchset introduces a non-blocking ring, on top of which a
> > > mempool can run.
> > > Crucially, the non-blocking algorithm relies on a 128-bit
> > > compare-and-swap, so it is currently limited to x86_64 machines.
> > > This is also an experimental API, so RING_F_NB users must build with
> > > the
> > ALLOW_EXPERIMENTAL_API flag.
> > >
> > > The ring uses more compare-and-swap atomic operations than the
> > > regular rte
> > > ring:
> > > With no contention, an enqueue of n pointers uses (1 + 2n) CAS
> > > operations and a dequeue of n pointers uses 2. This algorithm has
> > > worse average-case performance than the regular rte ring
> > > (particularly a highly-contended ring with large bulk accesses), however:
> > > - For applications with preemptible pthreads, the regular rte ring's worst-
> case
> > >   performance (i.e. one thread being preempted in the update_tail() critical
> > >   section) is much worse than the non-blocking ring's.
> > > - Software caching can mitigate the average case performance for ring-
> based
> > >   algorithms. For example, a non-blocking ring based mempool (a
> > > likely use case
> > >   for this ring) with per-thread caching.
> > >
> > > The non-blocking ring is enabled via a new flag, RING_F_NB. For
> > > ease-of-use, existing ring enqueue/dequeue functions work with both
> > > "regular" and non- blocking rings.
> > >
> > > This patchset also adds non-blocking versions of ring_autotest and
> > > ring_perf_autotest, and a non-blocking ring based mempool.
> > >
> > > This patchset makes one API change; a deprecation notice will be
> > > posted in a separate commit.
> > >
> > > This patchset depends on the non-blocking stack patchset[1].
> > >
> > > [1] http://mails.dpdk.org/archives/dev/2019-January/123653.html
> > >
> > > v3:
> > >  - Avoid the ABI break by putting 64-bit head and tail values in the same
> > >    cacheline as struct rte_ring's prod and cons members.
> > >  - Don't attempt to compile rte_atomic128_cmpset without
> > >    ALLOW_EXPERIMENTAL_API, as this would break a large number of
> libraries.
> > >  - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case
> > > someone tries
> > >    to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
> > >  - Update the ring mempool to use experimental APIs
> > >  - Clarify that RINB_F_NB is only limited to x86_64 currently;
> > > ARMv8.1-A builds
> > >    can eventually support it with the CASP instruction.
> > >
> > > v2:
> > >  - Merge separate docs commit into patch #5
> > >  - Convert uintptr_t to size_t
> > >  - Add a compile-time check for the size of size_t
> > >  - Fix a space-after-typecast issue
> > >  - Fix an unnecessary-parentheses checkpatch warning
> > >  - Bump librte_ring's library version
> > >
> > > Gage Eads (5):
> > >   ring: add 64-bit headtail structure
> > >   ring: add a non-blocking implementation
> > >   test_ring: add non-blocking ring autotest
> > >   test_ring_perf: add non-blocking ring perf test
> > >   mempool/ring: add non-blocking ring handlers
> > >
> > >  doc/guides/prog_guide/env_abstraction_layer.rst |   2 +-
> > >  drivers/mempool/ring/Makefile                   |   1 +
> > >  drivers/mempool/ring/meson.build                |   2 +
> > >  drivers/mempool/ring/rte_mempool_ring.c         |  58 ++-
> > >  lib/librte_eventdev/rte_event_ring.h            |   2 +-
> > >  lib/librte_ring/Makefile                        |   3 +-
> > >  lib/librte_ring/rte_ring.c                      |  72 ++-
> > >  lib/librte_ring/rte_ring.h                      | 574 ++++++++++++++++++++++--
> > >  lib/librte_ring/rte_ring_generic_64.h           | 152 +++++++
> > >  lib/librte_ring/rte_ring_version.map            |   7 +
> > >  test/test/test_ring.c                           |  57 ++-
> > >  test/test/test_ring_perf.c                      |  19 +-
> > >  12 files changed, 874 insertions(+), 75 deletions(-)  create mode
> > > 100644 lib/librte_ring/rte_ring_generic_64.h
> > >
> > > --
> > > 2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation
  2019-01-25 17:21             ` Eads, Gage
@ 2019-01-28 10:35               ` Ola Liljedahl
  2019-01-28 18:54                 ` Eads, Gage
  2019-01-28 13:34               ` Jerin Jacob Kollanukkaran
  1 sibling, 1 reply; 123+ messages in thread
From: Ola Liljedahl @ 2019-01-28 10:35 UTC (permalink / raw)
  To: jerinj, mczekaj, gage.eads, dev
  Cc: olivier.matz, stephen, nd, bruce.richardson, arybchenko,
	konstantin.ananyev

On Fri, 2019-01-25 at 17:21 +0000, Eads, Gage wrote:
> 
> > 
> > -----Original Message-----
> > From: Ola Liljedahl [mailto:Ola.Liljedahl@arm.com]
> > Sent: Wednesday, January 23, 2019 4:16 AM
> > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> > Cc: olivier.matz@6wind.com; stephen@networkplumber.org; nd
> > <nd@arm.com>; Richardson, Bruce <bruce.richardson@intel.com>;
> > arybchenko@solarflare.com; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking
> > implementation
> > 
> > On Tue, 2019-01-22 at 21:31 +0000, Eads, Gage wrote:
> > > 
> > > Hi Ola,
> > > 
> > > <snip>
> > > 
> > > > 
> > > > 
> > > > > 
> > > > > 
> > > > > @@ -331,6 +433,319 @@ void rte_ring_dump(FILE *f, const struct
> > > > > rte_ring *r);
> > > > >  #endif
> > > > >  #include "rte_ring_generic_64.h"
> > > > > 
> > > > > +/* @internal 128-bit structure used by the non-blocking ring */
> > > > > +struct nb_ring_entry {
> > > > > +	void *ptr; /**< Data pointer */
> > > > > +	uint64_t cnt; /**< Modification counter */
> > > > Why not make 'cnt' uintptr_t? This way 32-bit architectures will
> > > > also be supported. I think there are some claims that DPDK still
> > > > supports e.g.
> > > > ARMv7a
> > > > and possibly also 32-bit x86?
> > > I chose a 64-bit modification counter because (practically speaking)
> > > the ABA problem will not occur with such a large counter -- definitely
> > > not within my lifetime. See the "Discussion" section of the commit
> > > message for more information.
> > > 
> > > With a 32-bit counter, there is a very (very) low likelihood of it,
> > > but it is possible. Personally, I don't feel comfortable providing
> > > such code, because a) I doubt all users would understand the
> > > implementation well enough to do the risk/reward analysis, and b) such
> > > a bug would be near impossible to reproduce and root-cause if it did
> > > occur.
> > With a 64-bit counter (and 32-bit pointer), 32-bit architectures (e.g.
> > ARMv7a and
> > probably x86 as well) won't be able to support this as they at best support
> > 64-bit
> > CAS (ARMv7a has LDREXD/STREXD). So you are essentially putting a 64-bit (and
> > 128-bit CAS) requirement on the implementation.
> > 
> Yes, I am. I tried to make that clear in the cover letter.
> 
> > 
> > > 
> > > 
> > > > 
> > > > 
> > > > 
> > > > > 
> > > > > 
> > > > > +};
> > > > > +
> > > > > +/* The non-blocking ring algorithm is based on the original rte
> > > > > +ring (derived
> > > > > + * from FreeBSD's bufring.h) and inspired by Michael and Scott's
> > > > > +non-blocking
> > > > > + * concurrent queue.
> > > > > + */
> > > > > +
> > > > > +/**
> > > > > + * @internal
> > > > > + *   Enqueue several objects on the non-blocking ring
> > > > > +(single-producer only)
> > > > > + *
> > > > > + * @param r
> > > > > + *   A pointer to the ring structure.
> > > > > + * @param obj_table
> > > > > + *   A pointer to a table of void * pointers (objects).
> > > > > + * @param n
> > > > > + *   The number of objects to add in the ring from the obj_table.
> > > > > + * @param behavior
> > > > > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to
> > > > > +the ring
> > > > > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible
> > > > > +to the ring
> > > > > + * @param free_space
> > > > > + *   returns the amount of space after the enqueue operation has
> > > > > +finished
> > > > > + * @return
> > > > > + *   Actual number of objects enqueued.
> > > > > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > > > > + */
> > > > > +static __rte_always_inline unsigned int
> > > > > +__rte_ring_do_nb_enqueue_sp(struct rte_ring *r, void * const
> > > > > *obj_table,
> > > > > +			    unsigned int n,
> > > > > +			    enum rte_ring_queue_behavior behavior,
> > > > > +			    unsigned int *free_space)
> > > > > +{
> > > > > +	uint32_t free_entries;
> > > > > +	size_t head, next;
> > > > > +
> > > > > +	n = __rte_ring_move_prod_head_64(r, 1, n, behavior,
> > > > > +					 &head, &next,
> > > > > &free_entries);
> > > > > +	if (n == 0)
> > > > > +		goto end;
> > > > > +
> > > > > +	ENQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
> > > > > +
> > > > > +	r->prod_64.tail += n;
> > > > Don't we need release order when (or smp_wmb between) writing of the
> > > > ring pointers and the update of tail? By updating the tail pointer,
> > > > we are synchronising with a consumer.
> > > > 
> > > > I prefer using __atomic operations even for load and store. You can
> > > > see which parts of the code that synchronise with each other, e.g.
> > > > store-release to some location synchronises with load-acquire from
> > > > the same location. If you don't know how different threads
> > > > synchronise with each other, you are very likely to make mistakes.
> > > > 
> > > You can tell this code was written when I thought x86-64 was the only
> > > viable target :). Yes, you are correct.
> > > 
> > > With regards to using __atomic intrinsics, I'm planning on taking a
> > > similar approach to the functions duplicated in rte_ring_generic.h and
> > > rte_ring_c11_mem.h: one version that uses rte_atomic functions (and
> > > thus stricter memory ordering) and one that uses __atomic intrinsics
> > > (and thus can benefit from more relaxed memory ordering).
From a code point of view, I strongly prefer the atomic operations to be visible
in the top level code, not hidden in subroutines. For correctness, it is vital
that memory accesses are performed with the required ordering and that acquire
and release matches up. Hiding e.g. load-acquire and store-release in
subroutines (in a different file!) make this difficult. There have already been
such bugs found in rte_ring.

> > What's the advantage of having two different implementations? What is the
> > disadvantage?
> > 
> > The existing ring buffer code originally had only the "legacy"
> > implementation
> > which was kept when the __atomic implementation was added. The reason
> > claimed was that some older compilers for x86 do not support GCC __atomic
> > builtins. But I thought there was consensus that new functionality could
> > have
> > only __atomic implementations.
> > 
> When CONFIG_RTE_RING_USE_C11_MEM_MODEL was introduced, it was left disabled
> for thunderx[1] for performance reasons. Assuming that hasn't changed, the
> advantage to having two versions is to best support all of DPDK's platforms.
> The disadvantage is of course duplicated code and the additional maintenance
> burden.
The only way I see that a C11 memory model implementation can be slower than
using smp_wmb/rmb is if you need to order loads before a synchronizing store and
there are also outstanding stores which do not require ordering. smp_rmb()
handles this while store-release will also (unnecessarily) order those
outstanding stores. This situation occurs e.g. in ring buffer dequeue operations
where ring slots are read (and possibly written to thread-private memory) before
the ring slots are release (e.g. using CAS-release or store-release).

I imagine that the LSU/cache subsystem on ThunderX/OCTEON-TX also have something
to do with this problem. If there are a large amounts of stores pending in the
load/store unit, store-release might have to wait for a long time before the
synchronizing store can complete.

> 
> That said, if the thunderx maintainers are ok with it, I'm certainly open to
> only doing the __atomic version. Note that even in the __atomic version, based
> on Honnapa's findings[2], using a DPDK-defined rte_atomic128_cmpset() (with
> additional arguments to support machines with weak consistency) appears to be
> a better option than __atomic_compare_exchange_16.
__atomic_compare_exchange_16() is not guaranteed to be lock-free. It is not
lock-free on ARM/AArch64 and the support in GCC is formally broken (can't use
cmpexchg16b to implement __atomic_load_16).

So yes, I think DPDK will have to define and implement the 128-bit atomic
compare and exchange operation (whatever it will be called). For compatibility
with ARMv8.0, we can't require the "old" value returned by a failed compare-
exchange operation to be read atomically (LDXP does not guaranteed atomicity by
itself). But this is seldom a problem, many designs read the memory location
using two separate 64-bit loads (so not atomic) anyway, it is a successful
atomic compare exchange operation which provides atomicity.

> 
> I couldn't find the discussion about new functionality using __atomic going
> forward -- can you send a link?
> 
> [1] https://mails.dpdk.org/archives/dev/2017-December/082853.html
> [2] http://mails.dpdk.org/archives/dev/2019-January/124002.html
> 
> > 
> > Does the non-blocking ring buffer implementation have to support these older
> > compilers? Will the applications that require these older compiler be
> > updated to
> > utilise the non-blocking ring buffer?
> > 
> (See above -- compiler versions wasn't a consideration here.)
> 
> > 
> > > 
> > > 
> > > > 
> > > > 
> > > > > 
> > > > > 
> > > > > +
> > > > > +end:
> > > > > +	if (free_space != NULL)
> > > > > +		*free_space = free_entries - n;
> > > > > +	return n;
> > > > > +}
> > > > > +
> > > > > +/**
> > > > > + * @internal
> > > > > + *   Enqueue several objects on the non-blocking ring
> > > > > +(multi-producer
> > > > > +safe)
> > > > > + *
> > > > > + * @param r
> > > > > + *   A pointer to the ring structure.
> > > > > + * @param obj_table
> > > > > + *   A pointer to a table of void * pointers (objects).
> > > > > + * @param n
> > > > > + *   The number of objects to add in the ring from the obj_table.
> > > > > + * @param behavior
> > > > > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to
> > > > > +the ring
> > > > > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible
> > > > > +to the ring
> > > > > + * @param free_space
> > > > > + *   returns the amount of space after the enqueue operation has
> > > > > +finished
> > > > > + * @return
> > > > > + *   Actual number of objects enqueued.
> > > > > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > > > > + */
> > > > > +static __rte_always_inline unsigned int
> > > > > +__rte_ring_do_nb_enqueue_mp(struct rte_ring *r, void * const
> > *obj_table,
> > > 
> > > > 
> > > > > 
> > > > > +			    unsigned int n,
> > > > > +			    enum rte_ring_queue_behavior behavior,
> > > > > +			    unsigned int *free_space)
> > > > > +{
> > > > > +#if !defined(RTE_ARCH_X86_64) || !defined(ALLOW_EXPERIMENTAL_API)
> > > > > +	RTE_SET_USED(r);
> > > > > +	RTE_SET_USED(obj_table);
> > > > > +	RTE_SET_USED(n);
> > > > > +	RTE_SET_USED(behavior);
> > > > > +	RTE_SET_USED(free_space);
> > > > > +#ifndef ALLOW_EXPERIMENTAL_API
> > > > > +	printf("[%s()] RING_F_NB requires an experimental API."
> > > > > +	       " Recompile with ALLOW_EXPERIMENTAL_API to use it.\n"
> > > > > +	       , __func__);
> > > > > +#endif
> > > > > +	return 0;
> > > > > +#endif
> > > > > +#if defined(RTE_ARCH_X86_64) && defined(ALLOW_EXPERIMENTAL_API)
> > > > > +	size_t head, next, tail;
> > > > > +	uint32_t free_entries;
> > > > > +	unsigned int i;
> > > > > +
> > > > > +	n = __rte_ring_move_prod_head_64(r, 0, n, behavior,
> > > > > +					 &head, &next,
> > > > > &free_entries);
> > > > > +	if (n == 0)
> > > > > +		goto end;
> > > > > +
> > > > > +	for (i = 0; i < n; /* i incremented if enqueue succeeds */) {
> > > > > +		struct nb_ring_entry old_value, new_value;
> > > > > +		struct nb_ring_entry *ring_ptr;
> > > > > +
> > > > > +		/* Enqueue to the tail entry. If another thread wins
> > > > > the
> > > > > race,
> > > > > +		 * retry with the new tail.
> > > > > +		 */
> > > > > +		tail = r->prod_64.tail;
> > > > > +
> > > > > +		ring_ptr = &((struct nb_ring_entry *)&r[1])[tail & r-
> > > > > > 
> > > > > > mask];
> > > > This is an ugly expression and cast. Also I think it is unnecessary.
> > > > What's preventing this from being written without a cast? Perhaps
> > > > the ring array needs to be a union of "void *" and struct
> > > > nb_ring_entry?
> > > The cast is necessary for the correct pointer arithmetic (let
> > > "uintptr_t base == &r[1]"):
> > Yes I know the C language.
> > 
> > > 
> > > - With cast: ring_ptr = base + sizeof(struct nb_ring_entry) * (tail &
> > > r-
> > > > 
> > > > mask);
> > > - W/o cast: ring_ptr = base + sizeof(struct rte_ring) * (tail &
> > > r->mask);
> > > 
> > > FWIW, this is essentially the same as is done with the second argument
> > > (&r[1]) to ENQUEUE_PTRS and DEQUEUE_PTRS, but there it's split across
> > > multiple lines of code. The equivalent here would be:
> > > 
> > > struct nb_ring_entry *ring_base = (struct nb_ring_entry*)&r[1];
> > > ring_ptr = ring_base[tail & r->mask];
> > > 
> > > Which is more legible, I think.
> > The RTE ring buffer code is not very legible to start with.
> > 
> > > 
> > > 
> > > There is no ring array structure in which to add a union; the ring
> > > array is a contiguous chunk of memory that immediately follows after
> > > the end of a struct rte_ring. We interpret the memory there according
> > > to the ring entry data type (void * for regular rings and struct
> > > nb_ring_entry for
> > non-blocking rings).
> > My worry is that we are abusing the C language and creating a monster of
> > fragile C code that will be more and more difficult to understand and to
> > maintain. At some point you have to think the question "Are we doing the
> > right
> > thing?".
> > 
> I'm not aware of any fragility/maintainability issues in the ring code (though
> perhaps the maintainers have a different view!), and personally I find the
> code fairly legible. If you have a specific suggestion, I'll look into
> incorporating it.
> 
> Thanks,
> Gage
> 
> </snip>
-- 
Ola Liljedahl, Networking System Architect, Arm
Phone +46706866373, Skype ola.liljedahl


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
  2019-01-25 17:56       ` Eads, Gage
@ 2019-01-28 10:41         ` Ola Liljedahl
  0 siblings, 0 replies; 123+ messages in thread
From: Ola Liljedahl @ 2019-01-28 10:41 UTC (permalink / raw)
  To: Honnappa Nagarahalli, gage.eads, dev
  Cc: nd, bruce.richardson, thomas, konstantin.ananyev,
	Song Zhu (Arm Technology China),
	stephen, olivier.matz, arybchenko,
	Gavin Hu (Arm Technology China)

On Fri, 2019-01-25 at 17:56 +0000, Eads, Gage wrote:
> 
> > 
> > -----Original Message-----
> > From: Eads, Gage
> > Sent: Friday, January 25, 2019 11:43 AM
> > To: 'Honnappa Nagarahalli' <Honnappa.Nagarahalli@arm.com>; dev@dpdk.org
> > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson, Bruce
> > <bruce.richardson@intel.com>; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>; stephen@networkplumber.org; nd
> > <nd@arm.com>; thomas@monjalon.net; Ola Liljedahl
> > <Ola.Liljedahl@arm.com>; Gavin Hu (Arm Technology China)
> > <Gavin.Hu@arm.com>; Song Zhu (Arm Technology China)
> > <Song.Zhu@arm.com>; nd <nd@arm.com>
> > Subject: RE: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
> > 
> > Hi Honnappa,
> > 
> > Works for me -- I'm in favor of the best performing implementation, whoever
> > provides it.
> > 
> > To allow an apples-to-apples comparison, I suggest Ola's/ARM's
> > implementation
> > be made to fit into the rte_ring API with an associated mempool handler.
> > That'll
> > allow us to use the existing ring and mempool performance tests as well.
> > Feel
> > free to use code from this patchset for the rte_ring integration, if that
> > helps, of
> > course.
> > 
> But also, if Ola/ARM's algorithm is sufficiently similar to this one, it's
> probably better to tweak this patchset's enqueue and dequeue functions with
> any improvements you can identify rather than creating an entirely separate
> implementation.
There are strong similarities. But my implementation is separate from rte_ring
(whose code is a mess) which also freed me from any interoperatibility with the
rte_ring code and data structure (with two pairs of head+tail which is
unnecessary for the lock-free ring buffer).

My design and implementation is here:
https://github.com/ARM-software/progress64/blob/master/src/p64_lfring.c
I have a DPDK version in flight. Merging the relevant changes into your patch
makes sense. There are some differences we will have to agree on.

> 
> > 
> > I expect to have v4 available within the next week.
> > 
> > Thanks,
> > Gage
> > 
> > > 
> > > -----Original Message-----
> > > From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> > > Sent: Thursday, January 24, 2019 11:21 PM
> > > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> > > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com; Richardson,
> > > Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> > > <konstantin.ananyev@intel.com>; stephen@networkplumber.org; nd
> > > <nd@arm.com>; thomas@monjalon.net; Ola Liljedahl
> > > <Ola.Liljedahl@arm.com>; Gavin Hu (Arm Technology China)
> > > <Gavin.Hu@arm.com>; Song Zhu (Arm Technology China)
> > > <Song.Zhu@arm.com>; nd <nd@arm.com>
> > > Subject: RE: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
> > > 
> > > Hi Gage,
> > > 	Thank you for this patch. Arm (Ola Liljedahl) had worked on a non-
> > > blocking ring algorithm. We were planning to add it to DPDK at some
> > > point this year. I am wondering if you would be open to take a look at
> > > the algorithm and collaborate?
> > > 
> > > I am yet to fully understand both the algorithms. But, Ola has
> > > reviewed your patch and can provide a quick overview of the differences
> > > here.
> > > 
> > > If you agree, we can send a RFC patch. You can review that and do
> > > performance benchmarking on your platforms. I can also benchmark your
> > > patch (may be once you fix the issue identified in
> > > __rte_ring_do_nb_enqueue_mp  function?) on Arm platforms. May be we can
> > end up with a better combined algorithm.
> > > 
> > > 
> > > Hi Thomas/Bruce,
> > > 	Please let me know if this is ok and if there is a better way to do
> > > this.
> > > 
> > > Thank you,
> > > Honnappa
> > > 
> > > > 
> > > > -----Original Message-----
> > > > From: dev <dev-bounces@dpdk.org> On Behalf Of Gage Eads
> > > > Sent: Friday, January 18, 2019 9:23 AM
> > > > To: dev@dpdk.org
> > > > Cc: olivier.matz@6wind.com; arybchenko@solarflare.com;
> > > > bruce.richardson@intel.com; konstantin.ananyev@intel.com;
> > > > stephen@networkplumber.org
> > > > Subject: [dpdk-dev] [PATCH v3 0/5] Add non-blocking ring
> > > > 
> > > > For some users, the rte ring's "non-preemptive" constraint is not
> > > > acceptable; for example, if the application uses a mixture of pinned
> > > > high-priority threads and multiplexed low-priority threads that
> > > > share a
> > > mempool.
> > > > 
> > > > 
> > > > This patchset introduces a non-blocking ring, on top of which a
> > > > mempool can run.
> > > > Crucially, the non-blocking algorithm relies on a 128-bit
> > > > compare-and-swap, so it is currently limited to x86_64 machines.
> > > > This is also an experimental API, so RING_F_NB users must build with
> > > > the
> > > ALLOW_EXPERIMENTAL_API flag.
> > > > 
> > > > 
> > > > The ring uses more compare-and-swap atomic operations than the
> > > > regular rte
> > > > ring:
> > > > With no contention, an enqueue of n pointers uses (1 + 2n) CAS
> > > > operations and a dequeue of n pointers uses 2. This algorithm has
> > > > worse average-case performance than the regular rte ring
> > > > (particularly a highly-contended ring with large bulk accesses),
> > > > however:
> > > > - For applications with preemptible pthreads, the regular rte ring's
> > > > worst-
> > case
> > > 
> > > > 
> > > >   performance (i.e. one thread being preempted in the update_tail()
> > > > critical
> > > >   section) is much worse than the non-blocking ring's.
> > > > - Software caching can mitigate the average case performance for ring-
> > based
> > > 
> > > > 
> > > >   algorithms. For example, a non-blocking ring based mempool (a
> > > > likely use case
> > > >   for this ring) with per-thread caching.
> > > > 
> > > > The non-blocking ring is enabled via a new flag, RING_F_NB. For
> > > > ease-of-use, existing ring enqueue/dequeue functions work with both
> > > > "regular" and non- blocking rings.
> > > > 
> > > > This patchset also adds non-blocking versions of ring_autotest and
> > > > ring_perf_autotest, and a non-blocking ring based mempool.
> > > > 
> > > > This patchset makes one API change; a deprecation notice will be
> > > > posted in a separate commit.
> > > > 
> > > > This patchset depends on the non-blocking stack patchset[1].
> > > > 
> > > > [1] http://mails.dpdk.org/archives/dev/2019-January/123653.html
> > > > 
> > > > v3:
> > > >  - Avoid the ABI break by putting 64-bit head and tail values in the
> > > > same
> > > >    cacheline as struct rte_ring's prod and cons members.
> > > >  - Don't attempt to compile rte_atomic128_cmpset without
> > > >    ALLOW_EXPERIMENTAL_API, as this would break a large number of
> > libraries.
> > > 
> > > > 
> > > >  - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case
> > > > someone tries
> > > >    to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
> > > >  - Update the ring mempool to use experimental APIs
> > > >  - Clarify that RINB_F_NB is only limited to x86_64 currently;
> > > > ARMv8.1-A builds
> > > >    can eventually support it with the CASP instruction.
> > > > 
> > > > v2:
> > > >  - Merge separate docs commit into patch #5
> > > >  - Convert uintptr_t to size_t
> > > >  - Add a compile-time check for the size of size_t
> > > >  - Fix a space-after-typecast issue
> > > >  - Fix an unnecessary-parentheses checkpatch warning
> > > >  - Bump librte_ring's library version
> > > > 
> > > > Gage Eads (5):
> > > >   ring: add 64-bit headtail structure
> > > >   ring: add a non-blocking implementation
> > > >   test_ring: add non-blocking ring autotest
> > > >   test_ring_perf: add non-blocking ring perf test
> > > >   mempool/ring: add non-blocking ring handlers
> > > > 
> > > >  doc/guides/prog_guide/env_abstraction_layer.rst |   2 +-
> > > >  drivers/mempool/ring/Makefile                   |   1 +
> > > >  drivers/mempool/ring/meson.build                |   2 +
> > > >  drivers/mempool/ring/rte_mempool_ring.c         |  58 ++-
> > > >  lib/librte_eventdev/rte_event_ring.h            |   2 +-
> > > >  lib/librte_ring/Makefile                        |   3 +-
> > > >  lib/librte_ring/rte_ring.c                      |  72 ++-
> > > >  lib/librte_ring/rte_ring.h                      | 574
> > > > ++++++++++++++++++++++--
> > > >  lib/librte_ring/rte_ring_generic_64.h           | 152 +++++++
> > > >  lib/librte_ring/rte_ring_version.map            |   7 +
> > > >  test/test/test_ring.c                           |  57 ++-
> > > >  test/test/test_ring_perf.c                      |  19 +-
> > > >  12 files changed, 874 insertions(+), 75 deletions(-)  create mode
> > > > 100644 lib/librte_ring/rte_ring_generic_64.h
> > > > 
> > > > --
> > > > 2.13.6
-- 
Ola Liljedahl, Networking System Architect, Arm
Phone +46706866373, Skype ola.liljedahl


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [EXT] Re:  [PATCH v3 0/5] Add non-blocking ring
  2019-01-23 16:29         ` Ola Liljedahl
@ 2019-01-28 13:10           ` Jerin Jacob Kollanukkaran
  0 siblings, 0 replies; 123+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2019-01-28 13:10 UTC (permalink / raw)
  To: Ola.Liljedahl, gage.eads, dev
  Cc: olivier.matz, stephen, nd, bruce.richardson, arybchenko,
	konstantin.ananyev

On Wed, 2019-01-23 at 16:29 +0000, Ola Liljedahl wrote:
> External Email
> 
> -------------------------------------------------------------------
> ---
> On Wed, 2019-01-23 at 16:02 +0000, Jerin Jacob Kollanukkaran wrote:
> > On Tue, 2019-01-22 at 09:27 +0000, Ola Liljedahl wrote:
> > > On Fri, 2019-01-18 at 09:23 -0600, Gage Eads wrote:
> > > > v3:
> > > >  - Avoid the ABI break by putting 64-bit head and tail values
> > > > in
> > > > the
> > > > same
> > > >    cacheline as struct rte_ring's prod and cons members.
> > > >  - Don't attempt to compile rte_atomic128_cmpset without
> > > >    ALLOW_EXPERIMENTAL_API, as this would break a large number
> > > > of
> > > > libraries.
> > > >  - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in
> > > > case
> > > > someone tries
> > > >    to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
> > > >  - Update the ring mempool to use experimental APIs
> > > >  - Clarify that RINB_F_NB is only limited to x86_64 currently;
> > > > ARMv8.1-A builds
> > > >    can eventually support it with the CASP instruction.
> > > ARMv8.0 should be able to implement a 128-bit atomic compare
> > > exchange
> > > operation using LDXP/STXP.
> > Just wondering what would the performance difference between CASP
> > vs
> > LDXP/STXP on LSE supported machine?
> I think that is up to the microarchitecture. But one the ideas behind

Yes. This is where things are getting little messy to have generic code
where a lot of stuff is defined based on micro
architecture/IMPLEMENTATION DEFINED as arm spec. Al least, I am dealing
with three different micro archirectures now with a lot of difference.
Including the arm cores and qualcomm cores there could around >6ish
different micro archtectures.


> introducing the LSE atomics was that they should be "better" than the
> equivalent
> code using exclusives. I think non-conditional LDxxx and STxxx
> atomics could be
> better than using exclusives while conditional atomics (CAS, CASP)
> might not be
> so different (the reason has to do with cache coherency, a core can
> speculatively snoop-unique the cache line which is targetted by an
> atomic
> instruction but to what extent that provides a benefit could be
> depend on
> whether the atomic actually performs a store or not).
> 
> > I think, We can not detect the presese of LSE support in compile
> > time.
> > Right?
> Unfortunately, AFAIK GCC doesn't notify the source code that it is
> targetting
> v8.1+ with LSE support. If there were intrinsics for (certain) LSE
> instructions
> (e.g. those not generated by the compiler, e.g. STxxx and CASP), we
> could use
> some corresponding preprocessor define to detect the presence of such
> intrinsics
> (they exist for other intrinsics, e.g. __ARM_FEATURE_QRDMX for
> SQRDMLAH/SQRDMLSH
> instructions and corresponding intrinsics).
> 
> I have tried to interest the Arm GCC developers in this but have not
> yet
> succeeded. Perhaps if we have more use cases were atomics intrinsics
> would be
> useful, we could convince them to add such intrinsics to the ACLE
> (ARM C
> Language Extensions). But we will never get intrinsics for
> exclusives, they are
> deemed unsafe for explicit use from C. Instead need to provide inline
> assembler
> that contains the complete exclusives sequence. But in practice it
> seems to work
> with using inline assembler for LDXR and STXR as I do in the lockfree
> code
> linked below.
> 
> > The dynamic one will be costly like,
> Do you think so? Shouldn't this branch be perfectly predictable? Once

Not just branch predication. Right? Corresponding Load and need for
more I cache etc.

I think, for the generic build we can have either run time detection
or stick with LDXR/STXR.

We can give a compile time option for CASP based code so that for given
micro architecture if it optimized it can make use of it.(Something we
can easily expressed on meson build with MIDR value)


> in a while
> it will fall out of the branch history table but doesn't that mean
> the
> application hasn't been executing this code for some time so not
> really
> performance critical?
> 
> > if (hwcaps & HWCAP_ATOMICS) {
> > 	casp
> > } else {
> > 	ldxp
> > 	stxp
> > }
> > 
> > > From an ARM perspective, I want all atomic operations to take
> > > memory
> > > ordering arguments (e.g. acquire, release). Not all usages of
> > > e.g.
> > +1
> > 
> > > atomic compare exchange require sequential consistency (which I
> > > think
> > > what x86 cmpxchg instruction provides). DPDK functions should not
> > > be
> > > modelled after x86 behaviour.
> > > 
> > > Lock-free 128-bit atomics implementations for ARM/AArch64 and
> > > x86-64
> > > are available here:
> > > https://github.com/ARM-software/progress64/blob/master/src/lockfree.h
> > > 
> -- 
> Ola Liljedahl, Networking System Architect, Arm
> Phone +46706866373, Skype ola.liljedahl
> 

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation
  2019-01-25 17:21             ` Eads, Gage
  2019-01-28 10:35               ` Ola Liljedahl
@ 2019-01-28 13:34               ` Jerin Jacob Kollanukkaran
  2019-01-28 13:43                 ` Ola Liljedahl
  2019-01-28 18:59                 ` Eads, Gage
  1 sibling, 2 replies; 123+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2019-01-28 13:34 UTC (permalink / raw)
  To: Ola.Liljedahl, Maciej Czekaj, gage.eads, dev
  Cc: olivier.matz, stephen, nd, bruce.richardson, arybchenko,
	konstantin.ananyev

On Fri, 2019-01-25 at 17:21 +0000, Eads, Gage wrote:
> > -----Original Message-----
> > From: Ola Liljedahl [mailto:Ola.Liljedahl@arm.com]
> > Sent: Wednesday, January 23, 2019 4:16 AM
> > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> > Cc: olivier.matz@6wind.com; stephen@networkplumber.org; nd
> > <nd@arm.com>; Richardson, Bruce <bruce.richardson@intel.com>;
> > arybchenko@solarflare.com; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking
> > implementation
> > 
> > s.
> > > > 
> > > You can tell this code was written when I thought x86-64 was the
> > > only
> > > viable target :). Yes, you are correct.
> > > 
> > > With regards to using __atomic intrinsics, I'm planning on taking
> > > a
> > > similar approach to the functions duplicated in
> > > rte_ring_generic.h and
> > > rte_ring_c11_mem.h: one version that uses rte_atomic functions
> > > (and
> > > thus stricter memory ordering) and one that uses __atomic
> > > intrinsics
> > > (and thus can benefit from more relaxed memory ordering).
> > What's the advantage of having two different implementations? What
> > is the
> > disadvantage?
> > 
> > The existing ring buffer code originally had only the "legacy"
> > implementation
> > which was kept when the __atomic implementation was added. The
> > reason
> > claimed was that some older compilers for x86 do not support GCC
> > __atomic
> > builtins. But I thought there was consensus that new functionality
> > could have
> > only __atomic implementations.
> > 
> 
> When CONFIG_RTE_RING_USE_C11_MEM_MODEL was introduced, it was left
> disabled for thunderx[1] for performance reasons. Assuming that
> hasn't changed, the advantage to having two versions is to best
> support all of DPDK's platforms. The disadvantage is of course
> duplicated code and the additional maintenance burden.
> 
> That said, if the thunderx maintainers are ok with it, I'm certainly 

The ring code was so fundamental building block for DPDK, there was 
difference in performance and there was already legacy code so
introducing C11_MEM_MODEL was justified IMO. 

For the nonblocking implementation, I am happy to test with
three ARM64 microarchitectures and share the result with C11_MEM_MODEL
vs non C11_MEM_MODLE performance. We may need to consider PPC also
here. So IMO, based on the overall performance result may be can decide
the new code direction.


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation
  2019-01-28 13:34               ` Jerin Jacob Kollanukkaran
@ 2019-01-28 13:43                 ` Ola Liljedahl
  2019-01-28 14:04                   ` Jerin Jacob Kollanukkaran
  2019-01-28 18:59                 ` Eads, Gage
  1 sibling, 1 reply; 123+ messages in thread
From: Ola Liljedahl @ 2019-01-28 13:43 UTC (permalink / raw)
  To: jerinj, mczekaj, gage.eads, dev
  Cc: olivier.matz, stephen, nd, bruce.richardson, arybchenko,
	konstantin.ananyev

On Mon, 2019-01-28 at 13:34 +0000, Jerin Jacob Kollanukkaran wrote:
> On Fri, 2019-01-25 at 17:21 +0000, Eads, Gage wrote:
> > 
> > > 
> > > -----Original Message-----
> > > From: Ola Liljedahl [mailto:Ola.Liljedahl@arm.com]
> > > Sent: Wednesday, January 23, 2019 4:16 AM
> > > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> > > Cc: olivier.matz@6wind.com; stephen@networkplumber.org; nd
> > > <nd@arm.com>; Richardson, Bruce <bruce.richardson@intel.com>;
> > > arybchenko@solarflare.com; Ananyev, Konstantin
> > > <konstantin.ananyev@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking
> > > implementation
> > > 
> > > s.
> > > > 
> > > > > 
> > > > > 
> > > > You can tell this code was written when I thought x86-64 was the
> > > > only
> > > > viable target :). Yes, you are correct.
> > > > 
> > > > With regards to using __atomic intrinsics, I'm planning on taking
> > > > a
> > > > similar approach to the functions duplicated in
> > > > rte_ring_generic.h and
> > > > rte_ring_c11_mem.h: one version that uses rte_atomic functions
> > > > (and
> > > > thus stricter memory ordering) and one that uses __atomic
> > > > intrinsics
> > > > (and thus can benefit from more relaxed memory ordering).
> > > What's the advantage of having two different implementations? What
> > > is the
> > > disadvantage?
> > > 
> > > The existing ring buffer code originally had only the "legacy"
> > > implementation
> > > which was kept when the __atomic implementation was added. The
> > > reason
> > > claimed was that some older compilers for x86 do not support GCC
> > > __atomic
> > > builtins. But I thought there was consensus that new functionality
> > > could have
> > > only __atomic implementations.
> > > 
> > When CONFIG_RTE_RING_USE_C11_MEM_MODEL was introduced, it was left
> > disabled for thunderx[1] for performance reasons. Assuming that
> > hasn't changed, the advantage to having two versions is to best
> > support all of DPDK's platforms. The disadvantage is of course
> > duplicated code and the additional maintenance burden.
> > 
> > That said, if the thunderx maintainers are ok with it, I'm certainly 
> The ring code was so fundamental building block for DPDK, there was 
> difference in performance and there was already legacy code so
> introducing C11_MEM_MODEL was justified IMO. 
> 
> For the nonblocking implementation, I am happy to test with
> three ARM64 microarchitectures and share the result with C11_MEM_MODEL
> vs non C11_MEM_MODLE performance.
We should ensure the C11 memory model version enforces minimal ordering
requirements:
1) when computing number of available slots, allow for underflow (head and tail
observed in unexpected order) instead of imposing read order with an additional
read barrier.
2) We could cheat a little and use an explicit LoadStore barrier instead of
 store-release/cas-release in dequeue (which only reads the ring). At least see
if this improves performance. See such a patch here:
https://github.com/ARM-software/progress64/commit/84c48e9c84100eb5b2d15e54f0dbf7
8dfa468805

Ideally, C/C++ would have an __ATOMIC_RELEASE_READSONLY memory model to use in
situations where the shared data was only read before being released.

>  We may need to consider PPC also
> here. So IMO, based on the overall performance result may be can decide
> the new code direction.
Does PPC (64-bit POWER?) have support for double-word (128-bit) CAS?

> 
-- 
Ola Liljedahl, Networking System Architect, Arm
Phone +46706866373, Skype ola.liljedahl


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation
  2019-01-28 13:43                 ` Ola Liljedahl
@ 2019-01-28 14:04                   ` Jerin Jacob Kollanukkaran
  2019-01-28 14:06                     ` Ola Liljedahl
  0 siblings, 1 reply; 123+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2019-01-28 14:04 UTC (permalink / raw)
  To: Ola.Liljedahl, Maciej Czekaj, gage.eads, dev
  Cc: olivier.matz, stephen, nd, bruce.richardson, arybchenko,
	konstantin.ananyev

On Mon, 2019-01-28 at 13:43 +0000, Ola Liljedahl wrote:
> On Mon, 2019-01-28 at 13:34 +0000, Jerin Jacob Kollanukkaran wrote:
> > On Fri, 2019-01-25 at 17:21 +0000, Eads, Gage wrote:
> > > > -----Original Message-----
> > > > From: Ola Liljedahl [mailto:Ola.Liljedahl@arm.com]
> > > > Sent: Wednesday, January 23, 2019 4:16 AM
> > > > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> > > > Cc: olivier.matz@6wind.com; stephen@networkplumber.org; nd
> > > > <nd@arm.com>; Richardson, Bruce <bruce.richardson@intel.com>;
> > > > arybchenko@solarflare.com; Ananyev, Konstantin
> > > > <konstantin.ananyev@intel.com>
> > > > Subject: Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking
> > > > implementation
> > > > 
> > > > s.
> > > > > > 
> > > > > You can tell this code was written when I thought x86-64 was
> > > > > the
> > > > > only
> > > > > viable target :). Yes, you are correct.
> > > > > 
> > > > > With regards to using __atomic intrinsics, I'm planning on
> > > > > taking
> > > > > a
> > > > > similar approach to the functions duplicated in
> > > > > rte_ring_generic.h and
> > > > > rte_ring_c11_mem.h: one version that uses rte_atomic
> > > > > functions
> > > > > (and
> > > > > thus stricter memory ordering) and one that uses __atomic
> > > > > intrinsics
> > > > > (and thus can benefit from more relaxed memory ordering).
> > > > What's the advantage of having two different implementations?
> > > > What
> > > > is the
> > > > disadvantage?
> > > > 
> > > > The existing ring buffer code originally had only the "legacy"
> > > > implementation
> > > > which was kept when the __atomic implementation was added. The
> > > > reason
> > > > claimed was that some older compilers for x86 do not support
> > > > GCC
> > > > __atomic
> > > > builtins. But I thought there was consensus that new
> > > > functionality
> > > > could have
> > > > only __atomic implementations.
> > > > 
> > > When CONFIG_RTE_RING_USE_C11_MEM_MODEL was introduced, it was
> > > left
> > > disabled for thunderx[1] for performance reasons. Assuming that
> > > hasn't changed, the advantage to having two versions is to best
> > > support all of DPDK's platforms. The disadvantage is of course
> > > duplicated code and the additional maintenance burden.
> > > 
> > > That said, if the thunderx maintainers are ok with it, I'm
> > > certainly 
> > The ring code was so fundamental building block for DPDK, there
> > was 
> > difference in performance and there was already legacy code so
> > introducing C11_MEM_MODEL was justified IMO. 
> > 
> > For the nonblocking implementation, I am happy to test with
> > three ARM64 microarchitectures and share the result with
> > C11_MEM_MODEL
> > vs non C11_MEM_MODLE performance.
> We should ensure the C11 memory model version enforces minimal
> ordering
> requirements:

I agree.

I think, We should have enough test case for performance measurement in
order to choose algorithms and quantify the other variables like C11 vs
non C11, LDXP/STXP vs CASP etc.


> 1) when computing number of available slots, allow for underflow
> (head and tail
> observed in unexpected order) instead of imposing read order with an
> additional
> read barrier.
> 2) We could cheat a little and use an explicit LoadStore barrier
> instead of
>  store-release/cas-release in dequeue (which only reads the ring). At
> least see
> if this improves performance. See such a patch here:
> https://github.com/ARM-software/progress64/commit/84c48e9c84100eb5b2d15e54f0dbf7
> 8dfa468805
> 
> Ideally, C/C++ would have an __ATOMIC_RELEASE_READSONLY memory model
> to use in
> situations where the shared data was only read before being released.
> 
> >  We may need to consider PPC also
> > here. So IMO, based on the overall performance result may be can
> > decide
> > the new code direction.
> Does PPC (64-bit POWER?) have support for double-word (128-bit) CAS?

I dont know, I was telling wrt in general C11 mem model for PPC.

> 
> -- 
> Ola Liljedahl, Networking System Architect, Arm
> Phone +46706866373, Skype ola.liljedahl
> 

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation
  2019-01-28 14:04                   ` Jerin Jacob Kollanukkaran
@ 2019-01-28 14:06                     ` Ola Liljedahl
  0 siblings, 0 replies; 123+ messages in thread
From: Ola Liljedahl @ 2019-01-28 14:06 UTC (permalink / raw)
  To: jerinj, mczekaj, gage.eads, dev
  Cc: olivier.matz, stephen, nd, bruce.richardson, arybchenko,
	konstantin.ananyev

On Mon, 2019-01-28 at 14:04 +0000, Jerin Jacob Kollanukkaran wrote:
> > Does PPC (64-bit POWER?) have support for double-word (128-bit) CAS?
> 
> I dont know, I was telling wrt in general C11 mem model for PPC.
Sorry, I misunderstood.

-- 
Ola Liljedahl, Networking System Architect, Arm
Phone +46706866373, Skype ola.liljedahl


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v4 0/5] Add non-blocking ring
  2019-01-18 15:23   ` [dpdk-dev] [PATCH v3 " Gage Eads
                       ` (6 preceding siblings ...)
  2019-01-25  5:20     ` [dpdk-dev] " Honnappa Nagarahalli
@ 2019-01-28 18:14     ` Gage Eads
  2019-01-28 18:14       ` [dpdk-dev] [PATCH v4 1/5] ring: add 64-bit headtail structure Gage Eads
                         ` (5 more replies)
  7 siblings, 6 replies; 123+ messages in thread
From: Gage Eads @ 2019-01-28 18:14 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl

For some users, the rte ring's "non-preemptive" constraint is not acceptable;
for example, if the application uses a mixture of pinned high-priority threads
and multiplexed low-priority threads that share a mempool.

This patchset introduces a non-blocking ring, on top of which a mempool can run.
Crucially, the non-blocking algorithm relies on a 128-bit compare-and-swap, so
it is currently limited to x86_64 machines. This is also an experimental API,
so RING_F_NB users must build with the ALLOW_EXPERIMENTAL_API flag.

The ring uses more compare-and-swap atomic operations than the regular rte ring:
With no contention, an enqueue of n pointers uses (1 + 2n) CAS operations and a
dequeue of n pointers uses 2. This algorithm has worse average-case performance
than the regular rte ring (particularly a highly-contended ring with large bulk
accesses), however:
- For applications with preemptible pthreads, the regular rte ring's worst-case
  performance (i.e. one thread being preempted in the update_tail() critical
  section) is much worse than the non-blocking ring's.
- Software caching can mitigate the average case performance for ring-based
  algorithms. For example, a non-blocking ring based mempool (a likely use case
  for this ring) with per-thread caching.

The non-blocking ring is enabled via a new flag, RING_F_NB. For ease-of-use,
existing ring enqueue/dequeue functions work with both "regular" and
non-blocking rings.

This patchset also adds non-blocking versions of ring_autotest and
ring_perf_autotest, and a non-blocking ring based mempool.

This patchset makes one API change; a deprecation notice will be posted in a
separate commit.

This patchset depends on the 128-bit compare-and-set patch[1].

[1] http://mails.dpdk.org/archives/dev/2019-January/124159.html

v4:
 - Split out nb_enqueue and nb_dequeue functions in generic and C11 versions,
   with the necessary memory ordering behavior for weakly consistent machines.
 - Convert size_t variables (from v2) to uint64_t and no-longer-applicable
   comment about variably-sized ring indexes.
 - Fix bug in nb_enqueue_mp that the breaks the non-blocking guarantee.
 - Split the ring_ptr cast into two lines.
 - Change the dependent patchset from the non-blocking stack patch series
   to one only containing the 128b CAS commit

v3:
 - Avoid the ABI break by putting 64-bit head and tail values in the same
   cacheline as struct rte_ring's prod and cons members.
 - Don't attempt to compile rte_atomic128_cmpset without
   ALLOW_EXPERIMENTAL_API, as this would break a large number of libraries.
 - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case someone tries
   to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
 - Update the ring mempool to use experimental APIs
 - Clarify that RINB_F_NB is only limited to x86_64 currently; e.g ARMv8 has the
   ISA support for 128-bit CAS to eventually support it.

v2:
 - Merge separate docs commit into patch #5
 - Convert uintptr_t to size_t
 - Add a compile-time check for the size of size_t
 - Fix a space-after-typecast issue
 - Fix an unnecessary-parentheses checkpatch warning
 - Bump librte_ring's library version

Gage Eads (5):
  ring: add 64-bit headtail structure
  ring: add a non-blocking implementation
  test_ring: add non-blocking ring autotest
  test_ring_perf: add non-blocking ring perf test
  mempool/ring: add non-blocking ring handlers

 doc/guides/prog_guide/env_abstraction_layer.rst |   5 +
 drivers/mempool/ring/Makefile                   |   1 +
 drivers/mempool/ring/meson.build                |   2 +
 drivers/mempool/ring/rte_mempool_ring.c         |  58 +++-
 lib/librte_ring/rte_ring.c                      |  72 +++-
 lib/librte_ring/rte_ring.h                      | 336 +++++++++++++++++--
 lib/librte_ring/rte_ring_c11_mem.h              | 427 ++++++++++++++++++++++++
 lib/librte_ring/rte_ring_generic.h              | 408 ++++++++++++++++++++++
 lib/librte_ring/rte_ring_version.map            |   7 +
 test/test/test_ring.c                           |  57 ++--
 test/test/test_ring_perf.c                      |  19 +-
 11 files changed, 1319 insertions(+), 73 deletions(-)

-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v4 1/5] ring: add 64-bit headtail structure
  2019-01-28 18:14     ` [dpdk-dev] [PATCH v4 " Gage Eads
@ 2019-01-28 18:14       ` Gage Eads
  2019-01-29 12:56         ` Ola Liljedahl
  2019-01-28 18:14       ` [dpdk-dev] [PATCH v4 2/5] ring: add a non-blocking implementation Gage Eads
                         ` (4 subsequent siblings)
  5 siblings, 1 reply; 123+ messages in thread
From: Gage Eads @ 2019-01-28 18:14 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl

64-bit head and tail index widths greatly increases the time it takes for
them to wrap-around (with current CPU speeds, it won't happen within the
author's lifetime). This is fundamental to avoiding the ABA problem -- in
which a thread mistakes reading the same tail index in two accesses to mean
that the ring was not modified in the intervening time -- in the upcoming
non-blocking ring implementation. Using a 64-bit index makes the
possibility of this occurring effectively zero.

This commit places the new producer and consumer structures in the same
location in struct rte_ring as their 32-bit counterparts. Since the 32-bit
versions are padded out to a cache line, there is space for the new
structure without affecting the layout of struct rte_ring. Thus, the ABI is
preserved.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_ring/rte_ring.h         |  23 +++++-
 lib/librte_ring/rte_ring_c11_mem.h | 153 +++++++++++++++++++++++++++++++++++++
 lib/librte_ring/rte_ring_generic.h | 139 +++++++++++++++++++++++++++++++++
 3 files changed, 312 insertions(+), 3 deletions(-)

diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index af5444a9f..00dfb5b85 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  *
- * Copyright (c) 2010-2017 Intel Corporation
+ * Copyright (c) 2010-2019 Intel Corporation
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
@@ -70,6 +70,15 @@ struct rte_ring_headtail {
 	uint32_t single;         /**< True if single prod/cons */
 };
 
+/* 64-bit version of rte_ring_headtail, for use by rings that need to avoid
+ * head/tail wrap-around.
+ */
+struct rte_ring_headtail_64 {
+	volatile uint64_t head;  /**< Prod/consumer head. */
+	volatile uint64_t tail;  /**< Prod/consumer tail. */
+	uint32_t single;       /**< True if single prod/cons */
+};
+
 /**
  * An RTE ring structure.
  *
@@ -97,11 +106,19 @@ struct rte_ring {
 	char pad0 __rte_cache_aligned; /**< empty cache line */
 
 	/** Ring producer status. */
-	struct rte_ring_headtail prod __rte_cache_aligned;
+	RTE_STD_C11
+	union {
+		struct rte_ring_headtail prod __rte_cache_aligned;
+		struct rte_ring_headtail_64 prod_64 __rte_cache_aligned;
+	};
 	char pad1 __rte_cache_aligned; /**< empty cache line */
 
 	/** Ring consumer status. */
-	struct rte_ring_headtail cons __rte_cache_aligned;
+	RTE_STD_C11
+	union {
+		struct rte_ring_headtail cons __rte_cache_aligned;
+		struct rte_ring_headtail_64 cons_64 __rte_cache_aligned;
+	};
 	char pad2 __rte_cache_aligned; /**< empty cache line */
 };
 
diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 0fb73a337..47acd4c7c 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -178,4 +178,157 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	return n;
 }
 
+/**
+ * @internal This function updates the producer head for enqueue using
+ *	     64-bit head/tail values.
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sp
+ *   Indicates whether multi-producer path is needed or not
+ * @param n
+ *   The number of elements we will want to enqueue, i.e. how far should the
+ *   head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
+ * @param old_head
+ *   Returns head value as it was before the move, i.e. where enqueue starts
+ * @param new_head
+ *   Returns the current/new head value i.e. where enqueue finishes
+ * @param free_entries
+ *   Returns the amount of free space in the ring BEFORE head was moved
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_move_prod_head_64(struct rte_ring *r, unsigned int is_sp,
+		unsigned int n, enum rte_ring_queue_behavior behavior,
+		uint64_t *old_head, uint64_t *new_head,
+		uint32_t *free_entries)
+{
+	const uint32_t capacity = r->capacity;
+	uint64_t cons_tail;
+	unsigned int max = n;
+	int success;
+
+	*old_head = __atomic_load_n(&r->prod_64.head, __ATOMIC_RELAXED);
+	do {
+		/* Reset n to the initial burst count */
+		n = max;
+
+		/* Ensure the head is read before tail */
+		__atomic_thread_fence(__ATOMIC_ACQUIRE);
+
+		/* load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		cons_tail = __atomic_load_n(&r->cons_64.tail,
+					__ATOMIC_ACQUIRE);
+
+		/* The subtraction is done between two unsigned 32bits value
+		 * (the result is always modulo 32 bits even if we have
+		 * *old_head > cons_tail). So 'free_entries' is always between 0
+		 * and capacity (which is < size).
+		 */
+		*free_entries = (capacity + cons_tail - *old_head);
+
+		/* check that we have enough room in ring */
+		if (unlikely(n > *free_entries))
+			n = (behavior == RTE_RING_QUEUE_FIXED) ?
+					0 : *free_entries;
+
+		if (n == 0)
+			return 0;
+
+		*new_head = *old_head + n;
+		if (is_sp)
+			r->prod_64.head = *new_head, success = 1;
+		else
+			/* on failure, *old_head is updated */
+			success = __atomic_compare_exchange_n(&r->prod_64.head,
+					old_head, *new_head,
+					0, __ATOMIC_RELAXED,
+					__ATOMIC_RELAXED);
+	} while (unlikely(success == 0));
+	return n;
+}
+
+/**
+ * @internal This function updates the consumer head for dequeue using
+ *	     64-bit head/tail values.
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sc
+ *   Indicates whether multi-consumer path is needed or not
+ * @param n
+ *   The number of elements we will want to enqueue, i.e. how far should the
+ *   head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from ring
+ * @param old_head
+ *   Returns head value as it was before the move, i.e. where dequeue starts
+ * @param new_head
+ *   Returns the current/new head value i.e. where dequeue finishes
+ * @param entries
+ *   Returns the number of entries in the ring BEFORE head was moved
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_move_cons_head_64(struct rte_ring *r, unsigned int is_sc,
+		unsigned int n, enum rte_ring_queue_behavior behavior,
+		uint64_t *old_head, uint64_t *new_head,
+		uint32_t *entries)
+{
+	unsigned int max = n;
+	uint64_t prod_tail;
+	int success;
+
+	/* move cons.head atomically */
+	*old_head = __atomic_load_n(&r->cons_64.head, __ATOMIC_RELAXED);
+	do {
+		/* Restore n as it may change every loop */
+		n = max;
+
+		/* Ensure the head is read before tail */
+		__atomic_thread_fence(__ATOMIC_ACQUIRE);
+
+		/* this load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		prod_tail = __atomic_load_n(&r->prod_64.tail,
+					__ATOMIC_ACQUIRE);
+
+		/* The subtraction is done between two unsigned 32bits value
+		 * (the result is always modulo 32 bits even if we have
+		 * cons_head > prod_tail). So 'entries' is always between 0
+		 * and size(ring)-1.
+		 */
+		*entries = (prod_tail - *old_head);
+
+		/* Set the actual entries for dequeue */
+		if (n > *entries)
+			n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : *entries;
+
+		if (unlikely(n == 0))
+			return 0;
+
+		*new_head = *old_head + n;
+		if (is_sc)
+			r->cons_64.head = *new_head, success = 1;
+		else
+			/* on failure, *old_head will be updated */
+			success = __atomic_compare_exchange_n(&r->cons_64.head,
+							old_head, *new_head,
+							0, __ATOMIC_RELAXED,
+							__ATOMIC_RELAXED);
+	} while (unlikely(success == 0));
+	return n;
+}
+
 #endif /* _RTE_RING_C11_MEM_H_ */
diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index ea7dbe5b9..2158e092a 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -167,4 +167,143 @@ __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
 	return n;
 }
 
+/**
+ * @internal This function updates the producer head for enqueue using
+ *	     64-bit head/tail values.
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sp
+ *   Indicates whether multi-producer path is needed or not
+ * @param n
+ *   The number of elements we will want to enqueue, i.e. how far should the
+ *   head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
+ * @param old_head
+ *   Returns head value as it was before the move, i.e. where enqueue starts
+ * @param new_head
+ *   Returns the current/new head value i.e. where enqueue finishes
+ * @param free_entries
+ *   Returns the amount of free space in the ring BEFORE head was moved
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_move_prod_head_64(struct rte_ring *r, unsigned int is_sp,
+		unsigned int n, enum rte_ring_queue_behavior behavior,
+		uint64_t *old_head, uint64_t *new_head,
+		uint32_t *free_entries)
+{
+	const uint32_t capacity = r->capacity;
+	unsigned int max = n;
+	int success;
+
+	do {
+		/* Reset n to the initial burst count */
+		n = max;
+
+		*old_head = r->prod_64.head;
+
+		/* add rmb barrier to avoid load/load reorder in weak
+		 * memory model. It is noop on x86
+		 */
+		rte_smp_rmb();
+
+		/*
+		 *  The subtraction is done between two unsigned 64bits value
+		 * (the result is always modulo 64 bits even if we have
+		 * *old_head > cons_tail). So 'free_entries' is always between 0
+		 * and capacity (which is < size).
+		 */
+		*free_entries = (capacity + r->cons_64.tail - *old_head);
+
+		/* check that we have enough room in ring */
+		if (unlikely(n > *free_entries))
+			n = (behavior == RTE_RING_QUEUE_FIXED) ?
+					0 : *free_entries;
+
+		if (n == 0)
+			return 0;
+
+		*new_head = *old_head + n;
+		if (is_sp)
+			r->prod_64.head = *new_head, success = 1;
+		else
+			success = rte_atomic64_cmpset(&r->prod_64.head,
+					*old_head, *new_head);
+	} while (unlikely(success == 0));
+	return n;
+}
+
+/**
+ * @internal This function updates the consumer head for dequeue using
+ *	     64-bit head/tail values.
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sc
+ *   Indicates whether multi-consumer path is needed or not
+ * @param n
+ *   The number of elements we will want to enqueue, i.e. how far should the
+ *   head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from ring
+ * @param old_head
+ *   Returns head value as it was before the move, i.e. where dequeue starts
+ * @param new_head
+ *   Returns the current/new head value i.e. where dequeue finishes
+ * @param entries
+ *   Returns the number of entries in the ring BEFORE head was moved
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_move_cons_head_64(struct rte_ring *r, unsigned int is_sc,
+		unsigned int n, enum rte_ring_queue_behavior behavior,
+		uint64_t *old_head, uint64_t *new_head,
+		uint32_t *entries)
+{
+	unsigned int max = n;
+	int success;
+
+	do {
+		/* Restore n as it may change every loop */
+		n = max;
+
+		*old_head = r->cons_64.head;
+
+		/* add rmb barrier to avoid load/load reorder in weak
+		 * memory model. It is noop on x86
+		 */
+		rte_smp_rmb();
+
+		/* The subtraction is done between two unsigned 64bits value
+		 * (the result is always modulo 64 bits even if we have
+		 * cons_head > prod_tail). So 'entries' is always between 0
+		 * and size(ring)-1.
+		 */
+		*entries = (r->prod_64.tail - *old_head);
+
+		/* Set the actual entries for dequeue */
+		if (n > *entries)
+			n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : *entries;
+
+		if (unlikely(n == 0))
+			return 0;
+
+		*new_head = *old_head + n;
+		if (is_sc)
+			r->cons_64.head = *new_head, success = 1;
+		else
+			success = rte_atomic64_cmpset(&r->cons_64.head,
+					*old_head, *new_head);
+	} while (unlikely(success == 0));
+	return n;
+}
+
 #endif /* _RTE_RING_GENERIC_H_ */
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v4 2/5] ring: add a non-blocking implementation
  2019-01-28 18:14     ` [dpdk-dev] [PATCH v4 " Gage Eads
  2019-01-28 18:14       ` [dpdk-dev] [PATCH v4 1/5] ring: add 64-bit headtail structure Gage Eads
@ 2019-01-28 18:14       ` Gage Eads
  2019-01-28 18:14       ` [dpdk-dev] [PATCH v4 3/5] test_ring: add non-blocking ring autotest Gage Eads
                         ` (3 subsequent siblings)
  5 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-01-28 18:14 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl

This commit adds support for non-blocking circular ring enqueue and dequeue
functions. The ring uses a 128-bit compare-and-swap instruction, and thus
is currently limited to x86_64.

The algorithm is based on the original rte ring (derived from FreeBSD's
bufring.h) and inspired by Michael and Scott's non-blocking concurrent
queue. Importantly, it adds a modification counter to each ring entry to
ensure only one thread can write to an unused entry.

-----
Algorithm:

Multi-producer non-blocking enqueue:
1. Move the producer head index 'n' locations forward, effectively
   reserving 'n' locations.
2. For each pointer:
 a. Read the producer tail index, then ring[tail]. If ring[tail]'s
    modification counter isn't 'tail', retry.
 b. Construct the new entry: {pointer, tail + ring size}
 c. Compare-and-swap the old entry with the new. If unsuccessful, the
    next loop iteration will try to enqueue this pointer again.
 d. Compare-and-swap the tail index with 'tail + 1', whether or not step 2c
    succeeded. This guarantees threads can make forward progress.

Multi-consumer non-blocking dequeue:
1. Move the consumer head index 'n' locations forward, effectively
   reserving 'n' pointers to be dequeued.
2. Copy 'n' pointers into the caller's object table (ignoring the
   modification counter), starting from ring[tail], then compare-and-swap
   the tail index with 'tail + n'.  If unsuccessful, repeat step 2.

-----
Discussion:

There are two cases where the ABA problem is mitigated:
1. Enqueueing a pointer to the ring: without a modification counter
   tied to the tail index, the index could become stale by the time the
   enqueue happens, causing it to overwrite valid data. Tying the
   counter to the tail index gives us an expected value (as opposed to,
   say, a monotonically incrementing counter).

   Since the counter will eventually wrap, there is potential for the ABA
   problem. However, using a 64-bit counter makes this likelihood
   effectively zero.

2. Updating a tail index: the ABA problem can occur if the thread is
   preempted and the tail index wraps around. However, using 64-bit indexes
   makes this likelihood effectively zero.

With no contention, an enqueue of n pointers uses (1 + 2n) CAS operations
and a dequeue of n pointers uses 2. This algorithm has worse average-case
performance than the regular rte ring (particularly a highly-contended ring
with large bulk accesses), however:
- For applications with preemptible pthreads, the regular rte ring's
  worst-case performance (i.e. one thread being preempted in the
  update_tail() critical section) is much worse than the non-blocking
  ring's.
- Software caching can mitigate the average case performance for ring-based
  algorithms. For example, a non-blocking ring based mempool (a likely use
  case for this ring) with per-thread caching.

The non-blocking ring is enabled via a new flag, RING_F_NB. Because the
ring's memsize is now a function of its flags (the non-blocking ring
requires 128b for each entry), this commit adds a new argument ('flags') to
rte_ring_get_memsize(). An API deprecation notice will be sent in a
separate commit.

For ease-of-use, existing ring enqueue and dequeue functions work on both
regular and non-blocking rings. This introduces an additional branch in
the datapath, but this should be a highly predictable branch.
ring_perf_autotest shows a negligible performance impact; it's hard to
distinguish a real difference versus system noise.

                                  | ring_perf_autotest cycles with branch -
             Test                 |   ring_perf_autotest cycles without
------------------------------------------------------------------
SP/SC single enq/dequeue          | 0.33
MP/MC single enq/dequeue          | -4.00
SP/SC burst enq/dequeue (size 8)  | 0.00
MP/MC burst enq/dequeue (size 8)  | 0.00
SP/SC burst enq/dequeue (size 32) | 0.00
MP/MC burst enq/dequeue (size 32) | 0.00
SC empty dequeue                  | 1.00
MC empty dequeue                  | 0.00

Single lcore:
SP/SC bulk enq/dequeue (size 8)   | 0.49
MP/MC bulk enq/dequeue (size 8)   | 0.08
SP/SC bulk enq/dequeue (size 32)  | 0.07
MP/MC bulk enq/dequeue (size 32)  | 0.09

Two physical cores:
SP/SC bulk enq/dequeue (size 8)   | 0.19
MP/MC bulk enq/dequeue (size 8)   | -0.37
SP/SC bulk enq/dequeue (size 32)  | 0.09
MP/MC bulk enq/dequeue (size 32)  | -0.05

Two NUMA nodes:
SP/SC bulk enq/dequeue (size 8)   | -1.96
MP/MC bulk enq/dequeue (size 8)   | 0.88
SP/SC bulk enq/dequeue (size 32)  | 0.10
MP/MC bulk enq/dequeue (size 32)  | 0.46

Test setup: x86_64 build with default config, dual-socket Xeon E5-2699 v4,
running on isolcpus cores with a tickless scheduler. Each test run three
times and the results averaged.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_ring/rte_ring.c           |  72 ++++++--
 lib/librte_ring/rte_ring.h           | 313 +++++++++++++++++++++++++++++++----
 lib/librte_ring/rte_ring_c11_mem.h   | 282 ++++++++++++++++++++++++++++++-
 lib/librte_ring/rte_ring_generic.h   | 269 ++++++++++++++++++++++++++++++
 lib/librte_ring/rte_ring_version.map |   7 +
 5 files changed, 896 insertions(+), 47 deletions(-)

diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
index d215acecc..f3378dccd 100644
--- a/lib/librte_ring/rte_ring.c
+++ b/lib/librte_ring/rte_ring.c
@@ -45,9 +45,9 @@ EAL_REGISTER_TAILQ(rte_ring_tailq)
 
 /* return the size of memory occupied by a ring */
 ssize_t
-rte_ring_get_memsize(unsigned count)
+rte_ring_get_memsize_v1905(unsigned int count, unsigned int flags)
 {
-	ssize_t sz;
+	ssize_t sz, elt_sz;
 
 	/* count must be a power of 2 */
 	if ((!POWEROF2(count)) || (count > RTE_RING_SZ_MASK )) {
@@ -57,10 +57,23 @@ rte_ring_get_memsize(unsigned count)
 		return -EINVAL;
 	}
 
-	sz = sizeof(struct rte_ring) + count * sizeof(void *);
+	elt_sz = (flags & RING_F_NB) ? 2 * sizeof(void *) : sizeof(void *);
+
+	sz = sizeof(struct rte_ring) + count * elt_sz;
 	sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
 	return sz;
 }
+BIND_DEFAULT_SYMBOL(rte_ring_get_memsize, _v1905, 19.05);
+MAP_STATIC_SYMBOL(ssize_t rte_ring_get_memsize(unsigned int count,
+					       unsigned int flags),
+		  rte_ring_get_memsize_v1905);
+
+ssize_t
+rte_ring_get_memsize_v20(unsigned int count)
+{
+	return rte_ring_get_memsize_v1905(count, 0);
+}
+VERSION_SYMBOL(rte_ring_get_memsize, _v20, 2.0);
 
 int
 rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
@@ -82,8 +95,6 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 	if (ret < 0 || ret >= (int)sizeof(r->name))
 		return -ENAMETOOLONG;
 	r->flags = flags;
-	r->prod.single = (flags & RING_F_SP_ENQ) ? __IS_SP : __IS_MP;
-	r->cons.single = (flags & RING_F_SC_DEQ) ? __IS_SC : __IS_MC;
 
 	if (flags & RING_F_EXACT_SZ) {
 		r->size = rte_align32pow2(count + 1);
@@ -100,8 +111,30 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 		r->mask = count - 1;
 		r->capacity = r->mask;
 	}
-	r->prod.head = r->cons.head = 0;
-	r->prod.tail = r->cons.tail = 0;
+
+	if (flags & RING_F_NB) {
+		uint64_t i;
+
+		r->prod_64.single = (flags & RING_F_SP_ENQ) ? __IS_SP : __IS_MP;
+		r->cons_64.single = (flags & RING_F_SC_DEQ) ? __IS_SC : __IS_MC;
+		r->prod_64.head = r->cons_64.head = 0;
+		r->prod_64.tail = r->cons_64.tail = 0;
+
+		for (i = 0; i < r->size; i++) {
+			struct nb_ring_entry *ring_ptr, *base;
+
+			base = ((struct nb_ring_entry *)&r[1]);
+
+			ring_ptr = &base[i & r->mask];
+
+			ring_ptr->cnt = i;
+		}
+	} else {
+		r->prod.single = (flags & RING_F_SP_ENQ) ? __IS_SP : __IS_MP;
+		r->cons.single = (flags & RING_F_SC_DEQ) ? __IS_SC : __IS_MC;
+		r->prod.head = r->cons.head = 0;
+		r->prod.tail = r->cons.tail = 0;
+	}
 
 	return 0;
 }
@@ -123,11 +156,19 @@ rte_ring_create(const char *name, unsigned count, int socket_id,
 
 	ring_list = RTE_TAILQ_CAST(rte_ring_tailq.head, rte_ring_list);
 
+#if !defined(RTE_ARCH_X86_64)
+	if (flags & RING_F_NB) {
+		printf("RING_F_NB is only supported on x86-64 platforms\n");
+		rte_errno = EINVAL;
+		return NULL;
+	}
+#endif
+
 	/* for an exact size ring, round up from count to a power of two */
 	if (flags & RING_F_EXACT_SZ)
 		count = rte_align32pow2(count + 1);
 
-	ring_size = rte_ring_get_memsize(count);
+	ring_size = rte_ring_get_memsize(count, flags);
 	if (ring_size < 0) {
 		rte_errno = ring_size;
 		return NULL;
@@ -227,10 +268,17 @@ rte_ring_dump(FILE *f, const struct rte_ring *r)
 	fprintf(f, "  flags=%x\n", r->flags);
 	fprintf(f, "  size=%"PRIu32"\n", r->size);
 	fprintf(f, "  capacity=%"PRIu32"\n", r->capacity);
-	fprintf(f, "  ct=%"PRIu32"\n", r->cons.tail);
-	fprintf(f, "  ch=%"PRIu32"\n", r->cons.head);
-	fprintf(f, "  pt=%"PRIu32"\n", r->prod.tail);
-	fprintf(f, "  ph=%"PRIu32"\n", r->prod.head);
+	if (r->flags & RING_F_NB) {
+		fprintf(f, "  ct=%"PRIu64"\n", r->cons_64.tail);
+		fprintf(f, "  ch=%"PRIu64"\n", r->cons_64.head);
+		fprintf(f, "  pt=%"PRIu64"\n", r->prod_64.tail);
+		fprintf(f, "  ph=%"PRIu64"\n", r->prod_64.head);
+	} else {
+		fprintf(f, "  ct=%"PRIu32"\n", r->cons.tail);
+		fprintf(f, "  ch=%"PRIu32"\n", r->cons.head);
+		fprintf(f, "  pt=%"PRIu32"\n", r->prod.tail);
+		fprintf(f, "  ph=%"PRIu32"\n", r->prod.head);
+	}
 	fprintf(f, "  used=%u\n", rte_ring_count(r));
 	fprintf(f, "  avail=%u\n", rte_ring_free_count(r));
 }
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 00dfb5b85..c3d388c95 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -20,12 +20,16 @@
  *
  * - FIFO (First In First Out)
  * - Maximum size is fixed; the pointers are stored in a table.
- * - Lockless implementation.
+ * - Lockless (and optionally, non-blocking) implementation.
  * - Multi- or single-consumer dequeue.
  * - Multi- or single-producer enqueue.
  * - Bulk dequeue.
  * - Bulk enqueue.
  *
+ * The non-blocking ring algorithm is based on the original rte ring (derived
+ * from FreeBSD's bufring.h) and inspired by Michael and Scott's non-blocking
+ * concurrent queue.
+ *
  * Note: the ring implementation is not preemptible. Refer to Programmer's
  * guide/Environment Abstraction Layer/Multiple pthread/Known Issues/rte_ring
  * for more information.
@@ -134,6 +138,18 @@ struct rte_ring {
  */
 #define RING_F_EXACT_SZ 0x0004
 #define RTE_RING_SZ_MASK  (0x7fffffffU) /**< Ring size mask */
+/**
+ * The ring uses non-blocking enqueue and dequeue functions. These functions
+ * do not have the "non-preemptive" constraint of a regular rte ring, and thus
+ * are suited for applications using preemptible pthreads. However, the
+ * non-blocking functions have worse average-case performance than their
+ * regular rte ring counterparts. When used as the handler for a mempool,
+ * per-thread caching can mitigate the performance difference by reducing the
+ * number (and contention) of ring accesses.
+ *
+ * This flag is only supported on x86_64 platforms.
+ */
+#define RING_F_NB 0x0008
 
 /* @internal defines for passing to the enqueue dequeue worker functions */
 #define __IS_SP 1
@@ -151,11 +167,15 @@ struct rte_ring {
  *
  * @param count
  *   The number of elements in the ring (must be a power of 2).
+ * @param flags
+ *   The flags the ring will be created with.
  * @return
  *   - The memory size needed for the ring on success.
  *   - -EINVAL if count is not a power of 2.
  */
-ssize_t rte_ring_get_memsize(unsigned count);
+ssize_t rte_ring_get_memsize(unsigned int count, unsigned int flags);
+ssize_t rte_ring_get_memsize_v20(unsigned int count);
+ssize_t rte_ring_get_memsize_v1905(unsigned int count, unsigned int flags);
 
 /**
  * Initialize a ring structure.
@@ -188,6 +208,10 @@ ssize_t rte_ring_get_memsize(unsigned count);
  *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
  *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
  *      is "single-consumer". Otherwise, it is "multi-consumers".
+ *    - RING_F_EXACT_SZ: If this flag is set, count can be a non-power-of-2
+ *      number, but up to half the ring space may be wasted.
+ *    - RING_F_NB: (x86_64 only) If this flag is set, the ring uses
+ *      non-blocking variants of the dequeue and enqueue functions.
  * @return
  *   0 on success, or a negative value on error.
  */
@@ -223,12 +247,17 @@ int rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
  *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
  *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
  *      is "single-consumer". Otherwise, it is "multi-consumers".
+ *    - RING_F_EXACT_SZ: If this flag is set, count can be a non-power-of-2
+ *      number, but up to half the ring space may be wasted.
+ *    - RING_F_NB: (x86_64 only) If this flag is set, the ring uses
+ *      non-blocking variants of the dequeue and enqueue functions.
  * @return
  *   On success, the pointer to the new allocated ring. NULL on error with
  *    rte_errno set appropriately. Possible errno values include:
  *    - E_RTE_NO_CONFIG - function could not get pointer to rte_config structure
  *    - E_RTE_SECONDARY - function was called from a secondary process instance
- *    - EINVAL - count provided is not a power of 2
+ *    - EINVAL - count provided is not a power of 2, or RING_F_NB is used on an
+ *      unsupported platform
  *    - ENOSPC - the maximum number of memzones has already been allocated
  *    - EEXIST - a memzone with the same name already exists
  *    - ENOMEM - no appropriate memory area found in which to create memzone
@@ -284,6 +313,50 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 	} \
 } while (0)
 
+/* The actual enqueue of pointers on the ring.
+ * Used only by the single-producer non-blocking enqueue function, but
+ * out-lined here for code readability.
+ */
+#define ENQUEUE_PTRS_NB(r, ring_start, prod_head, obj_table, n) do { \
+	unsigned int i; \
+	const uint32_t size = (r)->size; \
+	size_t idx = prod_head & (r)->mask; \
+	size_t new_cnt = prod_head + size; \
+	struct nb_ring_entry *ring = (struct nb_ring_entry *)ring_start; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & ((~(unsigned)0x3))); i += 4, idx += 4) { \
+			ring[idx].ptr = obj_table[i]; \
+			ring[idx].cnt = new_cnt + i;  \
+			ring[idx + 1].ptr = obj_table[i + 1]; \
+			ring[idx + 1].cnt = new_cnt + i + 1;  \
+			ring[idx + 2].ptr = obj_table[i + 2]; \
+			ring[idx + 2].cnt = new_cnt + i + 2;  \
+			ring[idx + 3].ptr = obj_table[i + 3]; \
+			ring[idx + 3].cnt = new_cnt + i + 3;  \
+		} \
+		switch (n & 0x3) { \
+		case 3: \
+			ring[idx].cnt = new_cnt + i; \
+			ring[idx++].ptr = obj_table[i++]; /* fallthrough */ \
+		case 2: \
+			ring[idx].cnt = new_cnt + i; \
+			ring[idx++].ptr = obj_table[i++]; /* fallthrough */ \
+		case 1: \
+			ring[idx].cnt = new_cnt + i; \
+			ring[idx++].ptr = obj_table[i++]; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) { \
+			ring[idx].cnt = new_cnt + i;  \
+			ring[idx].ptr = obj_table[i]; \
+		} \
+		for (idx = 0; i < n; i++, idx++) {    \
+			ring[idx].cnt = new_cnt + i;  \
+			ring[idx].ptr = obj_table[i]; \
+		} \
+	} \
+} while (0)
+
 /* the actual copy of pointers on the ring to obj_table.
  * Placed here since identical code needed in both
  * single and multi consumer dequeue functions */
@@ -315,6 +388,45 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 	} \
 } while (0)
 
+/* The actual copy of pointers on the ring to obj_table.
+ * Placed here since identical code needed in both
+ * single and multi consumer non-blocking dequeue functions.
+ */
+#define DEQUEUE_PTRS_NB(r, ring_start, cons_head, obj_table, n) do { \
+	unsigned int i; \
+	size_t idx = cons_head & (r)->mask; \
+	const uint32_t size = (r)->size; \
+	struct nb_ring_entry *ring = (struct nb_ring_entry *)ring_start; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & (~(unsigned)0x3)); i += 4, idx += 4) {\
+			obj_table[i] = ring[idx].ptr; \
+			obj_table[i + 1] = ring[idx + 1].ptr; \
+			obj_table[i + 2] = ring[idx + 2].ptr; \
+			obj_table[i + 3] = ring[idx + 3].ptr; \
+		} \
+		switch (n & 0x3) { \
+		case 3: \
+			obj_table[i++] = ring[idx++].ptr; /* fallthrough */ \
+		case 2: \
+			obj_table[i++] = ring[idx++].ptr; /* fallthrough */ \
+		case 1: \
+			obj_table[i++] = ring[idx++].ptr; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) \
+			obj_table[i] = ring[idx].ptr; \
+		for (idx = 0; i < n; i++, idx++) \
+			obj_table[i] = ring[idx].ptr; \
+	} \
+} while (0)
+
+
+/* @internal 128-bit structure used by the non-blocking ring */
+struct nb_ring_entry {
+	void *ptr; /**< Data pointer */
+	uint64_t cnt; /**< Modification counter */
+};
+
 /* Between load and load. there might be cpu reorder in weak model
  * (powerpc/arm).
  * There are 2 choices for the users
@@ -331,6 +443,70 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 #endif
 
 /**
+ * @internal Enqueue several objects on the non-blocking ring
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param is_sp
+ *   Indicates whether to use single producer or multi-producer head update
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_enqueue(struct rte_ring *r, void * const *obj_table,
+			 unsigned int n, enum rte_ring_queue_behavior behavior,
+			 unsigned int is_sp, unsigned int *free_space)
+{
+	if (is_sp)
+		return __rte_ring_do_nb_enqueue_sp(r, obj_table, n,
+						   behavior, free_space);
+	else
+		return __rte_ring_do_nb_enqueue_mp(r, obj_table, n,
+						   behavior, free_space);
+}
+
+/**
+ * @internal Dequeue several objects from the non-blocking ring
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_dequeue(struct rte_ring *r, void **obj_table,
+		 unsigned int n, enum rte_ring_queue_behavior behavior,
+		 unsigned int is_sc, unsigned int *available)
+{
+	if (is_sc)
+		return __rte_ring_do_nb_dequeue_sc(r, obj_table, n,
+						   behavior, available);
+	else
+		return __rte_ring_do_nb_dequeue_mc(r, obj_table, n,
+						   behavior, available);
+}
+
+/**
  * @internal Enqueue several objects on the ring
  *
   * @param r
@@ -437,8 +613,14 @@ static __rte_always_inline unsigned int
 rte_ring_mp_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_MP, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_MP,
+						free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_MP,
+					     free_space);
 }
 
 /**
@@ -460,8 +642,14 @@ static __rte_always_inline unsigned int
 rte_ring_sp_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_SP, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_SP,
+						free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_SP,
+					     free_space);
 }
 
 /**
@@ -487,8 +675,14 @@ static __rte_always_inline unsigned int
 rte_ring_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
 		      unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			r->prod.single, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED,
+						r->prod_64.single, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED,
+					     r->prod.single, free_space);
 }
 
 /**
@@ -571,8 +765,14 @@ static __rte_always_inline unsigned int
 rte_ring_mc_dequeue_bulk(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_MC, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_MC,
+						available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_MC,
+					     available);
 }
 
 /**
@@ -595,8 +795,14 @@ static __rte_always_inline unsigned int
 rte_ring_sc_dequeue_bulk(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_SC, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_SC,
+						available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_SC,
+					     available);
 }
 
 /**
@@ -622,8 +828,14 @@ static __rte_always_inline unsigned int
 rte_ring_dequeue_bulk(struct rte_ring *r, void **obj_table, unsigned int n,
 		unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-				r->cons.single, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED,
+						r->cons_64.single, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED,
+					     r->cons.single, available);
 }
 
 /**
@@ -698,9 +910,13 @@ rte_ring_dequeue(struct rte_ring *r, void **obj_p)
 static inline unsigned
 rte_ring_count(const struct rte_ring *r)
 {
-	uint32_t prod_tail = r->prod.tail;
-	uint32_t cons_tail = r->cons.tail;
-	uint32_t count = (prod_tail - cons_tail) & r->mask;
+	uint32_t count;
+
+	if (r->flags & RING_F_NB)
+		count = (r->prod_64.tail - r->cons_64.tail) & r->mask;
+	else
+		count = (r->prod.tail - r->cons.tail) & r->mask;
+
 	return (count > r->capacity) ? r->capacity : count;
 }
 
@@ -820,8 +1036,14 @@ static __rte_always_inline unsigned
 rte_ring_mp_enqueue_burst(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_MP, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_MP, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_MP, free_space);
 }
 
 /**
@@ -843,8 +1065,14 @@ static __rte_always_inline unsigned
 rte_ring_sp_enqueue_burst(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_SP, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_SP, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_SP, free_space);
 }
 
 /**
@@ -870,8 +1098,14 @@ static __rte_always_inline unsigned
 rte_ring_enqueue_burst(struct rte_ring *r, void * const *obj_table,
 		      unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_VARIABLE,
-			r->prod.single, free_space);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						r->prod_64.single, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     r->prod.single, free_space);
 }
 
 /**
@@ -898,8 +1132,14 @@ static __rte_always_inline unsigned
 rte_ring_mc_dequeue_burst(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_MC, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_MC, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_MC, available);
 }
 
 /**
@@ -923,8 +1163,14 @@ static __rte_always_inline unsigned
 rte_ring_sc_dequeue_burst(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_SC, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_SC, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_SC, available);
 }
 
 /**
@@ -950,9 +1196,14 @@ static __rte_always_inline unsigned
 rte_ring_dequeue_burst(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n,
-				RTE_RING_QUEUE_VARIABLE,
-				r->cons.single, available);
+	if (r->flags & RING_F_NB)
+		return __rte_ring_do_nb_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						r->cons_64.single, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     r->cons.single, available);
 }
 
 #ifdef __cplusplus
diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 47acd4c7c..7f83a5dc9 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -221,8 +221,8 @@ __rte_ring_move_prod_head_64(struct rte_ring *r, unsigned int is_sp,
 		/* Ensure the head is read before tail */
 		__atomic_thread_fence(__ATOMIC_ACQUIRE);
 
-		/* load-acquire synchronize with store-release of ht->tail
-		 * in update_tail.
+		/* load-acquire synchronize with store-release of tail in
+		 * do_nb_dequeue_{sc, mc}.
 		 */
 		cons_tail = __atomic_load_n(&r->cons_64.tail,
 					__ATOMIC_ACQUIRE);
@@ -252,6 +252,7 @@ __rte_ring_move_prod_head_64(struct rte_ring *r, unsigned int is_sp,
 					0, __ATOMIC_RELAXED,
 					__ATOMIC_RELAXED);
 	} while (unlikely(success == 0));
+
 	return n;
 }
 
@@ -298,8 +299,8 @@ __rte_ring_move_cons_head_64(struct rte_ring *r, unsigned int is_sc,
 		/* Ensure the head is read before tail */
 		__atomic_thread_fence(__ATOMIC_ACQUIRE);
 
-		/* this load-acquire synchronize with store-release of ht->tail
-		 * in update_tail.
+		/* load-acquire synchronize with store-release of tail in
+		 * do_nb_enqueue_{sp, mp}.
 		 */
 		prod_tail = __atomic_load_n(&r->prod_64.tail,
 					__ATOMIC_ACQUIRE);
@@ -328,6 +329,279 @@ __rte_ring_move_cons_head_64(struct rte_ring *r, unsigned int is_sc,
 							0, __ATOMIC_RELAXED,
 							__ATOMIC_RELAXED);
 	} while (unlikely(success == 0));
+
+	return n;
+}
+
+/**
+ * @internal
+ *   Enqueue several objects on the non-blocking ring (single-producer only)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_enqueue_sp(struct rte_ring *r, void * const *obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+	uint32_t free_entries;
+	uint64_t head, next;
+
+	n = __rte_ring_move_prod_head_64(r, 1, n, behavior,
+					 &head, &next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	ENQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
+
+	__atomic_store_n(&r->prod_64.tail,
+			 r->prod_64.tail + n,
+			 __ATOMIC_RELEASE);
+end:
+	if (free_space != NULL)
+		*free_space = free_entries - n;
+	return n;
+}
+
+/**
+ * @internal
+ *   Enqueue several objects on the non-blocking ring (multi-producer safe)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_enqueue_mp(struct rte_ring *r, void * const *obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+#if !defined(RTE_ARCH_X86_64) || !defined(ALLOW_EXPERIMENTAL_API)
+	RTE_SET_USED(r);
+	RTE_SET_USED(obj_table);
+	RTE_SET_USED(n);
+	RTE_SET_USED(behavior);
+	RTE_SET_USED(free_space);
+#ifndef ALLOW_EXPERIMENTAL_API
+	printf("[%s()] RING_F_NB requires an experimental API."
+	       " Recompile with ALLOW_EXPERIMENTAL_API to use it.\n"
+	       , __func__);
+#endif
+	return 0;
+#endif
+#if defined(RTE_ARCH_X86_64) && defined(ALLOW_EXPERIMENTAL_API)
+	uint64_t head, next, tail;
+	uint32_t free_entries;
+	unsigned int i;
+
+	n = __rte_ring_move_prod_head_64(r, 0, n, behavior,
+					 &head, &next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	tail = __atomic_load_n(&r->prod_64.tail, __ATOMIC_RELAXED);
+
+	for (i = 0; i < n; /* i incremented if enqueue succeeds */) {
+		struct nb_ring_entry old_value, new_value;
+		struct nb_ring_entry *base, *ring_ptr;
+
+		/* Enqueue to the tail entry. If another thread wins the race,
+		 * retry with the new tail.
+		 */
+		base = (struct nb_ring_entry *)&r[1];
+
+		ring_ptr = &base[tail & r->mask];
+
+		old_value = *ring_ptr;
+
+		/* If the tail entry's modification counter doesn't match the
+		 * producer tail index, it's already been updated.
+		 *
+		 * Attempt to update the tail here, so this thread doesn't
+		 * depend on the forward progress of the thread that
+		 * successfully enqueued.
+		 */
+		if (old_value.cnt != tail) {
+			/* Use a release memmodel to ensure the tail entry is
+			 * visible to dequeueing threads before updating the
+			 * tail. (tail is updated on failure.)
+			 */
+			__atomic_compare_exchange_n(&r->prod_64.tail,
+						    &tail, tail + 1,
+						    0, __ATOMIC_RELEASE,
+						    __ATOMIC_RELAXED);
+			continue;
+		}
+
+		/* Prepare the new entry. The cnt field mitigates the ABA
+		 * problem on the ring write.
+		 */
+		new_value.ptr = obj_table[i];
+		new_value.cnt = tail + r->size;
+
+		if (rte_atomic128_cmpset((volatile rte_int128_t *)ring_ptr,
+					 (rte_int128_t *)&old_value,
+					 (rte_int128_t *)&new_value,
+					 0, RTE_ATOMIC_RELAXED,
+					 RTE_ATOMIC_RELAXED))
+			i++;
+
+		/* Use a release memmodel to ensure the tail entry is visible
+		 * to dequeueing threads before updating the tail. Every thread
+		 * attempts the cmpset, so they don't have to wait for the
+		 * thread that successfully enqueued to the ring. Using a
+		 * 64-bit tail mitigates the ABA problem here. (tail is updated
+		 * on failure.)
+		 */
+		__atomic_compare_exchange_n(&r->prod_64.tail,
+					    &tail, tail + 1,
+					    0, __ATOMIC_RELEASE,
+					    __ATOMIC_RELAXED);
+	}
+
+end:
+	if (free_space != NULL)
+		*free_space = free_entries - n;
+	return n;
+#endif
+}
+
+/**
+ * @internal
+ *   Dequeue several objects from the non-blocking ring (single-consumer only)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_dequeue_sc(struct rte_ring *r, void **obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *available)
+{
+	uint64_t head, next;
+	uint32_t entries;
+
+	n = __rte_ring_move_cons_head_64(r, 1, n, behavior,
+					 &head, &next, &entries);
+	if (n == 0)
+		goto end;
+
+	DEQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
+
+	__atomic_store_n(&r->cons_64.tail,
+			 r->cons_64.tail + n,
+			 __ATOMIC_RELEASE);
+end:
+	if (available != NULL)
+		*available = entries - n;
+	return n;
+}
+
+/**
+ * @internal
+ *   Dequeue several objects from the non-blocking ring (multi-consumer safe)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_dequeue_mc(struct rte_ring *r, void **obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *available)
+{
+	uint64_t head, tail, next;
+	uint32_t entries;
+	int success;
+
+	n = __rte_ring_move_cons_head_64(r, 0, n, behavior,
+					 &head, &next, &entries);
+	if (n == 0)
+		goto end;
+
+	/* The acquire-release synchronization on prod_64.tail ensures that
+	 * this thread correctly observes the ring entries up to prod_64.tail.
+	 * However by the time this thread reads cons_64.tail, or if its CAS
+	 * fails, cons_64.tail may have passed the previously read value of
+	 * prod_64.tail. Acquire-release synchronization on cons_64.tail is
+	 * necessary to ensure that dequeue threads always observe the correct
+	 * values of the ring entries.
+	 */
+	tail = __atomic_load_n(&r->cons_64.tail, __ATOMIC_ACQUIRE);
+	do {
+		/* Dequeue from the cons tail onwards. If multiple threads read
+		 * the same pointers, the thread that successfully performs the
+		 * CAS will keep them and the other(s) will retry.
+		 */
+		DEQUEUE_PTRS_NB(r, &r[1], tail, obj_table, n);
+
+		next = tail + n;
+
+		/* There is potential for the ABA problem here, but that is
+		 * mitigated by the large (64-bit) tail. Use a release memmodel
+		 * to ensure the dequeue operations and CAS are properly
+		 * ordered. (tail is updated on failure.)
+		 */
+		success = __atomic_compare_exchange_n(&r->cons_64.tail,
+						      &tail, next,
+						      0, __ATOMIC_RELEASE,
+						      __ATOMIC_ACQUIRE);
+	} while (success == 0);
+
+end:
+	if (available != NULL)
+		*available = entries - n;
 	return n;
 }
 
diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index 2158e092a..87c9a09ce 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -306,4 +306,273 @@ __rte_ring_move_cons_head_64(struct rte_ring *r, unsigned int is_sc,
 	return n;
 }
 
+/**
+ * @internal
+ *   Enqueue several objects on the non-blocking ring (single-producer only)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_enqueue_sp(struct rte_ring *r, void * const *obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+	uint32_t free_entries;
+	uint64_t head, next;
+
+	n = __rte_ring_move_prod_head_64(r, 1, n, behavior,
+					 &head, &next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	ENQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
+
+	rte_smp_wmb();
+
+	r->prod_64.tail += n;
+
+end:
+	if (free_space != NULL)
+		*free_space = free_entries - n;
+	return n;
+}
+
+/**
+ * @internal
+ *   Enqueue several objects on the non-blocking ring (multi-producer safe)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_enqueue_mp(struct rte_ring *r, void * const *obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+#if !defined(RTE_ARCH_X86_64) || !defined(ALLOW_EXPERIMENTAL_API)
+	RTE_SET_USED(r);
+	RTE_SET_USED(obj_table);
+	RTE_SET_USED(n);
+	RTE_SET_USED(behavior);
+	RTE_SET_USED(free_space);
+#ifndef ALLOW_EXPERIMENTAL_API
+	printf("[%s()] RING_F_NB requires an experimental API."
+	       " Recompile with ALLOW_EXPERIMENTAL_API to use it.\n"
+	       , __func__);
+#endif
+	return 0;
+#endif
+#if defined(RTE_ARCH_X86_64) && defined(ALLOW_EXPERIMENTAL_API)
+	uint64_t head, next, tail;
+	uint32_t free_entries;
+	unsigned int i;
+
+	n = __rte_ring_move_prod_head_64(r, 0, n, behavior,
+					 &head, &next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	for (i = 0; i < n; /* i incremented if enqueue succeeds */) {
+		struct nb_ring_entry old_value, new_value;
+		struct nb_ring_entry *base, *ring_ptr;
+
+		/* Enqueue to the tail entry. If another thread wins the race,
+		 * retry with the new tail.
+		 */
+		tail = r->prod_64.tail;
+
+		base = (struct nb_ring_entry *)&r[1];
+
+		ring_ptr = &base[tail & r->mask];
+
+		old_value = *ring_ptr;
+
+		/* If the tail entry's modification counter doesn't match the
+		 * producer tail index, it's already been updated.
+		 *
+		 * Attempt to update the tail here, so this thread doesn't
+		 * depend on the forward progress of the thread that
+		 * successfully enqueued.
+		 */
+		if (old_value.cnt != tail) {
+			/* Ensure the tail entry is visible to dequeueing
+			 * threads before updating the tail.
+			 */
+			rte_smp_wmb();
+
+			rte_atomic64_cmpset(&r->prod_64.tail, tail, tail + 1);
+			continue;
+		}
+
+		/* Prepare the new entry. The cnt field mitigates the ABA
+		 * problem on the ring write.
+		 */
+		new_value.ptr = obj_table[i];
+		new_value.cnt = tail + r->size;
+
+		if (rte_atomic128_cmpset((volatile rte_int128_t *)ring_ptr,
+					 (rte_int128_t *)&old_value,
+					 (rte_int128_t *)&new_value,
+					 0, RTE_ATOMIC_RELAXED,
+					 RTE_ATOMIC_RELAXED))
+			i++;
+
+		/* Ensure the tail entry is visible to dequeueing threads
+		 * before updating the tail.
+		 */
+		rte_smp_wmb();
+
+		/* Every thread attempts the cmpset, so they don't have to wait
+		 * for the thread that successfully enqueued to the ring.
+		 * Using a 64-bit tail mitigates the ABA problem here.
+		 */
+		rte_atomic64_cmpset(&r->prod_64.tail, tail, tail + 1);
+	}
+
+end:
+	if (free_space != NULL)
+		*free_space = free_entries - n;
+	return n;
+#endif
+}
+
+/**
+ * @internal
+ *   Dequeue several objects from the non-blocking ring (single-consumer only)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_dequeue_sc(struct rte_ring *r, void **obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *available)
+{
+	uint64_t head, next;
+	uint32_t entries;
+
+	n = __rte_ring_move_cons_head_64(r, 1, n, behavior,
+					 &head, &next, &entries);
+	if (n == 0)
+		goto end;
+
+	DEQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
+
+	rte_smp_rmb();
+
+	r->cons_64.tail += n;
+
+end:
+	if (available != NULL)
+		*available = entries - n;
+	return n;
+}
+
+/**
+ * @internal
+ *   Dequeue several objects from the non-blocking ring (multi-consumer safe)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_nb_dequeue_mc(struct rte_ring *r, void **obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *available)
+{
+	uint64_t head, next;
+	uint32_t entries;
+	int success;
+
+	n = __rte_ring_move_cons_head_64(r, 0, n, behavior,
+					 &head, &next, &entries);
+	if (n == 0)
+		goto end;
+
+	do {
+		uint64_t tail = r->cons_64.tail;
+
+		/* Ensure that the correct ring entry values are read by this
+		 * thread.
+		 */
+		rte_smp_rmb();
+
+		/* Dequeue from the cons tail onwards. If multiple threads read
+		 * the same pointers, the thread that successfully performs the
+		 * CAS will keep them and the other(s) will retry.
+		 */
+		DEQUEUE_PTRS_NB(r, &r[1], tail, obj_table, n);
+
+		next = tail + n;
+
+		/* Ensure the dequeue operations and CAS are properly
+		 * ordered.
+		 */
+		rte_smp_rmb();
+
+		/* There is potential for the ABA problem here, but that is
+		 * mitigated by the large (64-bit) tail.
+		 */
+		success = rte_atomic64_cmpset(&r->cons_64.tail, tail, next);
+	} while (success == 0);
+
+end:
+	if (available != NULL)
+		*available = entries - n;
+	return n;
+}
+
 #endif /* _RTE_RING_GENERIC_H_ */
diff --git a/lib/librte_ring/rte_ring_version.map b/lib/librte_ring/rte_ring_version.map
index d935efd0d..8969467af 100644
--- a/lib/librte_ring/rte_ring_version.map
+++ b/lib/librte_ring/rte_ring_version.map
@@ -17,3 +17,10 @@ DPDK_2.2 {
 	rte_ring_free;
 
 } DPDK_2.0;
+
+DPDK_19.05 {
+	global:
+
+	rte_ring_get_memsize;
+
+} DPDK_2.2;
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v4 3/5] test_ring: add non-blocking ring autotest
  2019-01-28 18:14     ` [dpdk-dev] [PATCH v4 " Gage Eads
  2019-01-28 18:14       ` [dpdk-dev] [PATCH v4 1/5] ring: add 64-bit headtail structure Gage Eads
  2019-01-28 18:14       ` [dpdk-dev] [PATCH v4 2/5] ring: add a non-blocking implementation Gage Eads
@ 2019-01-28 18:14       ` Gage Eads
  2019-01-28 18:14       ` [dpdk-dev] [PATCH v4 4/5] test_ring_perf: add non-blocking ring perf test Gage Eads
                         ` (2 subsequent siblings)
  5 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-01-28 18:14 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl

ring_nb_autotest re-uses the ring_autotest code by wrapping its top-level
function with one that takes a 'flags' argument.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 test/test/test_ring.c | 57 ++++++++++++++++++++++++++++++++-------------------
 1 file changed, 36 insertions(+), 21 deletions(-)

diff --git a/test/test/test_ring.c b/test/test/test_ring.c
index aaf1e70ad..ff410d978 100644
--- a/test/test/test_ring.c
+++ b/test/test/test_ring.c
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2010-2019 Intel Corporation
  */
 
 #include <string.h>
@@ -601,18 +601,20 @@ test_ring_burst_basic(struct rte_ring *r)
  * it will always fail to create ring with a wrong ring size number in this function
  */
 static int
-test_ring_creation_with_wrong_size(void)
+test_ring_creation_with_wrong_size(unsigned int flags)
 {
 	struct rte_ring * rp = NULL;
 
 	/* Test if ring size is not power of 2 */
-	rp = rte_ring_create("test_bad_ring_size", RING_SIZE + 1, SOCKET_ID_ANY, 0);
+	rp = rte_ring_create("test_bad_ring_size", RING_SIZE + 1,
+			     SOCKET_ID_ANY, flags);
 	if (NULL != rp) {
 		return -1;
 	}
 
 	/* Test if ring size is exceeding the limit */
-	rp = rte_ring_create("test_bad_ring_size", (RTE_RING_SZ_MASK + 1), SOCKET_ID_ANY, 0);
+	rp = rte_ring_create("test_bad_ring_size", (RTE_RING_SZ_MASK + 1),
+			     SOCKET_ID_ANY, flags);
 	if (NULL != rp) {
 		return -1;
 	}
@@ -623,11 +625,11 @@ test_ring_creation_with_wrong_size(void)
  * it tests if it would always fail to create ring with an used ring name
  */
 static int
-test_ring_creation_with_an_used_name(void)
+test_ring_creation_with_an_used_name(unsigned int flags)
 {
 	struct rte_ring * rp;
 
-	rp = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, 0);
+	rp = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, flags);
 	if (NULL != rp)
 		return -1;
 
@@ -639,10 +641,10 @@ test_ring_creation_with_an_used_name(void)
  * function to fail correctly
  */
 static int
-test_create_count_odd(void)
+test_create_count_odd(unsigned int flags)
 {
 	struct rte_ring *r = rte_ring_create("test_ring_count",
-			4097, SOCKET_ID_ANY, 0 );
+			4097, SOCKET_ID_ANY, flags);
 	if(r != NULL){
 		return -1;
 	}
@@ -665,7 +667,7 @@ test_lookup_null(void)
  * it tests some more basic ring operations
  */
 static int
-test_ring_basic_ex(void)
+test_ring_basic_ex(unsigned int flags)
 {
 	int ret = -1;
 	unsigned i;
@@ -679,7 +681,7 @@ test_ring_basic_ex(void)
 	}
 
 	rp = rte_ring_create("test_ring_basic_ex", RING_SIZE, SOCKET_ID_ANY,
-			RING_F_SP_ENQ | RING_F_SC_DEQ);
+			RING_F_SP_ENQ | RING_F_SC_DEQ | flags);
 	if (rp == NULL) {
 		printf("test_ring_basic_ex fail to create ring\n");
 		goto fail_test;
@@ -737,7 +739,7 @@ test_ring_basic_ex(void)
 }
 
 static int
-test_ring_with_exact_size(void)
+test_ring_with_exact_size(unsigned int flags)
 {
 	struct rte_ring *std_ring = NULL, *exact_sz_ring = NULL;
 	void *ptr_array[16];
@@ -746,13 +748,13 @@ test_ring_with_exact_size(void)
 	int ret = -1;
 
 	std_ring = rte_ring_create("std", ring_sz, rte_socket_id(),
-			RING_F_SP_ENQ | RING_F_SC_DEQ);
+			RING_F_SP_ENQ | RING_F_SC_DEQ | flags);
 	if (std_ring == NULL) {
 		printf("%s: error, can't create std ring\n", __func__);
 		goto end;
 	}
 	exact_sz_ring = rte_ring_create("exact sz", ring_sz, rte_socket_id(),
-			RING_F_SP_ENQ | RING_F_SC_DEQ | RING_F_EXACT_SZ);
+		RING_F_SP_ENQ | RING_F_SC_DEQ | RING_F_EXACT_SZ | flags);
 	if (exact_sz_ring == NULL) {
 		printf("%s: error, can't create exact size ring\n", __func__);
 		goto end;
@@ -808,17 +810,17 @@ test_ring_with_exact_size(void)
 }
 
 static int
-test_ring(void)
+__test_ring(unsigned int flags)
 {
 	struct rte_ring *r = NULL;
 
 	/* some more basic operations */
-	if (test_ring_basic_ex() < 0)
+	if (test_ring_basic_ex(flags) < 0)
 		goto test_fail;
 
 	rte_atomic32_init(&synchro);
 
-	r = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, 0);
+	r = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, flags);
 	if (r == NULL)
 		goto test_fail;
 
@@ -837,27 +839,27 @@ test_ring(void)
 		goto test_fail;
 
 	/* basic operations */
-	if ( test_create_count_odd() < 0){
+	if (test_create_count_odd(flags) < 0) {
 		printf("Test failed to detect odd count\n");
 		goto test_fail;
 	} else
 		printf("Test detected odd count\n");
 
-	if ( test_lookup_null() < 0){
+	if (test_lookup_null() < 0) {
 		printf("Test failed to detect NULL ring lookup\n");
 		goto test_fail;
 	} else
 		printf("Test detected NULL ring lookup\n");
 
 	/* test of creating ring with wrong size */
-	if (test_ring_creation_with_wrong_size() < 0)
+	if (test_ring_creation_with_wrong_size(flags) < 0)
 		goto test_fail;
 
 	/* test of creation ring with an used name */
-	if (test_ring_creation_with_an_used_name() < 0)
+	if (test_ring_creation_with_an_used_name(flags) < 0)
 		goto test_fail;
 
-	if (test_ring_with_exact_size() < 0)
+	if (test_ring_with_exact_size(flags) < 0)
 		goto test_fail;
 
 	/* dump the ring status */
@@ -873,4 +875,17 @@ test_ring(void)
 	return -1;
 }
 
+static int
+test_ring(void)
+{
+	return __test_ring(0);
+}
+
+static int
+test_nb_ring(void)
+{
+	return __test_ring(RING_F_NB);
+}
+
 REGISTER_TEST_COMMAND(ring_autotest, test_ring);
+REGISTER_TEST_COMMAND(ring_nb_autotest, test_nb_ring);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v4 4/5] test_ring_perf: add non-blocking ring perf test
  2019-01-28 18:14     ` [dpdk-dev] [PATCH v4 " Gage Eads
                         ` (2 preceding siblings ...)
  2019-01-28 18:14       ` [dpdk-dev] [PATCH v4 3/5] test_ring: add non-blocking ring autotest Gage Eads
@ 2019-01-28 18:14       ` Gage Eads
  2019-01-28 18:14       ` [dpdk-dev] [PATCH v4 5/5] mempool/ring: add non-blocking ring handlers Gage Eads
  2019-03-05 17:40       ` [dpdk-dev] [PATCH v5 0/6] Add lock-free ring and mempool handler Gage Eads
  5 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-01-28 18:14 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl

nb_ring_perf_autotest re-uses the ring_perf_autotest code by wrapping its
top-level function with one that takes a 'flags' argument.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 test/test/test_ring_perf.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/test/test/test_ring_perf.c b/test/test/test_ring_perf.c
index ebb3939f5..380c4b4a1 100644
--- a/test/test/test_ring_perf.c
+++ b/test/test/test_ring_perf.c
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2010-2019 Intel Corporation
  */
 
 
@@ -363,12 +363,12 @@ test_bulk_enqueue_dequeue(struct rte_ring *r)
 }
 
 static int
-test_ring_perf(void)
+__test_ring_perf(unsigned int flags)
 {
 	struct lcore_pair cores;
 	struct rte_ring *r = NULL;
 
-	r = rte_ring_create(RING_NAME, RING_SIZE, rte_socket_id(), 0);
+	r = rte_ring_create(RING_NAME, RING_SIZE, rte_socket_id(), flags);
 	if (r == NULL)
 		return -1;
 
@@ -398,4 +398,17 @@ test_ring_perf(void)
 	return 0;
 }
 
+static int
+test_ring_perf(void)
+{
+	return __test_ring_perf(0);
+}
+
+static int
+test_nb_ring_perf(void)
+{
+	return __test_ring_perf(RING_F_NB);
+}
+
 REGISTER_TEST_COMMAND(ring_perf_autotest, test_ring_perf);
+REGISTER_TEST_COMMAND(ring_nb_perf_autotest, test_nb_ring_perf);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v4 5/5] mempool/ring: add non-blocking ring handlers
  2019-01-28 18:14     ` [dpdk-dev] [PATCH v4 " Gage Eads
                         ` (3 preceding siblings ...)
  2019-01-28 18:14       ` [dpdk-dev] [PATCH v4 4/5] test_ring_perf: add non-blocking ring perf test Gage Eads
@ 2019-01-28 18:14       ` Gage Eads
  2019-03-05 17:40       ` [dpdk-dev] [PATCH v5 0/6] Add lock-free ring and mempool handler Gage Eads
  5 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-01-28 18:14 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl

These handlers allow an application to create a mempool based on the
non-blocking ring, with any combination of single/multi producer/consumer.

Also, add a note to the programmer's guide's "known issues" section.

Signed-off-by: Gage Eads <gage.eads@intel.com>
Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
---
 doc/guides/prog_guide/env_abstraction_layer.rst |  5 +++
 drivers/mempool/ring/Makefile                   |  1 +
 drivers/mempool/ring/meson.build                |  2 +
 drivers/mempool/ring/rte_mempool_ring.c         | 58 +++++++++++++++++++++++--
 4 files changed, 63 insertions(+), 3 deletions(-)

diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index 929d76dba..fcafd1cff 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -541,6 +541,11 @@ Known Issues
 
   5. It MUST not be used by multi-producer/consumer pthreads, whose scheduling policies are SCHED_FIFO or SCHED_RR.
 
+  Alternatively, x86_64 applications can use the non-blocking ring mempool handler. When considering it, note that:
+
+  - it is currently limited to the x86_64 platform, because it uses a function (16-byte compare-and-swap) that is not yet available on other platforms.
+  - it has worse average-case performance than the non-preemptive rte_ring, but software caching (e.g. the mempool cache) can mitigate this by reducing the number of handler operations.
+
 + rte_timer
 
   Running  ``rte_timer_manage()`` on a non-EAL pthread is not allowed. However, resetting/stopping the timer from a non-EAL pthread is allowed.
diff --git a/drivers/mempool/ring/Makefile b/drivers/mempool/ring/Makefile
index ddab522fe..012ba6966 100644
--- a/drivers/mempool/ring/Makefile
+++ b/drivers/mempool/ring/Makefile
@@ -10,6 +10,7 @@ LIB = librte_mempool_ring.a
 
 CFLAGS += -O3
 CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 LDLIBS += -lrte_eal -lrte_mempool -lrte_ring
 
 EXPORT_MAP := rte_mempool_ring_version.map
diff --git a/drivers/mempool/ring/meson.build b/drivers/mempool/ring/meson.build
index a021e908c..b1cb673cc 100644
--- a/drivers/mempool/ring/meson.build
+++ b/drivers/mempool/ring/meson.build
@@ -1,4 +1,6 @@
 # SPDX-License-Identifier: BSD-3-Clause
 # Copyright(c) 2017 Intel Corporation
 
+allow_experimental_apis = true
+
 sources = files('rte_mempool_ring.c')
diff --git a/drivers/mempool/ring/rte_mempool_ring.c b/drivers/mempool/ring/rte_mempool_ring.c
index bc123fc52..013dac3bc 100644
--- a/drivers/mempool/ring/rte_mempool_ring.c
+++ b/drivers/mempool/ring/rte_mempool_ring.c
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2016 Intel Corporation
+ * Copyright(c) 2010-2019 Intel Corporation
  */
 
 #include <stdio.h>
@@ -47,11 +47,11 @@ common_ring_get_count(const struct rte_mempool *mp)
 
 
 static int
-common_ring_alloc(struct rte_mempool *mp)
+__common_ring_alloc(struct rte_mempool *mp, int rg_flags)
 {
-	int rg_flags = 0, ret;
 	char rg_name[RTE_RING_NAMESIZE];
 	struct rte_ring *r;
+	int ret;
 
 	ret = snprintf(rg_name, sizeof(rg_name),
 		RTE_MEMPOOL_MZ_FORMAT, mp->name);
@@ -82,6 +82,18 @@ common_ring_alloc(struct rte_mempool *mp)
 	return 0;
 }
 
+static int
+common_ring_alloc(struct rte_mempool *mp)
+{
+	return __common_ring_alloc(mp, 0);
+}
+
+static int
+common_ring_alloc_nb(struct rte_mempool *mp)
+{
+	return __common_ring_alloc(mp, RING_F_NB);
+}
+
 static void
 common_ring_free(struct rte_mempool *mp)
 {
@@ -130,7 +142,47 @@ static const struct rte_mempool_ops ops_sp_mc = {
 	.get_count = common_ring_get_count,
 };
 
+static const struct rte_mempool_ops ops_mp_mc_nb = {
+	.name = "ring_mp_mc_nb",
+	.alloc = common_ring_alloc_nb,
+	.free = common_ring_free,
+	.enqueue = common_ring_mp_enqueue,
+	.dequeue = common_ring_mc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
+static const struct rte_mempool_ops ops_sp_sc_nb = {
+	.name = "ring_sp_sc_nb",
+	.alloc = common_ring_alloc_nb,
+	.free = common_ring_free,
+	.enqueue = common_ring_sp_enqueue,
+	.dequeue = common_ring_sc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
+static const struct rte_mempool_ops ops_mp_sc_nb = {
+	.name = "ring_mp_sc_nb",
+	.alloc = common_ring_alloc_nb,
+	.free = common_ring_free,
+	.enqueue = common_ring_mp_enqueue,
+	.dequeue = common_ring_sc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
+static const struct rte_mempool_ops ops_sp_mc_nb = {
+	.name = "ring_sp_mc_nb",
+	.alloc = common_ring_alloc_nb,
+	.free = common_ring_free,
+	.enqueue = common_ring_sp_enqueue,
+	.dequeue = common_ring_mc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
 MEMPOOL_REGISTER_OPS(ops_mp_mc);
 MEMPOOL_REGISTER_OPS(ops_sp_sc);
 MEMPOOL_REGISTER_OPS(ops_mp_sc);
 MEMPOOL_REGISTER_OPS(ops_sp_mc);
+MEMPOOL_REGISTER_OPS(ops_mp_mc_nb);
+MEMPOOL_REGISTER_OPS(ops_sp_sc_nb);
+MEMPOOL_REGISTER_OPS(ops_mp_sc_nb);
+MEMPOOL_REGISTER_OPS(ops_sp_mc_nb);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation
  2019-01-28 10:35               ` Ola Liljedahl
@ 2019-01-28 18:54                 ` Eads, Gage
  2019-01-28 22:31                   ` Ola Liljedahl
  0 siblings, 1 reply; 123+ messages in thread
From: Eads, Gage @ 2019-01-28 18:54 UTC (permalink / raw)
  To: Ola Liljedahl, jerinj, mczekaj, dev
  Cc: olivier.matz, stephen, nd, Richardson, Bruce, arybchenko,
	Ananyev, Konstantin



> -----Original Message-----
> From: Ola Liljedahl [mailto:Ola.Liljedahl@arm.com]
> Sent: Monday, January 28, 2019 4:36 AM
> To: jerinj@marvell.com; mczekaj@marvell.com; Eads, Gage
> <gage.eads@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; stephen@networkplumber.org; nd
> <nd@arm.com>; Richardson, Bruce <bruce.richardson@intel.com>;
> arybchenko@solarflare.com; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation
> 
> On Fri, 2019-01-25 at 17:21 +0000, Eads, Gage wrote:
> >
> > >
> > > -----Original Message-----
> > > From: Ola Liljedahl [mailto:Ola.Liljedahl@arm.com]
> > > Sent: Wednesday, January 23, 2019 4:16 AM
> > > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> > > Cc: olivier.matz@6wind.com; stephen@networkplumber.org; nd
> > > <nd@arm.com>; Richardson, Bruce <bruce.richardson@intel.com>;
> > > arybchenko@solarflare.com; Ananyev, Konstantin
> > > <konstantin.ananyev@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking
> > > implementation
> > >
> > > On Tue, 2019-01-22 at 21:31 +0000, Eads, Gage wrote:
> > > >
> > > > Hi Ola,
> > > >
> > > > <snip>
> > > >
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > @@ -331,6 +433,319 @@ void rte_ring_dump(FILE *f, const struct
> > > > > > rte_ring *r);
> > > > > >  #endif
> > > > > >  #include "rte_ring_generic_64.h"
> > > > > >
> > > > > > +/* @internal 128-bit structure used by the non-blocking ring
> > > > > > +*/ struct nb_ring_entry {
> > > > > > +	void *ptr; /**< Data pointer */
> > > > > > +	uint64_t cnt; /**< Modification counter */
> > > > > Why not make 'cnt' uintptr_t? This way 32-bit architectures will
> > > > > also be supported. I think there are some claims that DPDK still
> > > > > supports e.g.
> > > > > ARMv7a
> > > > > and possibly also 32-bit x86?
> > > > I chose a 64-bit modification counter because (practically
> > > > speaking) the ABA problem will not occur with such a large counter
> > > > -- definitely not within my lifetime. See the "Discussion" section
> > > > of the commit message for more information.
> > > >
> > > > With a 32-bit counter, there is a very (very) low likelihood of
> > > > it, but it is possible. Personally, I don't feel comfortable
> > > > providing such code, because a) I doubt all users would understand
> > > > the implementation well enough to do the risk/reward analysis, and
> > > > b) such a bug would be near impossible to reproduce and root-cause
> > > > if it did occur.
> > > With a 64-bit counter (and 32-bit pointer), 32-bit architectures (e.g.
> > > ARMv7a and
> > > probably x86 as well) won't be able to support this as they at best
> > > support 64-bit CAS (ARMv7a has LDREXD/STREXD). So you are
> > > essentially putting a 64-bit (and 128-bit CAS) requirement on the
> > > implementation.
> > >
> > Yes, I am. I tried to make that clear in the cover letter.
> >
> > >
> > > >
> > > >
> > > > >
> > > > >
> > > > >
> > > > > >
> > > > > >
> > > > > > +};
> > > > > > +
> > > > > > +/* The non-blocking ring algorithm is based on the original
> > > > > > +rte ring (derived
> > > > > > + * from FreeBSD's bufring.h) and inspired by Michael and
> > > > > > +Scott's non-blocking
> > > > > > + * concurrent queue.
> > > > > > + */
> > > > > > +
> > > > > > +/**
> > > > > > + * @internal
> > > > > > + *   Enqueue several objects on the non-blocking ring
> > > > > > +(single-producer only)
> > > > > > + *
> > > > > > + * @param r
> > > > > > + *   A pointer to the ring structure.
> > > > > > + * @param obj_table
> > > > > > + *   A pointer to a table of void * pointers (objects).
> > > > > > + * @param n
> > > > > > + *   The number of objects to add in the ring from the obj_table.
> > > > > > + * @param behavior
> > > > > > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items
> > > > > > +to the ring
> > > > > > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as
> > > > > > +possible to the ring
> > > > > > + * @param free_space
> > > > > > + *   returns the amount of space after the enqueue operation
> > > > > > +has finished
> > > > > > + * @return
> > > > > > + *   Actual number of objects enqueued.
> > > > > > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > > > > > + */
> > > > > > +static __rte_always_inline unsigned int
> > > > > > +__rte_ring_do_nb_enqueue_sp(struct rte_ring *r, void * const
> > > > > > *obj_table,
> > > > > > +			    unsigned int n,
> > > > > > +			    enum rte_ring_queue_behavior behavior,
> > > > > > +			    unsigned int *free_space) {
> > > > > > +	uint32_t free_entries;
> > > > > > +	size_t head, next;
> > > > > > +
> > > > > > +	n = __rte_ring_move_prod_head_64(r, 1, n, behavior,
> > > > > > +					 &head, &next,
> > > > > > &free_entries);
> > > > > > +	if (n == 0)
> > > > > > +		goto end;
> > > > > > +
> > > > > > +	ENQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
> > > > > > +
> > > > > > +	r->prod_64.tail += n;
> > > > > Don't we need release order when (or smp_wmb between) writing of
> > > > > the ring pointers and the update of tail? By updating the tail
> > > > > pointer, we are synchronising with a consumer.
> > > > >
> > > > > I prefer using __atomic operations even for load and store. You
> > > > > can see which parts of the code that synchronise with each other, e.g.
> > > > > store-release to some location synchronises with load-acquire
> > > > > from the same location. If you don't know how different threads
> > > > > synchronise with each other, you are very likely to make mistakes.
> > > > >
> > > > You can tell this code was written when I thought x86-64 was the
> > > > only viable target :). Yes, you are correct.
> > > >
> > > > With regards to using __atomic intrinsics, I'm planning on taking
> > > > a similar approach to the functions duplicated in
> > > > rte_ring_generic.h and
> > > > rte_ring_c11_mem.h: one version that uses rte_atomic functions
> > > > (and thus stricter memory ordering) and one that uses __atomic
> > > > intrinsics (and thus can benefit from more relaxed memory ordering).
> From a code point of view, I strongly prefer the atomic operations to be visible
> in the top level code, not hidden in subroutines. For correctness, it is vital that
> memory accesses are performed with the required ordering and that acquire and
> release matches up. Hiding e.g. load-acquire and store-release in subroutines (in
> a different file!) make this difficult. There have already been such bugs found in
> rte_ring.
> 

After working on the acq/rel ordering this weekend, I agree. This'll be easier/cleaner if we end up only using the C11 version.

> > > What's the advantage of having two different implementations? What
> > > is the disadvantage?
> > >
> > > The existing ring buffer code originally had only the "legacy"
> > > implementation
> > > which was kept when the __atomic implementation was added. The
> > > reason claimed was that some older compilers for x86 do not support
> > > GCC __atomic builtins. But I thought there was consensus that new
> > > functionality could have only __atomic implementations.
> > >
> > When CONFIG_RTE_RING_USE_C11_MEM_MODEL was introduced, it was left
> > disabled for thunderx[1] for performance reasons. Assuming that hasn't
> > changed, the advantage to having two versions is to best support all of DPDK's
> platforms.
> > The disadvantage is of course duplicated code and the additional
> > maintenance burden.
> The only way I see that a C11 memory model implementation can be slower
> than using smp_wmb/rmb is if you need to order loads before a synchronizing
> store and there are also outstanding stores which do not require ordering.
> smp_rmb() handles this while store-release will also (unnecessarily) order those
> outstanding stores. This situation occurs e.g. in ring buffer dequeue operations
> where ring slots are read (and possibly written to thread-private memory) before
> the ring slots are release (e.g. using CAS-release or store-release).
> 
> I imagine that the LSU/cache subsystem on ThunderX/OCTEON-TX also have
> something to do with this problem. If there are a large amounts of stores
> pending in the load/store unit, store-release might have to wait for a long time
> before the synchronizing store can complete.
> 
> >
> > That said, if the thunderx maintainers are ok with it, I'm certainly
> > open to only doing the __atomic version. Note that even in the
> > __atomic version, based on Honnapa's findings[2], using a DPDK-defined
> > rte_atomic128_cmpset() (with additional arguments to support machines
> > with weak consistency) appears to be a better option than
> __atomic_compare_exchange_16.
> __atomic_compare_exchange_16() is not guaranteed to be lock-free. It is not
> lock-free on ARM/AArch64 and the support in GCC is formally broken (can't use
> cmpexchg16b to implement __atomic_load_16).
> 
> So yes, I think DPDK will have to define and implement the 128-bit atomic
> compare and exchange operation (whatever it will be called). For compatibility
> with ARMv8.0, we can't require the "old" value returned by a failed compare-
> exchange operation to be read atomically (LDXP does not guaranteed atomicity
> by itself). But this is seldom a problem, many designs read the memory location
> using two separate 64-bit loads (so not atomic) anyway, it is a successful atomic
> compare exchange operation which provides atomicity.
> 

Ok. I agree, I don't expect that to be a problem. The 128-bit CAS patch I just submitted[1] (which was developed before reading this) will have to be changed.

[1] http://mails.dpdk.org/archives/dev/2019-January/124159.html

Thanks,
Gage

</snip>

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation
  2019-01-28 13:34               ` Jerin Jacob Kollanukkaran
  2019-01-28 13:43                 ` Ola Liljedahl
@ 2019-01-28 18:59                 ` Eads, Gage
  1 sibling, 0 replies; 123+ messages in thread
From: Eads, Gage @ 2019-01-28 18:59 UTC (permalink / raw)
  To: Jerin Jacob Kollanukkaran, Ola.Liljedahl, Maciej Czekaj, dev
  Cc: olivier.matz, stephen, nd, Richardson, Bruce, arybchenko,
	Ananyev, Konstantin



> -----Original Message-----
> From: Jerin Jacob Kollanukkaran [mailto:jerinj@marvell.com]
> Sent: Monday, January 28, 2019 7:34 AM
> To: Ola.Liljedahl@arm.com; Maciej Czekaj <mczekaj@marvell.com>; Eads, Gage
> <gage.eads@intel.com>; dev@dpdk.org
> Cc: olivier.matz@6wind.com; stephen@networkplumber.org; nd@arm.com;
> Richardson, Bruce <bruce.richardson@intel.com>; arybchenko@solarflare.com;
> Ananyev, Konstantin <konstantin.ananyev@intel.com>
> Subject: Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation
> 
> On Fri, 2019-01-25 at 17:21 +0000, Eads, Gage wrote:
> > > -----Original Message-----
> > > From: Ola Liljedahl [mailto:Ola.Liljedahl@arm.com]
> > > Sent: Wednesday, January 23, 2019 4:16 AM
> > > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> > > Cc: olivier.matz@6wind.com; stephen@networkplumber.org; nd
> > > <nd@arm.com>; Richardson, Bruce <bruce.richardson@intel.com>;
> > > arybchenko@solarflare.com; Ananyev, Konstantin
> > > <konstantin.ananyev@intel.com>
> > > Subject: Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking
> > > implementation
> > >
> > > s.
> > > > >
> > > > You can tell this code was written when I thought x86-64 was the
> > > > only viable target :). Yes, you are correct.
> > > >
> > > > With regards to using __atomic intrinsics, I'm planning on taking
> > > > a similar approach to the functions duplicated in
> > > > rte_ring_generic.h and
> > > > rte_ring_c11_mem.h: one version that uses rte_atomic functions
> > > > (and thus stricter memory ordering) and one that uses __atomic
> > > > intrinsics (and thus can benefit from more relaxed memory
> > > > ordering).
> > > What's the advantage of having two different implementations? What
> > > is the disadvantage?
> > >
> > > The existing ring buffer code originally had only the "legacy"
> > > implementation
> > > which was kept when the __atomic implementation was added. The
> > > reason claimed was that some older compilers for x86 do not support
> > > GCC __atomic builtins. But I thought there was consensus that new
> > > functionality could have only __atomic implementations.
> > >
> >
> > When CONFIG_RTE_RING_USE_C11_MEM_MODEL was introduced, it was left
> > disabled for thunderx[1] for performance reasons. Assuming that hasn't
> > changed, the advantage to having two versions is to best support all
> > of DPDK's platforms. The disadvantage is of course duplicated code and
> > the additional maintenance burden.
> >
> > That said, if the thunderx maintainers are ok with it, I'm certainly
> 
> The ring code was so fundamental building block for DPDK, there was difference
> in performance and there was already legacy code so introducing
> C11_MEM_MODEL was justified IMO.
> 
> For the nonblocking implementation, I am happy to test with three ARM64
> microarchitectures and share the result with C11_MEM_MODEL vs non
> C11_MEM_MODLE performance. We may need to consider PPC also here. So
> IMO, based on the overall performance result may be can decide the new code
> direction.

Appreciate the help. Please hold off any testing until we've had a chance to incorporate ideas from lfring, which will definitely affect performance.

^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking implementation
  2019-01-28 18:54                 ` Eads, Gage
@ 2019-01-28 22:31                   ` Ola Liljedahl
  0 siblings, 0 replies; 123+ messages in thread
From: Ola Liljedahl @ 2019-01-28 22:31 UTC (permalink / raw)
  To: jerinj, mczekaj, gage.eads, dev
  Cc: olivier.matz, stephen, nd, bruce.richardson, arybchenko,
	konstantin.ananyev

On Mon, 2019-01-28 at 18:54 +0000, Eads, Gage wrote:
> 
> > 
> > -----Original Message-----
> > From: Ola Liljedahl [mailto:Ola.Liljedahl@arm.com]
> > Sent: Monday, January 28, 2019 4:36 AM
> > To: jerinj@marvell.com; mczekaj@marvell.com; Eads, Gage
> > <gage.eads@intel.com>; dev@dpdk.org
> > Cc: olivier.matz@6wind.com; stephen@networkplumber.org; nd
> > <nd@arm.com>; Richardson, Bruce <bruce.richardson@intel.com>;
> > arybchenko@solarflare.com; Ananyev, Konstantin
> > <konstantin.ananyev@intel.com>
> > Subject: Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking
> > implementation
> > 
> > On Fri, 2019-01-25 at 17:21 +0000, Eads, Gage wrote:
> > > 
> > > 
> > > > 
> > > > 
> > > > -----Original Message-----
> > > > From: Ola Liljedahl [mailto:Ola.Liljedahl@arm.com]
> > > > Sent: Wednesday, January 23, 2019 4:16 AM
> > > > To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> > > > Cc: olivier.matz@6wind.com; stephen@networkplumber.org; nd
> > > > <nd@arm.com>; Richardson, Bruce <bruce.richardson@intel.com>;
> > > > arybchenko@solarflare.com; Ananyev, Konstantin
> > > > <konstantin.ananyev@intel.com>
> > > > Subject: Re: [dpdk-dev] [PATCH v3 2/5] ring: add a non-blocking
> > > > implementation
> > > > 
> > > > On Tue, 2019-01-22 at 21:31 +0000, Eads, Gage wrote:
> > > > > 
> > > > > 
> > > > > Hi Ola,
> > > > > 
> > > > > <snip>
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > @@ -331,6 +433,319 @@ void rte_ring_dump(FILE *f, const struct
> > > > > > > rte_ring *r);
> > > > > > >  #endif
> > > > > > >  #include "rte_ring_generic_64.h"
> > > > > > > 
> > > > > > > +/* @internal 128-bit structure used by the non-blocking ring
> > > > > > > +*/ struct nb_ring_entry {
> > > > > > > +	void *ptr; /**< Data pointer */
> > > > > > > +	uint64_t cnt; /**< Modification counter */
> > > > > > Why not make 'cnt' uintptr_t? This way 32-bit architectures will
> > > > > > also be supported. I think there are some claims that DPDK still
> > > > > > supports e.g.
> > > > > > ARMv7a
> > > > > > and possibly also 32-bit x86?
> > > > > I chose a 64-bit modification counter because (practically
> > > > > speaking) the ABA problem will not occur with such a large counter
> > > > > -- definitely not within my lifetime. See the "Discussion" section
> > > > > of the commit message for more information.
> > > > > 
> > > > > With a 32-bit counter, there is a very (very) low likelihood of
> > > > > it, but it is possible. Personally, I don't feel comfortable
> > > > > providing such code, because a) I doubt all users would understand
> > > > > the implementation well enough to do the risk/reward analysis, and
> > > > > b) such a bug would be near impossible to reproduce and root-cause
> > > > > if it did occur.
> > > > With a 64-bit counter (and 32-bit pointer), 32-bit architectures (e.g.
> > > > ARMv7a and
> > > > probably x86 as well) won't be able to support this as they at best
> > > > support 64-bit CAS (ARMv7a has LDREXD/STREXD). So you are
> > > > essentially putting a 64-bit (and 128-bit CAS) requirement on the
> > > > implementation.
> > > > 
> > > Yes, I am. I tried to make that clear in the cover letter.
> > > 
> > > > 
> > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > > > +};
> > > > > > > +
> > > > > > > +/* The non-blocking ring algorithm is based on the original
> > > > > > > +rte ring (derived
> > > > > > > + * from FreeBSD's bufring.h) and inspired by Michael and
> > > > > > > +Scott's non-blocking
> > > > > > > + * concurrent queue.
> > > > > > > + */
> > > > > > > +
> > > > > > > +/**
> > > > > > > + * @internal
> > > > > > > + *   Enqueue several objects on the non-blocking ring
> > > > > > > +(single-producer only)
> > > > > > > + *
> > > > > > > + * @param r
> > > > > > > + *   A pointer to the ring structure.
> > > > > > > + * @param obj_table
> > > > > > > + *   A pointer to a table of void * pointers (objects).
> > > > > > > + * @param n
> > > > > > > + *   The number of objects to add in the ring from the obj_table.
> > > > > > > + * @param behavior
> > > > > > > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items
> > > > > > > +to the ring
> > > > > > > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as
> > > > > > > +possible to the ring
> > > > > > > + * @param free_space
> > > > > > > + *   returns the amount of space after the enqueue operation
> > > > > > > +has finished
> > > > > > > + * @return
> > > > > > > + *   Actual number of objects enqueued.
> > > > > > > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n
> > > > > > > only.
> > > > > > > + */
> > > > > > > +static __rte_always_inline unsigned int
> > > > > > > +__rte_ring_do_nb_enqueue_sp(struct rte_ring *r, void * const
> > > > > > > *obj_table,
> > > > > > > +			    unsigned int n,
> > > > > > > +			    enum rte_ring_queue_behavior
> > > > > > > behavior,
> > > > > > > +			    unsigned int *free_space) {
> > > > > > > +	uint32_t free_entries;
> > > > > > > +	size_t head, next;
> > > > > > > +
> > > > > > > +	n = __rte_ring_move_prod_head_64(r, 1, n, behavior,
> > > > > > > +					 &head, &next,
> > > > > > > &free_entries);
> > > > > > > +	if (n == 0)
> > > > > > > +		goto end;
> > > > > > > +
> > > > > > > +	ENQUEUE_PTRS_NB(r, &r[1], head, obj_table, n);
> > > > > > > +
> > > > > > > +	r->prod_64.tail += n;
> > > > > > Don't we need release order when (or smp_wmb between) writing of
> > > > > > the ring pointers and the update of tail? By updating the tail
> > > > > > pointer, we are synchronising with a consumer.
> > > > > > 
> > > > > > I prefer using __atomic operations even for load and store. You
> > > > > > can see which parts of the code that synchronise with each other,
> > > > > > e.g.
> > > > > > store-release to some location synchronises with load-acquire
> > > > > > from the same location. If you don't know how different threads
> > > > > > synchronise with each other, you are very likely to make mistakes.
> > > > > > 
> > > > > You can tell this code was written when I thought x86-64 was the
> > > > > only viable target :). Yes, you are correct.
> > > > > 
> > > > > With regards to using __atomic intrinsics, I'm planning on taking
> > > > > a similar approach to the functions duplicated in
> > > > > rte_ring_generic.h and
> > > > > rte_ring_c11_mem.h: one version that uses rte_atomic functions
> > > > > (and thus stricter memory ordering) and one that uses __atomic
> > > > > intrinsics (and thus can benefit from more relaxed memory ordering).
> > From a code point of view, I strongly prefer the atomic operations to be
> > visible
> > in the top level code, not hidden in subroutines. For correctness, it is
> > vital that
> > memory accesses are performed with the required ordering and that acquire
> > and
> > release matches up. Hiding e.g. load-acquire and store-release in
> > subroutines (in
> > a different file!) make this difficult. There have already been such bugs
> > found in
> > rte_ring.
> > 
> After working on the acq/rel ordering this weekend, I agree. This'll be
> easier/cleaner if we end up only using the C11 version.
Fabulous!

As I wrote in a response to Jerin, with a small cheat (LoadStore fence+store-
relaxed instead of store-release in the dequeue function where we only read
shared data in the critical section), C11 should provide the same ordering and
thus the same performance as the explicit barrier version. Benchmarking will
show.

> 
> > 
> > > 
> > > > 
> > > > What's the advantage of having two different implementations? What
> > > > is the disadvantage?
> > > > 
> > > > The existing ring buffer code originally had only the "legacy"
> > > > implementation
> > > > which was kept when the __atomic implementation was added. The
> > > > reason claimed was that some older compilers for x86 do not support
> > > > GCC __atomic builtins. But I thought there was consensus that new
> > > > functionality could have only __atomic implementations.
> > > > 
> > > When CONFIG_RTE_RING_USE_C11_MEM_MODEL was introduced, it was left
> > > disabled for thunderx[1] for performance reasons. Assuming that hasn't
> > > changed, the advantage to having two versions is to best support all of
> > > DPDK's
> > platforms.
> > > 
> > > The disadvantage is of course duplicated code and the additional
> > > maintenance burden.
> > The only way I see that a C11 memory model implementation can be slower
> > than using smp_wmb/rmb is if you need to order loads before a synchronizing
> > store and there are also outstanding stores which do not require ordering.
> > smp_rmb() handles this while store-release will also (unnecessarily) order
> > those
> > outstanding stores. This situation occurs e.g. in ring buffer dequeue
> > operations
> > where ring slots are read (and possibly written to thread-private memory)
> > before
> > the ring slots are release (e.g. using CAS-release or store-release).
> > 
> > I imagine that the LSU/cache subsystem on ThunderX/OCTEON-TX also have
> > something to do with this problem. If there are a large amounts of stores
> > pending in the load/store unit, store-release might have to wait for a long
> > time
> > before the synchronizing store can complete.
> > 
> > > 
> > > 
> > > That said, if the thunderx maintainers are ok with it, I'm certainly
> > > open to only doing the __atomic version. Note that even in the
> > > __atomic version, based on Honnapa's findings[2], using a DPDK-defined
> > > rte_atomic128_cmpset() (with additional arguments to support machines
> > > with weak consistency) appears to be a better option than
> > __atomic_compare_exchange_16.
> > __atomic_compare_exchange_16() is not guaranteed to be lock-free. It is not
> > lock-free on ARM/AArch64 and the support in GCC is formally broken (can't
> > use
> > cmpexchg16b to implement __atomic_load_16).
> > 
> > So yes, I think DPDK will have to define and implement the 128-bit atomic
> > compare and exchange operation (whatever it will be called). For
> > compatibility
> > with ARMv8.0, we can't require the "old" value returned by a failed compare-
> > exchange operation to be read atomically (LDXP does not guaranteed atomicity
> > by itself). But this is seldom a problem, many designs read the memory
> > location
> > using two separate 64-bit loads (so not atomic) anyway, it is a successful
> > atomic
> > compare exchange operation which provides atomicity.
> > 
> Ok. I agree, I don't expect that to be a problem. The 128-bit CAS patch I just
> submitted[1] (which was developed before reading this) will have to be
> changed.
> 
> [1] http://mails.dpdk.org/archives/dev/2019-January/124159.html
I will take a look and comnment on this.

> 
> Thanks,
> Gage
> 
> </snip>
-- 
Ola Liljedahl, Networking System Architect, Arm
Phone +46706866373, Skype ola.liljedahl


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/5] ring: add 64-bit headtail structure
  2019-01-28 18:14       ` [dpdk-dev] [PATCH v4 1/5] ring: add 64-bit headtail structure Gage Eads
@ 2019-01-29 12:56         ` Ola Liljedahl
  2019-01-30  4:26           ` Eads, Gage
  0 siblings, 1 reply; 123+ messages in thread
From: Ola Liljedahl @ 2019-01-29 12:56 UTC (permalink / raw)
  To: gage.eads, dev
  Cc: jerinj, mczekaj, nd, bruce.richardson, konstantin.ananyev,
	stephen, olivier.matz, arybchenko

On Mon, 2019-01-28 at 12:14 -0600, Gage Eads wrote:
> 64-bit head and tail index widths greatly increases the time it takes for
> them to wrap-around (with current CPU speeds, it won't happen within the
> author's lifetime). This is fundamental to avoiding the ABA problem -- in
> which a thread mistakes reading the same tail index in two accesses to mean
> that the ring was not modified in the intervening time -- in the upcoming
> non-blocking ring implementation. Using a 64-bit index makes the
> possibility of this occurring effectively zero.
Just an observation.
The following invariant holds (using ring_size instead of mask):
∀ index: ring[index % ring_size].index % ring_size == index % ring_size
i.e. the N (N=log2 ring size) lsb of ring[].index will always be the same (for a
specific slot) so serve no purpose.

This means we don't have to store the whole index in each slot, it is enough to
store "index / ring_size" (which I call the lap counter). This could be useful
for an implementation for 32-bit platforms which support 64-bit CAS (to write
the slot ptr & index (lap counter) atomically) and uses 64-bit head & tail
indexes (to avoid the quick wrap around you would have with 32-bit ring
indexes).

So
ring[index % ring_size].lap = index / ring_size;

An implementation could of course use bitwise-and instead of modulo and bitwise-
right shift instead of division. The 2-logaritm of ring_size should also be pre-
calcucated and stored in the ring buffer metadata.

-- Ola

> 
> This commit places the new producer and consumer structures in the same
> location in struct rte_ring as their 32-bit counterparts. Since the 32-bit
> versions are padded out to a cache line, there is space for the new
> structure without affecting the layout of struct rte_ring. Thus, the ABI is
> preserved.
> 
> Signed-off-by: Gage Eads <gage.eads@intel.com>
> ---
>  lib/librte_ring/rte_ring.h         |  23 +++++-
>  lib/librte_ring/rte_ring_c11_mem.h | 153
> +++++++++++++++++++++++++++++++++++++
>  lib/librte_ring/rte_ring_generic.h | 139 +++++++++++++++++++++++++++++++++
>  3 files changed, 312 insertions(+), 3 deletions(-)
> 
> diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> index af5444a9f..00dfb5b85 100644
> --- a/lib/librte_ring/rte_ring.h
> +++ b/lib/librte_ring/rte_ring.h
> @@ -1,6 +1,6 @@
>  /* SPDX-License-Identifier: BSD-3-Clause
>   *
> - * Copyright (c) 2010-2017 Intel Corporation
> + * Copyright (c) 2010-2019 Intel Corporation
>   * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
>   * All rights reserved.
>   * Derived from FreeBSD's bufring.h
> @@ -70,6 +70,15 @@ struct rte_ring_headtail {
>  	uint32_t single;         /**< True if single prod/cons */
>  };
>  
> +/* 64-bit version of rte_ring_headtail, for use by rings that need to avoid
> + * head/tail wrap-around.
> + */
> +struct rte_ring_headtail_64 {
> +	volatile uint64_t head;  /**< Prod/consumer head. */
> +	volatile uint64_t tail;  /**< Prod/consumer tail. */
> +	uint32_t single;       /**< True if single prod/cons */
> +};
> +
>  /**
>   * An RTE ring structure.
>   *
> @@ -97,11 +106,19 @@ struct rte_ring {
>  	char pad0 __rte_cache_aligned; /**< empty cache line */
>  
>  	/** Ring producer status. */
> -	struct rte_ring_headtail prod __rte_cache_aligned;
> +	RTE_STD_C11
> +	union {
> +		struct rte_ring_headtail prod __rte_cache_aligned;
> +		struct rte_ring_headtail_64 prod_64 __rte_cache_aligned;
> +	};
>  	char pad1 __rte_cache_aligned; /**< empty cache line */
>  
>  	/** Ring consumer status. */
> -	struct rte_ring_headtail cons __rte_cache_aligned;
> +	RTE_STD_C11
> +	union {
> +		struct rte_ring_headtail cons __rte_cache_aligned;
> +		struct rte_ring_headtail_64 cons_64 __rte_cache_aligned;
> +	};
>  	char pad2 __rte_cache_aligned; /**< empty cache line */
>  };
>  
> diff --git a/lib/librte_ring/rte_ring_c11_mem.h
> b/lib/librte_ring/rte_ring_c11_mem.h
> index 0fb73a337..47acd4c7c 100644
> --- a/lib/librte_ring/rte_ring_c11_mem.h
> +++ b/lib/librte_ring/rte_ring_c11_mem.h
> @@ -178,4 +178,157 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
>  	return n;
>  }
>  
> +/**
> + * @internal This function updates the producer head for enqueue using
> + *	     64-bit head/tail values.
> + *
> + * @param r
> + *   A pointer to the ring structure
> + * @param is_sp
> + *   Indicates whether multi-producer path is needed or not
> + * @param n
> + *   The number of elements we will want to enqueue, i.e. how far should the
> + *   head be moved
> + * @param behavior
> + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items from a ring
> + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
> + * @param old_head
> + *   Returns head value as it was before the move, i.e. where enqueue starts
> + * @param new_head
> + *   Returns the current/new head value i.e. where enqueue finishes
> + * @param free_entries
> + *   Returns the amount of free space in the ring BEFORE head was moved
> + * @return
> + *   Actual number of objects enqueued.
> + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> + */
> +static __rte_always_inline unsigned int
> +__rte_ring_move_prod_head_64(struct rte_ring *r, unsigned int is_sp,
> +		unsigned int n, enum rte_ring_queue_behavior behavior,
> +		uint64_t *old_head, uint64_t *new_head,
> +		uint32_t *free_entries)
> +{
> +	const uint32_t capacity = r->capacity;
> +	uint64_t cons_tail;
> +	unsigned int max = n;
> +	int success;
> +
> +	*old_head = __atomic_load_n(&r->prod_64.head, __ATOMIC_RELAXED);
> +	do {
> +		/* Reset n to the initial burst count */
> +		n = max;
> +
> +		/* Ensure the head is read before tail */
> +		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> +
> +		/* load-acquire synchronize with store-release of ht->tail
> +		 * in update_tail.
> +		 */
> +		cons_tail = __atomic_load_n(&r->cons_64.tail,
> +					__ATOMIC_ACQUIRE);
> +
> +		/* The subtraction is done between two unsigned 32bits value
> +		 * (the result is always modulo 32 bits even if we have
> +		 * *old_head > cons_tail). So 'free_entries' is always
> between 0
> +		 * and capacity (which is < size).
> +		 */
> +		*free_entries = (capacity + cons_tail - *old_head);
> +
> +		/* check that we have enough room in ring */
> +		if (unlikely(n > *free_entries))
> +			n = (behavior == RTE_RING_QUEUE_FIXED) ?
> +					0 : *free_entries;
> +
> +		if (n == 0)
> +			return 0;
> +
> +		*new_head = *old_head + n;
> +		if (is_sp)
> +			r->prod_64.head = *new_head, success = 1;
> +		else
> +			/* on failure, *old_head is updated */
> +			success = __atomic_compare_exchange_n(&r-
> >prod_64.head,
> +					old_head, *new_head,
> +					0, __ATOMIC_RELAXED,
> +					__ATOMIC_RELAXED);
> +	} while (unlikely(success == 0));
> +	return n;
> +}
> +
> +/**
> + * @internal This function updates the consumer head for dequeue using
> + *	     64-bit head/tail values.
> + *
> + * @param r
> + *   A pointer to the ring structure
> + * @param is_sc
> + *   Indicates whether multi-consumer path is needed or not
> + * @param n
> + *   The number of elements we will want to enqueue, i.e. how far should the
> + *   head be moved
> + * @param behavior
> + *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from a ring
> + *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from ring
> + * @param old_head
> + *   Returns head value as it was before the move, i.e. where dequeue starts
> + * @param new_head
> + *   Returns the current/new head value i.e. where dequeue finishes
> + * @param entries
> + *   Returns the number of entries in the ring BEFORE head was moved
> + * @return
> + *   - Actual number of objects dequeued.
> + *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> + */
> +static __rte_always_inline unsigned int
> +__rte_ring_move_cons_head_64(struct rte_ring *r, unsigned int is_sc,
> +		unsigned int n, enum rte_ring_queue_behavior behavior,
> +		uint64_t *old_head, uint64_t *new_head,
> +		uint32_t *entries)
> +{
> +	unsigned int max = n;
> +	uint64_t prod_tail;
> +	int success;
> +
> +	/* move cons.head atomically */
> +	*old_head = __atomic_load_n(&r->cons_64.head, __ATOMIC_RELAXED);
> +	do {
> +		/* Restore n as it may change every loop */
> +		n = max;
> +
> +		/* Ensure the head is read before tail */
> +		__atomic_thread_fence(__ATOMIC_ACQUIRE);
> +
> +		/* this load-acquire synchronize with store-release of ht-
> >tail
> +		 * in update_tail.
> +		 */
> +		prod_tail = __atomic_load_n(&r->prod_64.tail,
> +					__ATOMIC_ACQUIRE);
> +
> +		/* The subtraction is done between two unsigned 32bits value
> +		 * (the result is always modulo 32 bits even if we have
> +		 * cons_head > prod_tail). So 'entries' is always between 0
> +		 * and size(ring)-1.
> +		 */
> +		*entries = (prod_tail - *old_head);
> +
> +		/* Set the actual entries for dequeue */
> +		if (n > *entries)
> +			n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 :
> *entries;
> +
> +		if (unlikely(n == 0))
> +			return 0;
> +
> +		*new_head = *old_head + n;
> +		if (is_sc)
> +			r->cons_64.head = *new_head, success = 1;
> +		else
> +			/* on failure, *old_head will be updated */
> +			success = __atomic_compare_exchange_n(&r-
> >cons_64.head,
> +							old_head, *new_head,
> +							0, __ATOMIC_RELAXED,
> +							__ATOMIC_RELAXED);
> +	} while (unlikely(success == 0));
> +	return n;
> +}
> +
>  #endif /* _RTE_RING_C11_MEM_H_ */
> diff --git a/lib/librte_ring/rte_ring_generic.h
> b/lib/librte_ring/rte_ring_generic.h
> index ea7dbe5b9..2158e092a 100644
> --- a/lib/librte_ring/rte_ring_generic.h
> +++ b/lib/librte_ring/rte_ring_generic.h
> @@ -167,4 +167,143 @@ __rte_ring_move_cons_head(struct rte_ring *r, unsigned
> int is_sc,
>  	return n;
>  }
>  
> +/**
> + * @internal This function updates the producer head for enqueue using
> + *	     64-bit head/tail values.
> + *
> + * @param r
> + *   A pointer to the ring structure
> + * @param is_sp
> + *   Indicates whether multi-producer path is needed or not
> + * @param n
> + *   The number of elements we will want to enqueue, i.e. how far should the
> + *   head be moved
> + * @param behavior
> + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items from a ring
> + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
> + * @param old_head
> + *   Returns head value as it was before the move, i.e. where enqueue starts
> + * @param new_head
> + *   Returns the current/new head value i.e. where enqueue finishes
> + * @param free_entries
> + *   Returns the amount of free space in the ring BEFORE head was moved
> + * @return
> + *   Actual number of objects enqueued.
> + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> + */
> +static __rte_always_inline unsigned int
> +__rte_ring_move_prod_head_64(struct rte_ring *r, unsigned int is_sp,
> +		unsigned int n, enum rte_ring_queue_behavior behavior,
> +		uint64_t *old_head, uint64_t *new_head,
> +		uint32_t *free_entries)
> +{
> +	const uint32_t capacity = r->capacity;
> +	unsigned int max = n;
> +	int success;
> +
> +	do {
> +		/* Reset n to the initial burst count */
> +		n = max;
> +
> +		*old_head = r->prod_64.head;
> +
> +		/* add rmb barrier to avoid load/load reorder in weak
> +		 * memory model. It is noop on x86
> +		 */
> +		rte_smp_rmb();
> +
> +		/*
> +		 *  The subtraction is done between two unsigned 64bits value
> +		 * (the result is always modulo 64 bits even if we have
> +		 * *old_head > cons_tail). So 'free_entries' is always
> between 0
> +		 * and capacity (which is < size).
> +		 */
> +		*free_entries = (capacity + r->cons_64.tail - *old_head);
> +
> +		/* check that we have enough room in ring */
> +		if (unlikely(n > *free_entries))
> +			n = (behavior == RTE_RING_QUEUE_FIXED) ?
> +					0 : *free_entries;
> +
> +		if (n == 0)
> +			return 0;
> +
> +		*new_head = *old_head + n;
> +		if (is_sp)
> +			r->prod_64.head = *new_head, success = 1;
> +		else
> +			success = rte_atomic64_cmpset(&r->prod_64.head,
> +					*old_head, *new_head);
> +	} while (unlikely(success == 0));
> +	return n;
> +}
> +
> +/**
> + * @internal This function updates the consumer head for dequeue using
> + *	     64-bit head/tail values.
> + *
> + * @param r
> + *   A pointer to the ring structure
> + * @param is_sc
> + *   Indicates whether multi-consumer path is needed or not
> + * @param n
> + *   The number of elements we will want to enqueue, i.e. how far should the
> + *   head be moved
> + * @param behavior
> + *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from a ring
> + *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from ring
> + * @param old_head
> + *   Returns head value as it was before the move, i.e. where dequeue starts
> + * @param new_head
> + *   Returns the current/new head value i.e. where dequeue finishes
> + * @param entries
> + *   Returns the number of entries in the ring BEFORE head was moved
> + * @return
> + *   - Actual number of objects dequeued.
> + *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> + */
> +static __rte_always_inline unsigned int
> +__rte_ring_move_cons_head_64(struct rte_ring *r, unsigned int is_sc,
> +		unsigned int n, enum rte_ring_queue_behavior behavior,
> +		uint64_t *old_head, uint64_t *new_head,
> +		uint32_t *entries)
> +{
> +	unsigned int max = n;
> +	int success;
> +
> +	do {
> +		/* Restore n as it may change every loop */
> +		n = max;
> +
> +		*old_head = r->cons_64.head;
> +
> +		/* add rmb barrier to avoid load/load reorder in weak
> +		 * memory model. It is noop on x86
> +		 */
> +		rte_smp_rmb();
> +
> +		/* The subtraction is done between two unsigned 64bits value
> +		 * (the result is always modulo 64 bits even if we have
> +		 * cons_head > prod_tail). So 'entries' is always between 0
> +		 * and size(ring)-1.
> +		 */
> +		*entries = (r->prod_64.tail - *old_head);
> +
> +		/* Set the actual entries for dequeue */
> +		if (n > *entries)
> +			n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 :
> *entries;
> +
> +		if (unlikely(n == 0))
> +			return 0;
> +
> +		*new_head = *old_head + n;
> +		if (is_sc)
> +			r->cons_64.head = *new_head, success = 1;
> +		else
> +			success = rte_atomic64_cmpset(&r->cons_64.head,
> +					*old_head, *new_head);
> +	} while (unlikely(success == 0));
> +	return n;
> +}
> +
>  #endif /* _RTE_RING_GENERIC_H_ */
-- 
Ola Liljedahl, Networking System Architect, Arm
Phone +46706866373, Skype ola.liljedahl


^ permalink raw reply	[flat|nested] 123+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/5] ring: add 64-bit headtail structure
  2019-01-29 12:56         ` Ola Liljedahl
@ 2019-01-30  4:26           ` Eads, Gage
  0 siblings, 0 replies; 123+ messages in thread
From: Eads, Gage @ 2019-01-30  4:26 UTC (permalink / raw)
  To: Ola Liljedahl, dev
  Cc: jerinj, mczekaj, nd, Richardson, Bruce, Ananyev, Konstantin,
	stephen, olivier.matz, arybchenko



> -----Original Message-----
> From: Ola Liljedahl [mailto:Ola.Liljedahl@arm.com]
> Sent: Tuesday, January 29, 2019 6:57 AM
> To: Eads, Gage <gage.eads@intel.com>; dev@dpdk.org
> Cc: jerinj@marvell.com; mczekaj@marvell.com; nd <nd@arm.com>;
> Richardson, Bruce <bruce.richardson@intel.com>; Ananyev, Konstantin
> <konstantin.ananyev@intel.com>; stephen@networkplumber.org;
> olivier.matz@6wind.com; arybchenko@solarflare.com
> Subject: Re: [PATCH v4 1/5] ring: add 64-bit headtail structure
> 
> On Mon, 2019-01-28 at 12:14 -0600, Gage Eads wrote:
> > 64-bit head and tail index widths greatly increases the time it takes
> > for them to wrap-around (with current CPU speeds, it won't happen
> > within the author's lifetime). This is fundamental to avoiding the ABA
> > problem -- in which a thread mistakes reading the same tail index in
> > two accesses to mean that the ring was not modified in the intervening
> > time -- in the upcoming non-blocking ring implementation. Using a
> > 64-bit index makes the possibility of this occurring effectively zero.
> Just an observation.
> The following invariant holds (using ring_size instead of mask):
> ∀ index: ring[index % ring_size].index % ring_size == index % ring_size i.e. the N
> (N=log2 ring size) lsb of ring[].index will always be the same (for a specific slot)
> so serve no purpose.
> 
> This means we don't have to store the whole index in each slot, it is enough to
> store "index / ring_size" (which I call the lap counter). This could be useful for an
> implementation for 32-bit platforms which support 64-bit CAS (to write the slot
> ptr & index (lap counter) atomically) and uses 64-bit head & tail indexes (to avoid
> the quick wrap around you would have with 32-bit ring indexes).
> 
> So
> ring[index % ring_size].lap = index / ring_size;
> 
> An implementation could of course use bitwise-and instead of modulo and
> bitwise- right shift instead of division. The 2-logaritm of ring_size should also be
> pre- calcucated and stored in the ring buffer metadata.
> 

That's a pretty interesting idea. The question is, with such a design, what should DPDK's minimum NB ring size be on 32-bit platforms?

If a ring entry is written on average every M cycles, then a lap occurs every M*N cycles and each counter repeats every M*N*2^32 cycles. If M=100 on a 2GHz system, then the counter repeats every

N=1: 3.33 minutes
...
N=256: 14.22 hours
N=512: 28.44 hours
N=1024: 2.37 days
...
N=16384: 37.92 days

I think a minimum size of 1024 strikes a good balance between not too burdensome and sufficiently low odds of ABA occurring.

Thanks,
Gage

[snip]

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v5 0/6] Add lock-free ring and mempool handler
  2019-01-28 18:14     ` [dpdk-dev] [PATCH v4 " Gage Eads
                         ` (4 preceding siblings ...)
  2019-01-28 18:14       ` [dpdk-dev] [PATCH v4 5/5] mempool/ring: add non-blocking ring handlers Gage Eads
@ 2019-03-05 17:40       ` Gage Eads
  2019-03-05 17:40         ` [dpdk-dev] [PATCH v5 1/6] ring: add a pointer-width headtail structure Gage Eads
                           ` (6 more replies)
  5 siblings, 7 replies; 123+ messages in thread
From: Gage Eads @ 2019-03-05 17:40 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl

For some users, the rte ring's "non-preemptive" constraint is not acceptable;
for example, if the application uses a mixture of pinned high-priority threads
and multiplexed low-priority threads that share a mempool.

This patchset introduces a lock-free ring and a mempool based on it. The
lock-free algorithm relies on a double-pointer compare-and-swap, so for 64-bit
architectures it is currently limited to x86_64.

The ring uses more compare-and-swap atomic operations than the regular rte ring:
With no contention, an enqueue of n pointers uses (1 + n) CAS operations and a
dequeue of n pointers uses 1. This algorithm has worse average-case performance
than the regular rte ring (particularly a highly-contended ring with large bulk
accesses), however:
- For applications with preemptible pthreads, the regular rte ring's worst-case
  performance (i.e. one thread being preempted in the update_tail() critical
  section) is much worse than the lock-free ring's.
- Software caching can mitigate the average case performance for ring-based
  algorithms. For example, a lock-free ring based mempool (a likely use case
  for this ring) with per-thread caching.

The lock-free ring is enabled via a new flag, RING_F_LF. For ease-of-use,
existing ring enqueue/dequeue functions work with both standard and lock-free
rings. This is also an experimental API, so RING_F_LF users must build with the
ALLOW_EXPERIMENTAL_API flag.

This patchset also adds lock-free versions of ring_autotest and
ring_perf_autotest, and a lock-free ring based mempool.

This patchset makes one API change; a deprecation notice was posted in a
separate commit[1].

This patchset depends on the 128-bit compare-and-set patch[2].

[1] http://mails.dpdk.org/archives/dev/2019-February/124321.html
[2] http://mails.dpdk.org/archives/dev/2019-March/125751.html

v5:
 - Incorporated lfring's enqueue and dequeue logic from
   http://mails.dpdk.org/archives/dev/2019-January/124242.html
 - Renamed non-blocking -> lock-free and NB -> LF to align with a similar
   change in the lock-free stack patchset:
   http://mails.dpdk.org/archives/dev/2019-March/125797.html
 - Added support for 32-bit architectures by using the full 32b of the
   modification counter and requiring LF rings on these architectures to be at
   least 1024 entries.
 - Updated to the latest rte_atomic128_cmp_exchange() interface.
 - Added ring start marker to struct rte_ring

v4:
 - Split out nb_enqueue and nb_dequeue functions in generic and C11 versions,
   with the necessary memory ordering behavior for weakly consistent machines.
 - Convert size_t variables (from v2) to uint64_t and no-longer-applicable
   comment about variably-sized ring indexes.
 - Fix bug in nb_enqueue_mp that the breaks the non-blocking guarantee.
 - Split the ring_ptr cast into two lines.
 - Change the dependent patchset from the non-blocking stack patch series
   to one only containing the 128b CAS commit

v3:
 - Avoid the ABI break by putting 64-bit head and tail values in the same
   cacheline as struct rte_ring's prod and cons members.
 - Don't attempt to compile rte_atomic128_cmpset without
   ALLOW_EXPERIMENTAL_API, as this would break a large number of libraries.
 - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case someone tries
   to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
 - Update the ring mempool to use experimental APIs
 - Clarify that RINB_F_NB is only limited to x86_64 currently; e.g. ARMv8 has the
   ISA support for 128-bit CAS to eventually support it.

v2:
 - Merge separate docs commit into patch #5
 - Convert uintptr_t to size_t
 - Add a compile-time check for the size of size_t
 - Fix a space-after-typecast issue
 - Fix an unnecessary-parentheses checkpatch warning
 - Bump librte_ring's library version

Gage Eads (6):
  ring: add a pointer-width headtail structure
  ring: add a ring start marker
  ring: add a lock-free implementation
  test_ring: add lock-free ring autotest
  test_ring_perf: add lock-free ring perf test
  mempool/ring: add lock-free ring handlers

 doc/guides/prog_guide/env_abstraction_layer.rst |  10 +
 drivers/mempool/ring/Makefile                   |   1 +
 drivers/mempool/ring/meson.build                |   2 +
 drivers/mempool/ring/rte_mempool_ring.c         |  58 ++-
 lib/librte_ring/rte_ring.c                      |  92 ++++-
 lib/librte_ring/rte_ring.h                      | 334 ++++++++++++++--
 lib/librte_ring/rte_ring_c11_mem.h              | 501 ++++++++++++++++++++++++
 lib/librte_ring/rte_ring_generic.h              | 484 +++++++++++++++++++++++
 lib/librte_ring/rte_ring_version.map            |   7 +
 test/test/test_ring.c                           |  61 +--
 test/test/test_ring_perf.c                      |  19 +-
 11 files changed, 1492 insertions(+), 77 deletions(-)

-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v5 1/6] ring: add a pointer-width headtail structure
  2019-03-05 17:40       ` [dpdk-dev] [PATCH v5 0/6] Add lock-free ring and mempool handler Gage Eads
@ 2019-03-05 17:40         ` Gage Eads
  2019-03-05 17:40         ` [dpdk-dev] [PATCH v5 2/6] ring: add a ring start marker Gage Eads
                           ` (5 subsequent siblings)
  6 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-03-05 17:40 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl

For 64-bit systems, at current CPU speeds, 64-bit head and tail indexes
will not wrap-around within the author's lifetime. This is important to
avoiding the ABA problem -- in which a thread mistakes reading the same
tail index in two accesses to mean that the ring was not modified in the
intervening time -- in the upcoming lock-free ring implementation. Using a
64-bit index makes the possibility of this occurring effectively zero. This
commit uses pointer-width indexes so the lock-free ring can support 32-bit
systems as well.

This commit places the new producer and consumer structures in the same
location in struct rte_ring as their 32-bit counterparts. Since the 32-bit
versions are padded out to a cache line, there is space for the new
structure without affecting the layout of struct rte_ring. Thus, the ABI is
preserved.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_ring/rte_ring.h         |  21 +++++-
 lib/librte_ring/rte_ring_c11_mem.h | 143 +++++++++++++++++++++++++++++++++++++
 lib/librte_ring/rte_ring_generic.h | 130 +++++++++++++++++++++++++++++++++
 3 files changed, 291 insertions(+), 3 deletions(-)

diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index af5444a9f..c78db6916 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  *
- * Copyright (c) 2010-2017 Intel Corporation
+ * Copyright (c) 2010-2019 Intel Corporation
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
@@ -70,6 +70,13 @@ struct rte_ring_headtail {
 	uint32_t single;         /**< True if single prod/cons */
 };
 
+/* Structure to hold a pair of pointer-sized head/tail values and metadata */
+struct rte_ring_headtail_ptr {
+	volatile uintptr_t head; /**< Prod/consumer head. */
+	volatile uintptr_t tail; /**< Prod/consumer tail. */
+	uint32_t single;         /**< True if single prod/cons */
+};
+
 /**
  * An RTE ring structure.
  *
@@ -97,11 +104,19 @@ struct rte_ring {
 	char pad0 __rte_cache_aligned; /**< empty cache line */
 
 	/** Ring producer status. */
-	struct rte_ring_headtail prod __rte_cache_aligned;
+	RTE_STD_C11
+	union {
+		struct rte_ring_headtail prod __rte_cache_aligned;
+		struct rte_ring_headtail_ptr prod_ptr __rte_cache_aligned;
+	};
 	char pad1 __rte_cache_aligned; /**< empty cache line */
 
 	/** Ring consumer status. */
-	struct rte_ring_headtail cons __rte_cache_aligned;
+	RTE_STD_C11
+	union {
+		struct rte_ring_headtail cons __rte_cache_aligned;
+		struct rte_ring_headtail_ptr cons_ptr __rte_cache_aligned;
+	};
 	char pad2 __rte_cache_aligned; /**< empty cache line */
 };
 
diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 0fb73a337..545caf257 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -178,4 +178,147 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	return n;
 }
 
+/**
+ * @internal This function updates the producer head for enqueue using
+ *	     pointer-sized head/tail values.
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sp
+ *   Indicates whether multi-producer path is needed or not
+ * @param n
+ *   The number of elements we will want to enqueue, i.e. how far should the
+ *   head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
+ * @param old_head
+ *   Returns head value as it was before the move, i.e. where enqueue starts
+ * @param new_head
+ *   Returns the current/new head value i.e. where enqueue finishes
+ * @param free_entries
+ *   Returns the amount of free space in the ring BEFORE head was moved
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_move_prod_head_ptr(struct rte_ring *r, unsigned int is_sp,
+		unsigned int n, enum rte_ring_queue_behavior behavior,
+		uintptr_t *old_head, uintptr_t *new_head,
+		uint32_t *free_entries)
+{
+	const uint32_t capacity = r->capacity;
+	uintptr_t cons_tail;
+	unsigned int max = n;
+	int success;
+
+	*old_head = __atomic_load_n(&r->prod_ptr.head, __ATOMIC_RELAXED);
+	do {
+		/* Reset n to the initial burst count */
+		n = max;
+
+		/* Ensure the head is read before tail */
+		__atomic_thread_fence(__ATOMIC_ACQUIRE);
+
+		/* load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		cons_tail = __atomic_load_n(&r->cons_ptr.tail,
+					__ATOMIC_ACQUIRE);
+
+		*free_entries = (capacity + cons_tail - *old_head);
+
+		/* check that we have enough room in ring */
+		if (unlikely(n > *free_entries))
+			n = (behavior == RTE_RING_QUEUE_FIXED) ?
+					0 : *free_entries;
+
+		if (n == 0)
+			return 0;
+
+		*new_head = *old_head + n;
+		if (is_sp)
+			r->prod_ptr.head = *new_head, success = 1;
+		else
+			/* on failure, *old_head is updated */
+			success = __atomic_compare_exchange_n(&r->prod_ptr.head,
+					old_head, *new_head,
+					0, __ATOMIC_RELAXED,
+					__ATOMIC_RELAXED);
+	} while (unlikely(success == 0));
+	return n;
+}
+
+/**
+ * @internal This function updates the consumer head for dequeue using
+ *	     pointer-sized head/tail values.
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sc
+ *   Indicates whether multi-consumer path is needed or not
+ * @param n
+ *   The number of elements we will want to enqueue, i.e. how far should the
+ *   head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from ring
+ * @param old_head
+ *   Returns head value as it was before the move, i.e. where dequeue starts
+ * @param new_head
+ *   Returns the current/new head value i.e. where dequeue finishes
+ * @param entries
+ *   Returns the number of entries in the ring BEFORE head was moved
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_move_cons_head_ptr(struct rte_ring *r, unsigned int is_sc,
+		unsigned int n, enum rte_ring_queue_behavior behavior,
+		uintptr_t *old_head, uintptr_t *new_head,
+		uint32_t *entries)
+{
+	unsigned int max = n;
+	uintptr_t prod_tail;
+	int success;
+
+	/* move cons.head atomically */
+	*old_head = __atomic_load_n(&r->cons_ptr.head, __ATOMIC_RELAXED);
+	do {
+		/* Restore n as it may change every loop */
+		n = max;
+
+		/* Ensure the head is read before tail */
+		__atomic_thread_fence(__ATOMIC_ACQUIRE);
+
+		/* this load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		prod_tail = __atomic_load_n(&r->prod_ptr.tail,
+					__ATOMIC_ACQUIRE);
+
+		*entries = (prod_tail - *old_head);
+
+		/* Set the actual entries for dequeue */
+		if (n > *entries)
+			n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : *entries;
+
+		if (unlikely(n == 0))
+			return 0;
+
+		*new_head = *old_head + n;
+		if (is_sc)
+			r->cons_ptr.head = *new_head, success = 1;
+		else
+			/* on failure, *old_head will be updated */
+			success = __atomic_compare_exchange_n(&r->cons_ptr.head,
+							old_head, *new_head,
+							0, __ATOMIC_RELAXED,
+							__ATOMIC_RELAXED);
+	} while (unlikely(success == 0));
+	return n;
+}
+
 #endif /* _RTE_RING_C11_MEM_H_ */
diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index ea7dbe5b9..6a0e1bbfb 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -167,4 +167,134 @@ __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
 	return n;
 }
 
+/**
+ * @internal This function updates the producer head for enqueue using
+ *	     pointer-sized head/tail values.
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sp
+ *   Indicates whether multi-producer path is needed or not
+ * @param n
+ *   The number of elements we will want to enqueue, i.e. how far should the
+ *   head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
+ * @param old_head
+ *   Returns head value as it was before the move, i.e. where enqueue starts
+ * @param new_head
+ *   Returns the current/new head value i.e. where enqueue finishes
+ * @param free_entries
+ *   Returns the amount of free space in the ring BEFORE head was moved
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_move_prod_head_ptr(struct rte_ring *r, unsigned int is_sp,
+		unsigned int n, enum rte_ring_queue_behavior behavior,
+		uintptr_t *old_head, uintptr_t *new_head,
+		uint32_t *free_entries)
+{
+	const uint32_t capacity = r->capacity;
+	unsigned int max = n;
+	int success;
+
+	do {
+		/* Reset n to the initial burst count */
+		n = max;
+
+		*old_head = r->prod_ptr.head;
+
+		/* add rmb barrier to avoid load/load reorder in weak
+		 * memory model. It is noop on x86
+		 */
+		rte_smp_rmb();
+
+		*free_entries = (capacity + r->cons_ptr.tail - *old_head);
+
+		/* check that we have enough room in ring */
+		if (unlikely(n > *free_entries))
+			n = (behavior == RTE_RING_QUEUE_FIXED) ?
+					0 : *free_entries;
+
+		if (n == 0)
+			return 0;
+
+		*new_head = *old_head + n;
+		if (is_sp)
+			r->prod_ptr.head = *new_head, success = 1;
+		else
+			success = __sync_bool_compare_and_swap(
+					&r->prod_ptr.head,
+					*old_head, *new_head);
+	} while (unlikely(success == 0));
+	return n;
+}
+
+/**
+ * @internal This function updates the consumer head for dequeue using
+ *	     pointer-sized head/tail values.
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sc
+ *   Indicates whether multi-consumer path is needed or not
+ * @param n
+ *   The number of elements we will want to enqueue, i.e. how far should the
+ *   head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from ring
+ * @param old_head
+ *   Returns head value as it was before the move, i.e. where dequeue starts
+ * @param new_head
+ *   Returns the current/new head value i.e. where dequeue finishes
+ * @param entries
+ *   Returns the number of entries in the ring BEFORE head was moved
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_move_cons_head_ptr(struct rte_ring *r, unsigned int is_sc,
+		unsigned int n, enum rte_ring_queue_behavior behavior,
+		uintptr_t *old_head, uintptr_t *new_head,
+		uint32_t *entries)
+{
+	unsigned int max = n;
+	int success;
+
+	do {
+		/* Restore n as it may change every loop */
+		n = max;
+
+		*old_head = r->cons_ptr.head;
+
+		/* add rmb barrier to avoid load/load reorder in weak
+		 * memory model. It is noop on x86
+		 */
+		rte_smp_rmb();
+
+		*entries = (r->prod_ptr.tail - *old_head);
+
+		/* Set the actual entries for dequeue */
+		if (n > *entries)
+			n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : *entries;
+
+		if (unlikely(n == 0))
+			return 0;
+
+		*new_head = *old_head + n;
+		if (is_sc)
+			r->cons_ptr.head = *new_head, success = 1;
+		else
+			success = __sync_bool_compare_and_swap(
+					&r->cons_ptr.head,
+					*old_head, *new_head);
+	} while (unlikely(success == 0));
+	return n;
+}
+
 #endif /* _RTE_RING_GENERIC_H_ */
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v5 2/6] ring: add a ring start marker
  2019-03-05 17:40       ` [dpdk-dev] [PATCH v5 0/6] Add lock-free ring and mempool handler Gage Eads
  2019-03-05 17:40         ` [dpdk-dev] [PATCH v5 1/6] ring: add a pointer-width headtail structure Gage Eads
@ 2019-03-05 17:40         ` Gage Eads
  2019-03-05 17:40         ` [dpdk-dev] [PATCH v5 3/6] ring: add a lock-free implementation Gage Eads
                           ` (4 subsequent siblings)
  6 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-03-05 17:40 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl

This marker allows us to replace "&r[1]" with "&r->ring" to locate the
start of the ring.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_ring/rte_ring.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index c78db6916..f16d77b8a 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -118,6 +118,7 @@ struct rte_ring {
 		struct rte_ring_headtail_ptr cons_ptr __rte_cache_aligned;
 	};
 	char pad2 __rte_cache_aligned; /**< empty cache line */
+	void *ring[] __rte_cache_aligned; /**< empty marker for ring start */
 };
 
 #define RING_F_SP_ENQ 0x0001 /**< The default enqueue is "single-producer". */
@@ -361,7 +362,7 @@ __rte_ring_do_enqueue(struct rte_ring *r, void * const *obj_table,
 	if (n == 0)
 		goto end;
 
-	ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n, void *);
+	ENQUEUE_PTRS(r, &r->ring, prod_head, obj_table, n, void *);
 
 	update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
 end:
@@ -403,7 +404,7 @@ __rte_ring_do_dequeue(struct rte_ring *r, void **obj_table,
 	if (n == 0)
 		goto end;
 
-	DEQUEUE_PTRS(r, &r[1], cons_head, obj_table, n, void *);
+	DEQUEUE_PTRS(r, &r->ring, cons_head, obj_table, n, void *);
 
 	update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
 
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v5 3/6] ring: add a lock-free implementation
  2019-03-05 17:40       ` [dpdk-dev] [PATCH v5 0/6] Add lock-free ring and mempool handler Gage Eads
  2019-03-05 17:40         ` [dpdk-dev] [PATCH v5 1/6] ring: add a pointer-width headtail structure Gage Eads
  2019-03-05 17:40         ` [dpdk-dev] [PATCH v5 2/6] ring: add a ring start marker Gage Eads
@ 2019-03-05 17:40         ` Gage Eads
  2019-03-05 17:40         ` [dpdk-dev] [PATCH v5 4/6] test_ring: add lock-free ring autotest Gage Eads
                           ` (3 subsequent siblings)
  6 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-03-05 17:40 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl

This commit adds support for lock-free circular ring enqueue and dequeue
functions. The ring is supported on 32- and 64-bit architectures, however
it uses a 128-bit compare-and-swap instruction when run on a 64-bit
architecture, and thus is currently limited to x86_64.

The algorithm is based on Ola Liljedahl's lfring, modified to fit within
the rte ring API. With no contention, an enqueue of n pointers uses (1 + n)
CAS operations and a dequeue of n pointers uses 1. This algorithm has worse
average-case performance than the regular rte ring (particularly a
highly-contended ring with large bulk accesses), however:
- For applications with preemptible pthreads, the regular rte ring's
  worst-case performance (i.e. one thread being preempted in the
  update_tail() critical section) is much worse than the lock-free ring's.
- Software caching can mitigate the average case performance for ring-based
  algorithms. For example, a lock-free ring based mempool (a likely use
  case for this ring) with per-thread caching.

To avoid the ABA problem, each ring entry contains a modification counter.
On a 64-bit architecture, the chance of ABA occurring are effectively zero;
a 64-bit counter will take many years to wrap at current CPU frequencies.
On a 32-bit architectures, a lock-free ring must be at least 1024-entries
deep; assuming 100 cycles per ring entry access, this guarantees the ring's
modification counters will wrap on the order of days.

The lock-free ring is enabled via a new flag, RING_F_LF. Because the ring's
memsize is now a function of its flags (the lock-free ring requires 128b
for each entry), this commit adds a new argument ('flags') to
rte_ring_get_memsize(). An API deprecation notice will be sent in a
separate commit.

For ease-of-use, existing ring enqueue and dequeue functions work on both
regular and lock-free rings. This introduces an additional branch in the
datapath, but this should be a highly predictable branch.
ring_perf_autotest shows a negligible performance impact; it's hard to
distinguish a real difference versus system noise.

                                  | ring_perf_autotest cycles with branch -
             Test                 |   ring_perf_autotest cycles without
------------------------------------------------------------------
SP/SC single enq/dequeue          | 0.33
MP/MC single enq/dequeue          | -4.00
SP/SC burst enq/dequeue (size 8)  | 0.00
MP/MC burst enq/dequeue (size 8)  | 0.00
SP/SC burst enq/dequeue (size 32) | 0.00
MP/MC burst enq/dequeue (size 32) | 0.00
SC empty dequeue                  | 1.00
MC empty dequeue                  | 0.00

Single lcore:
SP/SC bulk enq/dequeue (size 8)   | 0.49
MP/MC bulk enq/dequeue (size 8)   | 0.08
SP/SC bulk enq/dequeue (size 32)  | 0.07
MP/MC bulk enq/dequeue (size 32)  | 0.09

Two physical cores:
SP/SC bulk enq/dequeue (size 8)   | 0.19
MP/MC bulk enq/dequeue (size 8)   | -0.37
SP/SC bulk enq/dequeue (size 32)  | 0.09
MP/MC bulk enq/dequeue (size 32)  | -0.05

Two NUMA nodes:
SP/SC bulk enq/dequeue (size 8)   | -1.96
MP/MC bulk enq/dequeue (size 8)   | 0.88
SP/SC bulk enq/dequeue (size 32)  | 0.10
MP/MC bulk enq/dequeue (size 32)  | 0.46

Test setup: x86_64 build with default config, dual-socket Xeon E5-2699 v4,
running on isolcpus cores with a tickless scheduler. Each test run three
times and the results averaged.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_ring/rte_ring.c           |  92 +++++++--
 lib/librte_ring/rte_ring.h           | 308 ++++++++++++++++++++++++++---
 lib/librte_ring/rte_ring_c11_mem.h   | 366 ++++++++++++++++++++++++++++++++++-
 lib/librte_ring/rte_ring_generic.h   | 354 +++++++++++++++++++++++++++++++++
 lib/librte_ring/rte_ring_version.map |   7 +
 5 files changed, 1080 insertions(+), 47 deletions(-)

diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
index d215acecc..d4a176f57 100644
--- a/lib/librte_ring/rte_ring.c
+++ b/lib/librte_ring/rte_ring.c
@@ -45,9 +45,9 @@ EAL_REGISTER_TAILQ(rte_ring_tailq)
 
 /* return the size of memory occupied by a ring */
 ssize_t
-rte_ring_get_memsize(unsigned count)
+rte_ring_get_memsize_v1905(unsigned int count, unsigned int flags)
 {
-	ssize_t sz;
+	ssize_t sz, elt_sz;
 
 	/* count must be a power of 2 */
 	if ((!POWEROF2(count)) || (count > RTE_RING_SZ_MASK )) {
@@ -57,10 +57,23 @@ rte_ring_get_memsize(unsigned count)
 		return -EINVAL;
 	}
 
-	sz = sizeof(struct rte_ring) + count * sizeof(void *);
+	elt_sz = (flags & RING_F_LF) ? 2 * sizeof(void *) : sizeof(void *);
+
+	sz = sizeof(struct rte_ring) + count * elt_sz;
 	sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
 	return sz;
 }
+BIND_DEFAULT_SYMBOL(rte_ring_get_memsize, _v1905, 19.05);
+MAP_STATIC_SYMBOL(ssize_t rte_ring_get_memsize(unsigned int count,
+					       unsigned int flags),
+		  rte_ring_get_memsize_v1905);
+
+ssize_t
+rte_ring_get_memsize_v20(unsigned int count)
+{
+	return rte_ring_get_memsize_v1905(count, 0);
+}
+VERSION_SYMBOL(rte_ring_get_memsize, _v20, 2.0);
 
 int
 rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
@@ -75,6 +88,8 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 			  RTE_CACHE_LINE_MASK) != 0);
 	RTE_BUILD_BUG_ON((offsetof(struct rte_ring, prod) &
 			  RTE_CACHE_LINE_MASK) != 0);
+	RTE_BUILD_BUG_ON(sizeof(struct rte_ring_lf_entry) !=
+			 2 * sizeof(void *));
 
 	/* init the ring structure */
 	memset(r, 0, sizeof(*r));
@@ -82,8 +97,6 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 	if (ret < 0 || ret >= (int)sizeof(r->name))
 		return -ENAMETOOLONG;
 	r->flags = flags;
-	r->prod.single = (flags & RING_F_SP_ENQ) ? __IS_SP : __IS_MP;
-	r->cons.single = (flags & RING_F_SC_DEQ) ? __IS_SC : __IS_MC;
 
 	if (flags & RING_F_EXACT_SZ) {
 		r->size = rte_align32pow2(count + 1);
@@ -100,12 +113,46 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 		r->mask = count - 1;
 		r->capacity = r->mask;
 	}
-	r->prod.head = r->cons.head = 0;
-	r->prod.tail = r->cons.tail = 0;
+
+	r->log2_size = rte_log2_u64(r->size);
+
+	if (flags & RING_F_LF) {
+		uint32_t i;
+
+		r->prod_ptr.single =
+			(flags & RING_F_SP_ENQ) ? __IS_SP : __IS_MP;
+		r->cons_ptr.single =
+			(flags & RING_F_SC_DEQ) ? __IS_SC : __IS_MC;
+		r->prod_ptr.head = r->cons_ptr.head = 0;
+		r->prod_ptr.tail = r->cons_ptr.tail = 0;
+
+		for (i = 0; i < r->size; i++) {
+			struct rte_ring_lf_entry *ring_ptr, *base;
+
+			base = (struct rte_ring_lf_entry *)&r->ring;
+
+			ring_ptr = &base[i & r->mask];
+
+			ring_ptr->cnt = 0;
+		}
+	} else {
+		r->prod.single = (flags & RING_F_SP_ENQ) ? __IS_SP : __IS_MP;
+		r->cons.single = (flags & RING_F_SC_DEQ) ? __IS_SC : __IS_MC;
+		r->prod.head = r->cons.head = 0;
+		r->prod.tail = r->cons.tail = 0;
+	}
 
 	return 0;
 }
 
+/* If a ring entry is written on average every M cycles, then a ring entry is
+ * reused every M*count cycles, and a ring entry's counter repeats every
+ * M*count*2^32 cycles. If M=100 on a 2GHz system, then a 1024-entry ring's
+ * counters would repeat every 2.37 days. The likelihood of ABA occurring is
+ * considered sufficiently low for 1024-entry and larger rings.
+ */
+#define MIN_32_BIT_LF_RING_SIZE 1024
+
 /* create the ring */
 struct rte_ring *
 rte_ring_create(const char *name, unsigned count, int socket_id,
@@ -123,11 +170,25 @@ rte_ring_create(const char *name, unsigned count, int socket_id,
 
 	ring_list = RTE_TAILQ_CAST(rte_ring_tailq.head, rte_ring_list);
 
+#ifdef RTE_ARCH_64
+#if !defined(RTE_ARCH_X86_64)
+	printf("This platform does not support the atomic operation required for RING_F_LF\n");
+	rte_errno = EINVAL;
+	return NULL;
+#endif
+#else
+	if ((flags & RING_F_LF) && count < MIN_32_BIT_LF_RING_SIZE) {
+		printf("RING_F_LF is only supported on 32-bit platforms for rings with at least 1024 entries.\n");
+		rte_errno = EINVAL;
+		return NULL;
+	}
+#endif
+
 	/* for an exact size ring, round up from count to a power of two */
 	if (flags & RING_F_EXACT_SZ)
 		count = rte_align32pow2(count + 1);
 
-	ring_size = rte_ring_get_memsize(count);
+	ring_size = rte_ring_get_memsize(count, flags);
 	if (ring_size < 0) {
 		rte_errno = ring_size;
 		return NULL;
@@ -227,10 +288,17 @@ rte_ring_dump(FILE *f, const struct rte_ring *r)
 	fprintf(f, "  flags=%x\n", r->flags);
 	fprintf(f, "  size=%"PRIu32"\n", r->size);
 	fprintf(f, "  capacity=%"PRIu32"\n", r->capacity);
-	fprintf(f, "  ct=%"PRIu32"\n", r->cons.tail);
-	fprintf(f, "  ch=%"PRIu32"\n", r->cons.head);
-	fprintf(f, "  pt=%"PRIu32"\n", r->prod.tail);
-	fprintf(f, "  ph=%"PRIu32"\n", r->prod.head);
+	if (r->flags & RING_F_LF) {
+		fprintf(f, "  ct=%"PRIuPTR"\n", r->cons_ptr.tail);
+		fprintf(f, "  ch=%"PRIuPTR"\n", r->cons_ptr.head);
+		fprintf(f, "  pt=%"PRIuPTR"\n", r->prod_ptr.tail);
+		fprintf(f, "  ph=%"PRIuPTR"\n", r->prod_ptr.head);
+	} else {
+		fprintf(f, "  ct=%"PRIu32"\n", r->cons.tail);
+		fprintf(f, "  ch=%"PRIu32"\n", r->cons.head);
+		fprintf(f, "  pt=%"PRIu32"\n", r->prod.tail);
+		fprintf(f, "  ph=%"PRIu32"\n", r->prod.head);
+	}
 	fprintf(f, "  used=%u\n", rte_ring_count(r));
 	fprintf(f, "  avail=%u\n", rte_ring_free_count(r));
 }
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index f16d77b8a..200d7b2a0 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -20,7 +20,7 @@
  *
  * - FIFO (First In First Out)
  * - Maximum size is fixed; the pointers are stored in a table.
- * - Lockless implementation.
+ * - Lockless (and optionally, non-blocking/lock-free) implementation.
  * - Multi- or single-consumer dequeue.
  * - Multi- or single-producer enqueue.
  * - Bulk dequeue.
@@ -98,6 +98,7 @@ struct rte_ring {
 	const struct rte_memzone *memzone;
 			/**< Memzone, if any, containing the rte_ring */
 	uint32_t size;           /**< Size of ring. */
+	uint32_t log2_size;      /**< log2(size of ring) */
 	uint32_t mask;           /**< Mask (size-1) of ring. */
 	uint32_t capacity;       /**< Usable size of ring */
 
@@ -133,6 +134,18 @@ struct rte_ring {
  */
 #define RING_F_EXACT_SZ 0x0004
 #define RTE_RING_SZ_MASK  (0x7fffffffU) /**< Ring size mask */
+/**
+ * The ring uses lock-free enqueue and dequeue functions. These functions
+ * do not have the "non-preemptive" constraint of a regular rte ring, and thus
+ * are suited for applications using preemptible pthreads. However, the
+ * lock-free functions have worse average-case performance than their regular
+ * rte ring counterparts. When used as the handler for a mempool, per-thread
+ * caching can mitigate the performance difference by reducing the number (and
+ * contention) of ring accesses.
+ *
+ * This flag is only supported on 32-bit and x86_64 platforms.
+ */
+#define RING_F_LF 0x0008
 
 /* @internal defines for passing to the enqueue dequeue worker functions */
 #define __IS_SP 1
@@ -150,11 +163,15 @@ struct rte_ring {
  *
  * @param count
  *   The number of elements in the ring (must be a power of 2).
+ * @param flags
+ *   The flags the ring will be created with.
  * @return
  *   - The memory size needed for the ring on success.
  *   - -EINVAL if count is not a power of 2.
  */
-ssize_t rte_ring_get_memsize(unsigned count);
+ssize_t rte_ring_get_memsize(unsigned int count, unsigned int flags);
+ssize_t rte_ring_get_memsize_v20(unsigned int count);
+ssize_t rte_ring_get_memsize_v1905(unsigned int count, unsigned int flags);
 
 /**
  * Initialize a ring structure.
@@ -187,6 +204,10 @@ ssize_t rte_ring_get_memsize(unsigned count);
  *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
  *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
  *      is "single-consumer". Otherwise, it is "multi-consumers".
+ *    - RING_F_EXACT_SZ: If this flag is set, count can be a non-power-of-2
+ *      number, but up to half the ring space may be wasted.
+ *    - RING_F_LF: If this flag is set, the ring uses lock-free variants of the
+ *      dequeue and enqueue functions.
  * @return
  *   0 on success, or a negative value on error.
  */
@@ -222,12 +243,17 @@ int rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
  *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
  *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
  *      is "single-consumer". Otherwise, it is "multi-consumers".
+ *    - RING_F_EXACT_SZ: If this flag is set, count can be a non-power-of-2
+ *      number, but up to half the ring space may be wasted.
+ *    - RING_F_LF: If this flag is set, the ring uses lock-free variants of the
+ *      dequeue and enqueue functions.
  * @return
  *   On success, the pointer to the new allocated ring. NULL on error with
  *    rte_errno set appropriately. Possible errno values include:
  *    - E_RTE_NO_CONFIG - function could not get pointer to rte_config structure
  *    - E_RTE_SECONDARY - function was called from a secondary process instance
- *    - EINVAL - count provided is not a power of 2
+ *    - EINVAL - count provided is not a power of 2, or RING_F_LF is used on an
+ *      unsupported platform
  *    - ENOSPC - the maximum number of memzones has already been allocated
  *    - EEXIST - a memzone with the same name already exists
  *    - ENOMEM - no appropriate memory area found in which to create memzone
@@ -283,6 +309,50 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 	} \
 } while (0)
 
+/* The actual enqueue of pointers on the lock-free ring, used by the
+ * single-producer lock-free enqueue function.
+ */
+#define ENQUEUE_PTRS_LF(r, base, prod_head, obj_table, n) do { \
+	unsigned int i; \
+	const uint32_t size = (r)->size; \
+	size_t idx = prod_head & (r)->mask; \
+	size_t new_cnt = prod_head + size; \
+	struct rte_ring_lf_entry *ring = (struct rte_ring_lf_entry *)base; \
+	unsigned int mask = ~0x3; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & mask); i += 4, idx += 4) { \
+			ring[idx].ptr = obj_table[i]; \
+			ring[idx].cnt = (new_cnt + i) >> r->log2_size; \
+			ring[idx + 1].ptr = obj_table[i + 1]; \
+			ring[idx + 1].cnt = (new_cnt + i + 1) >> r->log2_size; \
+			ring[idx + 2].ptr = obj_table[i + 2]; \
+			ring[idx + 2].cnt = (new_cnt + i + 2) >> r->log2_size; \
+			ring[idx + 3].ptr = obj_table[i + 3]; \
+			ring[idx + 3].cnt = (new_cnt + i + 3) >> r->log2_size; \
+		} \
+		switch (n & 0x3) { \
+		case 3: \
+			ring[idx].cnt = (new_cnt + i) >> r->log2_size; \
+			ring[idx++].ptr = obj_table[i++]; /* fallthrough */ \
+		case 2: \
+			ring[idx].cnt = (new_cnt + i) >> r->log2_size; \
+			ring[idx++].ptr = obj_table[i++]; /* fallthrough */ \
+		case 1: \
+			ring[idx].cnt = (new_cnt + i) >> r->log2_size; \
+			ring[idx++].ptr = obj_table[i++]; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) { \
+			ring[idx].cnt = (new_cnt + i) >> r->log2_size; \
+			ring[idx].ptr = obj_table[i]; \
+		} \
+		for (idx = 0; i < n; i++, idx++) {    \
+			ring[idx].cnt = (new_cnt + i) >> r->log2_size; \
+			ring[idx].ptr = obj_table[i]; \
+		} \
+	} \
+} while (0)
+
 /* the actual copy of pointers on the ring to obj_table.
  * Placed here since identical code needed in both
  * single and multi consumer dequeue functions */
@@ -314,6 +384,43 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 	} \
 } while (0)
 
+/* The actual copy of pointers on the lock-free ring to obj_table. */
+#define DEQUEUE_PTRS_LF(r, base, cons_head, obj_table, n) do { \
+	unsigned int i; \
+	size_t idx = cons_head & (r)->mask; \
+	const uint32_t size = (r)->size; \
+	struct rte_ring_lf_entry *ring = (struct rte_ring_lf_entry *)base; \
+	unsigned int mask = ~0x3; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & mask); i += 4, idx += 4) {\
+			obj_table[i] = ring[idx].ptr; \
+			obj_table[i + 1] = ring[idx + 1].ptr; \
+			obj_table[i + 2] = ring[idx + 2].ptr; \
+			obj_table[i + 3] = ring[idx + 3].ptr; \
+		} \
+		switch (n & 0x3) { \
+		case 3: \
+			obj_table[i++] = ring[idx++].ptr; /* fallthrough */ \
+		case 2: \
+			obj_table[i++] = ring[idx++].ptr; /* fallthrough */ \
+		case 1: \
+			obj_table[i++] = ring[idx++].ptr; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) \
+			obj_table[i] = ring[idx].ptr; \
+		for (idx = 0; i < n; i++, idx++) \
+			obj_table[i] = ring[idx].ptr; \
+	} \
+} while (0)
+
+
+/* @internal 128-bit structure used by the lock-free ring */
+struct rte_ring_lf_entry {
+	void *ptr; /**< Data pointer */
+	uintptr_t cnt; /**< Modification counter */
+};
+
 /* Between load and load. there might be cpu reorder in weak model
  * (powerpc/arm).
  * There are 2 choices for the users
@@ -330,6 +437,70 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 #endif
 
 /**
+ * @internal Enqueue several objects on the lock-free ring
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param is_sp
+ *   Indicates whether to use single producer or multi-producer head update
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_enqueue(struct rte_ring *r, void * const *obj_table,
+			 unsigned int n, enum rte_ring_queue_behavior behavior,
+			 unsigned int is_sp, unsigned int *free_space)
+{
+	if (is_sp)
+		return __rte_ring_do_lf_enqueue_sp(r, obj_table, n,
+						   behavior, free_space);
+	else
+		return __rte_ring_do_lf_enqueue_mp(r, obj_table, n,
+						   behavior, free_space);
+}
+
+/**
+ * @internal Dequeue several objects from the lock-free ring
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_dequeue(struct rte_ring *r, void **obj_table,
+		 unsigned int n, enum rte_ring_queue_behavior behavior,
+		 unsigned int is_sc, unsigned int *available)
+{
+	if (is_sc)
+		return __rte_ring_do_lf_dequeue_sc(r, obj_table, n,
+						   behavior, available);
+	else
+		return __rte_ring_do_lf_dequeue_mc(r, obj_table, n,
+						   behavior, available);
+}
+
+/**
  * @internal Enqueue several objects on the ring
  *
   * @param r
@@ -436,8 +607,14 @@ static __rte_always_inline unsigned int
 rte_ring_mp_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_MP, free_space);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_MP,
+						free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_MP,
+					     free_space);
 }
 
 /**
@@ -459,8 +636,14 @@ static __rte_always_inline unsigned int
 rte_ring_sp_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_SP, free_space);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_SP,
+						free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_SP,
+					     free_space);
 }
 
 /**
@@ -486,8 +669,14 @@ static __rte_always_inline unsigned int
 rte_ring_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
 		      unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			r->prod.single, free_space);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED,
+						r->prod_ptr.single, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED,
+					     r->prod.single, free_space);
 }
 
 /**
@@ -570,8 +759,14 @@ static __rte_always_inline unsigned int
 rte_ring_mc_dequeue_bulk(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_MC, available);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_MC,
+						available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_MC,
+					     available);
 }
 
 /**
@@ -594,8 +789,14 @@ static __rte_always_inline unsigned int
 rte_ring_sc_dequeue_bulk(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_SC, available);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_SC,
+						available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_SC,
+					     available);
 }
 
 /**
@@ -621,8 +822,14 @@ static __rte_always_inline unsigned int
 rte_ring_dequeue_bulk(struct rte_ring *r, void **obj_table, unsigned int n,
 		unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-				r->cons.single, available);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED,
+						r->cons_ptr.single, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED,
+					     r->cons.single, available);
 }
 
 /**
@@ -697,9 +904,13 @@ rte_ring_dequeue(struct rte_ring *r, void **obj_p)
 static inline unsigned
 rte_ring_count(const struct rte_ring *r)
 {
-	uint32_t prod_tail = r->prod.tail;
-	uint32_t cons_tail = r->cons.tail;
-	uint32_t count = (prod_tail - cons_tail) & r->mask;
+	uint32_t count;
+
+	if (r->flags & RING_F_LF)
+		count = (r->prod_ptr.tail - r->cons_ptr.tail) & r->mask;
+	else
+		count = (r->prod.tail - r->cons.tail) & r->mask;
+
 	return (count > r->capacity) ? r->capacity : count;
 }
 
@@ -819,8 +1030,14 @@ static __rte_always_inline unsigned
 rte_ring_mp_enqueue_burst(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_MP, free_space);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_MP, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_MP, free_space);
 }
 
 /**
@@ -842,8 +1059,14 @@ static __rte_always_inline unsigned
 rte_ring_sp_enqueue_burst(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_SP, free_space);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_SP, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_SP, free_space);
 }
 
 /**
@@ -869,8 +1092,14 @@ static __rte_always_inline unsigned
 rte_ring_enqueue_burst(struct rte_ring *r, void * const *obj_table,
 		      unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_VARIABLE,
-			r->prod.single, free_space);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						r->prod_ptr.single, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     r->prod.single, free_space);
 }
 
 /**
@@ -897,8 +1126,14 @@ static __rte_always_inline unsigned
 rte_ring_mc_dequeue_burst(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_MC, available);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_MC, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_MC, available);
 }
 
 /**
@@ -922,8 +1157,14 @@ static __rte_always_inline unsigned
 rte_ring_sc_dequeue_burst(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_SC, available);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_SC, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_SC, available);
 }
 
 /**
@@ -949,9 +1190,14 @@ static __rte_always_inline unsigned
 rte_ring_dequeue_burst(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n,
-				RTE_RING_QUEUE_VARIABLE,
-				r->cons.single, available);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						r->cons_ptr.single, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     r->cons.single, available);
 }
 
 #ifdef __cplusplus
diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 545caf257..a672d161e 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -221,8 +221,8 @@ __rte_ring_move_prod_head_ptr(struct rte_ring *r, unsigned int is_sp,
 		/* Ensure the head is read before tail */
 		__atomic_thread_fence(__ATOMIC_ACQUIRE);
 
-		/* load-acquire synchronize with store-release of ht->tail
-		 * in update_tail.
+		/* load-acquire synchronize with store-release of tail in
+		 * __rte_ring_do_lf_dequeue_{sc, mc}.
 		 */
 		cons_tail = __atomic_load_n(&r->cons_ptr.tail,
 					__ATOMIC_ACQUIRE);
@@ -247,6 +247,7 @@ __rte_ring_move_prod_head_ptr(struct rte_ring *r, unsigned int is_sp,
 					0, __ATOMIC_RELAXED,
 					__ATOMIC_RELAXED);
 	} while (unlikely(success == 0));
+
 	return n;
 }
 
@@ -293,8 +294,8 @@ __rte_ring_move_cons_head_ptr(struct rte_ring *r, unsigned int is_sc,
 		/* Ensure the head is read before tail */
 		__atomic_thread_fence(__ATOMIC_ACQUIRE);
 
-		/* this load-acquire synchronize with store-release of ht->tail
-		 * in update_tail.
+		/* load-acquire synchronize with store-release of tail in
+		 * __rte_ring_do_lf_enqueue_{sp, mp}.
 		 */
 		prod_tail = __atomic_load_n(&r->prod_ptr.tail,
 					__ATOMIC_ACQUIRE);
@@ -318,6 +319,363 @@ __rte_ring_move_cons_head_ptr(struct rte_ring *r, unsigned int is_sc,
 							0, __ATOMIC_RELAXED,
 							__ATOMIC_RELAXED);
 	} while (unlikely(success == 0));
+
+	return n;
+}
+
+/**
+ * @internal
+ *   Enqueue several objects on the lock-free ring (single-producer only)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_enqueue_sp(struct rte_ring *r, void * const *obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+	uint32_t free_entries;
+	uintptr_t head, next;
+
+	n = __rte_ring_move_prod_head_ptr(r, 1, n, behavior,
+					  &head, &next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	ENQUEUE_PTRS_LF(r, &r->ring, head, obj_table, n);
+
+	__atomic_store_n(&r->prod_ptr.tail,
+			 r->prod_ptr.tail + n,
+			 __ATOMIC_RELEASE);
+end:
+	if (free_space != NULL)
+		*free_space = free_entries - n;
+	return n;
+}
+
+/* This macro defines the number of times an enqueueing thread can fail to find
+ * a free ring slot before reloading its producer tail index.
+ */
+#define ENQ_RETRY_LIMIT 32
+
+/**
+ * @internal
+ *   Get the next producer tail index.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param idx
+ *   The local tail index
+ * @return
+ *   If the ring's tail is ahead of the local tail, return the shared tail.
+ *   Else, return tail + 1.
+ */
+static __rte_always_inline uintptr_t
+__rte_ring_reload_tail(struct rte_ring *r, uintptr_t idx)
+{
+	uintptr_t fresh = __atomic_load_n(&r->prod_ptr.tail, __ATOMIC_RELAXED);
+
+	if ((intptr_t)(idx - fresh) < 0)
+		idx = fresh; /* fresh is after idx, use it instead */
+	else
+		idx++; /* Continue with next slot */
+
+	return idx;
+}
+
+/**
+ * @internal
+ *   Update the ring's producer tail index. If another thread already updated
+ *   the index beyond the caller's tail value, do nothing.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param idx
+ *   The local tail index
+ * @return
+ *   If the shared tail is ahead of the local tail, return the shared tail.
+ *   Else, return tail + 1.
+ */
+static __rte_always_inline uintptr_t
+__rte_ring_lf_update_tail(struct rte_ring *r, uintptr_t val)
+{
+	volatile uintptr_t *loc = &r->prod_ptr.tail;
+	uintptr_t old = __atomic_load_n(loc, __ATOMIC_RELAXED);
+
+	do {
+		/* Check if the tail has already been updated. */
+		if ((intptr_t)(val - old) < 0)
+			return old;
+
+		/* Else val >= old, need to update *loc */
+	} while (!__atomic_compare_exchange_n(loc, &old, val,
+					      1, __ATOMIC_RELEASE,
+					      __ATOMIC_RELAXED));
+
+	return val;
+}
+
+/**
+ * @internal
+ *   Enqueue several objects on the lock-free ring (multi-producer safe)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_enqueue_mp(struct rte_ring *r, void * const *obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+#if !defined(ALLOW_EXPERIMENTAL_API)
+	RTE_SET_USED(r);
+	RTE_SET_USED(obj_table);
+	RTE_SET_USED(n);
+	RTE_SET_USED(behavior);
+	RTE_SET_USED(free_space);
+	printf("[%s()] RING_F_LF requires an experimental API."
+	       " Recompile with ALLOW_EXPERIMENTAL_API to use it.\n"
+	       , __func__);
+	return 0;
+#else
+	struct rte_ring_lf_entry *base;
+	uintptr_t head, next, tail;
+	unsigned int i;
+	uint32_t avail;
+
+	/* Atomically update the prod head to reserve n slots. The prod tail
+	 * is modified at the end of the function.
+	 */
+	n = __rte_ring_move_prod_head_ptr(r, 0, n, behavior,
+					  &head, &next, &avail);
+
+	tail = __atomic_load_n(&r->prod_ptr.tail, __ATOMIC_RELAXED);
+	head = __atomic_load_n(&r->cons_ptr.tail, __ATOMIC_ACQUIRE);
+
+	if (unlikely(n == 0))
+		goto end;
+
+	base = (struct rte_ring_lf_entry *)&r->ring;
+
+	for (i = 0; i < n; i++) {
+		unsigned int retries = 0;
+		int success = 0;
+
+		/* Enqueue to the tail entry. If another thread wins the race,
+		 * retry with the new tail.
+		 */
+		do {
+			struct rte_ring_lf_entry old_value, new_value;
+			struct rte_ring_lf_entry *ring_ptr;
+
+			ring_ptr = &base[tail & r->mask];
+
+			old_value = *ring_ptr;
+
+			if (old_value.cnt != (tail >> r->log2_size)) {
+				/* This slot has already been used. Depending
+				 * on how far behind this thread is, either go
+				 * to the next slot or reload the tail.
+				 */
+				uintptr_t prev_tail;
+
+				prev_tail = (tail + r->size) >> r->log2_size;
+
+				if (old_value.cnt != prev_tail ||
+				    ++retries == ENQ_RETRY_LIMIT) {
+					/* This thread either fell 2+ laps
+					 * behind or hit the retry limit, so
+					 * reload the tail index.
+					 */
+					tail = __rte_ring_reload_tail(r, tail);
+					retries = 0;
+				} else {
+					/* Slot already used, try the next. */
+					tail++;
+
+				}
+
+				continue;
+			}
+
+			/* Found a free slot, try to enqueue next element. */
+			new_value.ptr = obj_table[i];
+			new_value.cnt = (tail + r->size) >> r->log2_size;
+
+#ifdef RTE_ARCH_64
+			success = rte_atomic128_cmp_exchange(
+					(rte_int128_t *)ring_ptr,
+					(rte_int128_t *)&old_value,
+					(rte_int128_t *)&new_value,
+					1, __ATOMIC_RELEASE,
+					__ATOMIC_RELAXED);
+#else
+			success = __atomic_compare_exchange(
+					(uint64_t *)ring_ptr,
+					&old_value,
+					&new_value,
+					1, __ATOMIC_RELEASE,
+					__ATOMIC_RELAXED);
+#endif
+		} while (success == 0);
+
+		/* Only increment tail if the CAS succeeds, since it can
+		 * spuriously fail on some architectures.
+		 */
+		tail++;
+	}
+
+end:
+	tail = __rte_ring_lf_update_tail(r, tail);
+
+	if (free_space != NULL)
+		*free_space = avail - n;
+	return n;
+#endif
+}
+
+/**
+ * @internal
+ *   Dequeue several objects from the lock-free ring (single-consumer only)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_dequeue_sc(struct rte_ring *r, void **obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *available)
+{
+	uintptr_t cons_tail, prod_tail, avail;
+
+	cons_tail = __atomic_load_n(&r->cons_ptr.tail, __ATOMIC_RELAXED);
+	prod_tail = __atomic_load_n(&r->prod_ptr.tail, __ATOMIC_ACQUIRE);
+
+	avail = prod_tail - cons_tail;
+
+	/* Set the actual entries for dequeue */
+	if (unlikely(avail < n))
+		n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : avail;
+
+	if (unlikely(n == 0))
+		goto end;
+
+	DEQUEUE_PTRS_LF(r, &r->ring, cons_tail, obj_table, n);
+
+	/* Use a read barrier and store-relaxed so we don't unnecessarily order
+	 * writes.
+	 */
+	rte_smp_rmb();
+
+	__atomic_store_n(&r->cons_ptr.tail, cons_tail + n, __ATOMIC_RELAXED);
+end:
+	if (available != NULL)
+		*available = avail - n;
+
+	return n;
+}
+
+/**
+ * @internal
+ *   Dequeue several objects from the lock-free ring (multi-consumer safe)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_dequeue_mc(struct rte_ring *r, void **obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *available)
+{
+	uintptr_t cons_tail, prod_tail, avail;
+
+	cons_tail = __atomic_load_n(&r->cons_ptr.tail, __ATOMIC_RELAXED);
+
+	do {
+		/* Load tail on every iteration to avoid spurious queue empty
+		 * situations.
+		 */
+		prod_tail = __atomic_load_n(&r->prod_ptr.tail,
+					    __ATOMIC_ACQUIRE);
+
+		avail = prod_tail - cons_tail;
+
+		/* Set the actual entries for dequeue */
+		if (unlikely(avail < n))
+			n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : avail;
+
+		if (unlikely(n == 0))
+			goto end;
+
+		DEQUEUE_PTRS_LF(r, &r->ring, cons_tail, obj_table, n);
+
+		/* Use a read barrier and store-relaxed so we don't
+		 * unnecessarily order writes.
+		 */
+		rte_smp_rmb();
+
+	} while (!__atomic_compare_exchange_n(&r->cons_ptr.tail,
+					      &cons_tail, cons_tail + n,
+					      0, __ATOMIC_RELAXED,
+					      __ATOMIC_RELAXED));
+
+end:
+	if (available != NULL)
+		*available = avail - n;
+
 	return n;
 }
 
diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index 6a0e1bbfb..944b353f4 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -297,4 +297,358 @@ __rte_ring_move_cons_head_ptr(struct rte_ring *r, unsigned int is_sc,
 	return n;
 }
 
+/**
+ * @internal
+ *   Enqueue several objects on the lock-free ring (single-producer only)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_enqueue_sp(struct rte_ring *r, void * const *obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+	uint32_t free_entries;
+	uintptr_t head, next;
+
+	n = __rte_ring_move_prod_head_ptr(r, 1, n, behavior,
+					  &head, &next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	ENQUEUE_PTRS_LF(r, &r->ring, head, obj_table, n);
+
+	rte_smp_wmb();
+
+	r->prod_ptr.tail += n;
+end:
+	if (free_space != NULL)
+		*free_space = free_entries - n;
+	return n;
+}
+
+/* This macro defines the number of times an enqueueing thread can fail to find
+ * a free ring slot before reloading its producer tail index.
+ */
+#define ENQ_RETRY_LIMIT 32
+
+/**
+ * @internal
+ *   Get the next producer tail index.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param idx
+ *   The local tail index
+ * @return
+ *   If the ring's tail is ahead of the local tail, return the shared tail.
+ *   Else, return tail + 1.
+ */
+static __rte_always_inline uintptr_t
+__rte_ring_reload_tail(struct rte_ring *r, uintptr_t idx)
+{
+	uintptr_t fresh = r->prod_ptr.tail;
+
+	if ((intptr_t)(idx - fresh) < 0)
+		/* fresh is after idx, use it instead */
+		idx = fresh;
+	else
+		/* Continue with next slot */
+		idx++;
+
+	return idx;
+}
+
+/**
+ * @internal
+ *   Update the ring's producer tail index. If another thread already updated
+ *   the index beyond the caller's tail value, do nothing.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param idx
+ *   The local tail index
+ * @return
+ *   If the shared tail is ahead of the local tail, return the shared tail.
+ *   Else, return tail + 1.
+ */
+static __rte_always_inline uintptr_t
+__rte_ring_lf_update_tail(struct rte_ring *r, uintptr_t val)
+{
+	volatile uintptr_t *loc = &r->prod_ptr.tail;
+	uintptr_t old = *loc;
+
+	do {
+		/* Check if the tail has already been updated. */
+		if ((intptr_t)(val - old) < 0)
+			return old;
+
+		/* Else val >= old, need to update *loc */
+	} while (!__sync_bool_compare_and_swap(loc, old, val));
+
+	return val;
+}
+
+/**
+ * @internal
+ *   Enqueue several objects on the lock-free ring (multi-producer safe)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_enqueue_mp(struct rte_ring *r, void * const *obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+#if !defined(ALLOW_EXPERIMENTAL_API)
+	RTE_SET_USED(r);
+	RTE_SET_USED(obj_table);
+	RTE_SET_USED(n);
+	RTE_SET_USED(behavior);
+	RTE_SET_USED(free_space);
+	printf("[%s()] RING_F_LF requires an experimental API."
+	       " Recompile with ALLOW_EXPERIMENTAL_API to use it.\n"
+	       , __func__);
+	return 0;
+#else
+	struct rte_ring_lf_entry *base;
+	uintptr_t head, next, tail;
+	unsigned int i;
+	uint32_t avail;
+
+	/* Atomically update the prod head to reserve n slots. The prod tail
+	 * is modified at the end of the function.
+	 */
+	n = __rte_ring_move_prod_head_ptr(r, 0, n, behavior,
+					  &head, &next, &avail);
+
+	tail = r->prod_ptr.tail;
+
+	rte_smp_rmb();
+
+	head = r->cons_ptr.tail;
+
+	if (unlikely(n == 0))
+		goto end;
+
+	base = (struct rte_ring_lf_entry *)&r->ring;
+
+	for (i = 0; i < n; i++) {
+		unsigned int retries = 0;
+		int success = 0;
+
+		/* Enqueue to the tail entry. If another thread wins the race,
+		 * retry with the new tail.
+		 */
+		do {
+			struct rte_ring_lf_entry old_value, new_value;
+			struct rte_ring_lf_entry *ring_ptr;
+
+			ring_ptr = &base[tail & r->mask];
+
+			old_value = *ring_ptr;
+
+			if (old_value.cnt != (tail >> r->log2_size)) {
+				/* This slot has already been used. Depending
+				 * on how far behind this thread is, either go
+				 * to the next slot or reload the tail.
+				 */
+				uintptr_t prev_tail;
+
+				prev_tail = (tail + r->size) >> r->log2_size;
+
+				if (old_value.cnt != prev_tail ||
+				    ++retries == ENQ_RETRY_LIMIT) {
+					/* This thread either fell 2+ laps
+					 * behind or hit the retry limit, so
+					 * reload the tail index.
+					 */
+					tail = __rte_ring_reload_tail(r, tail);
+					retries = 0;
+				} else {
+					/* Slot already used, try the next. */
+					tail++;
+
+				}
+
+				continue;
+			}
+
+			/* Found a free slot, try to enqueue next element. */
+			new_value.ptr = obj_table[i];
+			new_value.cnt = (tail + r->size) >> r->log2_size;
+
+#ifdef RTE_ARCH_64
+			success = rte_atomic128_cmp_exchange(
+					(rte_int128_t *)ring_ptr,
+					(rte_int128_t *)&old_value,
+					(rte_int128_t *)&new_value,
+					1, __ATOMIC_RELEASE,
+					__ATOMIC_RELAXED);
+#else
+			uint64_t *old_ptr = (uint64_t *)&old_value;
+			uint64_t *new_ptr = (uint64_t *)&new_value;
+
+			success = rte_atomic64_cmpset(
+					(volatile uint64_t *)ring_ptr,
+					*old_ptr, *new_ptr);
+#endif
+		} while (success == 0);
+
+		/* Only increment tail if the CAS succeeds, since it can
+		 * spuriously fail on some architectures.
+		 */
+		tail++;
+	}
+
+end:
+
+	tail = __rte_ring_lf_update_tail(r, tail);
+
+	if (free_space != NULL)
+		*free_space = avail - n;
+	return n;
+#endif
+}
+
+/**
+ * @internal
+ *   Dequeue several objects from the lock-free ring (single-consumer only)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_dequeue_sc(struct rte_ring *r, void **obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *available)
+{
+	uintptr_t cons_tail, prod_tail, avail;
+
+	cons_tail = r->cons_ptr.tail;
+
+	rte_smp_rmb();
+
+	prod_tail = r->prod_ptr.tail;
+
+	avail = prod_tail - cons_tail;
+
+	/* Set the actual entries for dequeue */
+	if (unlikely(avail < n))
+		n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : avail;
+
+	if (unlikely(n == 0))
+		goto end;
+
+	DEQUEUE_PTRS_LF(r, &r->ring, cons_tail, obj_table, n);
+
+	rte_smp_rmb();
+
+	r->cons_ptr.tail += n;
+end:
+	if (available != NULL)
+		*available = avail - n;
+
+	return n;
+}
+
+/**
+ * @internal
+ *   Dequeue several objects from the lock-free ring (multi-consumer safe)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_dequeue_mc(struct rte_ring *r, void **obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *available)
+{
+	uintptr_t cons_tail, prod_tail, avail;
+
+	cons_tail = r->cons_ptr.tail;
+
+	do {
+		rte_smp_rmb();
+
+		/* Load tail on every iteration to avoid spurious queue empty
+		 * situations.
+		 */
+		prod_tail = r->prod_ptr.tail;
+
+		avail = prod_tail - cons_tail;
+
+		/* Set the actual entries for dequeue */
+		if (unlikely(avail < n))
+			n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : avail;
+
+		if (unlikely(n == 0))
+			goto end;
+
+		DEQUEUE_PTRS_LF(r, &r->ring, cons_tail, obj_table, n);
+
+	} while (!__sync_bool_compare_and_swap(&r->cons_ptr.tail,
+					       cons_tail, cons_tail + n));
+
+end:
+	if (available != NULL)
+		*available = avail - n;
+
+	return n;
+}
+
 #endif /* _RTE_RING_GENERIC_H_ */
diff --git a/lib/librte_ring/rte_ring_version.map b/lib/librte_ring/rte_ring_version.map
index d935efd0d..8969467af 100644
--- a/lib/librte_ring/rte_ring_version.map
+++ b/lib/librte_ring/rte_ring_version.map
@@ -17,3 +17,10 @@ DPDK_2.2 {
 	rte_ring_free;
 
 } DPDK_2.0;
+
+DPDK_19.05 {
+	global:
+
+	rte_ring_get_memsize;
+
+} DPDK_2.2;
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v5 4/6] test_ring: add lock-free ring autotest
  2019-03-05 17:40       ` [dpdk-dev] [PATCH v5 0/6] Add lock-free ring and mempool handler Gage Eads
                           ` (2 preceding siblings ...)
  2019-03-05 17:40         ` [dpdk-dev] [PATCH v5 3/6] ring: add a lock-free implementation Gage Eads
@ 2019-03-05 17:40         ` Gage Eads
  2019-03-05 17:40         ` [dpdk-dev] [PATCH v5 5/6] test_ring_perf: add lock-free ring perf test Gage Eads
                           ` (2 subsequent siblings)
  6 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-03-05 17:40 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl

ring_nb_autotest re-uses the ring_autotest code by wrapping its top-level
function with one that takes a 'flags' argument.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 test/test/test_ring.c | 61 ++++++++++++++++++++++++++++++++-------------------
 1 file changed, 38 insertions(+), 23 deletions(-)

diff --git a/test/test/test_ring.c b/test/test/test_ring.c
index aaf1e70ad..400b1bffd 100644
--- a/test/test/test_ring.c
+++ b/test/test/test_ring.c
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2010-2019 Intel Corporation
  */
 
 #include <string.h>
@@ -601,18 +601,20 @@ test_ring_burst_basic(struct rte_ring *r)
  * it will always fail to create ring with a wrong ring size number in this function
  */
 static int
-test_ring_creation_with_wrong_size(void)
+test_ring_creation_with_wrong_size(unsigned int flags)
 {
 	struct rte_ring * rp = NULL;
 
 	/* Test if ring size is not power of 2 */
-	rp = rte_ring_create("test_bad_ring_size", RING_SIZE + 1, SOCKET_ID_ANY, 0);
+	rp = rte_ring_create("test_bad_ring_size", RING_SIZE + 1,
+			     SOCKET_ID_ANY, flags);
 	if (NULL != rp) {
 		return -1;
 	}
 
 	/* Test if ring size is exceeding the limit */
-	rp = rte_ring_create("test_bad_ring_size", (RTE_RING_SZ_MASK + 1), SOCKET_ID_ANY, 0);
+	rp = rte_ring_create("test_bad_ring_size", (RTE_RING_SZ_MASK + 1),
+			     SOCKET_ID_ANY, flags);
 	if (NULL != rp) {
 		return -1;
 	}
@@ -623,11 +625,11 @@ test_ring_creation_with_wrong_size(void)
  * it tests if it would always fail to create ring with an used ring name
  */
 static int
-test_ring_creation_with_an_used_name(void)
+test_ring_creation_with_an_used_name(unsigned int flags)
 {
 	struct rte_ring * rp;
 
-	rp = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, 0);
+	rp = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, flags);
 	if (NULL != rp)
 		return -1;
 
@@ -639,10 +641,10 @@ test_ring_creation_with_an_used_name(void)
  * function to fail correctly
  */
 static int
-test_create_count_odd(void)
+test_create_count_odd(unsigned int flags)
 {
 	struct rte_ring *r = rte_ring_create("test_ring_count",
-			4097, SOCKET_ID_ANY, 0 );
+			4097, SOCKET_ID_ANY, flags);
 	if(r != NULL){
 		return -1;
 	}
@@ -665,7 +667,7 @@ test_lookup_null(void)
  * it tests some more basic ring operations
  */
 static int
-test_ring_basic_ex(void)
+test_ring_basic_ex(unsigned int flags)
 {
 	int ret = -1;
 	unsigned i;
@@ -679,7 +681,7 @@ test_ring_basic_ex(void)
 	}
 
 	rp = rte_ring_create("test_ring_basic_ex", RING_SIZE, SOCKET_ID_ANY,
-			RING_F_SP_ENQ | RING_F_SC_DEQ);
+			RING_F_SP_ENQ | RING_F_SC_DEQ | flags);
 	if (rp == NULL) {
 		printf("test_ring_basic_ex fail to create ring\n");
 		goto fail_test;
@@ -737,22 +739,22 @@ test_ring_basic_ex(void)
 }
 
 static int
-test_ring_with_exact_size(void)
+test_ring_with_exact_size(unsigned int flags)
 {
 	struct rte_ring *std_ring = NULL, *exact_sz_ring = NULL;
-	void *ptr_array[16];
+	void *ptr_array[1024];
 	static const unsigned int ring_sz = RTE_DIM(ptr_array);
 	unsigned int i;
 	int ret = -1;
 
 	std_ring = rte_ring_create("std", ring_sz, rte_socket_id(),
-			RING_F_SP_ENQ | RING_F_SC_DEQ);
+			RING_F_SP_ENQ | RING_F_SC_DEQ | flags);
 	if (std_ring == NULL) {
 		printf("%s: error, can't create std ring\n", __func__);
 		goto end;
 	}
 	exact_sz_ring = rte_ring_create("exact sz", ring_sz, rte_socket_id(),
-			RING_F_SP_ENQ | RING_F_SC_DEQ | RING_F_EXACT_SZ);
+		RING_F_SP_ENQ | RING_F_SC_DEQ | RING_F_EXACT_SZ | flags);
 	if (exact_sz_ring == NULL) {
 		printf("%s: error, can't create exact size ring\n", __func__);
 		goto end;
@@ -770,7 +772,7 @@ test_ring_with_exact_size(void)
 	}
 	/*
 	 * check that the exact_sz_ring can hold one more element than the
-	 * standard ring. (16 vs 15 elements)
+	 * standard ring. (1024 vs 1023 elements)
 	 */
 	for (i = 0; i < ring_sz - 1; i++) {
 		rte_ring_enqueue(std_ring, NULL);
@@ -808,17 +810,17 @@ test_ring_with_exact_size(void)
 }
 
 static int
-test_ring(void)
+__test_ring(unsigned int flags)
 {
 	struct rte_ring *r = NULL;
 
 	/* some more basic operations */
-	if (test_ring_basic_ex() < 0)
+	if (test_ring_basic_ex(flags) < 0)
 		goto test_fail;
 
 	rte_atomic32_init(&synchro);
 
-	r = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, 0);
+	r = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, flags);
 	if (r == NULL)
 		goto test_fail;
 
@@ -837,27 +839,27 @@ test_ring(void)
 		goto test_fail;
 
 	/* basic operations */
-	if ( test_create_count_odd() < 0){
+	if (test_create_count_odd(flags) < 0) {
 		printf("Test failed to detect odd count\n");
 		goto test_fail;
 	} else
 		printf("Test detected odd count\n");
 
-	if ( test_lookup_null() < 0){
+	if (test_lookup_null() < 0) {
 		printf("Test failed to detect NULL ring lookup\n");
 		goto test_fail;
 	} else
 		printf("Test detected NULL ring lookup\n");
 
 	/* test of creating ring with wrong size */
-	if (test_ring_creation_with_wrong_size() < 0)
+	if (test_ring_creation_with_wrong_size(flags) < 0)
 		goto test_fail;
 
 	/* test of creation ring with an used name */
-	if (test_ring_creation_with_an_used_name() < 0)
+	if (test_ring_creation_with_an_used_name(flags) < 0)
 		goto test_fail;
 
-	if (test_ring_with_exact_size() < 0)
+	if (test_ring_with_exact_size(flags) < 0)
 		goto test_fail;
 
 	/* dump the ring status */
@@ -873,4 +875,17 @@ test_ring(void)
 	return -1;
 }
 
+static int
+test_ring(void)
+{
+	return __test_ring(0);
+}
+
+static int
+test_lf_ring(void)
+{
+	return __test_ring(RING_F_LF);
+}
+
 REGISTER_TEST_COMMAND(ring_autotest, test_ring);
+REGISTER_TEST_COMMAND(ring_lf_autotest, test_lf_ring);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v5 5/6] test_ring_perf: add lock-free ring perf test
  2019-03-05 17:40       ` [dpdk-dev] [PATCH v5 0/6] Add lock-free ring and mempool handler Gage Eads
                           ` (3 preceding siblings ...)
  2019-03-05 17:40         ` [dpdk-dev] [PATCH v5 4/6] test_ring: add lock-free ring autotest Gage Eads
@ 2019-03-05 17:40         ` Gage Eads
  2019-03-05 17:40         ` [dpdk-dev] [PATCH v5 6/6] mempool/ring: add lock-free ring handlers Gage Eads
  2019-03-06 15:03         ` [dpdk-dev] [PATCH v6 0/6] Add lock-free ring and mempool handler Gage Eads
  6 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-03-05 17:40 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl

nb_ring_perf_autotest re-uses the ring_perf_autotest code by wrapping its
top-level function with one that takes a 'flags' argument.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 test/test/test_ring_perf.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/test/test/test_ring_perf.c b/test/test/test_ring_perf.c
index ebb3939f5..be465c758 100644
--- a/test/test/test_ring_perf.c
+++ b/test/test/test_ring_perf.c
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2010-2019 Intel Corporation
  */
 
 
@@ -363,12 +363,12 @@ test_bulk_enqueue_dequeue(struct rte_ring *r)
 }
 
 static int
-test_ring_perf(void)
+__test_ring_perf(unsigned int flags)
 {
 	struct lcore_pair cores;
 	struct rte_ring *r = NULL;
 
-	r = rte_ring_create(RING_NAME, RING_SIZE, rte_socket_id(), 0);
+	r = rte_ring_create(RING_NAME, RING_SIZE, rte_socket_id(), flags);
 	if (r == NULL)
 		return -1;
 
@@ -398,4 +398,17 @@ test_ring_perf(void)
 	return 0;
 }
 
+static int
+test_ring_perf(void)
+{
+	return __test_ring_perf(0);
+}
+
+static int
+test_lf_ring_perf(void)
+{
+	return __test_ring_perf(RING_F_LF);
+}
+
 REGISTER_TEST_COMMAND(ring_perf_autotest, test_ring_perf);
+REGISTER_TEST_COMMAND(ring_lf_perf_autotest, test_lf_ring_perf);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v5 6/6] mempool/ring: add lock-free ring handlers
  2019-03-05 17:40       ` [dpdk-dev] [PATCH v5 0/6] Add lock-free ring and mempool handler Gage Eads
                           ` (4 preceding siblings ...)
  2019-03-05 17:40         ` [dpdk-dev] [PATCH v5 5/6] test_ring_perf: add lock-free ring perf test Gage Eads
@ 2019-03-05 17:40         ` Gage Eads
  2019-03-06 15:03         ` [dpdk-dev] [PATCH v6 0/6] Add lock-free ring and mempool handler Gage Eads
  6 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-03-05 17:40 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl

These handlers allow an application to create a mempool based on the
lock-free ring, with any combination of single/multi producer/consumer.

Also, add a note to the programmer's guide's "known issues" section.

Signed-off-by: Gage Eads <gage.eads@intel.com>
Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
---
 doc/guides/prog_guide/env_abstraction_layer.rst | 10 +++++
 drivers/mempool/ring/Makefile                   |  1 +
 drivers/mempool/ring/meson.build                |  2 +
 drivers/mempool/ring/rte_mempool_ring.c         | 58 +++++++++++++++++++++++--
 4 files changed, 68 insertions(+), 3 deletions(-)

diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index 929d76dba..2e2516465 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -541,6 +541,16 @@ Known Issues
 
   5. It MUST not be used by multi-producer/consumer pthreads, whose scheduling policies are SCHED_FIFO or SCHED_RR.
 
+  Alternatively, 32-bit and x86_64 applications can use the lock-free ring
+  mempool handler. When considering it, note that:
+
+  - Among 64-bit architectures it is currently limited to the x86_64 platform,
+    because it uses a function (16-byte compare-and-swap) that is not yet
+    available on other platforms.
+  - It has worse average-case performance than the non-preemptive rte_ring, but
+    software caching (e.g. the mempool cache) can mitigate this by reducing the
+    number of handler operations.
+
 + rte_timer
 
   Running  ``rte_timer_manage()`` on a non-EAL pthread is not allowed. However, resetting/stopping the timer from a non-EAL pthread is allowed.
diff --git a/drivers/mempool/ring/Makefile b/drivers/mempool/ring/Makefile
index ddab522fe..012ba6966 100644
--- a/drivers/mempool/ring/Makefile
+++ b/drivers/mempool/ring/Makefile
@@ -10,6 +10,7 @@ LIB = librte_mempool_ring.a
 
 CFLAGS += -O3
 CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 LDLIBS += -lrte_eal -lrte_mempool -lrte_ring
 
 EXPORT_MAP := rte_mempool_ring_version.map
diff --git a/drivers/mempool/ring/meson.build b/drivers/mempool/ring/meson.build
index a021e908c..b1cb673cc 100644
--- a/drivers/mempool/ring/meson.build
+++ b/drivers/mempool/ring/meson.build
@@ -1,4 +1,6 @@
 # SPDX-License-Identifier: BSD-3-Clause
 # Copyright(c) 2017 Intel Corporation
 
+allow_experimental_apis = true
+
 sources = files('rte_mempool_ring.c')
diff --git a/drivers/mempool/ring/rte_mempool_ring.c b/drivers/mempool/ring/rte_mempool_ring.c
index bc123fc52..48041ae69 100644
--- a/drivers/mempool/ring/rte_mempool_ring.c
+++ b/drivers/mempool/ring/rte_mempool_ring.c
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2016 Intel Corporation
+ * Copyright(c) 2010-2019 Intel Corporation
  */
 
 #include <stdio.h>
@@ -47,11 +47,11 @@ common_ring_get_count(const struct rte_mempool *mp)
 
 
 static int
-common_ring_alloc(struct rte_mempool *mp)
+__common_ring_alloc(struct rte_mempool *mp, int rg_flags)
 {
-	int rg_flags = 0, ret;
 	char rg_name[RTE_RING_NAMESIZE];
 	struct rte_ring *r;
+	int ret;
 
 	ret = snprintf(rg_name, sizeof(rg_name),
 		RTE_MEMPOOL_MZ_FORMAT, mp->name);
@@ -82,6 +82,18 @@ common_ring_alloc(struct rte_mempool *mp)
 	return 0;
 }
 
+static int
+common_ring_alloc(struct rte_mempool *mp)
+{
+	return __common_ring_alloc(mp, 0);
+}
+
+static int
+common_ring_alloc_lf(struct rte_mempool *mp)
+{
+	return __common_ring_alloc(mp, RING_F_LF);
+}
+
 static void
 common_ring_free(struct rte_mempool *mp)
 {
@@ -130,7 +142,47 @@ static const struct rte_mempool_ops ops_sp_mc = {
 	.get_count = common_ring_get_count,
 };
 
+static const struct rte_mempool_ops ops_mp_mc_lf = {
+	.name = "ring_mp_mc_lf",
+	.alloc = common_ring_alloc_lf,
+	.free = common_ring_free,
+	.enqueue = common_ring_mp_enqueue,
+	.dequeue = common_ring_mc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
+static const struct rte_mempool_ops ops_sp_sc_lf = {
+	.name = "ring_sp_sc_lf",
+	.alloc = common_ring_alloc_lf,
+	.free = common_ring_free,
+	.enqueue = common_ring_sp_enqueue,
+	.dequeue = common_ring_sc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
+static const struct rte_mempool_ops ops_mp_sc_lf = {
+	.name = "ring_mp_sc_lf",
+	.alloc = common_ring_alloc_lf,
+	.free = common_ring_free,
+	.enqueue = common_ring_mp_enqueue,
+	.dequeue = common_ring_sc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
+static const struct rte_mempool_ops ops_sp_mc_lf = {
+	.name = "ring_sp_mc_lf",
+	.alloc = common_ring_alloc_lf,
+	.free = common_ring_free,
+	.enqueue = common_ring_sp_enqueue,
+	.dequeue = common_ring_mc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
 MEMPOOL_REGISTER_OPS(ops_mp_mc);
 MEMPOOL_REGISTER_OPS(ops_sp_sc);
 MEMPOOL_REGISTER_OPS(ops_mp_sc);
 MEMPOOL_REGISTER_OPS(ops_sp_mc);
+MEMPOOL_REGISTER_OPS(ops_mp_mc_lf);
+MEMPOOL_REGISTER_OPS(ops_sp_sc_lf);
+MEMPOOL_REGISTER_OPS(ops_mp_sc_lf);
+MEMPOOL_REGISTER_OPS(ops_sp_mc_lf);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v6 0/6] Add lock-free ring and mempool handler
  2019-03-05 17:40       ` [dpdk-dev] [PATCH v5 0/6] Add lock-free ring and mempool handler Gage Eads
                           ` (5 preceding siblings ...)
  2019-03-05 17:40         ` [dpdk-dev] [PATCH v5 6/6] mempool/ring: add lock-free ring handlers Gage Eads
@ 2019-03-06 15:03         ` Gage Eads
  2019-03-06 15:03           ` [dpdk-dev] [PATCH v6 1/6] ring: add a pointer-width headtail structure Gage Eads
                             ` (6 more replies)
  6 siblings, 7 replies; 123+ messages in thread
From: Gage Eads @ 2019-03-06 15:03 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl

For some users, the rte ring's "non-preemptive" constraint is not acceptable;
for example, if the application uses a mixture of pinned high-priority threads
and multiplexed low-priority threads that share a mempool.

This patchset introduces a lock-free ring and a mempool based on it. The
lock-free algorithm relies on a double-pointer compare-and-swap, so for 64-bit
architectures it is currently limited to x86_64.

The ring uses more compare-and-swap atomic operations than the regular rte ring:
With no contention, an enqueue of n pointers uses (1 + n) CAS operations and a
dequeue of n pointers uses 1. This algorithm has worse average-case performance
than the regular rte ring (particularly a highly-contended ring with large bulk
accesses), however:
- For applications with preemptible pthreads, the regular rte ring's worst-case
  performance (i.e. one thread being preempted in the update_tail() critical
  section) is much worse than the lock-free ring's.
- Software caching can mitigate the average case performance for ring-based
  algorithms. For example, a lock-free ring based mempool (a likely use case
  for this ring) with per-thread caching.

The lock-free ring is enabled via a new flag, RING_F_LF. For ease-of-use,
existing ring enqueue/dequeue functions work with both standard and lock-free
rings. This is also an experimental API, so RING_F_LF users must build with the
ALLOW_EXPERIMENTAL_API flag.

This patchset also adds lock-free versions of ring_autotest and
ring_perf_autotest, and a lock-free ring based mempool.

This patchset makes one API change; a deprecation notice was posted in a
separate commit[1].

This patchset depends on the 128-bit compare-and-set patch[2].

[1] http://mails.dpdk.org/archives/dev/2019-February/124321.html
[2] http://mails.dpdk.org/archives/dev/2019-March/125751.html

v6:
- Rebase patchset onto master (test/test/ -> app/test/)

v5:
 - Incorporated lfring's enqueue and dequeue logic from
   http://mails.dpdk.org/archives/dev/2019-January/124242.html
 - Renamed non-blocking -> lock-free and NB -> LF to align with a similar
   change in the lock-free stack patchset:
   http://mails.dpdk.org/archives/dev/2019-March/125797.html
 - Added support for 32-bit architectures by using the full 32b of the
   modification counter and requiring LF rings on these architectures to be at
   least 1024 entries.
 - Updated to the latest rte_atomic128_cmp_exchange() interface.
 - Added ring start marker to struct rte_ring

v4:
 - Split out nb_enqueue and nb_dequeue functions in generic and C11 versions,
   with the necessary memory ordering behavior for weakly consistent machines.
 - Convert size_t variables (from v2) to uint64_t and no-longer-applicable
   comment about variably-sized ring indexes.
 - Fix bug in nb_enqueue_mp that the breaks the non-blocking guarantee.
 - Split the ring_ptr cast into two lines.
 - Change the dependent patchset from the non-blocking stack patch series
   to one only containing the 128b CAS commit

v3:
 - Avoid the ABI break by putting 64-bit head and tail values in the same
   cacheline as struct rte_ring's prod and cons members.
 - Don't attempt to compile rte_atomic128_cmpset without
   ALLOW_EXPERIMENTAL_API, as this would break a large number of libraries.
 - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case someone tries
   to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
 - Update the ring mempool to use experimental APIs
 - Clarify that RINB_F_NB is only limited to x86_64 currently; e.g. ARMv8 has the
   ISA support for 128-bit CAS to eventually support it.

v2:
 - Merge separate docs commit into patch #5
 - Convert uintptr_t to size_t
 - Add a compile-time check for the size of size_t
 - Fix a space-after-typecast issue
 - Fix an unnecessary-parentheses checkpatch warning
 - Bump librte_ring's library version

Gage Eads (6):
  ring: add a pointer-width headtail structure
  ring: add a ring start marker
  ring: add a lock-free implementation
  test_ring: add lock-free ring autotest
  test_ring_perf: add lock-free ring perf test
  mempool/ring: add lock-free ring handlers

 app/test/test_ring.c                            |  61 +--
 app/test/test_ring_perf.c                       |  19 +-
 doc/guides/prog_guide/env_abstraction_layer.rst |  10 +
 drivers/mempool/ring/Makefile                   |   1 +
 drivers/mempool/ring/meson.build                |   2 +
 drivers/mempool/ring/rte_mempool_ring.c         |  58 ++-
 lib/librte_ring/rte_ring.c                      |  92 ++++-
 lib/librte_ring/rte_ring.h                      | 334 ++++++++++++++--
 lib/librte_ring/rte_ring_c11_mem.h              | 501 ++++++++++++++++++++++++
 lib/librte_ring/rte_ring_generic.h              | 484 +++++++++++++++++++++++
 lib/librte_ring/rte_ring_version.map            |   7 +
 11 files changed, 1492 insertions(+), 77 deletions(-)

-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v6 1/6] ring: add a pointer-width headtail structure
  2019-03-06 15:03         ` [dpdk-dev] [PATCH v6 0/6] Add lock-free ring and mempool handler Gage Eads
@ 2019-03-06 15:03           ` Gage Eads
  2019-03-06 15:03           ` [dpdk-dev] [PATCH v6 2/6] ring: add a ring start marker Gage Eads
                             ` (5 subsequent siblings)
  6 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-03-06 15:03 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl

For 64-bit systems, at current CPU speeds, 64-bit head and tail indexes
will not wrap-around within the author's lifetime. This is important to
avoiding the ABA problem -- in which a thread mistakes reading the same
tail index in two accesses to mean that the ring was not modified in the
intervening time -- in the upcoming lock-free ring implementation. Using a
64-bit index makes the possibility of this occurring effectively zero. This
commit uses pointer-width indexes so the lock-free ring can support 32-bit
systems as well.

This commit places the new producer and consumer structures in the same
location in struct rte_ring as their 32-bit counterparts. Since the 32-bit
versions are padded out to a cache line, there is space for the new
structure without affecting the layout of struct rte_ring. Thus, the ABI is
preserved.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_ring/rte_ring.h         |  21 +++++-
 lib/librte_ring/rte_ring_c11_mem.h | 143 +++++++++++++++++++++++++++++++++++++
 lib/librte_ring/rte_ring_generic.h | 130 +++++++++++++++++++++++++++++++++
 3 files changed, 291 insertions(+), 3 deletions(-)

diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index af5444a9f..c78db6916 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  *
- * Copyright (c) 2010-2017 Intel Corporation
+ * Copyright (c) 2010-2019 Intel Corporation
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
@@ -70,6 +70,13 @@ struct rte_ring_headtail {
 	uint32_t single;         /**< True if single prod/cons */
 };
 
+/* Structure to hold a pair of pointer-sized head/tail values and metadata */
+struct rte_ring_headtail_ptr {
+	volatile uintptr_t head; /**< Prod/consumer head. */
+	volatile uintptr_t tail; /**< Prod/consumer tail. */
+	uint32_t single;         /**< True if single prod/cons */
+};
+
 /**
  * An RTE ring structure.
  *
@@ -97,11 +104,19 @@ struct rte_ring {
 	char pad0 __rte_cache_aligned; /**< empty cache line */
 
 	/** Ring producer status. */
-	struct rte_ring_headtail prod __rte_cache_aligned;
+	RTE_STD_C11
+	union {
+		struct rte_ring_headtail prod __rte_cache_aligned;
+		struct rte_ring_headtail_ptr prod_ptr __rte_cache_aligned;
+	};
 	char pad1 __rte_cache_aligned; /**< empty cache line */
 
 	/** Ring consumer status. */
-	struct rte_ring_headtail cons __rte_cache_aligned;
+	RTE_STD_C11
+	union {
+		struct rte_ring_headtail cons __rte_cache_aligned;
+		struct rte_ring_headtail_ptr cons_ptr __rte_cache_aligned;
+	};
 	char pad2 __rte_cache_aligned; /**< empty cache line */
 };
 
diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 0fb73a337..545caf257 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -178,4 +178,147 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	return n;
 }
 
+/**
+ * @internal This function updates the producer head for enqueue using
+ *	     pointer-sized head/tail values.
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sp
+ *   Indicates whether multi-producer path is needed or not
+ * @param n
+ *   The number of elements we will want to enqueue, i.e. how far should the
+ *   head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
+ * @param old_head
+ *   Returns head value as it was before the move, i.e. where enqueue starts
+ * @param new_head
+ *   Returns the current/new head value i.e. where enqueue finishes
+ * @param free_entries
+ *   Returns the amount of free space in the ring BEFORE head was moved
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_move_prod_head_ptr(struct rte_ring *r, unsigned int is_sp,
+		unsigned int n, enum rte_ring_queue_behavior behavior,
+		uintptr_t *old_head, uintptr_t *new_head,
+		uint32_t *free_entries)
+{
+	const uint32_t capacity = r->capacity;
+	uintptr_t cons_tail;
+	unsigned int max = n;
+	int success;
+
+	*old_head = __atomic_load_n(&r->prod_ptr.head, __ATOMIC_RELAXED);
+	do {
+		/* Reset n to the initial burst count */
+		n = max;
+
+		/* Ensure the head is read before tail */
+		__atomic_thread_fence(__ATOMIC_ACQUIRE);
+
+		/* load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		cons_tail = __atomic_load_n(&r->cons_ptr.tail,
+					__ATOMIC_ACQUIRE);
+
+		*free_entries = (capacity + cons_tail - *old_head);
+
+		/* check that we have enough room in ring */
+		if (unlikely(n > *free_entries))
+			n = (behavior == RTE_RING_QUEUE_FIXED) ?
+					0 : *free_entries;
+
+		if (n == 0)
+			return 0;
+
+		*new_head = *old_head + n;
+		if (is_sp)
+			r->prod_ptr.head = *new_head, success = 1;
+		else
+			/* on failure, *old_head is updated */
+			success = __atomic_compare_exchange_n(&r->prod_ptr.head,
+					old_head, *new_head,
+					0, __ATOMIC_RELAXED,
+					__ATOMIC_RELAXED);
+	} while (unlikely(success == 0));
+	return n;
+}
+
+/**
+ * @internal This function updates the consumer head for dequeue using
+ *	     pointer-sized head/tail values.
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sc
+ *   Indicates whether multi-consumer path is needed or not
+ * @param n
+ *   The number of elements we will want to enqueue, i.e. how far should the
+ *   head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from ring
+ * @param old_head
+ *   Returns head value as it was before the move, i.e. where dequeue starts
+ * @param new_head
+ *   Returns the current/new head value i.e. where dequeue finishes
+ * @param entries
+ *   Returns the number of entries in the ring BEFORE head was moved
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_move_cons_head_ptr(struct rte_ring *r, unsigned int is_sc,
+		unsigned int n, enum rte_ring_queue_behavior behavior,
+		uintptr_t *old_head, uintptr_t *new_head,
+		uint32_t *entries)
+{
+	unsigned int max = n;
+	uintptr_t prod_tail;
+	int success;
+
+	/* move cons.head atomically */
+	*old_head = __atomic_load_n(&r->cons_ptr.head, __ATOMIC_RELAXED);
+	do {
+		/* Restore n as it may change every loop */
+		n = max;
+
+		/* Ensure the head is read before tail */
+		__atomic_thread_fence(__ATOMIC_ACQUIRE);
+
+		/* this load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		prod_tail = __atomic_load_n(&r->prod_ptr.tail,
+					__ATOMIC_ACQUIRE);
+
+		*entries = (prod_tail - *old_head);
+
+		/* Set the actual entries for dequeue */
+		if (n > *entries)
+			n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : *entries;
+
+		if (unlikely(n == 0))
+			return 0;
+
+		*new_head = *old_head + n;
+		if (is_sc)
+			r->cons_ptr.head = *new_head, success = 1;
+		else
+			/* on failure, *old_head will be updated */
+			success = __atomic_compare_exchange_n(&r->cons_ptr.head,
+							old_head, *new_head,
+							0, __ATOMIC_RELAXED,
+							__ATOMIC_RELAXED);
+	} while (unlikely(success == 0));
+	return n;
+}
+
 #endif /* _RTE_RING_C11_MEM_H_ */
diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index ea7dbe5b9..6a0e1bbfb 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -167,4 +167,134 @@ __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
 	return n;
 }
 
+/**
+ * @internal This function updates the producer head for enqueue using
+ *	     pointer-sized head/tail values.
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sp
+ *   Indicates whether multi-producer path is needed or not
+ * @param n
+ *   The number of elements we will want to enqueue, i.e. how far should the
+ *   head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
+ * @param old_head
+ *   Returns head value as it was before the move, i.e. where enqueue starts
+ * @param new_head
+ *   Returns the current/new head value i.e. where enqueue finishes
+ * @param free_entries
+ *   Returns the amount of free space in the ring BEFORE head was moved
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_move_prod_head_ptr(struct rte_ring *r, unsigned int is_sp,
+		unsigned int n, enum rte_ring_queue_behavior behavior,
+		uintptr_t *old_head, uintptr_t *new_head,
+		uint32_t *free_entries)
+{
+	const uint32_t capacity = r->capacity;
+	unsigned int max = n;
+	int success;
+
+	do {
+		/* Reset n to the initial burst count */
+		n = max;
+
+		*old_head = r->prod_ptr.head;
+
+		/* add rmb barrier to avoid load/load reorder in weak
+		 * memory model. It is noop on x86
+		 */
+		rte_smp_rmb();
+
+		*free_entries = (capacity + r->cons_ptr.tail - *old_head);
+
+		/* check that we have enough room in ring */
+		if (unlikely(n > *free_entries))
+			n = (behavior == RTE_RING_QUEUE_FIXED) ?
+					0 : *free_entries;
+
+		if (n == 0)
+			return 0;
+
+		*new_head = *old_head + n;
+		if (is_sp)
+			r->prod_ptr.head = *new_head, success = 1;
+		else
+			success = __sync_bool_compare_and_swap(
+					&r->prod_ptr.head,
+					*old_head, *new_head);
+	} while (unlikely(success == 0));
+	return n;
+}
+
+/**
+ * @internal This function updates the consumer head for dequeue using
+ *	     pointer-sized head/tail values.
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sc
+ *   Indicates whether multi-consumer path is needed or not
+ * @param n
+ *   The number of elements we will want to enqueue, i.e. how far should the
+ *   head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from ring
+ * @param old_head
+ *   Returns head value as it was before the move, i.e. where dequeue starts
+ * @param new_head
+ *   Returns the current/new head value i.e. where dequeue finishes
+ * @param entries
+ *   Returns the number of entries in the ring BEFORE head was moved
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_move_cons_head_ptr(struct rte_ring *r, unsigned int is_sc,
+		unsigned int n, enum rte_ring_queue_behavior behavior,
+		uintptr_t *old_head, uintptr_t *new_head,
+		uint32_t *entries)
+{
+	unsigned int max = n;
+	int success;
+
+	do {
+		/* Restore n as it may change every loop */
+		n = max;
+
+		*old_head = r->cons_ptr.head;
+
+		/* add rmb barrier to avoid load/load reorder in weak
+		 * memory model. It is noop on x86
+		 */
+		rte_smp_rmb();
+
+		*entries = (r->prod_ptr.tail - *old_head);
+
+		/* Set the actual entries for dequeue */
+		if (n > *entries)
+			n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : *entries;
+
+		if (unlikely(n == 0))
+			return 0;
+
+		*new_head = *old_head + n;
+		if (is_sc)
+			r->cons_ptr.head = *new_head, success = 1;
+		else
+			success = __sync_bool_compare_and_swap(
+					&r->cons_ptr.head,
+					*old_head, *new_head);
+	} while (unlikely(success == 0));
+	return n;
+}
+
 #endif /* _RTE_RING_GENERIC_H_ */
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v6 2/6] ring: add a ring start marker
  2019-03-06 15:03         ` [dpdk-dev] [PATCH v6 0/6] Add lock-free ring and mempool handler Gage Eads
  2019-03-06 15:03           ` [dpdk-dev] [PATCH v6 1/6] ring: add a pointer-width headtail structure Gage Eads
@ 2019-03-06 15:03           ` Gage Eads
  2019-03-06 15:03           ` [dpdk-dev] [PATCH v6 3/6] ring: add a lock-free implementation Gage Eads
                             ` (4 subsequent siblings)
  6 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-03-06 15:03 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl

This marker allows us to replace "&r[1]" with "&r->ring" to locate the
start of the ring.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_ring/rte_ring.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index c78db6916..f16d77b8a 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -118,6 +118,7 @@ struct rte_ring {
 		struct rte_ring_headtail_ptr cons_ptr __rte_cache_aligned;
 	};
 	char pad2 __rte_cache_aligned; /**< empty cache line */
+	void *ring[] __rte_cache_aligned; /**< empty marker for ring start */
 };
 
 #define RING_F_SP_ENQ 0x0001 /**< The default enqueue is "single-producer". */
@@ -361,7 +362,7 @@ __rte_ring_do_enqueue(struct rte_ring *r, void * const *obj_table,
 	if (n == 0)
 		goto end;
 
-	ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n, void *);
+	ENQUEUE_PTRS(r, &r->ring, prod_head, obj_table, n, void *);
 
 	update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
 end:
@@ -403,7 +404,7 @@ __rte_ring_do_dequeue(struct rte_ring *r, void **obj_table,
 	if (n == 0)
 		goto end;
 
-	DEQUEUE_PTRS(r, &r[1], cons_head, obj_table, n, void *);
+	DEQUEUE_PTRS(r, &r->ring, cons_head, obj_table, n, void *);
 
 	update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
 
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v6 3/6] ring: add a lock-free implementation
  2019-03-06 15:03         ` [dpdk-dev] [PATCH v6 0/6] Add lock-free ring and mempool handler Gage Eads
  2019-03-06 15:03           ` [dpdk-dev] [PATCH v6 1/6] ring: add a pointer-width headtail structure Gage Eads
  2019-03-06 15:03           ` [dpdk-dev] [PATCH v6 2/6] ring: add a ring start marker Gage Eads
@ 2019-03-06 15:03           ` Gage Eads
  2019-03-06 15:03           ` [dpdk-dev] [PATCH v6 4/6] test_ring: add lock-free ring autotest Gage Eads
                             ` (3 subsequent siblings)
  6 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-03-06 15:03 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl

This commit adds support for lock-free circular ring enqueue and dequeue
functions. The ring is supported on 32- and 64-bit architectures, however
it uses a 128-bit compare-and-swap instruction when run on a 64-bit
architecture, and thus is currently limited to x86_64.

The algorithm is based on Ola Liljedahl's lfring, modified to fit within
the rte ring API. With no contention, an enqueue of n pointers uses (1 + n)
CAS operations and a dequeue of n pointers uses 1. This algorithm has worse
average-case performance than the regular rte ring (particularly a
highly-contended ring with large bulk accesses), however:
- For applications with preemptible pthreads, the regular rte ring's
  worst-case performance (i.e. one thread being preempted in the
  update_tail() critical section) is much worse than the lock-free ring's.
- Software caching can mitigate the average case performance for ring-based
  algorithms. For example, a lock-free ring based mempool (a likely use
  case for this ring) with per-thread caching.

To avoid the ABA problem, each ring entry contains a modification counter.
On a 64-bit architecture, the chance of ABA occurring are effectively zero;
a 64-bit counter will take many years to wrap at current CPU frequencies.
On a 32-bit architectures, a lock-free ring must be at least 1024-entries
deep; assuming 100 cycles per ring entry access, this guarantees the ring's
modification counters will wrap on the order of days.

The lock-free ring is enabled via a new flag, RING_F_LF. Because the ring's
memsize is now a function of its flags (the lock-free ring requires 128b
for each entry), this commit adds a new argument ('flags') to
rte_ring_get_memsize(). An API deprecation notice will be sent in a
separate commit.

For ease-of-use, existing ring enqueue and dequeue functions work on both
regular and lock-free rings. This introduces an additional branch in the
datapath, but this should be a highly predictable branch.
ring_perf_autotest shows a negligible performance impact; it's hard to
distinguish a real difference versus system noise.

                                  | ring_perf_autotest cycles with branch -
             Test                 |   ring_perf_autotest cycles without
------------------------------------------------------------------
SP/SC single enq/dequeue          | 0.33
MP/MC single enq/dequeue          | -4.00
SP/SC burst enq/dequeue (size 8)  | 0.00
MP/MC burst enq/dequeue (size 8)  | 0.00
SP/SC burst enq/dequeue (size 32) | 0.00
MP/MC burst enq/dequeue (size 32) | 0.00
SC empty dequeue                  | 1.00
MC empty dequeue                  | 0.00

Single lcore:
SP/SC bulk enq/dequeue (size 8)   | 0.49
MP/MC bulk enq/dequeue (size 8)   | 0.08
SP/SC bulk enq/dequeue (size 32)  | 0.07
MP/MC bulk enq/dequeue (size 32)  | 0.09

Two physical cores:
SP/SC bulk enq/dequeue (size 8)   | 0.19
MP/MC bulk enq/dequeue (size 8)   | -0.37
SP/SC bulk enq/dequeue (size 32)  | 0.09
MP/MC bulk enq/dequeue (size 32)  | -0.05

Two NUMA nodes:
SP/SC bulk enq/dequeue (size 8)   | -1.96
MP/MC bulk enq/dequeue (size 8)   | 0.88
SP/SC bulk enq/dequeue (size 32)  | 0.10
MP/MC bulk enq/dequeue (size 32)  | 0.46

Test setup: x86_64 build with default config, dual-socket Xeon E5-2699 v4,
running on isolcpus cores with a tickless scheduler. Each test run three
times and the results averaged.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_ring/rte_ring.c           |  92 +++++++--
 lib/librte_ring/rte_ring.h           | 308 ++++++++++++++++++++++++++---
 lib/librte_ring/rte_ring_c11_mem.h   | 366 ++++++++++++++++++++++++++++++++++-
 lib/librte_ring/rte_ring_generic.h   | 354 +++++++++++++++++++++++++++++++++
 lib/librte_ring/rte_ring_version.map |   7 +
 5 files changed, 1080 insertions(+), 47 deletions(-)

diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
index d215acecc..d4a176f57 100644
--- a/lib/librte_ring/rte_ring.c
+++ b/lib/librte_ring/rte_ring.c
@@ -45,9 +45,9 @@ EAL_REGISTER_TAILQ(rte_ring_tailq)
 
 /* return the size of memory occupied by a ring */
 ssize_t
-rte_ring_get_memsize(unsigned count)
+rte_ring_get_memsize_v1905(unsigned int count, unsigned int flags)
 {
-	ssize_t sz;
+	ssize_t sz, elt_sz;
 
 	/* count must be a power of 2 */
 	if ((!POWEROF2(count)) || (count > RTE_RING_SZ_MASK )) {
@@ -57,10 +57,23 @@ rte_ring_get_memsize(unsigned count)
 		return -EINVAL;
 	}
 
-	sz = sizeof(struct rte_ring) + count * sizeof(void *);
+	elt_sz = (flags & RING_F_LF) ? 2 * sizeof(void *) : sizeof(void *);
+
+	sz = sizeof(struct rte_ring) + count * elt_sz;
 	sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
 	return sz;
 }
+BIND_DEFAULT_SYMBOL(rte_ring_get_memsize, _v1905, 19.05);
+MAP_STATIC_SYMBOL(ssize_t rte_ring_get_memsize(unsigned int count,
+					       unsigned int flags),
+		  rte_ring_get_memsize_v1905);
+
+ssize_t
+rte_ring_get_memsize_v20(unsigned int count)
+{
+	return rte_ring_get_memsize_v1905(count, 0);
+}
+VERSION_SYMBOL(rte_ring_get_memsize, _v20, 2.0);
 
 int
 rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
@@ -75,6 +88,8 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 			  RTE_CACHE_LINE_MASK) != 0);
 	RTE_BUILD_BUG_ON((offsetof(struct rte_ring, prod) &
 			  RTE_CACHE_LINE_MASK) != 0);
+	RTE_BUILD_BUG_ON(sizeof(struct rte_ring_lf_entry) !=
+			 2 * sizeof(void *));
 
 	/* init the ring structure */
 	memset(r, 0, sizeof(*r));
@@ -82,8 +97,6 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 	if (ret < 0 || ret >= (int)sizeof(r->name))
 		return -ENAMETOOLONG;
 	r->flags = flags;
-	r->prod.single = (flags & RING_F_SP_ENQ) ? __IS_SP : __IS_MP;
-	r->cons.single = (flags & RING_F_SC_DEQ) ? __IS_SC : __IS_MC;
 
 	if (flags & RING_F_EXACT_SZ) {
 		r->size = rte_align32pow2(count + 1);
@@ -100,12 +113,46 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 		r->mask = count - 1;
 		r->capacity = r->mask;
 	}
-	r->prod.head = r->cons.head = 0;
-	r->prod.tail = r->cons.tail = 0;
+
+	r->log2_size = rte_log2_u64(r->size);
+
+	if (flags & RING_F_LF) {
+		uint32_t i;
+
+		r->prod_ptr.single =
+			(flags & RING_F_SP_ENQ) ? __IS_SP : __IS_MP;
+		r->cons_ptr.single =
+			(flags & RING_F_SC_DEQ) ? __IS_SC : __IS_MC;
+		r->prod_ptr.head = r->cons_ptr.head = 0;
+		r->prod_ptr.tail = r->cons_ptr.tail = 0;
+
+		for (i = 0; i < r->size; i++) {
+			struct rte_ring_lf_entry *ring_ptr, *base;
+
+			base = (struct rte_ring_lf_entry *)&r->ring;
+
+			ring_ptr = &base[i & r->mask];
+
+			ring_ptr->cnt = 0;
+		}
+	} else {
+		r->prod.single = (flags & RING_F_SP_ENQ) ? __IS_SP : __IS_MP;
+		r->cons.single = (flags & RING_F_SC_DEQ) ? __IS_SC : __IS_MC;
+		r->prod.head = r->cons.head = 0;
+		r->prod.tail = r->cons.tail = 0;
+	}
 
 	return 0;
 }
 
+/* If a ring entry is written on average every M cycles, then a ring entry is
+ * reused every M*count cycles, and a ring entry's counter repeats every
+ * M*count*2^32 cycles. If M=100 on a 2GHz system, then a 1024-entry ring's
+ * counters would repeat every 2.37 days. The likelihood of ABA occurring is
+ * considered sufficiently low for 1024-entry and larger rings.
+ */
+#define MIN_32_BIT_LF_RING_SIZE 1024
+
 /* create the ring */
 struct rte_ring *
 rte_ring_create(const char *name, unsigned count, int socket_id,
@@ -123,11 +170,25 @@ rte_ring_create(const char *name, unsigned count, int socket_id,
 
 	ring_list = RTE_TAILQ_CAST(rte_ring_tailq.head, rte_ring_list);
 
+#ifdef RTE_ARCH_64
+#if !defined(RTE_ARCH_X86_64)
+	printf("This platform does not support the atomic operation required for RING_F_LF\n");
+	rte_errno = EINVAL;
+	return NULL;
+#endif
+#else
+	if ((flags & RING_F_LF) && count < MIN_32_BIT_LF_RING_SIZE) {
+		printf("RING_F_LF is only supported on 32-bit platforms for rings with at least 1024 entries.\n");
+		rte_errno = EINVAL;
+		return NULL;
+	}
+#endif
+
 	/* for an exact size ring, round up from count to a power of two */
 	if (flags & RING_F_EXACT_SZ)
 		count = rte_align32pow2(count + 1);
 
-	ring_size = rte_ring_get_memsize(count);
+	ring_size = rte_ring_get_memsize(count, flags);
 	if (ring_size < 0) {
 		rte_errno = ring_size;
 		return NULL;
@@ -227,10 +288,17 @@ rte_ring_dump(FILE *f, const struct rte_ring *r)
 	fprintf(f, "  flags=%x\n", r->flags);
 	fprintf(f, "  size=%"PRIu32"\n", r->size);
 	fprintf(f, "  capacity=%"PRIu32"\n", r->capacity);
-	fprintf(f, "  ct=%"PRIu32"\n", r->cons.tail);
-	fprintf(f, "  ch=%"PRIu32"\n", r->cons.head);
-	fprintf(f, "  pt=%"PRIu32"\n", r->prod.tail);
-	fprintf(f, "  ph=%"PRIu32"\n", r->prod.head);
+	if (r->flags & RING_F_LF) {
+		fprintf(f, "  ct=%"PRIuPTR"\n", r->cons_ptr.tail);
+		fprintf(f, "  ch=%"PRIuPTR"\n", r->cons_ptr.head);
+		fprintf(f, "  pt=%"PRIuPTR"\n", r->prod_ptr.tail);
+		fprintf(f, "  ph=%"PRIuPTR"\n", r->prod_ptr.head);
+	} else {
+		fprintf(f, "  ct=%"PRIu32"\n", r->cons.tail);
+		fprintf(f, "  ch=%"PRIu32"\n", r->cons.head);
+		fprintf(f, "  pt=%"PRIu32"\n", r->prod.tail);
+		fprintf(f, "  ph=%"PRIu32"\n", r->prod.head);
+	}
 	fprintf(f, "  used=%u\n", rte_ring_count(r));
 	fprintf(f, "  avail=%u\n", rte_ring_free_count(r));
 }
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index f16d77b8a..200d7b2a0 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -20,7 +20,7 @@
  *
  * - FIFO (First In First Out)
  * - Maximum size is fixed; the pointers are stored in a table.
- * - Lockless implementation.
+ * - Lockless (and optionally, non-blocking/lock-free) implementation.
  * - Multi- or single-consumer dequeue.
  * - Multi- or single-producer enqueue.
  * - Bulk dequeue.
@@ -98,6 +98,7 @@ struct rte_ring {
 	const struct rte_memzone *memzone;
 			/**< Memzone, if any, containing the rte_ring */
 	uint32_t size;           /**< Size of ring. */
+	uint32_t log2_size;      /**< log2(size of ring) */
 	uint32_t mask;           /**< Mask (size-1) of ring. */
 	uint32_t capacity;       /**< Usable size of ring */
 
@@ -133,6 +134,18 @@ struct rte_ring {
  */
 #define RING_F_EXACT_SZ 0x0004
 #define RTE_RING_SZ_MASK  (0x7fffffffU) /**< Ring size mask */
+/**
+ * The ring uses lock-free enqueue and dequeue functions. These functions
+ * do not have the "non-preemptive" constraint of a regular rte ring, and thus
+ * are suited for applications using preemptible pthreads. However, the
+ * lock-free functions have worse average-case performance than their regular
+ * rte ring counterparts. When used as the handler for a mempool, per-thread
+ * caching can mitigate the performance difference by reducing the number (and
+ * contention) of ring accesses.
+ *
+ * This flag is only supported on 32-bit and x86_64 platforms.
+ */
+#define RING_F_LF 0x0008
 
 /* @internal defines for passing to the enqueue dequeue worker functions */
 #define __IS_SP 1
@@ -150,11 +163,15 @@ struct rte_ring {
  *
  * @param count
  *   The number of elements in the ring (must be a power of 2).
+ * @param flags
+ *   The flags the ring will be created with.
  * @return
  *   - The memory size needed for the ring on success.
  *   - -EINVAL if count is not a power of 2.
  */
-ssize_t rte_ring_get_memsize(unsigned count);
+ssize_t rte_ring_get_memsize(unsigned int count, unsigned int flags);
+ssize_t rte_ring_get_memsize_v20(unsigned int count);
+ssize_t rte_ring_get_memsize_v1905(unsigned int count, unsigned int flags);
 
 /**
  * Initialize a ring structure.
@@ -187,6 +204,10 @@ ssize_t rte_ring_get_memsize(unsigned count);
  *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
  *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
  *      is "single-consumer". Otherwise, it is "multi-consumers".
+ *    - RING_F_EXACT_SZ: If this flag is set, count can be a non-power-of-2
+ *      number, but up to half the ring space may be wasted.
+ *    - RING_F_LF: If this flag is set, the ring uses lock-free variants of the
+ *      dequeue and enqueue functions.
  * @return
  *   0 on success, or a negative value on error.
  */
@@ -222,12 +243,17 @@ int rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
  *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
  *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
  *      is "single-consumer". Otherwise, it is "multi-consumers".
+ *    - RING_F_EXACT_SZ: If this flag is set, count can be a non-power-of-2
+ *      number, but up to half the ring space may be wasted.
+ *    - RING_F_LF: If this flag is set, the ring uses lock-free variants of the
+ *      dequeue and enqueue functions.
  * @return
  *   On success, the pointer to the new allocated ring. NULL on error with
  *    rte_errno set appropriately. Possible errno values include:
  *    - E_RTE_NO_CONFIG - function could not get pointer to rte_config structure
  *    - E_RTE_SECONDARY - function was called from a secondary process instance
- *    - EINVAL - count provided is not a power of 2
+ *    - EINVAL - count provided is not a power of 2, or RING_F_LF is used on an
+ *      unsupported platform
  *    - ENOSPC - the maximum number of memzones has already been allocated
  *    - EEXIST - a memzone with the same name already exists
  *    - ENOMEM - no appropriate memory area found in which to create memzone
@@ -283,6 +309,50 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 	} \
 } while (0)
 
+/* The actual enqueue of pointers on the lock-free ring, used by the
+ * single-producer lock-free enqueue function.
+ */
+#define ENQUEUE_PTRS_LF(r, base, prod_head, obj_table, n) do { \
+	unsigned int i; \
+	const uint32_t size = (r)->size; \
+	size_t idx = prod_head & (r)->mask; \
+	size_t new_cnt = prod_head + size; \
+	struct rte_ring_lf_entry *ring = (struct rte_ring_lf_entry *)base; \
+	unsigned int mask = ~0x3; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & mask); i += 4, idx += 4) { \
+			ring[idx].ptr = obj_table[i]; \
+			ring[idx].cnt = (new_cnt + i) >> r->log2_size; \
+			ring[idx + 1].ptr = obj_table[i + 1]; \
+			ring[idx + 1].cnt = (new_cnt + i + 1) >> r->log2_size; \
+			ring[idx + 2].ptr = obj_table[i + 2]; \
+			ring[idx + 2].cnt = (new_cnt + i + 2) >> r->log2_size; \
+			ring[idx + 3].ptr = obj_table[i + 3]; \
+			ring[idx + 3].cnt = (new_cnt + i + 3) >> r->log2_size; \
+		} \
+		switch (n & 0x3) { \
+		case 3: \
+			ring[idx].cnt = (new_cnt + i) >> r->log2_size; \
+			ring[idx++].ptr = obj_table[i++]; /* fallthrough */ \
+		case 2: \
+			ring[idx].cnt = (new_cnt + i) >> r->log2_size; \
+			ring[idx++].ptr = obj_table[i++]; /* fallthrough */ \
+		case 1: \
+			ring[idx].cnt = (new_cnt + i) >> r->log2_size; \
+			ring[idx++].ptr = obj_table[i++]; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) { \
+			ring[idx].cnt = (new_cnt + i) >> r->log2_size; \
+			ring[idx].ptr = obj_table[i]; \
+		} \
+		for (idx = 0; i < n; i++, idx++) {    \
+			ring[idx].cnt = (new_cnt + i) >> r->log2_size; \
+			ring[idx].ptr = obj_table[i]; \
+		} \
+	} \
+} while (0)
+
 /* the actual copy of pointers on the ring to obj_table.
  * Placed here since identical code needed in both
  * single and multi consumer dequeue functions */
@@ -314,6 +384,43 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 	} \
 } while (0)
 
+/* The actual copy of pointers on the lock-free ring to obj_table. */
+#define DEQUEUE_PTRS_LF(r, base, cons_head, obj_table, n) do { \
+	unsigned int i; \
+	size_t idx = cons_head & (r)->mask; \
+	const uint32_t size = (r)->size; \
+	struct rte_ring_lf_entry *ring = (struct rte_ring_lf_entry *)base; \
+	unsigned int mask = ~0x3; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & mask); i += 4, idx += 4) {\
+			obj_table[i] = ring[idx].ptr; \
+			obj_table[i + 1] = ring[idx + 1].ptr; \
+			obj_table[i + 2] = ring[idx + 2].ptr; \
+			obj_table[i + 3] = ring[idx + 3].ptr; \
+		} \
+		switch (n & 0x3) { \
+		case 3: \
+			obj_table[i++] = ring[idx++].ptr; /* fallthrough */ \
+		case 2: \
+			obj_table[i++] = ring[idx++].ptr; /* fallthrough */ \
+		case 1: \
+			obj_table[i++] = ring[idx++].ptr; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) \
+			obj_table[i] = ring[idx].ptr; \
+		for (idx = 0; i < n; i++, idx++) \
+			obj_table[i] = ring[idx].ptr; \
+	} \
+} while (0)
+
+
+/* @internal 128-bit structure used by the lock-free ring */
+struct rte_ring_lf_entry {
+	void *ptr; /**< Data pointer */
+	uintptr_t cnt; /**< Modification counter */
+};
+
 /* Between load and load. there might be cpu reorder in weak model
  * (powerpc/arm).
  * There are 2 choices for the users
@@ -330,6 +437,70 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 #endif
 
 /**
+ * @internal Enqueue several objects on the lock-free ring
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param is_sp
+ *   Indicates whether to use single producer or multi-producer head update
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_enqueue(struct rte_ring *r, void * const *obj_table,
+			 unsigned int n, enum rte_ring_queue_behavior behavior,
+			 unsigned int is_sp, unsigned int *free_space)
+{
+	if (is_sp)
+		return __rte_ring_do_lf_enqueue_sp(r, obj_table, n,
+						   behavior, free_space);
+	else
+		return __rte_ring_do_lf_enqueue_mp(r, obj_table, n,
+						   behavior, free_space);
+}
+
+/**
+ * @internal Dequeue several objects from the lock-free ring
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_dequeue(struct rte_ring *r, void **obj_table,
+		 unsigned int n, enum rte_ring_queue_behavior behavior,
+		 unsigned int is_sc, unsigned int *available)
+{
+	if (is_sc)
+		return __rte_ring_do_lf_dequeue_sc(r, obj_table, n,
+						   behavior, available);
+	else
+		return __rte_ring_do_lf_dequeue_mc(r, obj_table, n,
+						   behavior, available);
+}
+
+/**
  * @internal Enqueue several objects on the ring
  *
   * @param r
@@ -436,8 +607,14 @@ static __rte_always_inline unsigned int
 rte_ring_mp_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_MP, free_space);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_MP,
+						free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_MP,
+					     free_space);
 }
 
 /**
@@ -459,8 +636,14 @@ static __rte_always_inline unsigned int
 rte_ring_sp_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_SP, free_space);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_SP,
+						free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_SP,
+					     free_space);
 }
 
 /**
@@ -486,8 +669,14 @@ static __rte_always_inline unsigned int
 rte_ring_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
 		      unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			r->prod.single, free_space);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED,
+						r->prod_ptr.single, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED,
+					     r->prod.single, free_space);
 }
 
 /**
@@ -570,8 +759,14 @@ static __rte_always_inline unsigned int
 rte_ring_mc_dequeue_bulk(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_MC, available);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_MC,
+						available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_MC,
+					     available);
 }
 
 /**
@@ -594,8 +789,14 @@ static __rte_always_inline unsigned int
 rte_ring_sc_dequeue_bulk(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_SC, available);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_SC,
+						available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_SC,
+					     available);
 }
 
 /**
@@ -621,8 +822,14 @@ static __rte_always_inline unsigned int
 rte_ring_dequeue_bulk(struct rte_ring *r, void **obj_table, unsigned int n,
 		unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-				r->cons.single, available);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED,
+						r->cons_ptr.single, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED,
+					     r->cons.single, available);
 }
 
 /**
@@ -697,9 +904,13 @@ rte_ring_dequeue(struct rte_ring *r, void **obj_p)
 static inline unsigned
 rte_ring_count(const struct rte_ring *r)
 {
-	uint32_t prod_tail = r->prod.tail;
-	uint32_t cons_tail = r->cons.tail;
-	uint32_t count = (prod_tail - cons_tail) & r->mask;
+	uint32_t count;
+
+	if (r->flags & RING_F_LF)
+		count = (r->prod_ptr.tail - r->cons_ptr.tail) & r->mask;
+	else
+		count = (r->prod.tail - r->cons.tail) & r->mask;
+
 	return (count > r->capacity) ? r->capacity : count;
 }
 
@@ -819,8 +1030,14 @@ static __rte_always_inline unsigned
 rte_ring_mp_enqueue_burst(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_MP, free_space);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_MP, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_MP, free_space);
 }
 
 /**
@@ -842,8 +1059,14 @@ static __rte_always_inline unsigned
 rte_ring_sp_enqueue_burst(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_SP, free_space);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_SP, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_SP, free_space);
 }
 
 /**
@@ -869,8 +1092,14 @@ static __rte_always_inline unsigned
 rte_ring_enqueue_burst(struct rte_ring *r, void * const *obj_table,
 		      unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_VARIABLE,
-			r->prod.single, free_space);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						r->prod_ptr.single, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     r->prod.single, free_space);
 }
 
 /**
@@ -897,8 +1126,14 @@ static __rte_always_inline unsigned
 rte_ring_mc_dequeue_burst(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_MC, available);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_MC, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_MC, available);
 }
 
 /**
@@ -922,8 +1157,14 @@ static __rte_always_inline unsigned
 rte_ring_sc_dequeue_burst(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_SC, available);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_SC, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_SC, available);
 }
 
 /**
@@ -949,9 +1190,14 @@ static __rte_always_inline unsigned
 rte_ring_dequeue_burst(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n,
-				RTE_RING_QUEUE_VARIABLE,
-				r->cons.single, available);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						r->cons_ptr.single, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     r->cons.single, available);
 }
 
 #ifdef __cplusplus
diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 545caf257..a672d161e 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -221,8 +221,8 @@ __rte_ring_move_prod_head_ptr(struct rte_ring *r, unsigned int is_sp,
 		/* Ensure the head is read before tail */
 		__atomic_thread_fence(__ATOMIC_ACQUIRE);
 
-		/* load-acquire synchronize with store-release of ht->tail
-		 * in update_tail.
+		/* load-acquire synchronize with store-release of tail in
+		 * __rte_ring_do_lf_dequeue_{sc, mc}.
 		 */
 		cons_tail = __atomic_load_n(&r->cons_ptr.tail,
 					__ATOMIC_ACQUIRE);
@@ -247,6 +247,7 @@ __rte_ring_move_prod_head_ptr(struct rte_ring *r, unsigned int is_sp,
 					0, __ATOMIC_RELAXED,
 					__ATOMIC_RELAXED);
 	} while (unlikely(success == 0));
+
 	return n;
 }
 
@@ -293,8 +294,8 @@ __rte_ring_move_cons_head_ptr(struct rte_ring *r, unsigned int is_sc,
 		/* Ensure the head is read before tail */
 		__atomic_thread_fence(__ATOMIC_ACQUIRE);
 
-		/* this load-acquire synchronize with store-release of ht->tail
-		 * in update_tail.
+		/* load-acquire synchronize with store-release of tail in
+		 * __rte_ring_do_lf_enqueue_{sp, mp}.
 		 */
 		prod_tail = __atomic_load_n(&r->prod_ptr.tail,
 					__ATOMIC_ACQUIRE);
@@ -318,6 +319,363 @@ __rte_ring_move_cons_head_ptr(struct rte_ring *r, unsigned int is_sc,
 							0, __ATOMIC_RELAXED,
 							__ATOMIC_RELAXED);
 	} while (unlikely(success == 0));
+
+	return n;
+}
+
+/**
+ * @internal
+ *   Enqueue several objects on the lock-free ring (single-producer only)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_enqueue_sp(struct rte_ring *r, void * const *obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+	uint32_t free_entries;
+	uintptr_t head, next;
+
+	n = __rte_ring_move_prod_head_ptr(r, 1, n, behavior,
+					  &head, &next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	ENQUEUE_PTRS_LF(r, &r->ring, head, obj_table, n);
+
+	__atomic_store_n(&r->prod_ptr.tail,
+			 r->prod_ptr.tail + n,
+			 __ATOMIC_RELEASE);
+end:
+	if (free_space != NULL)
+		*free_space = free_entries - n;
+	return n;
+}
+
+/* This macro defines the number of times an enqueueing thread can fail to find
+ * a free ring slot before reloading its producer tail index.
+ */
+#define ENQ_RETRY_LIMIT 32
+
+/**
+ * @internal
+ *   Get the next producer tail index.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param idx
+ *   The local tail index
+ * @return
+ *   If the ring's tail is ahead of the local tail, return the shared tail.
+ *   Else, return tail + 1.
+ */
+static __rte_always_inline uintptr_t
+__rte_ring_reload_tail(struct rte_ring *r, uintptr_t idx)
+{
+	uintptr_t fresh = __atomic_load_n(&r->prod_ptr.tail, __ATOMIC_RELAXED);
+
+	if ((intptr_t)(idx - fresh) < 0)
+		idx = fresh; /* fresh is after idx, use it instead */
+	else
+		idx++; /* Continue with next slot */
+
+	return idx;
+}
+
+/**
+ * @internal
+ *   Update the ring's producer tail index. If another thread already updated
+ *   the index beyond the caller's tail value, do nothing.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param idx
+ *   The local tail index
+ * @return
+ *   If the shared tail is ahead of the local tail, return the shared tail.
+ *   Else, return tail + 1.
+ */
+static __rte_always_inline uintptr_t
+__rte_ring_lf_update_tail(struct rte_ring *r, uintptr_t val)
+{
+	volatile uintptr_t *loc = &r->prod_ptr.tail;
+	uintptr_t old = __atomic_load_n(loc, __ATOMIC_RELAXED);
+
+	do {
+		/* Check if the tail has already been updated. */
+		if ((intptr_t)(val - old) < 0)
+			return old;
+
+		/* Else val >= old, need to update *loc */
+	} while (!__atomic_compare_exchange_n(loc, &old, val,
+					      1, __ATOMIC_RELEASE,
+					      __ATOMIC_RELAXED));
+
+	return val;
+}
+
+/**
+ * @internal
+ *   Enqueue several objects on the lock-free ring (multi-producer safe)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_enqueue_mp(struct rte_ring *r, void * const *obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+#if !defined(ALLOW_EXPERIMENTAL_API)
+	RTE_SET_USED(r);
+	RTE_SET_USED(obj_table);
+	RTE_SET_USED(n);
+	RTE_SET_USED(behavior);
+	RTE_SET_USED(free_space);
+	printf("[%s()] RING_F_LF requires an experimental API."
+	       " Recompile with ALLOW_EXPERIMENTAL_API to use it.\n"
+	       , __func__);
+	return 0;
+#else
+	struct rte_ring_lf_entry *base;
+	uintptr_t head, next, tail;
+	unsigned int i;
+	uint32_t avail;
+
+	/* Atomically update the prod head to reserve n slots. The prod tail
+	 * is modified at the end of the function.
+	 */
+	n = __rte_ring_move_prod_head_ptr(r, 0, n, behavior,
+					  &head, &next, &avail);
+
+	tail = __atomic_load_n(&r->prod_ptr.tail, __ATOMIC_RELAXED);
+	head = __atomic_load_n(&r->cons_ptr.tail, __ATOMIC_ACQUIRE);
+
+	if (unlikely(n == 0))
+		goto end;
+
+	base = (struct rte_ring_lf_entry *)&r->ring;
+
+	for (i = 0; i < n; i++) {
+		unsigned int retries = 0;
+		int success = 0;
+
+		/* Enqueue to the tail entry. If another thread wins the race,
+		 * retry with the new tail.
+		 */
+		do {
+			struct rte_ring_lf_entry old_value, new_value;
+			struct rte_ring_lf_entry *ring_ptr;
+
+			ring_ptr = &base[tail & r->mask];
+
+			old_value = *ring_ptr;
+
+			if (old_value.cnt != (tail >> r->log2_size)) {
+				/* This slot has already been used. Depending
+				 * on how far behind this thread is, either go
+				 * to the next slot or reload the tail.
+				 */
+				uintptr_t prev_tail;
+
+				prev_tail = (tail + r->size) >> r->log2_size;
+
+				if (old_value.cnt != prev_tail ||
+				    ++retries == ENQ_RETRY_LIMIT) {
+					/* This thread either fell 2+ laps
+					 * behind or hit the retry limit, so
+					 * reload the tail index.
+					 */
+					tail = __rte_ring_reload_tail(r, tail);
+					retries = 0;
+				} else {
+					/* Slot already used, try the next. */
+					tail++;
+
+				}
+
+				continue;
+			}
+
+			/* Found a free slot, try to enqueue next element. */
+			new_value.ptr = obj_table[i];
+			new_value.cnt = (tail + r->size) >> r->log2_size;
+
+#ifdef RTE_ARCH_64
+			success = rte_atomic128_cmp_exchange(
+					(rte_int128_t *)ring_ptr,
+					(rte_int128_t *)&old_value,
+					(rte_int128_t *)&new_value,
+					1, __ATOMIC_RELEASE,
+					__ATOMIC_RELAXED);
+#else
+			success = __atomic_compare_exchange(
+					(uint64_t *)ring_ptr,
+					&old_value,
+					&new_value,
+					1, __ATOMIC_RELEASE,
+					__ATOMIC_RELAXED);
+#endif
+		} while (success == 0);
+
+		/* Only increment tail if the CAS succeeds, since it can
+		 * spuriously fail on some architectures.
+		 */
+		tail++;
+	}
+
+end:
+	tail = __rte_ring_lf_update_tail(r, tail);
+
+	if (free_space != NULL)
+		*free_space = avail - n;
+	return n;
+#endif
+}
+
+/**
+ * @internal
+ *   Dequeue several objects from the lock-free ring (single-consumer only)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_dequeue_sc(struct rte_ring *r, void **obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *available)
+{
+	uintptr_t cons_tail, prod_tail, avail;
+
+	cons_tail = __atomic_load_n(&r->cons_ptr.tail, __ATOMIC_RELAXED);
+	prod_tail = __atomic_load_n(&r->prod_ptr.tail, __ATOMIC_ACQUIRE);
+
+	avail = prod_tail - cons_tail;
+
+	/* Set the actual entries for dequeue */
+	if (unlikely(avail < n))
+		n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : avail;
+
+	if (unlikely(n == 0))
+		goto end;
+
+	DEQUEUE_PTRS_LF(r, &r->ring, cons_tail, obj_table, n);
+
+	/* Use a read barrier and store-relaxed so we don't unnecessarily order
+	 * writes.
+	 */
+	rte_smp_rmb();
+
+	__atomic_store_n(&r->cons_ptr.tail, cons_tail + n, __ATOMIC_RELAXED);
+end:
+	if (available != NULL)
+		*available = avail - n;
+
+	return n;
+}
+
+/**
+ * @internal
+ *   Dequeue several objects from the lock-free ring (multi-consumer safe)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_dequeue_mc(struct rte_ring *r, void **obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *available)
+{
+	uintptr_t cons_tail, prod_tail, avail;
+
+	cons_tail = __atomic_load_n(&r->cons_ptr.tail, __ATOMIC_RELAXED);
+
+	do {
+		/* Load tail on every iteration to avoid spurious queue empty
+		 * situations.
+		 */
+		prod_tail = __atomic_load_n(&r->prod_ptr.tail,
+					    __ATOMIC_ACQUIRE);
+
+		avail = prod_tail - cons_tail;
+
+		/* Set the actual entries for dequeue */
+		if (unlikely(avail < n))
+			n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : avail;
+
+		if (unlikely(n == 0))
+			goto end;
+
+		DEQUEUE_PTRS_LF(r, &r->ring, cons_tail, obj_table, n);
+
+		/* Use a read barrier and store-relaxed so we don't
+		 * unnecessarily order writes.
+		 */
+		rte_smp_rmb();
+
+	} while (!__atomic_compare_exchange_n(&r->cons_ptr.tail,
+					      &cons_tail, cons_tail + n,
+					      0, __ATOMIC_RELAXED,
+					      __ATOMIC_RELAXED));
+
+end:
+	if (available != NULL)
+		*available = avail - n;
+
 	return n;
 }
 
diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index 6a0e1bbfb..944b353f4 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -297,4 +297,358 @@ __rte_ring_move_cons_head_ptr(struct rte_ring *r, unsigned int is_sc,
 	return n;
 }
 
+/**
+ * @internal
+ *   Enqueue several objects on the lock-free ring (single-producer only)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_enqueue_sp(struct rte_ring *r, void * const *obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+	uint32_t free_entries;
+	uintptr_t head, next;
+
+	n = __rte_ring_move_prod_head_ptr(r, 1, n, behavior,
+					  &head, &next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	ENQUEUE_PTRS_LF(r, &r->ring, head, obj_table, n);
+
+	rte_smp_wmb();
+
+	r->prod_ptr.tail += n;
+end:
+	if (free_space != NULL)
+		*free_space = free_entries - n;
+	return n;
+}
+
+/* This macro defines the number of times an enqueueing thread can fail to find
+ * a free ring slot before reloading its producer tail index.
+ */
+#define ENQ_RETRY_LIMIT 32
+
+/**
+ * @internal
+ *   Get the next producer tail index.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param idx
+ *   The local tail index
+ * @return
+ *   If the ring's tail is ahead of the local tail, return the shared tail.
+ *   Else, return tail + 1.
+ */
+static __rte_always_inline uintptr_t
+__rte_ring_reload_tail(struct rte_ring *r, uintptr_t idx)
+{
+	uintptr_t fresh = r->prod_ptr.tail;
+
+	if ((intptr_t)(idx - fresh) < 0)
+		/* fresh is after idx, use it instead */
+		idx = fresh;
+	else
+		/* Continue with next slot */
+		idx++;
+
+	return idx;
+}
+
+/**
+ * @internal
+ *   Update the ring's producer tail index. If another thread already updated
+ *   the index beyond the caller's tail value, do nothing.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param idx
+ *   The local tail index
+ * @return
+ *   If the shared tail is ahead of the local tail, return the shared tail.
+ *   Else, return tail + 1.
+ */
+static __rte_always_inline uintptr_t
+__rte_ring_lf_update_tail(struct rte_ring *r, uintptr_t val)
+{
+	volatile uintptr_t *loc = &r->prod_ptr.tail;
+	uintptr_t old = *loc;
+
+	do {
+		/* Check if the tail has already been updated. */
+		if ((intptr_t)(val - old) < 0)
+			return old;
+
+		/* Else val >= old, need to update *loc */
+	} while (!__sync_bool_compare_and_swap(loc, old, val));
+
+	return val;
+}
+
+/**
+ * @internal
+ *   Enqueue several objects on the lock-free ring (multi-producer safe)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_enqueue_mp(struct rte_ring *r, void * const *obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+#if !defined(ALLOW_EXPERIMENTAL_API)
+	RTE_SET_USED(r);
+	RTE_SET_USED(obj_table);
+	RTE_SET_USED(n);
+	RTE_SET_USED(behavior);
+	RTE_SET_USED(free_space);
+	printf("[%s()] RING_F_LF requires an experimental API."
+	       " Recompile with ALLOW_EXPERIMENTAL_API to use it.\n"
+	       , __func__);
+	return 0;
+#else
+	struct rte_ring_lf_entry *base;
+	uintptr_t head, next, tail;
+	unsigned int i;
+	uint32_t avail;
+
+	/* Atomically update the prod head to reserve n slots. The prod tail
+	 * is modified at the end of the function.
+	 */
+	n = __rte_ring_move_prod_head_ptr(r, 0, n, behavior,
+					  &head, &next, &avail);
+
+	tail = r->prod_ptr.tail;
+
+	rte_smp_rmb();
+
+	head = r->cons_ptr.tail;
+
+	if (unlikely(n == 0))
+		goto end;
+
+	base = (struct rte_ring_lf_entry *)&r->ring;
+
+	for (i = 0; i < n; i++) {
+		unsigned int retries = 0;
+		int success = 0;
+
+		/* Enqueue to the tail entry. If another thread wins the race,
+		 * retry with the new tail.
+		 */
+		do {
+			struct rte_ring_lf_entry old_value, new_value;
+			struct rte_ring_lf_entry *ring_ptr;
+
+			ring_ptr = &base[tail & r->mask];
+
+			old_value = *ring_ptr;
+
+			if (old_value.cnt != (tail >> r->log2_size)) {
+				/* This slot has already been used. Depending
+				 * on how far behind this thread is, either go
+				 * to the next slot or reload the tail.
+				 */
+				uintptr_t prev_tail;
+
+				prev_tail = (tail + r->size) >> r->log2_size;
+
+				if (old_value.cnt != prev_tail ||
+				    ++retries == ENQ_RETRY_LIMIT) {
+					/* This thread either fell 2+ laps
+					 * behind or hit the retry limit, so
+					 * reload the tail index.
+					 */
+					tail = __rte_ring_reload_tail(r, tail);
+					retries = 0;
+				} else {
+					/* Slot already used, try the next. */
+					tail++;
+
+				}
+
+				continue;
+			}
+
+			/* Found a free slot, try to enqueue next element. */
+			new_value.ptr = obj_table[i];
+			new_value.cnt = (tail + r->size) >> r->log2_size;
+
+#ifdef RTE_ARCH_64
+			success = rte_atomic128_cmp_exchange(
+					(rte_int128_t *)ring_ptr,
+					(rte_int128_t *)&old_value,
+					(rte_int128_t *)&new_value,
+					1, __ATOMIC_RELEASE,
+					__ATOMIC_RELAXED);
+#else
+			uint64_t *old_ptr = (uint64_t *)&old_value;
+			uint64_t *new_ptr = (uint64_t *)&new_value;
+
+			success = rte_atomic64_cmpset(
+					(volatile uint64_t *)ring_ptr,
+					*old_ptr, *new_ptr);
+#endif
+		} while (success == 0);
+
+		/* Only increment tail if the CAS succeeds, since it can
+		 * spuriously fail on some architectures.
+		 */
+		tail++;
+	}
+
+end:
+
+	tail = __rte_ring_lf_update_tail(r, tail);
+
+	if (free_space != NULL)
+		*free_space = avail - n;
+	return n;
+#endif
+}
+
+/**
+ * @internal
+ *   Dequeue several objects from the lock-free ring (single-consumer only)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_dequeue_sc(struct rte_ring *r, void **obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *available)
+{
+	uintptr_t cons_tail, prod_tail, avail;
+
+	cons_tail = r->cons_ptr.tail;
+
+	rte_smp_rmb();
+
+	prod_tail = r->prod_ptr.tail;
+
+	avail = prod_tail - cons_tail;
+
+	/* Set the actual entries for dequeue */
+	if (unlikely(avail < n))
+		n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : avail;
+
+	if (unlikely(n == 0))
+		goto end;
+
+	DEQUEUE_PTRS_LF(r, &r->ring, cons_tail, obj_table, n);
+
+	rte_smp_rmb();
+
+	r->cons_ptr.tail += n;
+end:
+	if (available != NULL)
+		*available = avail - n;
+
+	return n;
+}
+
+/**
+ * @internal
+ *   Dequeue several objects from the lock-free ring (multi-consumer safe)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_dequeue_mc(struct rte_ring *r, void **obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *available)
+{
+	uintptr_t cons_tail, prod_tail, avail;
+
+	cons_tail = r->cons_ptr.tail;
+
+	do {
+		rte_smp_rmb();
+
+		/* Load tail on every iteration to avoid spurious queue empty
+		 * situations.
+		 */
+		prod_tail = r->prod_ptr.tail;
+
+		avail = prod_tail - cons_tail;
+
+		/* Set the actual entries for dequeue */
+		if (unlikely(avail < n))
+			n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : avail;
+
+		if (unlikely(n == 0))
+			goto end;
+
+		DEQUEUE_PTRS_LF(r, &r->ring, cons_tail, obj_table, n);
+
+	} while (!__sync_bool_compare_and_swap(&r->cons_ptr.tail,
+					       cons_tail, cons_tail + n));
+
+end:
+	if (available != NULL)
+		*available = avail - n;
+
+	return n;
+}
+
 #endif /* _RTE_RING_GENERIC_H_ */
diff --git a/lib/librte_ring/rte_ring_version.map b/lib/librte_ring/rte_ring_version.map
index d935efd0d..8969467af 100644
--- a/lib/librte_ring/rte_ring_version.map
+++ b/lib/librte_ring/rte_ring_version.map
@@ -17,3 +17,10 @@ DPDK_2.2 {
 	rte_ring_free;
 
 } DPDK_2.0;
+
+DPDK_19.05 {
+	global:
+
+	rte_ring_get_memsize;
+
+} DPDK_2.2;
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v6 4/6] test_ring: add lock-free ring autotest
  2019-03-06 15:03         ` [dpdk-dev] [PATCH v6 0/6] Add lock-free ring and mempool handler Gage Eads
                             ` (2 preceding siblings ...)
  2019-03-06 15:03           ` [dpdk-dev] [PATCH v6 3/6] ring: add a lock-free implementation Gage Eads
@ 2019-03-06 15:03           ` Gage Eads
  2019-03-06 15:03           ` [dpdk-dev] [PATCH v6 5/6] test_ring_perf: add lock-free ring perf test Gage Eads
                             ` (2 subsequent siblings)
  6 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-03-06 15:03 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl

ring_nb_autotest re-uses the ring_autotest code by wrapping its top-level
function with one that takes a 'flags' argument.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 app/test/test_ring.c | 61 ++++++++++++++++++++++++++++++++--------------------
 1 file changed, 38 insertions(+), 23 deletions(-)

diff --git a/app/test/test_ring.c b/app/test/test_ring.c
index aaf1e70ad..400b1bffd 100644
--- a/app/test/test_ring.c
+++ b/app/test/test_ring.c
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2010-2019 Intel Corporation
  */
 
 #include <string.h>
@@ -601,18 +601,20 @@ test_ring_burst_basic(struct rte_ring *r)
  * it will always fail to create ring with a wrong ring size number in this function
  */
 static int
-test_ring_creation_with_wrong_size(void)
+test_ring_creation_with_wrong_size(unsigned int flags)
 {
 	struct rte_ring * rp = NULL;
 
 	/* Test if ring size is not power of 2 */
-	rp = rte_ring_create("test_bad_ring_size", RING_SIZE + 1, SOCKET_ID_ANY, 0);
+	rp = rte_ring_create("test_bad_ring_size", RING_SIZE + 1,
+			     SOCKET_ID_ANY, flags);
 	if (NULL != rp) {
 		return -1;
 	}
 
 	/* Test if ring size is exceeding the limit */
-	rp = rte_ring_create("test_bad_ring_size", (RTE_RING_SZ_MASK + 1), SOCKET_ID_ANY, 0);
+	rp = rte_ring_create("test_bad_ring_size", (RTE_RING_SZ_MASK + 1),
+			     SOCKET_ID_ANY, flags);
 	if (NULL != rp) {
 		return -1;
 	}
@@ -623,11 +625,11 @@ test_ring_creation_with_wrong_size(void)
  * it tests if it would always fail to create ring with an used ring name
  */
 static int
-test_ring_creation_with_an_used_name(void)
+test_ring_creation_with_an_used_name(unsigned int flags)
 {
 	struct rte_ring * rp;
 
-	rp = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, 0);
+	rp = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, flags);
 	if (NULL != rp)
 		return -1;
 
@@ -639,10 +641,10 @@ test_ring_creation_with_an_used_name(void)
  * function to fail correctly
  */
 static int
-test_create_count_odd(void)
+test_create_count_odd(unsigned int flags)
 {
 	struct rte_ring *r = rte_ring_create("test_ring_count",
-			4097, SOCKET_ID_ANY, 0 );
+			4097, SOCKET_ID_ANY, flags);
 	if(r != NULL){
 		return -1;
 	}
@@ -665,7 +667,7 @@ test_lookup_null(void)
  * it tests some more basic ring operations
  */
 static int
-test_ring_basic_ex(void)
+test_ring_basic_ex(unsigned int flags)
 {
 	int ret = -1;
 	unsigned i;
@@ -679,7 +681,7 @@ test_ring_basic_ex(void)
 	}
 
 	rp = rte_ring_create("test_ring_basic_ex", RING_SIZE, SOCKET_ID_ANY,
-			RING_F_SP_ENQ | RING_F_SC_DEQ);
+			RING_F_SP_ENQ | RING_F_SC_DEQ | flags);
 	if (rp == NULL) {
 		printf("test_ring_basic_ex fail to create ring\n");
 		goto fail_test;
@@ -737,22 +739,22 @@ test_ring_basic_ex(void)
 }
 
 static int
-test_ring_with_exact_size(void)
+test_ring_with_exact_size(unsigned int flags)
 {
 	struct rte_ring *std_ring = NULL, *exact_sz_ring = NULL;
-	void *ptr_array[16];
+	void *ptr_array[1024];
 	static const unsigned int ring_sz = RTE_DIM(ptr_array);
 	unsigned int i;
 	int ret = -1;
 
 	std_ring = rte_ring_create("std", ring_sz, rte_socket_id(),
-			RING_F_SP_ENQ | RING_F_SC_DEQ);
+			RING_F_SP_ENQ | RING_F_SC_DEQ | flags);
 	if (std_ring == NULL) {
 		printf("%s: error, can't create std ring\n", __func__);
 		goto end;
 	}
 	exact_sz_ring = rte_ring_create("exact sz", ring_sz, rte_socket_id(),
-			RING_F_SP_ENQ | RING_F_SC_DEQ | RING_F_EXACT_SZ);
+		RING_F_SP_ENQ | RING_F_SC_DEQ | RING_F_EXACT_SZ | flags);
 	if (exact_sz_ring == NULL) {
 		printf("%s: error, can't create exact size ring\n", __func__);
 		goto end;
@@ -770,7 +772,7 @@ test_ring_with_exact_size(void)
 	}
 	/*
 	 * check that the exact_sz_ring can hold one more element than the
-	 * standard ring. (16 vs 15 elements)
+	 * standard ring. (1024 vs 1023 elements)
 	 */
 	for (i = 0; i < ring_sz - 1; i++) {
 		rte_ring_enqueue(std_ring, NULL);
@@ -808,17 +810,17 @@ test_ring_with_exact_size(void)
 }
 
 static int
-test_ring(void)
+__test_ring(unsigned int flags)
 {
 	struct rte_ring *r = NULL;
 
 	/* some more basic operations */
-	if (test_ring_basic_ex() < 0)
+	if (test_ring_basic_ex(flags) < 0)
 		goto test_fail;
 
 	rte_atomic32_init(&synchro);
 
-	r = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, 0);
+	r = rte_ring_create("test", RING_SIZE, SOCKET_ID_ANY, flags);
 	if (r == NULL)
 		goto test_fail;
 
@@ -837,27 +839,27 @@ test_ring(void)
 		goto test_fail;
 
 	/* basic operations */
-	if ( test_create_count_odd() < 0){
+	if (test_create_count_odd(flags) < 0) {
 		printf("Test failed to detect odd count\n");
 		goto test_fail;
 	} else
 		printf("Test detected odd count\n");
 
-	if ( test_lookup_null() < 0){
+	if (test_lookup_null() < 0) {
 		printf("Test failed to detect NULL ring lookup\n");
 		goto test_fail;
 	} else
 		printf("Test detected NULL ring lookup\n");
 
 	/* test of creating ring with wrong size */
-	if (test_ring_creation_with_wrong_size() < 0)
+	if (test_ring_creation_with_wrong_size(flags) < 0)
 		goto test_fail;
 
 	/* test of creation ring with an used name */
-	if (test_ring_creation_with_an_used_name() < 0)
+	if (test_ring_creation_with_an_used_name(flags) < 0)
 		goto test_fail;
 
-	if (test_ring_with_exact_size() < 0)
+	if (test_ring_with_exact_size(flags) < 0)
 		goto test_fail;
 
 	/* dump the ring status */
@@ -873,4 +875,17 @@ test_ring(void)
 	return -1;
 }
 
+static int
+test_ring(void)
+{
+	return __test_ring(0);
+}
+
+static int
+test_lf_ring(void)
+{
+	return __test_ring(RING_F_LF);
+}
+
 REGISTER_TEST_COMMAND(ring_autotest, test_ring);
+REGISTER_TEST_COMMAND(ring_lf_autotest, test_lf_ring);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v6 5/6] test_ring_perf: add lock-free ring perf test
  2019-03-06 15:03         ` [dpdk-dev] [PATCH v6 0/6] Add lock-free ring and mempool handler Gage Eads
                             ` (3 preceding siblings ...)
  2019-03-06 15:03           ` [dpdk-dev] [PATCH v6 4/6] test_ring: add lock-free ring autotest Gage Eads
@ 2019-03-06 15:03           ` Gage Eads
  2019-03-06 15:03           ` [dpdk-dev] [PATCH v6 6/6] mempool/ring: add lock-free ring handlers Gage Eads
  2019-03-18 21:35           ` [dpdk-dev] [PATCH v7 0/6] Add lock-free ring and mempool handler Gage Eads
  6 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-03-06 15:03 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl

nb_ring_perf_autotest re-uses the ring_perf_autotest code by wrapping its
top-level function with one that takes a 'flags' argument.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 app/test/test_ring_perf.c | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/app/test/test_ring_perf.c b/app/test/test_ring_perf.c
index ebb3939f5..be465c758 100644
--- a/app/test/test_ring_perf.c
+++ b/app/test/test_ring_perf.c
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2014 Intel Corporation
+ * Copyright(c) 2010-2019 Intel Corporation
  */
 
 
@@ -363,12 +363,12 @@ test_bulk_enqueue_dequeue(struct rte_ring *r)
 }
 
 static int
-test_ring_perf(void)
+__test_ring_perf(unsigned int flags)
 {
 	struct lcore_pair cores;
 	struct rte_ring *r = NULL;
 
-	r = rte_ring_create(RING_NAME, RING_SIZE, rte_socket_id(), 0);
+	r = rte_ring_create(RING_NAME, RING_SIZE, rte_socket_id(), flags);
 	if (r == NULL)
 		return -1;
 
@@ -398,4 +398,17 @@ test_ring_perf(void)
 	return 0;
 }
 
+static int
+test_ring_perf(void)
+{
+	return __test_ring_perf(0);
+}
+
+static int
+test_lf_ring_perf(void)
+{
+	return __test_ring_perf(RING_F_LF);
+}
+
 REGISTER_TEST_COMMAND(ring_perf_autotest, test_ring_perf);
+REGISTER_TEST_COMMAND(ring_lf_perf_autotest, test_lf_ring_perf);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v6 6/6] mempool/ring: add lock-free ring handlers
  2019-03-06 15:03         ` [dpdk-dev] [PATCH v6 0/6] Add lock-free ring and mempool handler Gage Eads
                             ` (4 preceding siblings ...)
  2019-03-06 15:03           ` [dpdk-dev] [PATCH v6 5/6] test_ring_perf: add lock-free ring perf test Gage Eads
@ 2019-03-06 15:03           ` Gage Eads
  2019-03-18 21:35           ` [dpdk-dev] [PATCH v7 0/6] Add lock-free ring and mempool handler Gage Eads
  6 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-03-06 15:03 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl

These handlers allow an application to create a mempool based on the
lock-free ring, with any combination of single/multi producer/consumer.

Also, add a note to the programmer's guide's "known issues" section.

Signed-off-by: Gage Eads <gage.eads@intel.com>
Acked-by: Andrew Rybchenko <arybchenko@solarflare.com>
---
 doc/guides/prog_guide/env_abstraction_layer.rst | 10 +++++
 drivers/mempool/ring/Makefile                   |  1 +
 drivers/mempool/ring/meson.build                |  2 +
 drivers/mempool/ring/rte_mempool_ring.c         | 58 +++++++++++++++++++++++--
 4 files changed, 68 insertions(+), 3 deletions(-)

diff --git a/doc/guides/prog_guide/env_abstraction_layer.rst b/doc/guides/prog_guide/env_abstraction_layer.rst
index 929d76dba..2e2516465 100644
--- a/doc/guides/prog_guide/env_abstraction_layer.rst
+++ b/doc/guides/prog_guide/env_abstraction_layer.rst
@@ -541,6 +541,16 @@ Known Issues
 
   5. It MUST not be used by multi-producer/consumer pthreads, whose scheduling policies are SCHED_FIFO or SCHED_RR.
 
+  Alternatively, 32-bit and x86_64 applications can use the lock-free ring
+  mempool handler. When considering it, note that:
+
+  - Among 64-bit architectures it is currently limited to the x86_64 platform,
+    because it uses a function (16-byte compare-and-swap) that is not yet
+    available on other platforms.
+  - It has worse average-case performance than the non-preemptive rte_ring, but
+    software caching (e.g. the mempool cache) can mitigate this by reducing the
+    number of handler operations.
+
 + rte_timer
 
   Running  ``rte_timer_manage()`` on a non-EAL pthread is not allowed. However, resetting/stopping the timer from a non-EAL pthread is allowed.
diff --git a/drivers/mempool/ring/Makefile b/drivers/mempool/ring/Makefile
index ddab522fe..012ba6966 100644
--- a/drivers/mempool/ring/Makefile
+++ b/drivers/mempool/ring/Makefile
@@ -10,6 +10,7 @@ LIB = librte_mempool_ring.a
 
 CFLAGS += -O3
 CFLAGS += $(WERROR_FLAGS)
+CFLAGS += -DALLOW_EXPERIMENTAL_API
 LDLIBS += -lrte_eal -lrte_mempool -lrte_ring
 
 EXPORT_MAP := rte_mempool_ring_version.map
diff --git a/drivers/mempool/ring/meson.build b/drivers/mempool/ring/meson.build
index a021e908c..b1cb673cc 100644
--- a/drivers/mempool/ring/meson.build
+++ b/drivers/mempool/ring/meson.build
@@ -1,4 +1,6 @@
 # SPDX-License-Identifier: BSD-3-Clause
 # Copyright(c) 2017 Intel Corporation
 
+allow_experimental_apis = true
+
 sources = files('rte_mempool_ring.c')
diff --git a/drivers/mempool/ring/rte_mempool_ring.c b/drivers/mempool/ring/rte_mempool_ring.c
index bc123fc52..48041ae69 100644
--- a/drivers/mempool/ring/rte_mempool_ring.c
+++ b/drivers/mempool/ring/rte_mempool_ring.c
@@ -1,5 +1,5 @@
 /* SPDX-License-Identifier: BSD-3-Clause
- * Copyright(c) 2010-2016 Intel Corporation
+ * Copyright(c) 2010-2019 Intel Corporation
  */
 
 #include <stdio.h>
@@ -47,11 +47,11 @@ common_ring_get_count(const struct rte_mempool *mp)
 
 
 static int
-common_ring_alloc(struct rte_mempool *mp)
+__common_ring_alloc(struct rte_mempool *mp, int rg_flags)
 {
-	int rg_flags = 0, ret;
 	char rg_name[RTE_RING_NAMESIZE];
 	struct rte_ring *r;
+	int ret;
 
 	ret = snprintf(rg_name, sizeof(rg_name),
 		RTE_MEMPOOL_MZ_FORMAT, mp->name);
@@ -82,6 +82,18 @@ common_ring_alloc(struct rte_mempool *mp)
 	return 0;
 }
 
+static int
+common_ring_alloc(struct rte_mempool *mp)
+{
+	return __common_ring_alloc(mp, 0);
+}
+
+static int
+common_ring_alloc_lf(struct rte_mempool *mp)
+{
+	return __common_ring_alloc(mp, RING_F_LF);
+}
+
 static void
 common_ring_free(struct rte_mempool *mp)
 {
@@ -130,7 +142,47 @@ static const struct rte_mempool_ops ops_sp_mc = {
 	.get_count = common_ring_get_count,
 };
 
+static const struct rte_mempool_ops ops_mp_mc_lf = {
+	.name = "ring_mp_mc_lf",
+	.alloc = common_ring_alloc_lf,
+	.free = common_ring_free,
+	.enqueue = common_ring_mp_enqueue,
+	.dequeue = common_ring_mc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
+static const struct rte_mempool_ops ops_sp_sc_lf = {
+	.name = "ring_sp_sc_lf",
+	.alloc = common_ring_alloc_lf,
+	.free = common_ring_free,
+	.enqueue = common_ring_sp_enqueue,
+	.dequeue = common_ring_sc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
+static const struct rte_mempool_ops ops_mp_sc_lf = {
+	.name = "ring_mp_sc_lf",
+	.alloc = common_ring_alloc_lf,
+	.free = common_ring_free,
+	.enqueue = common_ring_mp_enqueue,
+	.dequeue = common_ring_sc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
+static const struct rte_mempool_ops ops_sp_mc_lf = {
+	.name = "ring_sp_mc_lf",
+	.alloc = common_ring_alloc_lf,
+	.free = common_ring_free,
+	.enqueue = common_ring_sp_enqueue,
+	.dequeue = common_ring_mc_dequeue,
+	.get_count = common_ring_get_count,
+};
+
 MEMPOOL_REGISTER_OPS(ops_mp_mc);
 MEMPOOL_REGISTER_OPS(ops_sp_sc);
 MEMPOOL_REGISTER_OPS(ops_mp_sc);
 MEMPOOL_REGISTER_OPS(ops_sp_mc);
+MEMPOOL_REGISTER_OPS(ops_mp_mc_lf);
+MEMPOOL_REGISTER_OPS(ops_sp_sc_lf);
+MEMPOOL_REGISTER_OPS(ops_mp_sc_lf);
+MEMPOOL_REGISTER_OPS(ops_sp_mc_lf);
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v7 0/6] Add lock-free ring and mempool handler
  2019-03-06 15:03         ` [dpdk-dev] [PATCH v6 0/6] Add lock-free ring and mempool handler Gage Eads
                             ` (5 preceding siblings ...)
  2019-03-06 15:03           ` [dpdk-dev] [PATCH v6 6/6] mempool/ring: add lock-free ring handlers Gage Eads
@ 2019-03-18 21:35           ` Gage Eads
  2019-03-18 21:35             ` Gage Eads
                               ` (8 more replies)
  6 siblings, 9 replies; 123+ messages in thread
From: Gage Eads @ 2019-03-18 21:35 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl, gage.eads

For some users, the rte ring's "non-preemptive" constraint is not acceptable;
for example, if the application uses a mixture of pinned high-priority threads
and multiplexed low-priority threads that share a mempool.

This patchset introduces a lock-free ring and a mempool based on it. The
lock-free algorithm relies on a double-pointer compare-and-swap, so for 64-bit
architectures it is currently limited to x86_64.

The ring uses more compare-and-swap atomic operations than the regular rte ring:
With no contention, an enqueue of n pointers uses (1 + n) CAS operations and a
dequeue of n pointers uses 1. This algorithm has worse average-case performance
than the regular rte ring (particularly a highly-contended ring with large bulk
accesses), however:
- For applications with preemptible pthreads, the regular rte ring's worst-case
  performance (i.e. one thread being preempted in the update_tail() critical
  section) is much worse than the lock-free ring's.
- Software caching can mitigate the average case performance for ring-based
  algorithms. For example, a lock-free ring based mempool (a likely use case
  for this ring) with per-thread caching.

The lock-free ring is enabled via a new flag, RING_F_LF. For ease-of-use,
existing ring enqueue/dequeue functions work with both standard and lock-free
rings. This is also an experimental API, so RING_F_LF users must build with the
ALLOW_EXPERIMENTAL_API flag.

This patchset also adds lock-free versions of ring_autotest and
ring_perf_autotest, and a lock-free ring based mempool.

This patchset makes one API change; a deprecation notice was posted in a
separate commit[1].

This patchset depends on the 128-bit compare-and-set patch[2].

[1] http://mails.dpdk.org/archives/dev/2019-February/124321.html
[2] http://mails.dpdk.org/archives/dev/2019-March/125751.html

v7:
- Added ARM copyright to rte_ring_generic.h and rte_ring_c11_mem.h, since the
  lock-free algorithm is based on ARM's lfring (see v5 notes)
- Rename __rte_ring_reload_tail() -> __rte_ring_lf_load_tail()
- Remove the unused return value from __rte_ring_lf_load_tail()
- Rename 'prev_tail' to 'next_tail' in the multi-producer lock-free enqueue

v6:
- Rebase patchset onto master (test/test/ -> app/test/)

v5:
 - Incorporated lfring's enqueue and dequeue logic from
   http://mails.dpdk.org/archives/dev/2019-January/124242.html
 - Renamed non-blocking -> lock-free and NB -> LF to align with a similar
   change in the lock-free stack patchset:
   http://mails.dpdk.org/archives/dev/2019-March/125797.html
 - Added support for 32-bit architectures by using the full 32b of the
   modification counter and requiring LF rings on these architectures to be at
   least 1024 entries.
 - Updated to the latest rte_atomic128_cmp_exchange() interface.
 - Added ring start marker to struct rte_ring

v4:
 - Split out nb_enqueue and nb_dequeue functions in generic and C11 versions,
   with the necessary memory ordering behavior for weakly consistent machines.
 - Convert size_t variables (from v2) to uint64_t and no-longer-applicable
   comment about variably-sized ring indexes.
 - Fix bug in nb_enqueue_mp that the breaks the non-blocking guarantee.
 - Split the ring_ptr cast into two lines.
 - Change the dependent patchset from the non-blocking stack patch series
   to one only containing the 128b CAS commit

v3:
 - Avoid the ABI break by putting 64-bit head and tail values in the same
   cacheline as struct rte_ring's prod and cons members.
 - Don't attempt to compile rte_atomic128_cmpset without
   ALLOW_EXPERIMENTAL_API, as this would break a large number of libraries.
 - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case someone tries
   to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
 - Update the ring mempool to use experimental APIs
 - Clarify that RINB_F_NB is only limited to x86_64 currently; e.g. ARMv8 has the
   ISA support for 128-bit CAS to eventually support it.

v2:
 - Merge separate docs commit into patch #5
 - Convert uintptr_t to size_t
 - Add a compile-time check for the size of size_t
 - Fix a space-after-typecast issue
 - Fix an unnecessary-parentheses checkpatch warning
 - Bump librte_ring's library version

Gage Eads (6):
  ring: add a pointer-width headtail structure
  ring: add a ring start marker
  ring: add a lock-free implementation
  test_ring: add lock-free ring autotest
  test_ring_perf: add lock-free ring perf test
  mempool/ring: add lock-free ring handlers

 app/test/test_ring.c                            |  61 +--
 app/test/test_ring_perf.c                       |  19 +-
 doc/guides/prog_guide/env_abstraction_layer.rst |  10 +
 drivers/mempool/ring/Makefile                   |   1 +
 drivers/mempool/ring/meson.build                |   2 +
 drivers/mempool/ring/rte_mempool_ring.c         |  58 ++-
 lib/librte_ring/rte_ring.c                      |  92 ++++-
 lib/librte_ring/rte_ring.h                      | 334 ++++++++++++++--
 lib/librte_ring/rte_ring_c11_mem.h              | 501 ++++++++++++++++++++++++
 lib/librte_ring/rte_ring_generic.h              | 485 ++++++++++++++++++++++-
 lib/librte_ring/rte_ring_version.map            |   7 +
 11 files changed, 1492 insertions(+), 78 deletions(-)

-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v7 0/6] Add lock-free ring and mempool handler
  2019-03-18 21:35           ` [dpdk-dev] [PATCH v7 0/6] Add lock-free ring and mempool handler Gage Eads
@ 2019-03-18 21:35             ` Gage Eads
  2019-03-18 21:35             ` [dpdk-dev] [PATCH v7 1/6] ring: add a pointer-width headtail structure Gage Eads
                               ` (7 subsequent siblings)
  8 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-03-18 21:35 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl, gage.eads

For some users, the rte ring's "non-preemptive" constraint is not acceptable;
for example, if the application uses a mixture of pinned high-priority threads
and multiplexed low-priority threads that share a mempool.

This patchset introduces a lock-free ring and a mempool based on it. The
lock-free algorithm relies on a double-pointer compare-and-swap, so for 64-bit
architectures it is currently limited to x86_64.

The ring uses more compare-and-swap atomic operations than the regular rte ring:
With no contention, an enqueue of n pointers uses (1 + n) CAS operations and a
dequeue of n pointers uses 1. This algorithm has worse average-case performance
than the regular rte ring (particularly a highly-contended ring with large bulk
accesses), however:
- For applications with preemptible pthreads, the regular rte ring's worst-case
  performance (i.e. one thread being preempted in the update_tail() critical
  section) is much worse than the lock-free ring's.
- Software caching can mitigate the average case performance for ring-based
  algorithms. For example, a lock-free ring based mempool (a likely use case
  for this ring) with per-thread caching.

The lock-free ring is enabled via a new flag, RING_F_LF. For ease-of-use,
existing ring enqueue/dequeue functions work with both standard and lock-free
rings. This is also an experimental API, so RING_F_LF users must build with the
ALLOW_EXPERIMENTAL_API flag.

This patchset also adds lock-free versions of ring_autotest and
ring_perf_autotest, and a lock-free ring based mempool.

This patchset makes one API change; a deprecation notice was posted in a
separate commit[1].

This patchset depends on the 128-bit compare-and-set patch[2].

[1] http://mails.dpdk.org/archives/dev/2019-February/124321.html
[2] http://mails.dpdk.org/archives/dev/2019-March/125751.html

v7:
- Added ARM copyright to rte_ring_generic.h and rte_ring_c11_mem.h, since the
  lock-free algorithm is based on ARM's lfring (see v5 notes)
- Rename __rte_ring_reload_tail() -> __rte_ring_lf_load_tail()
- Remove the unused return value from __rte_ring_lf_load_tail()
- Rename 'prev_tail' to 'next_tail' in the multi-producer lock-free enqueue

v6:
- Rebase patchset onto master (test/test/ -> app/test/)

v5:
 - Incorporated lfring's enqueue and dequeue logic from
   http://mails.dpdk.org/archives/dev/2019-January/124242.html
 - Renamed non-blocking -> lock-free and NB -> LF to align with a similar
   change in the lock-free stack patchset:
   http://mails.dpdk.org/archives/dev/2019-March/125797.html
 - Added support for 32-bit architectures by using the full 32b of the
   modification counter and requiring LF rings on these architectures to be at
   least 1024 entries.
 - Updated to the latest rte_atomic128_cmp_exchange() interface.
 - Added ring start marker to struct rte_ring

v4:
 - Split out nb_enqueue and nb_dequeue functions in generic and C11 versions,
   with the necessary memory ordering behavior for weakly consistent machines.
 - Convert size_t variables (from v2) to uint64_t and no-longer-applicable
   comment about variably-sized ring indexes.
 - Fix bug in nb_enqueue_mp that the breaks the non-blocking guarantee.
 - Split the ring_ptr cast into two lines.
 - Change the dependent patchset from the non-blocking stack patch series
   to one only containing the 128b CAS commit

v3:
 - Avoid the ABI break by putting 64-bit head and tail values in the same
   cacheline as struct rte_ring's prod and cons members.
 - Don't attempt to compile rte_atomic128_cmpset without
   ALLOW_EXPERIMENTAL_API, as this would break a large number of libraries.
 - Add a helpful warning to __rte_ring_do_nb_enqueue_mp() in case someone tries
   to use RING_F_NB without the ALLOW_EXPERIMENTAL_API flag.
 - Update the ring mempool to use experimental APIs
 - Clarify that RINB_F_NB is only limited to x86_64 currently; e.g. ARMv8 has the
   ISA support for 128-bit CAS to eventually support it.

v2:
 - Merge separate docs commit into patch #5
 - Convert uintptr_t to size_t
 - Add a compile-time check for the size of size_t
 - Fix a space-after-typecast issue
 - Fix an unnecessary-parentheses checkpatch warning
 - Bump librte_ring's library version

Gage Eads (6):
  ring: add a pointer-width headtail structure
  ring: add a ring start marker
  ring: add a lock-free implementation
  test_ring: add lock-free ring autotest
  test_ring_perf: add lock-free ring perf test
  mempool/ring: add lock-free ring handlers

 app/test/test_ring.c                            |  61 +--
 app/test/test_ring_perf.c                       |  19 +-
 doc/guides/prog_guide/env_abstraction_layer.rst |  10 +
 drivers/mempool/ring/Makefile                   |   1 +
 drivers/mempool/ring/meson.build                |   2 +
 drivers/mempool/ring/rte_mempool_ring.c         |  58 ++-
 lib/librte_ring/rte_ring.c                      |  92 ++++-
 lib/librte_ring/rte_ring.h                      | 334 ++++++++++++++--
 lib/librte_ring/rte_ring_c11_mem.h              | 501 ++++++++++++++++++++++++
 lib/librte_ring/rte_ring_generic.h              | 485 ++++++++++++++++++++++-
 lib/librte_ring/rte_ring_version.map            |   7 +
 11 files changed, 1492 insertions(+), 78 deletions(-)

-- 
2.13.6


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v7 1/6] ring: add a pointer-width headtail structure
  2019-03-18 21:35           ` [dpdk-dev] [PATCH v7 0/6] Add lock-free ring and mempool handler Gage Eads
  2019-03-18 21:35             ` Gage Eads
@ 2019-03-18 21:35             ` Gage Eads
  2019-03-18 21:35               ` Gage Eads
  2019-03-18 21:35             ` [dpdk-dev] [PATCH v7 2/6] ring: add a ring start marker Gage Eads
                               ` (6 subsequent siblings)
  8 siblings, 1 reply; 123+ messages in thread
From: Gage Eads @ 2019-03-18 21:35 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl, gage.eads

For 64-bit systems, at current CPU speeds, 64-bit head and tail indexes
will not wrap-around within the author's lifetime. This is important to
avoiding the ABA problem -- in which a thread mistakes reading the same
tail index in two accesses to mean that the ring was not modified in the
intervening time -- in the upcoming lock-free ring implementation. Using a
64-bit index makes the possibility of this occurring effectively zero. This
commit uses pointer-width indexes so the lock-free ring can support 32-bit
systems as well.

This commit places the new producer and consumer structures in the same
location in struct rte_ring as their 32-bit counterparts. Since the 32-bit
versions are padded out to a cache line, there is space for the new
structure without affecting the layout of struct rte_ring. Thus, the ABI is
preserved.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_ring/rte_ring.h         |  21 +++++-
 lib/librte_ring/rte_ring_c11_mem.h | 143 +++++++++++++++++++++++++++++++++++++
 lib/librte_ring/rte_ring_generic.h | 130 +++++++++++++++++++++++++++++++++
 3 files changed, 291 insertions(+), 3 deletions(-)

diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index af5444a9f..c78db6916 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  *
- * Copyright (c) 2010-2017 Intel Corporation
+ * Copyright (c) 2010-2019 Intel Corporation
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
@@ -70,6 +70,13 @@ struct rte_ring_headtail {
 	uint32_t single;         /**< True if single prod/cons */
 };
 
+/* Structure to hold a pair of pointer-sized head/tail values and metadata */
+struct rte_ring_headtail_ptr {
+	volatile uintptr_t head; /**< Prod/consumer head. */
+	volatile uintptr_t tail; /**< Prod/consumer tail. */
+	uint32_t single;         /**< True if single prod/cons */
+};
+
 /**
  * An RTE ring structure.
  *
@@ -97,11 +104,19 @@ struct rte_ring {
 	char pad0 __rte_cache_aligned; /**< empty cache line */
 
 	/** Ring producer status. */
-	struct rte_ring_headtail prod __rte_cache_aligned;
+	RTE_STD_C11
+	union {
+		struct rte_ring_headtail prod __rte_cache_aligned;
+		struct rte_ring_headtail_ptr prod_ptr __rte_cache_aligned;
+	};
 	char pad1 __rte_cache_aligned; /**< empty cache line */
 
 	/** Ring consumer status. */
-	struct rte_ring_headtail cons __rte_cache_aligned;
+	RTE_STD_C11
+	union {
+		struct rte_ring_headtail cons __rte_cache_aligned;
+		struct rte_ring_headtail_ptr cons_ptr __rte_cache_aligned;
+	};
 	char pad2 __rte_cache_aligned; /**< empty cache line */
 };
 
diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 0fb73a337..545caf257 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -178,4 +178,147 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	return n;
 }
 
+/**
+ * @internal This function updates the producer head for enqueue using
+ *	     pointer-sized head/tail values.
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sp
+ *   Indicates whether multi-producer path is needed or not
+ * @param n
+ *   The number of elements we will want to enqueue, i.e. how far should the
+ *   head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
+ * @param old_head
+ *   Returns head value as it was before the move, i.e. where enqueue starts
+ * @param new_head
+ *   Returns the current/new head value i.e. where enqueue finishes
+ * @param free_entries
+ *   Returns the amount of free space in the ring BEFORE head was moved
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_move_prod_head_ptr(struct rte_ring *r, unsigned int is_sp,
+		unsigned int n, enum rte_ring_queue_behavior behavior,
+		uintptr_t *old_head, uintptr_t *new_head,
+		uint32_t *free_entries)
+{
+	const uint32_t capacity = r->capacity;
+	uintptr_t cons_tail;
+	unsigned int max = n;
+	int success;
+
+	*old_head = __atomic_load_n(&r->prod_ptr.head, __ATOMIC_RELAXED);
+	do {
+		/* Reset n to the initial burst count */
+		n = max;
+
+		/* Ensure the head is read before tail */
+		__atomic_thread_fence(__ATOMIC_ACQUIRE);
+
+		/* load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		cons_tail = __atomic_load_n(&r->cons_ptr.tail,
+					__ATOMIC_ACQUIRE);
+
+		*free_entries = (capacity + cons_tail - *old_head);
+
+		/* check that we have enough room in ring */
+		if (unlikely(n > *free_entries))
+			n = (behavior == RTE_RING_QUEUE_FIXED) ?
+					0 : *free_entries;
+
+		if (n == 0)
+			return 0;
+
+		*new_head = *old_head + n;
+		if (is_sp)
+			r->prod_ptr.head = *new_head, success = 1;
+		else
+			/* on failure, *old_head is updated */
+			success = __atomic_compare_exchange_n(&r->prod_ptr.head,
+					old_head, *new_head,
+					0, __ATOMIC_RELAXED,
+					__ATOMIC_RELAXED);
+	} while (unlikely(success == 0));
+	return n;
+}
+
+/**
+ * @internal This function updates the consumer head for dequeue using
+ *	     pointer-sized head/tail values.
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sc
+ *   Indicates whether multi-consumer path is needed or not
+ * @param n
+ *   The number of elements we will want to enqueue, i.e. how far should the
+ *   head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from ring
+ * @param old_head
+ *   Returns head value as it was before the move, i.e. where dequeue starts
+ * @param new_head
+ *   Returns the current/new head value i.e. where dequeue finishes
+ * @param entries
+ *   Returns the number of entries in the ring BEFORE head was moved
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_move_cons_head_ptr(struct rte_ring *r, unsigned int is_sc,
+		unsigned int n, enum rte_ring_queue_behavior behavior,
+		uintptr_t *old_head, uintptr_t *new_head,
+		uint32_t *entries)
+{
+	unsigned int max = n;
+	uintptr_t prod_tail;
+	int success;
+
+	/* move cons.head atomically */
+	*old_head = __atomic_load_n(&r->cons_ptr.head, __ATOMIC_RELAXED);
+	do {
+		/* Restore n as it may change every loop */
+		n = max;
+
+		/* Ensure the head is read before tail */
+		__atomic_thread_fence(__ATOMIC_ACQUIRE);
+
+		/* this load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		prod_tail = __atomic_load_n(&r->prod_ptr.tail,
+					__ATOMIC_ACQUIRE);
+
+		*entries = (prod_tail - *old_head);
+
+		/* Set the actual entries for dequeue */
+		if (n > *entries)
+			n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : *entries;
+
+		if (unlikely(n == 0))
+			return 0;
+
+		*new_head = *old_head + n;
+		if (is_sc)
+			r->cons_ptr.head = *new_head, success = 1;
+		else
+			/* on failure, *old_head will be updated */
+			success = __atomic_compare_exchange_n(&r->cons_ptr.head,
+							old_head, *new_head,
+							0, __ATOMIC_RELAXED,
+							__ATOMIC_RELAXED);
+	} while (unlikely(success == 0));
+	return n;
+}
+
 #endif /* _RTE_RING_C11_MEM_H_ */
diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index ea7dbe5b9..6a0e1bbfb 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -167,4 +167,134 @@ __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
 	return n;
 }
 
+/**
+ * @internal This function updates the producer head for enqueue using
+ *	     pointer-sized head/tail values.
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sp
+ *   Indicates whether multi-producer path is needed or not
+ * @param n
+ *   The number of elements we will want to enqueue, i.e. how far should the
+ *   head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
+ * @param old_head
+ *   Returns head value as it was before the move, i.e. where enqueue starts
+ * @param new_head
+ *   Returns the current/new head value i.e. where enqueue finishes
+ * @param free_entries
+ *   Returns the amount of free space in the ring BEFORE head was moved
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_move_prod_head_ptr(struct rte_ring *r, unsigned int is_sp,
+		unsigned int n, enum rte_ring_queue_behavior behavior,
+		uintptr_t *old_head, uintptr_t *new_head,
+		uint32_t *free_entries)
+{
+	const uint32_t capacity = r->capacity;
+	unsigned int max = n;
+	int success;
+
+	do {
+		/* Reset n to the initial burst count */
+		n = max;
+
+		*old_head = r->prod_ptr.head;
+
+		/* add rmb barrier to avoid load/load reorder in weak
+		 * memory model. It is noop on x86
+		 */
+		rte_smp_rmb();
+
+		*free_entries = (capacity + r->cons_ptr.tail - *old_head);
+
+		/* check that we have enough room in ring */
+		if (unlikely(n > *free_entries))
+			n = (behavior == RTE_RING_QUEUE_FIXED) ?
+					0 : *free_entries;
+
+		if (n == 0)
+			return 0;
+
+		*new_head = *old_head + n;
+		if (is_sp)
+			r->prod_ptr.head = *new_head, success = 1;
+		else
+			success = __sync_bool_compare_and_swap(
+					&r->prod_ptr.head,
+					*old_head, *new_head);
+	} while (unlikely(success == 0));
+	return n;
+}
+
+/**
+ * @internal This function updates the consumer head for dequeue using
+ *	     pointer-sized head/tail values.
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sc
+ *   Indicates whether multi-consumer path is needed or not
+ * @param n
+ *   The number of elements we will want to enqueue, i.e. how far should the
+ *   head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from ring
+ * @param old_head
+ *   Returns head value as it was before the move, i.e. where dequeue starts
+ * @param new_head
+ *   Returns the current/new head value i.e. where dequeue finishes
+ * @param entries
+ *   Returns the number of entries in the ring BEFORE head was moved
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_move_cons_head_ptr(struct rte_ring *r, unsigned int is_sc,
+		unsigned int n, enum rte_ring_queue_behavior behavior,
+		uintptr_t *old_head, uintptr_t *new_head,
+		uint32_t *entries)
+{
+	unsigned int max = n;
+	int success;
+
+	do {
+		/* Restore n as it may change every loop */
+		n = max;
+
+		*old_head = r->cons_ptr.head;
+
+		/* add rmb barrier to avoid load/load reorder in weak
+		 * memory model. It is noop on x86
+		 */
+		rte_smp_rmb();
+
+		*entries = (r->prod_ptr.tail - *old_head);
+
+		/* Set the actual entries for dequeue */
+		if (n > *entries)
+			n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : *entries;
+
+		if (unlikely(n == 0))
+			return 0;
+
+		*new_head = *old_head + n;
+		if (is_sc)
+			r->cons_ptr.head = *new_head, success = 1;
+		else
+			success = __sync_bool_compare_and_swap(
+					&r->cons_ptr.head,
+					*old_head, *new_head);
+	} while (unlikely(success == 0));
+	return n;
+}
+
 #endif /* _RTE_RING_GENERIC_H_ */
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v7 1/6] ring: add a pointer-width headtail structure
  2019-03-18 21:35             ` [dpdk-dev] [PATCH v7 1/6] ring: add a pointer-width headtail structure Gage Eads
@ 2019-03-18 21:35               ` Gage Eads
  0 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-03-18 21:35 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl, gage.eads

For 64-bit systems, at current CPU speeds, 64-bit head and tail indexes
will not wrap-around within the author's lifetime. This is important to
avoiding the ABA problem -- in which a thread mistakes reading the same
tail index in two accesses to mean that the ring was not modified in the
intervening time -- in the upcoming lock-free ring implementation. Using a
64-bit index makes the possibility of this occurring effectively zero. This
commit uses pointer-width indexes so the lock-free ring can support 32-bit
systems as well.

This commit places the new producer and consumer structures in the same
location in struct rte_ring as their 32-bit counterparts. Since the 32-bit
versions are padded out to a cache line, there is space for the new
structure without affecting the layout of struct rte_ring. Thus, the ABI is
preserved.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_ring/rte_ring.h         |  21 +++++-
 lib/librte_ring/rte_ring_c11_mem.h | 143 +++++++++++++++++++++++++++++++++++++
 lib/librte_ring/rte_ring_generic.h | 130 +++++++++++++++++++++++++++++++++
 3 files changed, 291 insertions(+), 3 deletions(-)

diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index af5444a9f..c78db6916 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -1,6 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  *
- * Copyright (c) 2010-2017 Intel Corporation
+ * Copyright (c) 2010-2019 Intel Corporation
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
@@ -70,6 +70,13 @@ struct rte_ring_headtail {
 	uint32_t single;         /**< True if single prod/cons */
 };
 
+/* Structure to hold a pair of pointer-sized head/tail values and metadata */
+struct rte_ring_headtail_ptr {
+	volatile uintptr_t head; /**< Prod/consumer head. */
+	volatile uintptr_t tail; /**< Prod/consumer tail. */
+	uint32_t single;         /**< True if single prod/cons */
+};
+
 /**
  * An RTE ring structure.
  *
@@ -97,11 +104,19 @@ struct rte_ring {
 	char pad0 __rte_cache_aligned; /**< empty cache line */
 
 	/** Ring producer status. */
-	struct rte_ring_headtail prod __rte_cache_aligned;
+	RTE_STD_C11
+	union {
+		struct rte_ring_headtail prod __rte_cache_aligned;
+		struct rte_ring_headtail_ptr prod_ptr __rte_cache_aligned;
+	};
 	char pad1 __rte_cache_aligned; /**< empty cache line */
 
 	/** Ring consumer status. */
-	struct rte_ring_headtail cons __rte_cache_aligned;
+	RTE_STD_C11
+	union {
+		struct rte_ring_headtail cons __rte_cache_aligned;
+		struct rte_ring_headtail_ptr cons_ptr __rte_cache_aligned;
+	};
 	char pad2 __rte_cache_aligned; /**< empty cache line */
 };
 
diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 0fb73a337..545caf257 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -178,4 +178,147 @@ __rte_ring_move_cons_head(struct rte_ring *r, int is_sc,
 	return n;
 }
 
+/**
+ * @internal This function updates the producer head for enqueue using
+ *	     pointer-sized head/tail values.
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sp
+ *   Indicates whether multi-producer path is needed or not
+ * @param n
+ *   The number of elements we will want to enqueue, i.e. how far should the
+ *   head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
+ * @param old_head
+ *   Returns head value as it was before the move, i.e. where enqueue starts
+ * @param new_head
+ *   Returns the current/new head value i.e. where enqueue finishes
+ * @param free_entries
+ *   Returns the amount of free space in the ring BEFORE head was moved
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_move_prod_head_ptr(struct rte_ring *r, unsigned int is_sp,
+		unsigned int n, enum rte_ring_queue_behavior behavior,
+		uintptr_t *old_head, uintptr_t *new_head,
+		uint32_t *free_entries)
+{
+	const uint32_t capacity = r->capacity;
+	uintptr_t cons_tail;
+	unsigned int max = n;
+	int success;
+
+	*old_head = __atomic_load_n(&r->prod_ptr.head, __ATOMIC_RELAXED);
+	do {
+		/* Reset n to the initial burst count */
+		n = max;
+
+		/* Ensure the head is read before tail */
+		__atomic_thread_fence(__ATOMIC_ACQUIRE);
+
+		/* load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		cons_tail = __atomic_load_n(&r->cons_ptr.tail,
+					__ATOMIC_ACQUIRE);
+
+		*free_entries = (capacity + cons_tail - *old_head);
+
+		/* check that we have enough room in ring */
+		if (unlikely(n > *free_entries))
+			n = (behavior == RTE_RING_QUEUE_FIXED) ?
+					0 : *free_entries;
+
+		if (n == 0)
+			return 0;
+
+		*new_head = *old_head + n;
+		if (is_sp)
+			r->prod_ptr.head = *new_head, success = 1;
+		else
+			/* on failure, *old_head is updated */
+			success = __atomic_compare_exchange_n(&r->prod_ptr.head,
+					old_head, *new_head,
+					0, __ATOMIC_RELAXED,
+					__ATOMIC_RELAXED);
+	} while (unlikely(success == 0));
+	return n;
+}
+
+/**
+ * @internal This function updates the consumer head for dequeue using
+ *	     pointer-sized head/tail values.
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sc
+ *   Indicates whether multi-consumer path is needed or not
+ * @param n
+ *   The number of elements we will want to enqueue, i.e. how far should the
+ *   head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from ring
+ * @param old_head
+ *   Returns head value as it was before the move, i.e. where dequeue starts
+ * @param new_head
+ *   Returns the current/new head value i.e. where dequeue finishes
+ * @param entries
+ *   Returns the number of entries in the ring BEFORE head was moved
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_move_cons_head_ptr(struct rte_ring *r, unsigned int is_sc,
+		unsigned int n, enum rte_ring_queue_behavior behavior,
+		uintptr_t *old_head, uintptr_t *new_head,
+		uint32_t *entries)
+{
+	unsigned int max = n;
+	uintptr_t prod_tail;
+	int success;
+
+	/* move cons.head atomically */
+	*old_head = __atomic_load_n(&r->cons_ptr.head, __ATOMIC_RELAXED);
+	do {
+		/* Restore n as it may change every loop */
+		n = max;
+
+		/* Ensure the head is read before tail */
+		__atomic_thread_fence(__ATOMIC_ACQUIRE);
+
+		/* this load-acquire synchronize with store-release of ht->tail
+		 * in update_tail.
+		 */
+		prod_tail = __atomic_load_n(&r->prod_ptr.tail,
+					__ATOMIC_ACQUIRE);
+
+		*entries = (prod_tail - *old_head);
+
+		/* Set the actual entries for dequeue */
+		if (n > *entries)
+			n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : *entries;
+
+		if (unlikely(n == 0))
+			return 0;
+
+		*new_head = *old_head + n;
+		if (is_sc)
+			r->cons_ptr.head = *new_head, success = 1;
+		else
+			/* on failure, *old_head will be updated */
+			success = __atomic_compare_exchange_n(&r->cons_ptr.head,
+							old_head, *new_head,
+							0, __ATOMIC_RELAXED,
+							__ATOMIC_RELAXED);
+	} while (unlikely(success == 0));
+	return n;
+}
+
 #endif /* _RTE_RING_C11_MEM_H_ */
diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index ea7dbe5b9..6a0e1bbfb 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -167,4 +167,134 @@ __rte_ring_move_cons_head(struct rte_ring *r, unsigned int is_sc,
 	return n;
 }
 
+/**
+ * @internal This function updates the producer head for enqueue using
+ *	     pointer-sized head/tail values.
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sp
+ *   Indicates whether multi-producer path is needed or not
+ * @param n
+ *   The number of elements we will want to enqueue, i.e. how far should the
+ *   head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
+ * @param old_head
+ *   Returns head value as it was before the move, i.e. where enqueue starts
+ * @param new_head
+ *   Returns the current/new head value i.e. where enqueue finishes
+ * @param free_entries
+ *   Returns the amount of free space in the ring BEFORE head was moved
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_move_prod_head_ptr(struct rte_ring *r, unsigned int is_sp,
+		unsigned int n, enum rte_ring_queue_behavior behavior,
+		uintptr_t *old_head, uintptr_t *new_head,
+		uint32_t *free_entries)
+{
+	const uint32_t capacity = r->capacity;
+	unsigned int max = n;
+	int success;
+
+	do {
+		/* Reset n to the initial burst count */
+		n = max;
+
+		*old_head = r->prod_ptr.head;
+
+		/* add rmb barrier to avoid load/load reorder in weak
+		 * memory model. It is noop on x86
+		 */
+		rte_smp_rmb();
+
+		*free_entries = (capacity + r->cons_ptr.tail - *old_head);
+
+		/* check that we have enough room in ring */
+		if (unlikely(n > *free_entries))
+			n = (behavior == RTE_RING_QUEUE_FIXED) ?
+					0 : *free_entries;
+
+		if (n == 0)
+			return 0;
+
+		*new_head = *old_head + n;
+		if (is_sp)
+			r->prod_ptr.head = *new_head, success = 1;
+		else
+			success = __sync_bool_compare_and_swap(
+					&r->prod_ptr.head,
+					*old_head, *new_head);
+	} while (unlikely(success == 0));
+	return n;
+}
+
+/**
+ * @internal This function updates the consumer head for dequeue using
+ *	     pointer-sized head/tail values.
+ *
+ * @param r
+ *   A pointer to the ring structure
+ * @param is_sc
+ *   Indicates whether multi-consumer path is needed or not
+ * @param n
+ *   The number of elements we will want to enqueue, i.e. how far should the
+ *   head be moved
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from ring
+ * @param old_head
+ *   Returns head value as it was before the move, i.e. where dequeue starts
+ * @param new_head
+ *   Returns the current/new head value i.e. where dequeue finishes
+ * @param entries
+ *   Returns the number of entries in the ring BEFORE head was moved
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_move_cons_head_ptr(struct rte_ring *r, unsigned int is_sc,
+		unsigned int n, enum rte_ring_queue_behavior behavior,
+		uintptr_t *old_head, uintptr_t *new_head,
+		uint32_t *entries)
+{
+	unsigned int max = n;
+	int success;
+
+	do {
+		/* Restore n as it may change every loop */
+		n = max;
+
+		*old_head = r->cons_ptr.head;
+
+		/* add rmb barrier to avoid load/load reorder in weak
+		 * memory model. It is noop on x86
+		 */
+		rte_smp_rmb();
+
+		*entries = (r->prod_ptr.tail - *old_head);
+
+		/* Set the actual entries for dequeue */
+		if (n > *entries)
+			n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : *entries;
+
+		if (unlikely(n == 0))
+			return 0;
+
+		*new_head = *old_head + n;
+		if (is_sc)
+			r->cons_ptr.head = *new_head, success = 1;
+		else
+			success = __sync_bool_compare_and_swap(
+					&r->cons_ptr.head,
+					*old_head, *new_head);
+	} while (unlikely(success == 0));
+	return n;
+}
+
 #endif /* _RTE_RING_GENERIC_H_ */
-- 
2.13.6


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v7 2/6] ring: add a ring start marker
  2019-03-18 21:35           ` [dpdk-dev] [PATCH v7 0/6] Add lock-free ring and mempool handler Gage Eads
  2019-03-18 21:35             ` Gage Eads
  2019-03-18 21:35             ` [dpdk-dev] [PATCH v7 1/6] ring: add a pointer-width headtail structure Gage Eads
@ 2019-03-18 21:35             ` Gage Eads
  2019-03-18 21:35               ` Gage Eads
  2019-03-18 21:35             ` [dpdk-dev] [PATCH v7 3/6] ring: add a lock-free implementation Gage Eads
                               ` (5 subsequent siblings)
  8 siblings, 1 reply; 123+ messages in thread
From: Gage Eads @ 2019-03-18 21:35 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl, gage.eads

This marker allows us to replace "&r[1]" with "&r->ring" to locate the
start of the ring.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_ring/rte_ring.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index c78db6916..f16d77b8a 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -118,6 +118,7 @@ struct rte_ring {
 		struct rte_ring_headtail_ptr cons_ptr __rte_cache_aligned;
 	};
 	char pad2 __rte_cache_aligned; /**< empty cache line */
+	void *ring[] __rte_cache_aligned; /**< empty marker for ring start */
 };
 
 #define RING_F_SP_ENQ 0x0001 /**< The default enqueue is "single-producer". */
@@ -361,7 +362,7 @@ __rte_ring_do_enqueue(struct rte_ring *r, void * const *obj_table,
 	if (n == 0)
 		goto end;
 
-	ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n, void *);
+	ENQUEUE_PTRS(r, &r->ring, prod_head, obj_table, n, void *);
 
 	update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
 end:
@@ -403,7 +404,7 @@ __rte_ring_do_dequeue(struct rte_ring *r, void **obj_table,
 	if (n == 0)
 		goto end;
 
-	DEQUEUE_PTRS(r, &r[1], cons_head, obj_table, n, void *);
+	DEQUEUE_PTRS(r, &r->ring, cons_head, obj_table, n, void *);
 
 	update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
 
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v7 2/6] ring: add a ring start marker
  2019-03-18 21:35             ` [dpdk-dev] [PATCH v7 2/6] ring: add a ring start marker Gage Eads
@ 2019-03-18 21:35               ` Gage Eads
  0 siblings, 0 replies; 123+ messages in thread
From: Gage Eads @ 2019-03-18 21:35 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl, gage.eads

This marker allows us to replace "&r[1]" with "&r->ring" to locate the
start of the ring.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_ring/rte_ring.h | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index c78db6916..f16d77b8a 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -118,6 +118,7 @@ struct rte_ring {
 		struct rte_ring_headtail_ptr cons_ptr __rte_cache_aligned;
 	};
 	char pad2 __rte_cache_aligned; /**< empty cache line */
+	void *ring[] __rte_cache_aligned; /**< empty marker for ring start */
 };
 
 #define RING_F_SP_ENQ 0x0001 /**< The default enqueue is "single-producer". */
@@ -361,7 +362,7 @@ __rte_ring_do_enqueue(struct rte_ring *r, void * const *obj_table,
 	if (n == 0)
 		goto end;
 
-	ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n, void *);
+	ENQUEUE_PTRS(r, &r->ring, prod_head, obj_table, n, void *);
 
 	update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
 end:
@@ -403,7 +404,7 @@ __rte_ring_do_dequeue(struct rte_ring *r, void **obj_table,
 	if (n == 0)
 		goto end;
 
-	DEQUEUE_PTRS(r, &r[1], cons_head, obj_table, n, void *);
+	DEQUEUE_PTRS(r, &r->ring, cons_head, obj_table, n, void *);
 
 	update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
 
-- 
2.13.6


^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v7 3/6] ring: add a lock-free implementation
  2019-03-18 21:35           ` [dpdk-dev] [PATCH v7 0/6] Add lock-free ring and mempool handler Gage Eads
                               ` (2 preceding siblings ...)
  2019-03-18 21:35             ` [dpdk-dev] [PATCH v7 2/6] ring: add a ring start marker Gage Eads
@ 2019-03-18 21:35             ` Gage Eads
  2019-03-18 21:35               ` Gage Eads
  2019-03-19 15:50               ` Stephen Hemminger
  2019-03-18 21:35             ` [dpdk-dev] [PATCH v7 4/6] test_ring: add lock-free ring autotest Gage Eads
                               ` (4 subsequent siblings)
  8 siblings, 2 replies; 123+ messages in thread
From: Gage Eads @ 2019-03-18 21:35 UTC (permalink / raw)
  To: dev
  Cc: olivier.matz, arybchenko, bruce.richardson, konstantin.ananyev,
	stephen, jerinj, mczekaj, nd, Ola.Liljedahl, gage.eads

This commit adds support for lock-free circular ring enqueue and dequeue
functions. The ring is supported on 32- and 64-bit architectures, however
it uses a 128-bit compare-and-swap instruction when run on a 64-bit
architecture, and thus is currently limited to x86_64.

The algorithm is based on Ola Liljedahl's lfring, modified to fit within
the rte ring API. With no contention, an enqueue of n pointers uses (1 + n)
CAS operations and a dequeue of n pointers uses 1. This algorithm has worse
average-case performance than the regular rte ring (particularly a
highly-contended ring with large bulk accesses), however:
- For applications with preemptible pthreads, the regular rte ring's
  worst-case performance (i.e. one thread being preempted in the
  update_tail() critical section) is much worse than the lock-free ring's.
- Software caching can mitigate the average case performance for ring-based
  algorithms. For example, a lock-free ring based mempool (a likely use
  case for this ring) with per-thread caching.

To avoid the ABA problem, each ring entry contains a modification counter.
On a 64-bit architecture, the chance of ABA occurring are effectively zero;
a 64-bit counter will take many years to wrap at current CPU frequencies.
On a 32-bit architectures, a lock-free ring must be at least 1024-entries
deep; assuming 100 cycles per ring entry access, this guarantees the ring's
modification counters will wrap on the order of days.

The lock-free ring is enabled via a new flag, RING_F_LF. Because the ring's
memsize is now a function of its flags (the lock-free ring requires 128b
for each entry), this commit adds a new argument ('flags') to
rte_ring_get_memsize(). An API deprecation notice will be sent in a
separate commit.

For ease-of-use, existing ring enqueue and dequeue functions work on both
regular and lock-free rings. This introduces an additional branch in the
datapath, but this should be a highly predictable branch.
ring_perf_autotest shows a negligible performance impact; it's hard to
distinguish a real difference versus system noise.

                                  | ring_perf_autotest cycles with branch -
             Test                 |   ring_perf_autotest cycles without
------------------------------------------------------------------
SP/SC single enq/dequeue          | 0.33
MP/MC single enq/dequeue          | -4.00
SP/SC burst enq/dequeue (size 8)  | 0.00
MP/MC burst enq/dequeue (size 8)  | 0.00
SP/SC burst enq/dequeue (size 32) | 0.00
MP/MC burst enq/dequeue (size 32) | 0.00
SC empty dequeue                  | 1.00
MC empty dequeue                  | 0.00

Single lcore:
SP/SC bulk enq/dequeue (size 8)   | 0.49
MP/MC bulk enq/dequeue (size 8)   | 0.08
SP/SC bulk enq/dequeue (size 32)  | 0.07
MP/MC bulk enq/dequeue (size 32)  | 0.09

Two physical cores:
SP/SC bulk enq/dequeue (size 8)   | 0.19
MP/MC bulk enq/dequeue (size 8)   | -0.37
SP/SC bulk enq/dequeue (size 32)  | 0.09
MP/MC bulk enq/dequeue (size 32)  | -0.05

Two NUMA nodes:
SP/SC bulk enq/dequeue (size 8)   | -1.96
MP/MC bulk enq/dequeue (size 8)   | 0.88
SP/SC bulk enq/dequeue (size 32)  | 0.10
MP/MC bulk enq/dequeue (size 32)  | 0.46

Test setup: x86_64 build with default config, dual-socket Xeon E5-2699 v4,
running on isolcpus cores with a tickless scheduler. Each test run three
times and the results averaged.

Signed-off-by: Gage Eads <gage.eads@intel.com>
---
 lib/librte_ring/rte_ring.c           |  92 +++++++--
 lib/librte_ring/rte_ring.h           | 308 ++++++++++++++++++++++++++---
 lib/librte_ring/rte_ring_c11_mem.h   | 366 ++++++++++++++++++++++++++++++++++-
 lib/librte_ring/rte_ring_generic.h   | 355 ++++++++++++++++++++++++++++++++-
 lib/librte_ring/rte_ring_version.map |   7 +
 5 files changed, 1080 insertions(+), 48 deletions(-)

diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
index d215acecc..d4a176f57 100644
--- a/lib/librte_ring/rte_ring.c
+++ b/lib/librte_ring/rte_ring.c
@@ -45,9 +45,9 @@ EAL_REGISTER_TAILQ(rte_ring_tailq)
 
 /* return the size of memory occupied by a ring */
 ssize_t
-rte_ring_get_memsize(unsigned count)
+rte_ring_get_memsize_v1905(unsigned int count, unsigned int flags)
 {
-	ssize_t sz;
+	ssize_t sz, elt_sz;
 
 	/* count must be a power of 2 */
 	if ((!POWEROF2(count)) || (count > RTE_RING_SZ_MASK )) {
@@ -57,10 +57,23 @@ rte_ring_get_memsize(unsigned count)
 		return -EINVAL;
 	}
 
-	sz = sizeof(struct rte_ring) + count * sizeof(void *);
+	elt_sz = (flags & RING_F_LF) ? 2 * sizeof(void *) : sizeof(void *);
+
+	sz = sizeof(struct rte_ring) + count * elt_sz;
 	sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
 	return sz;
 }
+BIND_DEFAULT_SYMBOL(rte_ring_get_memsize, _v1905, 19.05);
+MAP_STATIC_SYMBOL(ssize_t rte_ring_get_memsize(unsigned int count,
+					       unsigned int flags),
+		  rte_ring_get_memsize_v1905);
+
+ssize_t
+rte_ring_get_memsize_v20(unsigned int count)
+{
+	return rte_ring_get_memsize_v1905(count, 0);
+}
+VERSION_SYMBOL(rte_ring_get_memsize, _v20, 2.0);
 
 int
 rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
@@ -75,6 +88,8 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 			  RTE_CACHE_LINE_MASK) != 0);
 	RTE_BUILD_BUG_ON((offsetof(struct rte_ring, prod) &
 			  RTE_CACHE_LINE_MASK) != 0);
+	RTE_BUILD_BUG_ON(sizeof(struct rte_ring_lf_entry) !=
+			 2 * sizeof(void *));
 
 	/* init the ring structure */
 	memset(r, 0, sizeof(*r));
@@ -82,8 +97,6 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 	if (ret < 0 || ret >= (int)sizeof(r->name))
 		return -ENAMETOOLONG;
 	r->flags = flags;
-	r->prod.single = (flags & RING_F_SP_ENQ) ? __IS_SP : __IS_MP;
-	r->cons.single = (flags & RING_F_SC_DEQ) ? __IS_SC : __IS_MC;
 
 	if (flags & RING_F_EXACT_SZ) {
 		r->size = rte_align32pow2(count + 1);
@@ -100,12 +113,46 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 		r->mask = count - 1;
 		r->capacity = r->mask;
 	}
-	r->prod.head = r->cons.head = 0;
-	r->prod.tail = r->cons.tail = 0;
+
+	r->log2_size = rte_log2_u64(r->size);
+
+	if (flags & RING_F_LF) {
+		uint32_t i;
+
+		r->prod_ptr.single =
+			(flags & RING_F_SP_ENQ) ? __IS_SP : __IS_MP;
+		r->cons_ptr.single =
+			(flags & RING_F_SC_DEQ) ? __IS_SC : __IS_MC;
+		r->prod_ptr.head = r->cons_ptr.head = 0;
+		r->prod_ptr.tail = r->cons_ptr.tail = 0;
+
+		for (i = 0; i < r->size; i++) {
+			struct rte_ring_lf_entry *ring_ptr, *base;
+
+			base = (struct rte_ring_lf_entry *)&r->ring;
+
+			ring_ptr = &base[i & r->mask];
+
+			ring_ptr->cnt = 0;
+		}
+	} else {
+		r->prod.single = (flags & RING_F_SP_ENQ) ? __IS_SP : __IS_MP;
+		r->cons.single = (flags & RING_F_SC_DEQ) ? __IS_SC : __IS_MC;
+		r->prod.head = r->cons.head = 0;
+		r->prod.tail = r->cons.tail = 0;
+	}
 
 	return 0;
 }
 
+/* If a ring entry is written on average every M cycles, then a ring entry is
+ * reused every M*count cycles, and a ring entry's counter repeats every
+ * M*count*2^32 cycles. If M=100 on a 2GHz system, then a 1024-entry ring's
+ * counters would repeat every 2.37 days. The likelihood of ABA occurring is
+ * considered sufficiently low for 1024-entry and larger rings.
+ */
+#define MIN_32_BIT_LF_RING_SIZE 1024
+
 /* create the ring */
 struct rte_ring *
 rte_ring_create(const char *name, unsigned count, int socket_id,
@@ -123,11 +170,25 @@ rte_ring_create(const char *name, unsigned count, int socket_id,
 
 	ring_list = RTE_TAILQ_CAST(rte_ring_tailq.head, rte_ring_list);
 
+#ifdef RTE_ARCH_64
+#if !defined(RTE_ARCH_X86_64)
+	printf("This platform does not support the atomic operation required for RING_F_LF\n");
+	rte_errno = EINVAL;
+	return NULL;
+#endif
+#else
+	if ((flags & RING_F_LF) && count < MIN_32_BIT_LF_RING_SIZE) {
+		printf("RING_F_LF is only supported on 32-bit platforms for rings with at least 1024 entries.\n");
+		rte_errno = EINVAL;
+		return NULL;
+	}
+#endif
+
 	/* for an exact size ring, round up from count to a power of two */
 	if (flags & RING_F_EXACT_SZ)
 		count = rte_align32pow2(count + 1);
 
-	ring_size = rte_ring_get_memsize(count);
+	ring_size = rte_ring_get_memsize(count, flags);
 	if (ring_size < 0) {
 		rte_errno = ring_size;
 		return NULL;
@@ -227,10 +288,17 @@ rte_ring_dump(FILE *f, const struct rte_ring *r)
 	fprintf(f, "  flags=%x\n", r->flags);
 	fprintf(f, "  size=%"PRIu32"\n", r->size);
 	fprintf(f, "  capacity=%"PRIu32"\n", r->capacity);
-	fprintf(f, "  ct=%"PRIu32"\n", r->cons.tail);
-	fprintf(f, "  ch=%"PRIu32"\n", r->cons.head);
-	fprintf(f, "  pt=%"PRIu32"\n", r->prod.tail);
-	fprintf(f, "  ph=%"PRIu32"\n", r->prod.head);
+	if (r->flags & RING_F_LF) {
+		fprintf(f, "  ct=%"PRIuPTR"\n", r->cons_ptr.tail);
+		fprintf(f, "  ch=%"PRIuPTR"\n", r->cons_ptr.head);
+		fprintf(f, "  pt=%"PRIuPTR"\n", r->prod_ptr.tail);
+		fprintf(f, "  ph=%"PRIuPTR"\n", r->prod_ptr.head);
+	} else {
+		fprintf(f, "  ct=%"PRIu32"\n", r->cons.tail);
+		fprintf(f, "  ch=%"PRIu32"\n", r->cons.head);
+		fprintf(f, "  pt=%"PRIu32"\n", r->prod.tail);
+		fprintf(f, "  ph=%"PRIu32"\n", r->prod.head);
+	}
 	fprintf(f, "  used=%u\n", rte_ring_count(r));
 	fprintf(f, "  avail=%u\n", rte_ring_free_count(r));
 }
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index f16d77b8a..200d7b2a0 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -20,7 +20,7 @@
  *
  * - FIFO (First In First Out)
  * - Maximum size is fixed; the pointers are stored in a table.
- * - Lockless implementation.
+ * - Lockless (and optionally, non-blocking/lock-free) implementation.
  * - Multi- or single-consumer dequeue.
  * - Multi- or single-producer enqueue.
  * - Bulk dequeue.
@@ -98,6 +98,7 @@ struct rte_ring {
 	const struct rte_memzone *memzone;
 			/**< Memzone, if any, containing the rte_ring */
 	uint32_t size;           /**< Size of ring. */
+	uint32_t log2_size;      /**< log2(size of ring) */
 	uint32_t mask;           /**< Mask (size-1) of ring. */
 	uint32_t capacity;       /**< Usable size of ring */
 
@@ -133,6 +134,18 @@ struct rte_ring {
  */
 #define RING_F_EXACT_SZ 0x0004
 #define RTE_RING_SZ_MASK  (0x7fffffffU) /**< Ring size mask */
+/**
+ * The ring uses lock-free enqueue and dequeue functions. These functions
+ * do not have the "non-preemptive" constraint of a regular rte ring, and thus
+ * are suited for applications using preemptible pthreads. However, the
+ * lock-free functions have worse average-case performance than their regular
+ * rte ring counterparts. When used as the handler for a mempool, per-thread
+ * caching can mitigate the performance difference by reducing the number (and
+ * contention) of ring accesses.
+ *
+ * This flag is only supported on 32-bit and x86_64 platforms.
+ */
+#define RING_F_LF 0x0008
 
 /* @internal defines for passing to the enqueue dequeue worker functions */
 #define __IS_SP 1
@@ -150,11 +163,15 @@ struct rte_ring {
  *
  * @param count
  *   The number of elements in the ring (must be a power of 2).
+ * @param flags
+ *   The flags the ring will be created with.
  * @return
  *   - The memory size needed for the ring on success.
  *   - -EINVAL if count is not a power of 2.
  */
-ssize_t rte_ring_get_memsize(unsigned count);
+ssize_t rte_ring_get_memsize(unsigned int count, unsigned int flags);
+ssize_t rte_ring_get_memsize_v20(unsigned int count);
+ssize_t rte_ring_get_memsize_v1905(unsigned int count, unsigned int flags);
 
 /**
  * Initialize a ring structure.
@@ -187,6 +204,10 @@ ssize_t rte_ring_get_memsize(unsigned count);
  *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
  *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
  *      is "single-consumer". Otherwise, it is "multi-consumers".
+ *    - RING_F_EXACT_SZ: If this flag is set, count can be a non-power-of-2
+ *      number, but up to half the ring space may be wasted.
+ *    - RING_F_LF: If this flag is set, the ring uses lock-free variants of the
+ *      dequeue and enqueue functions.
  * @return
  *   0 on success, or a negative value on error.
  */
@@ -222,12 +243,17 @@ int rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
  *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
  *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
  *      is "single-consumer". Otherwise, it is "multi-consumers".
+ *    - RING_F_EXACT_SZ: If this flag is set, count can be a non-power-of-2
+ *      number, but up to half the ring space may be wasted.
+ *    - RING_F_LF: If this flag is set, the ring uses lock-free variants of the
+ *      dequeue and enqueue functions.
  * @return
  *   On success, the pointer to the new allocated ring. NULL on error with
  *    rte_errno set appropriately. Possible errno values include:
  *    - E_RTE_NO_CONFIG - function could not get pointer to rte_config structure
  *    - E_RTE_SECONDARY - function was called from a secondary process instance
- *    - EINVAL - count provided is not a power of 2
+ *    - EINVAL - count provided is not a power of 2, or RING_F_LF is used on an
+ *      unsupported platform
  *    - ENOSPC - the maximum number of memzones has already been allocated
  *    - EEXIST - a memzone with the same name already exists
  *    - ENOMEM - no appropriate memory area found in which to create memzone
@@ -283,6 +309,50 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 	} \
 } while (0)
 
+/* The actual enqueue of pointers on the lock-free ring, used by the
+ * single-producer lock-free enqueue function.
+ */
+#define ENQUEUE_PTRS_LF(r, base, prod_head, obj_table, n) do { \
+	unsigned int i; \
+	const uint32_t size = (r)->size; \
+	size_t idx = prod_head & (r)->mask; \
+	size_t new_cnt = prod_head + size; \
+	struct rte_ring_lf_entry *ring = (struct rte_ring_lf_entry *)base; \
+	unsigned int mask = ~0x3; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & mask); i += 4, idx += 4) { \
+			ring[idx].ptr = obj_table[i]; \
+			ring[idx].cnt = (new_cnt + i) >> r->log2_size; \
+			ring[idx + 1].ptr = obj_table[i + 1]; \
+			ring[idx + 1].cnt = (new_cnt + i + 1) >> r->log2_size; \
+			ring[idx + 2].ptr = obj_table[i + 2]; \
+			ring[idx + 2].cnt = (new_cnt + i + 2) >> r->log2_size; \
+			ring[idx + 3].ptr = obj_table[i + 3]; \
+			ring[idx + 3].cnt = (new_cnt + i + 3) >> r->log2_size; \
+		} \
+		switch (n & 0x3) { \
+		case 3: \
+			ring[idx].cnt = (new_cnt + i) >> r->log2_size; \
+			ring[idx++].ptr = obj_table[i++]; /* fallthrough */ \
+		case 2: \
+			ring[idx].cnt = (new_cnt + i) >> r->log2_size; \
+			ring[idx++].ptr = obj_table[i++]; /* fallthrough */ \
+		case 1: \
+			ring[idx].cnt = (new_cnt + i) >> r->log2_size; \
+			ring[idx++].ptr = obj_table[i++]; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) { \
+			ring[idx].cnt = (new_cnt + i) >> r->log2_size; \
+			ring[idx].ptr = obj_table[i]; \
+		} \
+		for (idx = 0; i < n; i++, idx++) {    \
+			ring[idx].cnt = (new_cnt + i) >> r->log2_size; \
+			ring[idx].ptr = obj_table[i]; \
+		} \
+	} \
+} while (0)
+
 /* the actual copy of pointers on the ring to obj_table.
  * Placed here since identical code needed in both
  * single and multi consumer dequeue functions */
@@ -314,6 +384,43 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 	} \
 } while (0)
 
+/* The actual copy of pointers on the lock-free ring to obj_table. */
+#define DEQUEUE_PTRS_LF(r, base, cons_head, obj_table, n) do { \
+	unsigned int i; \
+	size_t idx = cons_head & (r)->mask; \
+	const uint32_t size = (r)->size; \
+	struct rte_ring_lf_entry *ring = (struct rte_ring_lf_entry *)base; \
+	unsigned int mask = ~0x3; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & mask); i += 4, idx += 4) {\
+			obj_table[i] = ring[idx].ptr; \
+			obj_table[i + 1] = ring[idx + 1].ptr; \
+			obj_table[i + 2] = ring[idx + 2].ptr; \
+			obj_table[i + 3] = ring[idx + 3].ptr; \
+		} \
+		switch (n & 0x3) { \
+		case 3: \
+			obj_table[i++] = ring[idx++].ptr; /* fallthrough */ \
+		case 2: \
+			obj_table[i++] = ring[idx++].ptr; /* fallthrough */ \
+		case 1: \
+			obj_table[i++] = ring[idx++].ptr; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) \
+			obj_table[i] = ring[idx].ptr; \
+		for (idx = 0; i < n; i++, idx++) \
+			obj_table[i] = ring[idx].ptr; \
+	} \
+} while (0)
+
+
+/* @internal 128-bit structure used by the lock-free ring */
+struct rte_ring_lf_entry {
+	void *ptr; /**< Data pointer */
+	uintptr_t cnt; /**< Modification counter */
+};
+
 /* Between load and load. there might be cpu reorder in weak model
  * (powerpc/arm).
  * There are 2 choices for the users
@@ -330,6 +437,70 @@ void rte_ring_dump(FILE *f, const struct rte_ring *r);
 #endif
 
 /**
+ * @internal Enqueue several objects on the lock-free ring
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param is_sp
+ *   Indicates whether to use single producer or multi-producer head update
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_enqueue(struct rte_ring *r, void * const *obj_table,
+			 unsigned int n, enum rte_ring_queue_behavior behavior,
+			 unsigned int is_sp, unsigned int *free_space)
+{
+	if (is_sp)
+		return __rte_ring_do_lf_enqueue_sp(r, obj_table, n,
+						   behavior, free_space);
+	else
+		return __rte_ring_do_lf_enqueue_mp(r, obj_table, n,
+						   behavior, free_space);
+}
+
+/**
+ * @internal Dequeue several objects from the lock-free ring
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_dequeue(struct rte_ring *r, void **obj_table,
+		 unsigned int n, enum rte_ring_queue_behavior behavior,
+		 unsigned int is_sc, unsigned int *available)
+{
+	if (is_sc)
+		return __rte_ring_do_lf_dequeue_sc(r, obj_table, n,
+						   behavior, available);
+	else
+		return __rte_ring_do_lf_dequeue_mc(r, obj_table, n,
+						   behavior, available);
+}
+
+/**
  * @internal Enqueue several objects on the ring
  *
   * @param r
@@ -436,8 +607,14 @@ static __rte_always_inline unsigned int
 rte_ring_mp_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_MP, free_space);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_MP,
+						free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_MP,
+					     free_space);
 }
 
 /**
@@ -459,8 +636,14 @@ static __rte_always_inline unsigned int
 rte_ring_sp_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_SP, free_space);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_SP,
+						free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_SP,
+					     free_space);
 }
 
 /**
@@ -486,8 +669,14 @@ static __rte_always_inline unsigned int
 rte_ring_enqueue_bulk(struct rte_ring *r, void * const *obj_table,
 		      unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			r->prod.single, free_space);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED,
+						r->prod_ptr.single, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED,
+					     r->prod.single, free_space);
 }
 
 /**
@@ -570,8 +759,14 @@ static __rte_always_inline unsigned int
 rte_ring_mc_dequeue_bulk(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_MC, available);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_MC,
+						available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_MC,
+					     available);
 }
 
 /**
@@ -594,8 +789,14 @@ static __rte_always_inline unsigned int
 rte_ring_sc_dequeue_bulk(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-			__IS_SC, available);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED, __IS_SC,
+						available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED, __IS_SC,
+					     available);
 }
 
 /**
@@ -621,8 +822,14 @@ static __rte_always_inline unsigned int
 rte_ring_dequeue_bulk(struct rte_ring *r, void **obj_table, unsigned int n,
 		unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
-				r->cons.single, available);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_FIXED,
+						r->cons_ptr.single, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_FIXED,
+					     r->cons.single, available);
 }
 
 /**
@@ -697,9 +904,13 @@ rte_ring_dequeue(struct rte_ring *r, void **obj_p)
 static inline unsigned
 rte_ring_count(const struct rte_ring *r)
 {
-	uint32_t prod_tail = r->prod.tail;
-	uint32_t cons_tail = r->cons.tail;
-	uint32_t count = (prod_tail - cons_tail) & r->mask;
+	uint32_t count;
+
+	if (r->flags & RING_F_LF)
+		count = (r->prod_ptr.tail - r->cons_ptr.tail) & r->mask;
+	else
+		count = (r->prod.tail - r->cons.tail) & r->mask;
+
 	return (count > r->capacity) ? r->capacity : count;
 }
 
@@ -819,8 +1030,14 @@ static __rte_always_inline unsigned
 rte_ring_mp_enqueue_burst(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_MP, free_space);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_MP, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_MP, free_space);
 }
 
 /**
@@ -842,8 +1059,14 @@ static __rte_always_inline unsigned
 rte_ring_sp_enqueue_burst(struct rte_ring *r, void * const *obj_table,
 			 unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_SP, free_space);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_SP, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_SP, free_space);
 }
 
 /**
@@ -869,8 +1092,14 @@ static __rte_always_inline unsigned
 rte_ring_enqueue_burst(struct rte_ring *r, void * const *obj_table,
 		      unsigned int n, unsigned int *free_space)
 {
-	return __rte_ring_do_enqueue(r, obj_table, n, RTE_RING_QUEUE_VARIABLE,
-			r->prod.single, free_space);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_enqueue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						r->prod_ptr.single, free_space);
+	else
+		return __rte_ring_do_enqueue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     r->prod.single, free_space);
 }
 
 /**
@@ -897,8 +1126,14 @@ static __rte_always_inline unsigned
 rte_ring_mc_dequeue_burst(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_MC, available);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_MC, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_MC, available);
 }
 
 /**
@@ -922,8 +1157,14 @@ static __rte_always_inline unsigned
 rte_ring_sc_dequeue_burst(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n,
-			RTE_RING_QUEUE_VARIABLE, __IS_SC, available);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						__IS_SC, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     __IS_SC, available);
 }
 
 /**
@@ -949,9 +1190,14 @@ static __rte_always_inline unsigned
 rte_ring_dequeue_burst(struct rte_ring *r, void **obj_table,
 		unsigned int n, unsigned int *available)
 {
-	return __rte_ring_do_dequeue(r, obj_table, n,
-				RTE_RING_QUEUE_VARIABLE,
-				r->cons.single, available);
+	if (r->flags & RING_F_LF)
+		return __rte_ring_do_lf_dequeue(r, obj_table, n,
+						RTE_RING_QUEUE_VARIABLE,
+						r->cons_ptr.single, available);
+	else
+		return __rte_ring_do_dequeue(r, obj_table, n,
+					     RTE_RING_QUEUE_VARIABLE,
+					     r->cons.single, available);
 }
 
 #ifdef __cplusplus
diff --git a/lib/librte_ring/rte_ring_c11_mem.h b/lib/librte_ring/rte_ring_c11_mem.h
index 545caf257..55fc3ed29 100644
--- a/lib/librte_ring/rte_ring_c11_mem.h
+++ b/lib/librte_ring/rte_ring_c11_mem.h
@@ -1,5 +1,7 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  *
+ * Copyright (c) 2010-2019 Intel Corporation
+ * Copyright (c) 2018-2019 Arm Limited
  * Copyright (c) 2017,2018 HXT-semitech Corporation.
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
  * All rights reserved.
@@ -221,8 +223,8 @@ __rte_ring_move_prod_head_ptr(struct rte_ring *r, unsigned int is_sp,
 		/* Ensure the head is read before tail */
 		__atomic_thread_fence(__ATOMIC_ACQUIRE);
 
-		/* load-acquire synchronize with store-release of ht->tail
-		 * in update_tail.
+		/* load-acquire synchronize with store-release of tail in
+		 * __rte_ring_do_lf_dequeue_{sc, mc}.
 		 */
 		cons_tail = __atomic_load_n(&r->cons_ptr.tail,
 					__ATOMIC_ACQUIRE);
@@ -247,6 +249,7 @@ __rte_ring_move_prod_head_ptr(struct rte_ring *r, unsigned int is_sp,
 					0, __ATOMIC_RELAXED,
 					__ATOMIC_RELAXED);
 	} while (unlikely(success == 0));
+
 	return n;
 }
 
@@ -293,8 +296,8 @@ __rte_ring_move_cons_head_ptr(struct rte_ring *r, unsigned int is_sc,
 		/* Ensure the head is read before tail */
 		__atomic_thread_fence(__ATOMIC_ACQUIRE);
 
-		/* this load-acquire synchronize with store-release of ht->tail
-		 * in update_tail.
+		/* load-acquire synchronize with store-release of tail in
+		 * __rte_ring_do_lf_enqueue_{sp, mp}.
 		 */
 		prod_tail = __atomic_load_n(&r->prod_ptr.tail,
 					__ATOMIC_ACQUIRE);
@@ -318,6 +321,361 @@ __rte_ring_move_cons_head_ptr(struct rte_ring *r, unsigned int is_sc,
 							0, __ATOMIC_RELAXED,
 							__ATOMIC_RELAXED);
 	} while (unlikely(success == 0));
+
+	return n;
+}
+
+/**
+ * @internal
+ *   Enqueue several objects on the lock-free ring (single-producer only)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_enqueue_sp(struct rte_ring *r, void * const *obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+	uint32_t free_entries;
+	uintptr_t head, next;
+
+	n = __rte_ring_move_prod_head_ptr(r, 1, n, behavior,
+					  &head, &next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	ENQUEUE_PTRS_LF(r, &r->ring, head, obj_table, n);
+
+	__atomic_store_n(&r->prod_ptr.tail,
+			 r->prod_ptr.tail + n,
+			 __ATOMIC_RELEASE);
+end:
+	if (free_space != NULL)
+		*free_space = free_entries - n;
+	return n;
+}
+
+/* This macro defines the number of times an enqueueing thread can fail to find
+ * a free ring slot before reloading its producer tail index.
+ */
+#define ENQ_RETRY_LIMIT 32
+
+/**
+ * @internal
+ *   Get the next producer tail index.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param idx
+ *   The local tail index
+ * @return
+ *   If the ring's tail is ahead of the local tail, return the shared tail.
+ *   Else, return tail + 1.
+ */
+static __rte_always_inline uintptr_t
+__rte_ring_lf_load_tail(struct rte_ring *r, uintptr_t idx)
+{
+	uintptr_t fresh = __atomic_load_n(&r->prod_ptr.tail, __ATOMIC_RELAXED);
+
+	if ((intptr_t)(idx - fresh) < 0)
+		idx = fresh; /* fresh is after idx, use it instead */
+	else
+		idx++; /* Continue with next slot */
+
+	return idx;
+}
+
+/**
+ * @internal
+ *   Update the ring's producer tail index. If another thread already updated
+ *   the index beyond the caller's tail value, do nothing.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param idx
+ *   The local tail index
+ * @return
+ *   If the shared tail is ahead of the local tail, return the shared tail.
+ *   Else, return tail + 1.
+ */
+static __rte_always_inline void
+__rte_ring_lf_update_tail(struct rte_ring *r, uintptr_t val)
+{
+	volatile uintptr_t *loc = &r->prod_ptr.tail;
+	uintptr_t old = __atomic_load_n(loc, __ATOMIC_RELAXED);
+
+	do {
+		/* Check if the tail has already been updated. */
+		if ((intptr_t)(val - old) < 0)
+			return;
+
+		/* Else val >= old, need to update *loc */
+	} while (!__atomic_compare_exchange_n(loc, &old, val,
+					      1, __ATOMIC_RELEASE,
+					      __ATOMIC_RELAXED));
+}
+
+/**
+ * @internal
+ *   Enqueue several objects on the lock-free ring (multi-producer safe)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_enqueue_mp(struct rte_ring *r, void * const *obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+#if !defined(ALLOW_EXPERIMENTAL_API)
+	RTE_SET_USED(r);
+	RTE_SET_USED(obj_table);
+	RTE_SET_USED(n);
+	RTE_SET_USED(behavior);
+	RTE_SET_USED(free_space);
+	printf("[%s()] RING_F_LF requires an experimental API."
+	       " Recompile with ALLOW_EXPERIMENTAL_API to use it.\n"
+	       , __func__);
+	return 0;
+#else
+	struct rte_ring_lf_entry *base;
+	uintptr_t head, next, tail;
+	unsigned int i;
+	uint32_t avail;
+
+	/* Atomically update the prod head to reserve n slots. The prod tail
+	 * is modified at the end of the function.
+	 */
+	n = __rte_ring_move_prod_head_ptr(r, 0, n, behavior,
+					  &head, &next, &avail);
+
+	tail = __atomic_load_n(&r->prod_ptr.tail, __ATOMIC_RELAXED);
+	head = __atomic_load_n(&r->cons_ptr.tail, __ATOMIC_ACQUIRE);
+
+	if (unlikely(n == 0))
+		goto end;
+
+	base = (struct rte_ring_lf_entry *)&r->ring;
+
+	for (i = 0; i < n; i++) {
+		unsigned int retries = 0;
+		int success = 0;
+
+		/* Enqueue to the tail entry. If another thread wins the race,
+		 * retry with the new tail.
+		 */
+		do {
+			struct rte_ring_lf_entry old_value, new_value;
+			struct rte_ring_lf_entry *ring_ptr;
+
+			ring_ptr = &base[tail & r->mask];
+
+			old_value = *ring_ptr;
+
+			if (old_value.cnt != (tail >> r->log2_size)) {
+				/* This slot has already been used. Depending
+				 * on how far behind this thread is, either go
+				 * to the next slot or reload the tail.
+				 */
+				uintptr_t next_tail;
+
+				next_tail = (tail + r->size) >> r->log2_size;
+
+				if (old_value.cnt != next_tail ||
+				    ++retries == ENQ_RETRY_LIMIT) {
+					/* This thread either fell 2+ laps
+					 * behind or hit the retry limit, so
+					 * reload the tail index.
+					 */
+					tail = __rte_ring_lf_load_tail(r, tail);
+					retries = 0;
+				} else {
+					/* Slot already used, try the next. */
+					tail++;
+
+				}
+
+				continue;
+			}
+
+			/* Found a free slot, try to enqueue next element. */
+			new_value.ptr = obj_table[i];
+			new_value.cnt = (tail + r->size) >> r->log2_size;
+
+#ifdef RTE_ARCH_64
+			success = rte_atomic128_cmp_exchange(
+					(rte_int128_t *)ring_ptr,
+					(rte_int128_t *)&old_value,
+					(rte_int128_t *)&new_value,
+					1, __ATOMIC_RELEASE,
+					__ATOMIC_RELAXED);
+#else
+			success = __atomic_compare_exchange(
+					(uint64_t *)ring_ptr,
+					&old_value,
+					&new_value,
+					1, __ATOMIC_RELEASE,
+					__ATOMIC_RELAXED);
+#endif
+		} while (success == 0);
+
+		/* Only increment tail if the CAS succeeds, since it can
+		 * spuriously fail on some architectures.
+		 */
+		tail++;
+	}
+
+end:
+	__rte_ring_lf_update_tail(r, tail);
+
+	if (free_space != NULL)
+		*free_space = avail - n;
+	return n;
+#endif
+}
+
+/**
+ * @internal
+ *   Dequeue several objects from the lock-free ring (single-consumer only)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_dequeue_sc(struct rte_ring *r, void **obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *available)
+{
+	uintptr_t cons_tail, prod_tail, avail;
+
+	cons_tail = __atomic_load_n(&r->cons_ptr.tail, __ATOMIC_RELAXED);
+	prod_tail = __atomic_load_n(&r->prod_ptr.tail, __ATOMIC_ACQUIRE);
+
+	avail = prod_tail - cons_tail;
+
+	/* Set the actual entries for dequeue */
+	if (unlikely(avail < n))
+		n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : avail;
+
+	if (unlikely(n == 0))
+		goto end;
+
+	DEQUEUE_PTRS_LF(r, &r->ring, cons_tail, obj_table, n);
+
+	/* Use a read barrier and store-relaxed so we don't unnecessarily order
+	 * writes.
+	 */
+	rte_smp_rmb();
+
+	__atomic_store_n(&r->cons_ptr.tail, cons_tail + n, __ATOMIC_RELAXED);
+end:
+	if (available != NULL)
+		*available = avail - n;
+
+	return n;
+}
+
+/**
+ * @internal
+ *   Dequeue several objects from the lock-free ring (multi-consumer safe)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_dequeue_mc(struct rte_ring *r, void **obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *available)
+{
+	uintptr_t cons_tail, prod_tail, avail;
+
+	cons_tail = __atomic_load_n(&r->cons_ptr.tail, __ATOMIC_RELAXED);
+
+	do {
+		/* Load tail on every iteration to avoid spurious queue empty
+		 * situations.
+		 */
+		prod_tail = __atomic_load_n(&r->prod_ptr.tail,
+					    __ATOMIC_ACQUIRE);
+
+		avail = prod_tail - cons_tail;
+
+		/* Set the actual entries for dequeue */
+		if (unlikely(avail < n))
+			n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : avail;
+
+		if (unlikely(n == 0))
+			goto end;
+
+		DEQUEUE_PTRS_LF(r, &r->ring, cons_tail, obj_table, n);
+
+		/* Use a read barrier and store-relaxed so we don't
+		 * unnecessarily order writes.
+		 */
+		rte_smp_rmb();
+
+	} while (!__atomic_compare_exchange_n(&r->cons_ptr.tail,
+					      &cons_tail, cons_tail + n,
+					      0, __ATOMIC_RELAXED,
+					      __ATOMIC_RELAXED));
+
+end:
+	if (available != NULL)
+		*available = avail - n;
+
 	return n;
 }
 
diff --git a/lib/librte_ring/rte_ring_generic.h b/lib/librte_ring/rte_ring_generic.h
index 6a0e1bbfb..d0173d6d5 100644
--- a/lib/librte_ring/rte_ring_generic.h
+++ b/lib/librte_ring/rte_ring_generic.h
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  *
- * Copyright (c) 2010-2017 Intel Corporation
+ * Copyright (c) 2010-2019 Intel Corporation
+ * Copyright (c) 2018-2019 Arm Limited
  * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
  * All rights reserved.
  * Derived from FreeBSD's bufring.h
@@ -297,4 +298,356 @@ __rte_ring_move_cons_head_ptr(struct rte_ring *r, unsigned int is_sc,
 	return n;
 }
 
+/**
+ * @internal
+ *   Enqueue several objects on the lock-free ring (single-producer only)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_enqueue_sp(struct rte_ring *r, void * const *obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+	uint32_t free_entries;
+	uintptr_t head, next;
+
+	n = __rte_ring_move_prod_head_ptr(r, 1, n, behavior,
+					  &head, &next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	ENQUEUE_PTRS_LF(r, &r->ring, head, obj_table, n);
+
+	rte_smp_wmb();
+
+	r->prod_ptr.tail += n;
+end:
+	if (free_space != NULL)
+		*free_space = free_entries - n;
+	return n;
+}
+
+/* This macro defines the number of times an enqueueing thread can fail to find
+ * a free ring slot before reloading its producer tail index.
+ */
+#define ENQ_RETRY_LIMIT 32
+
+/**
+ * @internal
+ *   Get the next producer tail index.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param idx
+ *   The local tail index
+ * @return
+ *   If the ring's tail is ahead of the local tail, return the shared tail.
+ *   Else, return tail + 1.
+ */
+static __rte_always_inline uintptr_t
+__rte_ring_lf_load_tail(struct rte_ring *r, uintptr_t idx)
+{
+	uintptr_t fresh = r->prod_ptr.tail;
+
+	if ((intptr_t)(idx - fresh) < 0)
+		/* fresh is after idx, use it instead */
+		idx = fresh;
+	else
+		/* Continue with next slot */
+		idx++;
+
+	return idx;
+}
+
+/**
+ * @internal
+ *   Update the ring's producer tail index. If another thread already updated
+ *   the index beyond the caller's tail value, do nothing.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param idx
+ *   The local tail index
+ * @return
+ *   If the shared tail is ahead of the local tail, return the shared tail.
+ *   Else, return tail + 1.
+ */
+static __rte_always_inline void
+__rte_ring_lf_update_tail(struct rte_ring *r, uintptr_t val)
+{
+	volatile uintptr_t *loc = &r->prod_ptr.tail;
+	uintptr_t old = *loc;
+
+	do {
+		/* Check if the tail has already been updated. */
+		if ((intptr_t)(val - old) < 0)
+			return;
+
+		/* Else val >= old, need to update *loc */
+	} while (!__sync_bool_compare_and_swap(loc, old, val));
+}
+
+/**
+ * @internal
+ *   Enqueue several objects on the lock-free ring (multi-producer safe)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items to the ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible to the ring
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_enqueue_mp(struct rte_ring *r, void * const *obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *free_space)
+{
+#if !defined(ALLOW_EXPERIMENTAL_API)
+	RTE_SET_USED(r);
+	RTE_SET_USED(obj_table);
+	RTE_SET_USED(n);
+	RTE_SET_USED(behavior);
+	RTE_SET_USED(free_space);
+	printf("[%s()] RING_F_LF requires an experimental API."
+	       " Recompile with ALLOW_EXPERIMENTAL_API to use it.\n"
+	       , __func__);
+	return 0;
+#else
+	struct rte_ring_lf_entry *base;
+	uintptr_t head, next, tail;
+	unsigned int i;
+	uint32_t avail;
+
+	/* Atomically update the prod head to reserve n slots. The prod tail
+	 * is modified at the end of the function.
+	 */
+	n = __rte_ring_move_prod_head_ptr(r, 0, n, behavior,
+					  &head, &next, &avail);
+
+	tail = r->prod_ptr.tail;
+
+	rte_smp_rmb();
+
+	head = r->cons_ptr.tail;
+
+	if (unlikely(n == 0))
+		goto end;
+
+	base = (struct rte_ring_lf_entry *)&r->ring;
+
+	for (i = 0; i < n; i++) {
+		unsigned int retries = 0;
+		int success = 0;
+
+		/* Enqueue to the tail entry. If another thread wins the race,
+		 * retry with the new tail.
+		 */
+		do {
+			struct rte_ring_lf_entry old_value, new_value;
+			struct rte_ring_lf_entry *ring_ptr;
+
+			ring_ptr = &base[tail & r->mask];
+
+			old_value = *ring_ptr;
+
+			if (old_value.cnt != (tail >> r->log2_size)) {
+				/* This slot has already been used. Depending
+				 * on how far behind this thread is, either go
+				 * to the next slot or reload the tail.
+				 */
+				uintptr_t next_tail;
+
+				next_tail = (tail + r->size) >> r->log2_size;
+
+				if (old_value.cnt != next_tail ||
+				    ++retries == ENQ_RETRY_LIMIT) {
+					/* This thread either fell 2+ laps
+					 * behind or hit the retry limit, so
+					 * reload the tail index.
+					 */
+					tail = __rte_ring_lf_load_tail(r, tail);
+					retries = 0;
+				} else {
+					/* Slot already used, try the next. */
+					tail++;
+
+				}
+
+				continue;
+			}
+
+			/* Found a free slot, try to enqueue next element. */
+			new_value.ptr = obj_table[i];
+			new_value.cnt = (tail + r->size) >> r->log2_size;
+
+#ifdef RTE_ARCH_64
+			success = rte_atomic128_cmp_exchange(
+					(rte_int128_t *)ring_ptr,
+					(rte_int128_t *)&old_value,
+					(rte_int128_t *)&new_value,
+					1, __ATOMIC_RELEASE,
+					__ATOMIC_RELAXED);
+#else
+			uint64_t *old_ptr = (uint64_t *)&old_value;
+			uint64_t *new_ptr = (uint64_t *)&new_value;
+
+			success = rte_atomic64_cmpset(
+					(volatile uint64_t *)ring_ptr,
+					*old_ptr, *new_ptr);
+#endif
+		} while (success == 0);
+
+		/* Only increment tail if the CAS succeeds, since it can
+		 * spuriously fail on some architectures.
+		 */
+		tail++;
+	}
+
+end:
+
+	__rte_ring_lf_update_tail(r, tail);
+
+	if (free_space != NULL)
+		*free_space = avail - n;
+	return n;
+#endif
+}
+
+/**
+ * @internal
+ *   Dequeue several objects from the lock-free ring (single-consumer only)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_dequeue_sc(struct rte_ring *r, void **obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *available)
+{
+	uintptr_t cons_tail, prod_tail, avail;
+
+	cons_tail = r->cons_ptr.tail;
+
+	rte_smp_rmb();
+
+	prod_tail = r->prod_ptr.tail;
+
+	avail = prod_tail - cons_tail;
+
+	/* Set the actual entries for dequeue */
+	if (unlikely(avail < n))
+		n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : avail;
+
+	if (unlikely(n == 0))
+		goto end;
+
+	DEQUEUE_PTRS_LF(r, &r->ring, cons_tail, obj_table, n);
+
+	rte_smp_rmb();
+
+	r->cons_ptr.tail += n;
+end:
+	if (available != NULL)
+		*available = avail - n;
+
+	return n;
+}
+
+/**
+ * @internal
+ *   Dequeue several objects from the lock-free ring (multi-consumer safe)
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from the ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from the ring
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_lf_dequeue_mc(struct rte_ring *r, void **obj_table,
+			    unsigned int n,
+			    enum rte_ring_queue_behavior behavior,
+			    unsigned int *available)
+{
+	uintptr_t cons_tail, prod_tail, avail;
+
+	cons_tail = r->cons_ptr.tail;
+
+	do {
+		rte_smp_rmb();
+
+		/* Load tail on every iteration to avoid spurious queue empty
+		 * situations.
+		 */
+		prod_tail = r->prod_ptr.tail;
+
+		avail = prod_tail - cons_tail;
+
+		/* Set the actual entries for dequeue */
+		if (unlikely(avail < n))
+			n = (behavior == RTE_RING_QUEUE_FIXED) ? 0 : avail;
+
+		if (unlikely(n == 0))
+			goto end;
+
+		DEQUEUE_PTRS_LF(r, &r->ring, cons_tail, obj_table, n);
+
+	} while (!__sync_bool_compare_and_swap(&r->cons_ptr.tail,
+					       cons_tail, cons_tail + n));
+
+end:
+	if (available != NULL)
+		*available = avail - n;
+
+	return n;
+}
+
 #endif /* _RTE_RING_GENERIC_H_ */
diff --git a/lib/librte_ring/rte_ring_version.map b/lib/librte_ring/rte_ring_version.map
index d935efd0d..8969467af 100644
--- a/lib/librte_ring/rte_ring_version.map
+++ b/lib/librte_ring/rte_ring_version.map
@@ -17,3 +17,10 @@ DPDK_2.2 {
 	rte_ring_free;
 
 } DPDK_2.0;
+
+DPDK_19.05 {
+	global:
+
+	rte_ring_get_memsize;
+
+} DPDK_2.2;
-- 
2.13.6

^ permalink raw reply	[flat|nested] 123+ messages in thread

* [dpdk-dev] [PATCH v7 3/6] ring: add a lock-free implementation
  2019-03-18 21:3