DPDK patches and discussions
 help / color / mirror / Atom feed
* [dpdk-dev] [PATCH 0/5] lib/ring: templates to support custom element size
@ 2019-08-28 14:46 Honnappa Nagarahalli
  2019-08-28 14:46 ` [dpdk-dev] [PATCH 1/5] lib/ring: apis to support configurable " Honnappa Nagarahalli
                   ` (7 more replies)
  0 siblings, 8 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-08-28 14:46 UTC (permalink / raw)
  To: olivier.matz, yipeng1.wang, sameh.gobriel, bruce.richardson,
	pablo.de.lara.guarch, honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, gavin.hu, ruifeng.wang, nd

The current rte_ring hard-codes the type of the ring element to 'void *',
hence the size of the element is hard-coded to 32b/64b. Since the ring
element type is not an input to rte_ring APIs, it results in couple
of issues:

1) If an application requires to store an element which is not 64b, it
   needs to writes its own ring APIs similar to rte_event_ring APIs. This
   creates additional burden on the programmers, who simply end up making
   work-arounds and often waste memory.
2) If there are multiple libraries that store elements of the same
   type, currently they would have to write their own rte_ring APIs. This
   results in code duplication.

This patch consists of 4 parts:
1) New APIs to support configurable ring element size
   These will help reduce code duplication in the templates. I think these
   can be made internal (do not expose to DPDK applications, but expose to
   DPDK libraries), feedback needed.

2) rte_ring templates
   The templates provide an easy way to add new APIs for different ring
   element types/sizes which can be used by multiple libraries. These
   also allow for creating APIs to store elements of custom types
   (for ex: a structure)

   The template needs 4 parameters:
   a) RTE_RING_TMPLT_API_SUFFIX - This is used as a suffix to the
      rte_ring APIs.
      For ex: if RTE_RING_TMPLT_API_SUFFIX is '32b', the API name will be
      rte_ring_create_32b
   b) RTE_RING_TMPLT_ELEM_SIZE - Size of the ring element in bytes.
      For ex: sizeof(uint32_t)
   c) RTE_RING_TMPLT_ELEM_TYPE - Type of the ring element.
      For ex: uint32_t. If a common ring library does not use a standard
      data type, it should create its own type by defining a structure
      with standard data type. For ex: for an elment size of 96b, one
      could define a structure

      struct s_96b {
          uint32_t a[3];
      }
      The common library can use this structure to define
      RTE_RING_TMPLT_ELEM_TYPE.

      The application using this common ring library should define its
      element type as a union with the above structure.

      union app_element_type {
          struct s_96b v;
          struct app_element {
              uint16_t a;
              uint16_t b;
              uint32_t c;
              uint32_t d;
          }
      }
   d) RTE_RING_TMPLT_EXPERIMENTAL - Indicates if the new APIs being defined
      are experimental. Should be set to empty to remove the experimental
      tag.

   The ring library consists of some APIs that are defined as inline
   functions and some APIs that are non-inline functions. The non-inline
   functions are in rte_ring_template.c. However, this file needs to be
   included in other .c files. Any feedback on how to handle this is
   appreciated.

   Note that the templates help create the APIs that are dependent on the
   element size (for ex: rte_ring_create, enqueue/dequeue etc). Other APIs
   that do NOT depend on the element size do not need to be part of the
   template (for ex: rte_ring_dump, rte_ring_count, rte_ring_free_count
   etc).

3) APIs for 32b ring element size
   This uses the templates to create APIs to enqueue/dequeue elements of
   size 32b.

4) rte_hash libray is changed to use 32b ring APIs
   The 32b APIs are used in rte_hash library to store the free slot index
   and free bucket index.

This patch results in following checkpatch issue:
WARNING:UNSPECIFIED_INT: Prefer 'unsigned int' to bare use of 'unsigned'

The patch is following the rules in the existing code. Please let me know
if this needs to be fixed.

Honnappa Nagarahalli (5):
  lib/ring: apis to support configurable element size
  lib/ring: add template to support different element sizes
  tools/checkpatch: relax constraints on __rte_experimental
  lib/ring: add ring APIs to support 32b ring elements
  lib/hash: use ring with 32b element size to save memory

 devtools/checkpatches.sh             |  11 +-
 lib/librte_hash/rte_cuckoo_hash.c    |  55 ++---
 lib/librte_hash/rte_cuckoo_hash.h    |   2 +-
 lib/librte_ring/Makefile             |   9 +-
 lib/librte_ring/meson.build          |  11 +-
 lib/librte_ring/rte_ring.c           |  34 ++-
 lib/librte_ring/rte_ring.h           |  72 ++++++
 lib/librte_ring/rte_ring_32.c        |  19 ++
 lib/librte_ring/rte_ring_32.h        |  36 +++
 lib/librte_ring/rte_ring_template.c  |  46 ++++
 lib/librte_ring/rte_ring_template.h  | 330 +++++++++++++++++++++++++++
 lib/librte_ring/rte_ring_version.map |   4 +
 12 files changed, 582 insertions(+), 47 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_32.c
 create mode 100644 lib/librte_ring/rte_ring_32.h
 create mode 100644 lib/librte_ring/rte_ring_template.c
 create mode 100644 lib/librte_ring/rte_ring_template.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH 1/5] lib/ring: apis to support configurable element size
  2019-08-28 14:46 [dpdk-dev] [PATCH 0/5] lib/ring: templates to support custom element size Honnappa Nagarahalli
@ 2019-08-28 14:46 ` Honnappa Nagarahalli
  2019-08-28 14:46 ` [dpdk-dev] [PATCH 2/5] lib/ring: add template to support different element sizes Honnappa Nagarahalli
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-08-28 14:46 UTC (permalink / raw)
  To: olivier.matz, yipeng1.wang, sameh.gobriel, bruce.richardson,
	pablo.de.lara.guarch, honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, gavin.hu, ruifeng.wang, nd

Current APIs assume ring elements to be pointers. However, in many
use cases, the size can be different. The new APIs
rte_ring_get_memsize_elem and rte_ring_create_elem help reduce code
duplication while creating rte_ring templates.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_ring/Makefile             |  2 +-
 lib/librte_ring/meson.build          |  3 ++
 lib/librte_ring/rte_ring.c           | 34 +++++++++----
 lib/librte_ring/rte_ring.h           | 72 ++++++++++++++++++++++++++++
 lib/librte_ring/rte_ring_version.map |  2 +
 5 files changed, 104 insertions(+), 9 deletions(-)

diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index 21a36770d..4c8410229 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -6,7 +6,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 # library name
 LIB = librte_ring.a
 
-CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3
+CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -DALLOW_EXPERIMENTAL_API
 LDLIBS += -lrte_eal
 
 EXPORT_MAP := rte_ring_version.map
diff --git a/lib/librte_ring/meson.build b/lib/librte_ring/meson.build
index ab8b0b469..74219840a 100644
--- a/lib/librte_ring/meson.build
+++ b/lib/librte_ring/meson.build
@@ -6,3 +6,6 @@ sources = files('rte_ring.c')
 headers = files('rte_ring.h',
 		'rte_ring_c11_mem.h',
 		'rte_ring_generic.h')
+
+# rte_ring_create_elem and rte_ring_get_memsize_elem are experimental
+allow_experimental_apis = true
diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
index d9b308036..879feb9f6 100644
--- a/lib/librte_ring/rte_ring.c
+++ b/lib/librte_ring/rte_ring.c
@@ -46,23 +46,32 @@ EAL_REGISTER_TAILQ(rte_ring_tailq)
 
 /* return the size of memory occupied by a ring */
 ssize_t
-rte_ring_get_memsize(unsigned count)
+rte_ring_get_memsize_elem(unsigned count, size_t esize)
 {
 	ssize_t sz;
 
 	/* count must be a power of 2 */
 	if ((!POWEROF2(count)) || (count > RTE_RING_SZ_MASK )) {
 		RTE_LOG(ERR, RING,
-			"Requested size is invalid, must be power of 2, and "
-			"do not exceed the size limit %u\n", RTE_RING_SZ_MASK);
+			"Requested number of elements is invalid, must be "
+			"power of 2, and do not exceed the limit %u\n",
+			RTE_RING_SZ_MASK);
+
 		return -EINVAL;
 	}
 
-	sz = sizeof(struct rte_ring) + count * sizeof(void *);
+	sz = sizeof(struct rte_ring) + count * esize;
 	sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
 	return sz;
 }
 
+/* return the size of memory occupied by a ring */
+ssize_t
+rte_ring_get_memsize(unsigned count)
+{
+	return rte_ring_get_memsize_elem(count, sizeof(void *));
+}
+
 void
 rte_ring_reset(struct rte_ring *r)
 {
@@ -114,10 +123,10 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 	return 0;
 }
 
-/* create the ring */
+/* create the ring for a given element size */
 struct rte_ring *
-rte_ring_create(const char *name, unsigned count, int socket_id,
-		unsigned flags)
+rte_ring_create_elem(const char *name, unsigned count, size_t esize,
+		int socket_id, unsigned flags)
 {
 	char mz_name[RTE_MEMZONE_NAMESIZE];
 	struct rte_ring *r;
@@ -135,7 +144,7 @@ rte_ring_create(const char *name, unsigned count, int socket_id,
 	if (flags & RING_F_EXACT_SZ)
 		count = rte_align32pow2(count + 1);
 
-	ring_size = rte_ring_get_memsize(count);
+	ring_size = rte_ring_get_memsize_elem(count, esize);
 	if (ring_size < 0) {
 		rte_errno = ring_size;
 		return NULL;
@@ -182,6 +191,15 @@ rte_ring_create(const char *name, unsigned count, int socket_id,
 	return r;
 }
 
+/* create the ring */
+struct rte_ring *
+rte_ring_create(const char *name, unsigned count, int socket_id,
+		unsigned flags)
+{
+	return rte_ring_create_elem(name, count, sizeof(void *), socket_id,
+		flags);
+}
+
 /* free the ring */
 void
 rte_ring_free(struct rte_ring *r)
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 2a9f768a1..bbc1202d3 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -122,6 +122,29 @@ struct rte_ring {
 #define __IS_SC 1
 #define __IS_MC 0
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Calculate the memory size needed for a ring with given element size
+ *
+ * This function returns the number of bytes needed for a ring, given
+ * the number of elements in it and the size of the element. This value
+ * is the sum of the size of the structure rte_ring and the size of the
+ * memory needed for storing the elements. The value is aligned to a cache
+ * line size.
+ *
+ * @param count
+ *   The number of elements in the ring (must be a power of 2).
+ * @param esize
+ *   The size of elements in the ring (recommended to be a power of 2).
+ * @return
+ *   - The memory size needed for the ring on success.
+ *   - -EINVAL if count is not a power of 2.
+ */
+__rte_experimental
+ssize_t rte_ring_get_memsize_elem(unsigned count, size_t esize);
+
 /**
  * Calculate the memory size needed for a ring
  *
@@ -175,6 +198,54 @@ ssize_t rte_ring_get_memsize(unsigned count);
 int rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 	unsigned flags);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Create a new ring named *name* that stores elements with given size.
+ *
+ * This function uses ``memzone_reserve()`` to allocate memory. Then it
+ * calls rte_ring_init() to initialize an empty ring.
+ *
+ * The new ring size is set to *count*, which must be a power of
+ * two. Water marking is disabled by default. The real usable ring size
+ * is *count-1* instead of *count* to differentiate a free ring from an
+ * empty ring.
+ *
+ * The ring is added in RTE_TAILQ_RING list.
+ *
+ * @param name
+ *   The name of the ring.
+ * @param count
+ *   The number of elements in the ring (must be a power of 2).
+ * @param esize
+ *   The size of elements in the ring (recommended to be a power of 2).
+ * @param socket_id
+ *   The *socket_id* argument is the socket identifier in case of
+ *   NUMA. The value can be *SOCKET_ID_ANY* if there is no NUMA
+ *   constraint for the reserved zone.
+ * @param flags
+ *   An OR of the following:
+ *    - RING_F_SP_ENQ: If this flag is set, the default behavior when
+ *      using ``rte_ring_enqueue()`` or ``rte_ring_enqueue_bulk()``
+ *      is "single-producer". Otherwise, it is "multi-producers".
+ *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
+ *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
+ *      is "single-consumer". Otherwise, it is "multi-consumers".
+ * @return
+ *   On success, the pointer to the new allocated ring. NULL on error with
+ *    rte_errno set appropriately. Possible errno values include:
+ *    - E_RTE_NO_CONFIG - function could not get pointer to rte_config structure
+ *    - E_RTE_SECONDARY - function was called from a secondary process instance
+ *    - EINVAL - count provided is not a power of 2
+ *    - ENOSPC - the maximum number of memzones has already been allocated
+ *    - EEXIST - a memzone with the same name already exists
+ *    - ENOMEM - no appropriate memory area found in which to create memzone
+ */
+__rte_experimental
+struct rte_ring *rte_ring_create_elem(const char *name, unsigned count,
+				size_t esize, int socket_id, unsigned flags);
+
 /**
  * Create a new ring named *name* in memory.
  *
@@ -216,6 +287,7 @@ int rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
  */
 struct rte_ring *rte_ring_create(const char *name, unsigned count,
 				 int socket_id, unsigned flags);
+
 /**
  * De-allocate all memory used by the ring.
  *
diff --git a/lib/librte_ring/rte_ring_version.map b/lib/librte_ring/rte_ring_version.map
index 510c1386e..e410a7503 100644
--- a/lib/librte_ring/rte_ring_version.map
+++ b/lib/librte_ring/rte_ring_version.map
@@ -21,6 +21,8 @@ DPDK_2.2 {
 EXPERIMENTAL {
 	global:
 
+	rte_ring_create_elem;
+	rte_ring_get_memsize_elem;
 	rte_ring_reset;
 
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH 2/5] lib/ring: add template to support different element sizes
  2019-08-28 14:46 [dpdk-dev] [PATCH 0/5] lib/ring: templates to support custom element size Honnappa Nagarahalli
  2019-08-28 14:46 ` [dpdk-dev] [PATCH 1/5] lib/ring: apis to support configurable " Honnappa Nagarahalli
@ 2019-08-28 14:46 ` Honnappa Nagarahalli
  2019-10-01 11:47   ` Ananyev, Konstantin
  2019-08-28 14:46 ` [dpdk-dev] [PATCH 3/5] tools/checkpatch: relax constraints on __rte_experimental Honnappa Nagarahalli
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-08-28 14:46 UTC (permalink / raw)
  To: olivier.matz, yipeng1.wang, sameh.gobriel, bruce.richardson,
	pablo.de.lara.guarch, honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, gavin.hu, ruifeng.wang, nd

Add templates to support creating ring APIs with different
ring element sizes.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_ring/Makefile            |   4 +-
 lib/librte_ring/meson.build         |   4 +-
 lib/librte_ring/rte_ring_template.c |  46 ++++
 lib/librte_ring/rte_ring_template.h | 330 ++++++++++++++++++++++++++++
 4 files changed, 382 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_template.c
 create mode 100644 lib/librte_ring/rte_ring_template.h

diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index 4c8410229..818898110 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -19,6 +19,8 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
 					rte_ring_generic.h \
-					rte_ring_c11_mem.h
+					rte_ring_c11_mem.h \
+					rte_ring_template.h \
+					rte_ring_template.c
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_ring/meson.build b/lib/librte_ring/meson.build
index 74219840a..e4e208a7c 100644
--- a/lib/librte_ring/meson.build
+++ b/lib/librte_ring/meson.build
@@ -5,7 +5,9 @@ version = 2
 sources = files('rte_ring.c')
 headers = files('rte_ring.h',
 		'rte_ring_c11_mem.h',
-		'rte_ring_generic.h')
+		'rte_ring_generic.h',
+		'rte_ring_template.h',
+		'rte_ring_template.c')
 
 # rte_ring_create_elem and rte_ring_get_memsize_elem are experimental
 allow_experimental_apis = true
diff --git a/lib/librte_ring/rte_ring_template.c b/lib/librte_ring/rte_ring_template.c
new file mode 100644
index 000000000..1ca593f95
--- /dev/null
+++ b/lib/librte_ring/rte_ring_template.c
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2019 Arm Limited
+ */
+
+#include <stdio.h>
+#include <stdarg.h>
+#include <string.h>
+#include <stdint.h>
+#include <inttypes.h>
+#include <errno.h>
+#include <sys/queue.h>
+
+#include <rte_common.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_malloc.h>
+#include <rte_launch.h>
+#include <rte_eal.h>
+#include <rte_eal_memconfig.h>
+#include <rte_atomic.h>
+#include <rte_per_lcore.h>
+#include <rte_lcore.h>
+#include <rte_branch_prediction.h>
+#include <rte_errno.h>
+#include <rte_string_fns.h>
+#include <rte_spinlock.h>
+#include <rte_tailq.h>
+
+#include "rte_ring.h"
+
+/* return the size of memory occupied by a ring */
+ssize_t
+__RTE_RING_CONCAT(rte_ring_get_memsize)(unsigned count)
+{
+	return rte_ring_get_memsize_elem(count, RTE_RING_TMPLT_ELEM_SIZE);
+}
+
+/* create the ring */
+struct rte_ring *
+__RTE_RING_CONCAT(rte_ring_create)(const char *name, unsigned count,
+		int socket_id, unsigned flags)
+{
+	return rte_ring_create_elem(name, count, RTE_RING_TMPLT_ELEM_SIZE,
+		socket_id, flags);
+}
diff --git a/lib/librte_ring/rte_ring_template.h b/lib/librte_ring/rte_ring_template.h
new file mode 100644
index 000000000..b9b14dfbb
--- /dev/null
+++ b/lib/librte_ring/rte_ring_template.h
@@ -0,0 +1,330 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2019 Arm Limited
+ */
+
+#ifndef _RTE_RING_TEMPLATE_H_
+#define _RTE_RING_TEMPLATE_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stdio.h>
+#include <stdint.h>
+#include <sys/queue.h>
+#include <errno.h>
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_memory.h>
+#include <rte_lcore.h>
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_memzone.h>
+#include <rte_pause.h>
+#include <rte_ring.h>
+
+/* Ring API suffix name - used to append to API names */
+#ifndef RTE_RING_TMPLT_API_SUFFIX
+#error RTE_RING_TMPLT_API_SUFFIX not defined
+#endif
+
+/* Ring's element size in bits, should be a power of 2 */
+#ifndef RTE_RING_TMPLT_ELEM_SIZE
+#error RTE_RING_TMPLT_ELEM_SIZE not defined
+#endif
+
+/* Type of ring elements */
+#ifndef RTE_RING_TMPLT_ELEM_TYPE
+#error RTE_RING_TMPLT_ELEM_TYPE not defined
+#endif
+
+#define _rte_fuse(a, b) a##_##b
+#define __rte_fuse(a, b) _rte_fuse(a, b)
+#define __RTE_RING_CONCAT(a) __rte_fuse(a, RTE_RING_TMPLT_API_SUFFIX)
+
+/* Calculate the memory size needed for a ring */
+RTE_RING_TMPLT_EXPERIMENTAL
+ssize_t __RTE_RING_CONCAT(rte_ring_get_memsize)(unsigned count);
+
+/* Create a new ring named *name* in memory. */
+RTE_RING_TMPLT_EXPERIMENTAL
+struct rte_ring *
+__RTE_RING_CONCAT(rte_ring_create)(const char *name, unsigned count,
+					int socket_id, unsigned flags);
+
+/**
+ * @internal Enqueue several objects on the ring
+ */
+static __rte_always_inline unsigned int
+__RTE_RING_CONCAT(__rte_ring_do_enqueue)(struct rte_ring *r,
+		RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int n,
+		enum rte_ring_queue_behavior behavior, unsigned int is_sp,
+		unsigned int *free_space)
+{
+	uint32_t prod_head, prod_next;
+	uint32_t free_entries;
+
+	n = __rte_ring_move_prod_head(r, is_sp, n, behavior,
+			&prod_head, &prod_next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n,
+		RTE_RING_TMPLT_ELEM_TYPE);
+
+	update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
+end:
+	if (free_space != NULL)
+		*free_space = free_entries - n;
+	return n;
+}
+
+/**
+ * @internal Dequeue several objects from the ring
+ */
+static __rte_always_inline unsigned int
+__RTE_RING_CONCAT(__rte_ring_do_dequeue)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
+	enum rte_ring_queue_behavior behavior, unsigned int is_sc,
+	unsigned int *available)
+{
+	uint32_t cons_head, cons_next;
+	uint32_t entries;
+
+	n = __rte_ring_move_cons_head(r, (int)is_sc, n, behavior,
+			&cons_head, &cons_next, &entries);
+	if (n == 0)
+		goto end;
+
+	DEQUEUE_PTRS(r, &r[1], cons_head, obj_table, n,
+		RTE_RING_TMPLT_ELEM_TYPE);
+
+	update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
+
+end:
+	if (available != NULL)
+		*available = entries - n;
+	return n;
+}
+
+
+/**
+ * Enqueue several objects on the ring (multi-producers safe).
+ */
+static __rte_always_inline unsigned int
+__RTE_RING_CONCAT(rte_ring_mp_enqueue_bulk)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int n,
+	unsigned int *free_space)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
+			RTE_RING_QUEUE_FIXED, __IS_MP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring (NOT multi-producers safe).
+ */
+static __rte_always_inline unsigned int
+__RTE_RING_CONCAT(rte_ring_sp_enqueue_bulk)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int n,
+	unsigned int *free_space)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
+			RTE_RING_QUEUE_FIXED, __IS_SP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring.
+ */
+static __rte_always_inline unsigned int
+__RTE_RING_CONCAT(rte_ring_enqueue_bulk)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int n,
+	unsigned int *free_space)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
+			RTE_RING_QUEUE_FIXED, r->prod.single, free_space);
+}
+
+/**
+ * Enqueue one object on a ring (multi-producers safe).
+ */
+static __rte_always_inline int
+__RTE_RING_CONCAT(rte_ring_mp_enqueue)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE obj)
+{
+	return __RTE_RING_CONCAT(rte_ring_mp_enqueue_bulk)(r, &obj, 1, NULL) ?
+			0 : -ENOBUFS;
+}
+
+/**
+ * Enqueue one object on a ring (NOT multi-producers safe).
+ */
+static __rte_always_inline int
+__RTE_RING_CONCAT(rte_ring_sp_enqueue)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE obj)
+{
+	return __RTE_RING_CONCAT(rte_ring_sp_enqueue_bulk)(r, &obj, 1, NULL) ?
+			0 : -ENOBUFS;
+}
+
+/**
+ * Enqueue one object on a ring.
+ */
+static __rte_always_inline int
+__RTE_RING_CONCAT(rte_ring_enqueue)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj)
+{
+	return __RTE_RING_CONCAT(rte_ring_enqueue_bulk)(r, obj, 1, NULL) ?
+			0 : -ENOBUFS;
+}
+
+/**
+ * Dequeue several objects from a ring (multi-consumers safe).
+ */
+static __rte_always_inline unsigned int
+__RTE_RING_CONCAT(rte_ring_mc_dequeue_bulk)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
+	unsigned int *available)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
+			RTE_RING_QUEUE_FIXED, __IS_MC, available);
+}
+
+/**
+ * Dequeue several objects from a ring (NOT multi-consumers safe).
+ */
+static __rte_always_inline unsigned int
+__RTE_RING_CONCAT(rte_ring_sc_dequeue_bulk)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
+	unsigned int *available)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
+			RTE_RING_QUEUE_FIXED, __IS_SC, available);
+}
+
+/**
+ * Dequeue several objects from a ring.
+ */
+static __rte_always_inline unsigned int
+__RTE_RING_CONCAT(rte_ring_dequeue_bulk)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
+	unsigned int *available)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
+			RTE_RING_QUEUE_FIXED, r->cons.single, available);
+}
+
+/**
+ * Dequeue one object from a ring (multi-consumers safe).
+ */
+static __rte_always_inline int
+__RTE_RING_CONCAT(rte_ring_mc_dequeue)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_p)
+{
+	return __RTE_RING_CONCAT(rte_ring_mc_dequeue_bulk)(r, obj_p, 1, NULL) ?
+			0 : -ENOENT;
+}
+
+/**
+ * Dequeue one object from a ring (NOT multi-consumers safe).
+ */
+static __rte_always_inline int
+__RTE_RING_CONCAT(rte_ring_sc_dequeue)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_p)
+{
+	return __RTE_RING_CONCAT(rte_ring_sc_dequeue_bulk)(r, obj_p, 1, NULL) ?
+			0 : -ENOENT;
+}
+
+/**
+ * Dequeue one object from a ring.
+ */
+static __rte_always_inline int
+__RTE_RING_CONCAT(rte_ring_dequeue)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_p)
+{
+	return __RTE_RING_CONCAT(rte_ring_dequeue_bulk)(r, obj_p, 1, NULL) ?
+			0 : -ENOENT;
+}
+
+/**
+ * Enqueue several objects on the ring (multi-producers safe).
+ */
+static __rte_always_inline unsigned
+__RTE_RING_CONCAT(rte_ring_mp_enqueue_burst)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_table,
+			 unsigned int n, unsigned int *free_space)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_MP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring (NOT multi-producers safe).
+ */
+static __rte_always_inline unsigned
+__RTE_RING_CONCAT(rte_ring_sp_enqueue_burst)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_table,
+			 unsigned int n, unsigned int *free_space)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_SP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring.
+ */
+static __rte_always_inline unsigned
+__RTE_RING_CONCAT(rte_ring_enqueue_burst)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
+	unsigned int *free_space)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
+			RTE_RING_QUEUE_VARIABLE, r->prod.single, free_space);
+}
+
+/**
+ * Dequeue several objects from a ring (multi-consumers safe). When the request
+ * objects are more than the available objects, only dequeue the actual number
+ * of objects
+ */
+static __rte_always_inline unsigned
+__RTE_RING_CONCAT(rte_ring_mc_dequeue_burst)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
+	unsigned int *available)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_MC, available);
+}
+
+/**
+ * Dequeue several objects from a ring (NOT multi-consumers safe).When the
+ * request objects are more than the available objects, only dequeue the
+ * actual number of objects
+ */
+static __rte_always_inline unsigned
+__RTE_RING_CONCAT(rte_ring_sc_dequeue_burst)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
+	unsigned int *available)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_SC, available);
+}
+
+/**
+ * Dequeue multiple objects from a ring up to a maximum number.
+ */
+static __rte_always_inline unsigned
+__RTE_RING_CONCAT(rte_ring_dequeue_burst)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
+	unsigned int *available)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
+				RTE_RING_QUEUE_VARIABLE,
+				r->cons.single, available);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_RING_TEMPLATE_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH 3/5] tools/checkpatch: relax constraints on __rte_experimental
  2019-08-28 14:46 [dpdk-dev] [PATCH 0/5] lib/ring: templates to support custom element size Honnappa Nagarahalli
  2019-08-28 14:46 ` [dpdk-dev] [PATCH 1/5] lib/ring: apis to support configurable " Honnappa Nagarahalli
  2019-08-28 14:46 ` [dpdk-dev] [PATCH 2/5] lib/ring: add template to support different element sizes Honnappa Nagarahalli
@ 2019-08-28 14:46 ` Honnappa Nagarahalli
  2019-08-28 14:46 ` [dpdk-dev] [PATCH 4/5] lib/ring: add ring APIs to support 32b ring elements Honnappa Nagarahalli
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-08-28 14:46 UTC (permalink / raw)
  To: olivier.matz, yipeng1.wang, sameh.gobriel, bruce.richardson,
	pablo.de.lara.guarch, honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, gavin.hu, ruifeng.wang, nd

Relax the constraints on __rte_experimental usage, allow redefining
to macros.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 devtools/checkpatches.sh | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/devtools/checkpatches.sh b/devtools/checkpatches.sh
index 560e6ce93..090c9b08a 100755
--- a/devtools/checkpatches.sh
+++ b/devtools/checkpatches.sh
@@ -99,9 +99,14 @@ check_experimental_tags() { # <patch>
 			ret = 1;
 		}
 		if ($1 != "+__rte_experimental" || $2 != "") {
-			print "__rte_experimental must appear alone on the line" \
-				" immediately preceding the return type of a function."
-			ret = 1;
+			# code such as "#define XYZ __rte_experimental" is
+			# allowed
+			if ($1 != "+#define") {
+				print "__rte_experimental must appear alone " \
+				      "on the line immediately preceding the " \
+				      "return type of a function."
+				ret = 1;
+			}
 		}
 	}
 	END {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH 4/5] lib/ring: add ring APIs to support 32b ring elements
  2019-08-28 14:46 [dpdk-dev] [PATCH 0/5] lib/ring: templates to support custom element size Honnappa Nagarahalli
                   ` (2 preceding siblings ...)
  2019-08-28 14:46 ` [dpdk-dev] [PATCH 3/5] tools/checkpatch: relax constraints on __rte_experimental Honnappa Nagarahalli
@ 2019-08-28 14:46 ` Honnappa Nagarahalli
  2019-08-28 14:46 ` [dpdk-dev] [PATCH 5/5] lib/hash: use ring with 32b element size to save memory Honnappa Nagarahalli
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-08-28 14:46 UTC (permalink / raw)
  To: olivier.matz, yipeng1.wang, sameh.gobriel, bruce.richardson,
	pablo.de.lara.guarch, honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, gavin.hu, ruifeng.wang, nd

Add ring APIs to support 32b ring elements using templates.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_ring/Makefile             |  3 ++-
 lib/librte_ring/meson.build          |  4 +++-
 lib/librte_ring/rte_ring_32.c        | 19 +++++++++++++++
 lib/librte_ring/rte_ring_32.h        | 36 ++++++++++++++++++++++++++++
 lib/librte_ring/rte_ring_version.map |  2 ++
 5 files changed, 62 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_32.c
 create mode 100644 lib/librte_ring/rte_ring_32.h

diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index 818898110..3102bb64d 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -14,10 +14,11 @@ EXPORT_MAP := rte_ring_version.map
 LIBABIVER := 2
 
 # all source are stored in SRCS-y
-SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
+SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c rte_ring_32.c
 
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
+					rte_ring_32.h \
 					rte_ring_generic.h \
 					rte_ring_c11_mem.h \
 					rte_ring_template.h \
diff --git a/lib/librte_ring/meson.build b/lib/librte_ring/meson.build
index e4e208a7c..81ea53ed7 100644
--- a/lib/librte_ring/meson.build
+++ b/lib/librte_ring/meson.build
@@ -2,8 +2,10 @@
 # Copyright(c) 2017 Intel Corporation
 
 version = 2
-sources = files('rte_ring.c')
+sources = files('rte_ring.c',
+		'rte_ring_32.c')
 headers = files('rte_ring.h',
+		'rte_ring_32.h',
 		'rte_ring_c11_mem.h',
 		'rte_ring_generic.h',
 		'rte_ring_template.h',
diff --git a/lib/librte_ring/rte_ring_32.c b/lib/librte_ring/rte_ring_32.c
new file mode 100644
index 000000000..09e90cec1
--- /dev/null
+++ b/lib/librte_ring/rte_ring_32.c
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2019 Arm Limited
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <sys/queue.h>
+#include <errno.h>
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_memory.h>
+#include <rte_lcore.h>
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_memzone.h>
+#include <rte_pause.h>
+
+#include <rte_ring_32.h>
+#include <rte_ring_template.c>
diff --git a/lib/librte_ring/rte_ring_32.h b/lib/librte_ring/rte_ring_32.h
new file mode 100644
index 000000000..5270a9bc7
--- /dev/null
+++ b/lib/librte_ring/rte_ring_32.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2019 Arm Limited
+ */
+
+#ifndef _RTE_RING_32_H_
+#define _RTE_RING_32_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stdio.h>
+#include <stdint.h>
+#include <sys/queue.h>
+#include <errno.h>
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_memory.h>
+#include <rte_lcore.h>
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_memzone.h>
+#include <rte_pause.h>
+
+#define RTE_RING_TMPLT_API_SUFFIX 32
+#define RTE_RING_TMPLT_ELEM_SIZE sizeof(uint32_t)
+#define RTE_RING_TMPLT_ELEM_TYPE uint32_t
+#define RTE_RING_TMPLT_EXPERIMENTAL __rte_experimental
+
+#include <rte_ring_template.h>
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_RING_32_H_ */
diff --git a/lib/librte_ring/rte_ring_version.map b/lib/librte_ring/rte_ring_version.map
index e410a7503..9efba91bb 100644
--- a/lib/librte_ring/rte_ring_version.map
+++ b/lib/librte_ring/rte_ring_version.map
@@ -21,7 +21,9 @@ DPDK_2.2 {
 EXPERIMENTAL {
 	global:
 
+	rte_ring_create_32;
 	rte_ring_create_elem;
+	rte_ring_get_memsize_32;
 	rte_ring_get_memsize_elem;
 	rte_ring_reset;
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH 5/5] lib/hash: use ring with 32b element size to save memory
  2019-08-28 14:46 [dpdk-dev] [PATCH 0/5] lib/ring: templates to support custom element size Honnappa Nagarahalli
                   ` (3 preceding siblings ...)
  2019-08-28 14:46 ` [dpdk-dev] [PATCH 4/5] lib/ring: add ring APIs to support 32b ring elements Honnappa Nagarahalli
@ 2019-08-28 14:46 ` Honnappa Nagarahalli
  2019-08-28 15:12 ` [dpdk-dev] [PATCH 0/5] lib/ring: templates to support custom element size Jerin Jacob Kollanukkaran
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-08-28 14:46 UTC (permalink / raw)
  To: olivier.matz, yipeng1.wang, sameh.gobriel, bruce.richardson,
	pablo.de.lara.guarch, honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, gavin.hu, ruifeng.wang, nd

The freelist and external bucket indices are 32b. Using rings
that use 32b element sizes will save memory.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_hash/rte_cuckoo_hash.c | 55 ++++++++++++++-----------------
 lib/librte_hash/rte_cuckoo_hash.h |  2 +-
 2 files changed, 26 insertions(+), 31 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c
index 87a4c01f2..a0cd3360a 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -24,7 +24,7 @@
 #include <rte_cpuflags.h>
 #include <rte_rwlock.h>
 #include <rte_spinlock.h>
-#include <rte_ring.h>
+#include <rte_ring_32.h>
 #include <rte_compat.h>
 #include <rte_vect.h>
 #include <rte_tailq.h>
@@ -213,7 +213,7 @@ rte_hash_create(const struct rte_hash_parameters *params)
 
 	snprintf(ring_name, sizeof(ring_name), "HT_%s", params->name);
 	/* Create ring (Dummy slot index is not enqueued) */
-	r = rte_ring_create(ring_name, rte_align32pow2(num_key_slots),
+	r = rte_ring_create_32(ring_name, rte_align32pow2(num_key_slots),
 			params->socket_id, 0);
 	if (r == NULL) {
 		RTE_LOG(ERR, HASH, "memory allocation failed\n");
@@ -227,7 +227,7 @@ rte_hash_create(const struct rte_hash_parameters *params)
 	if (ext_table_support) {
 		snprintf(ext_ring_name, sizeof(ext_ring_name), "HT_EXT_%s",
 								params->name);
-		r_ext = rte_ring_create(ext_ring_name,
+		r_ext = rte_ring_create_32(ext_ring_name,
 				rte_align32pow2(num_buckets + 1),
 				params->socket_id, 0);
 
@@ -295,7 +295,7 @@ rte_hash_create(const struct rte_hash_parameters *params)
 		 * for next bucket
 		 */
 		for (i = 1; i <= num_buckets; i++)
-			rte_ring_sp_enqueue(r_ext, (void *)((uintptr_t) i));
+			rte_ring_sp_enqueue_32(r_ext, i);
 
 		if (readwrite_concur_lf_support) {
 			ext_bkt_to_free = rte_zmalloc(NULL, sizeof(uint32_t) *
@@ -434,7 +434,7 @@ rte_hash_create(const struct rte_hash_parameters *params)
 
 	/* Populate free slots ring. Entry zero is reserved for key misses. */
 	for (i = 1; i < num_key_slots; i++)
-		rte_ring_sp_enqueue(r, (void *)((uintptr_t) i));
+		rte_ring_sp_enqueue_32(r, i);
 
 	te->data = (void *) h;
 	TAILQ_INSERT_TAIL(hash_list, te, next);
@@ -598,13 +598,12 @@ rte_hash_reset(struct rte_hash *h)
 		tot_ring_cnt = h->entries;
 
 	for (i = 1; i < tot_ring_cnt + 1; i++)
-		rte_ring_sp_enqueue(h->free_slots, (void *)((uintptr_t) i));
+		rte_ring_sp_enqueue_32(h->free_slots, i);
 
 	/* Repopulate the free ext bkt ring. */
 	if (h->ext_table_support) {
 		for (i = 1; i <= h->num_buckets; i++)
-			rte_ring_sp_enqueue(h->free_ext_bkts,
-						(void *)((uintptr_t) i));
+			rte_ring_sp_enqueue_32(h->free_ext_bkts, i);
 	}
 
 	if (h->use_local_cache) {
@@ -623,13 +622,13 @@ rte_hash_reset(struct rte_hash *h)
 static inline void
 enqueue_slot_back(const struct rte_hash *h,
 		struct lcore_cache *cached_free_slots,
-		void *slot_id)
+		uint32_t slot_id)
 {
 	if (h->use_local_cache) {
 		cached_free_slots->objs[cached_free_slots->len] = slot_id;
 		cached_free_slots->len++;
 	} else
-		rte_ring_sp_enqueue(h->free_slots, slot_id);
+		rte_ring_sp_enqueue_32(h->free_slots, slot_id);
 }
 
 /* Search a key from bucket and update its data.
@@ -923,8 +922,8 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 	uint32_t prim_bucket_idx, sec_bucket_idx;
 	struct rte_hash_bucket *prim_bkt, *sec_bkt, *cur_bkt;
 	struct rte_hash_key *new_k, *keys = h->key_store;
-	void *slot_id = NULL;
-	void *ext_bkt_id = NULL;
+	uint32_t slot_id = 0;
+	uint32_t ext_bkt_id = 0;
 	uint32_t new_idx, bkt_id;
 	int ret;
 	unsigned n_slots;
@@ -968,7 +967,7 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 		/* Try to get a free slot from the local cache */
 		if (cached_free_slots->len == 0) {
 			/* Need to get another burst of free slots from global ring */
-			n_slots = rte_ring_mc_dequeue_burst(h->free_slots,
+			n_slots = rte_ring_mc_dequeue_burst_32(h->free_slots,
 					cached_free_slots->objs,
 					LCORE_CACHE_SIZE, NULL);
 			if (n_slots == 0) {
@@ -982,13 +981,12 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 		cached_free_slots->len--;
 		slot_id = cached_free_slots->objs[cached_free_slots->len];
 	} else {
-		if (rte_ring_sc_dequeue(h->free_slots, &slot_id) != 0) {
+		if (rte_ring_sc_dequeue_32(h->free_slots, &slot_id) != 0)
 			return -ENOSPC;
-		}
 	}
 
-	new_k = RTE_PTR_ADD(keys, (uintptr_t)slot_id * h->key_entry_size);
-	new_idx = (uint32_t)((uintptr_t) slot_id);
+	new_k = RTE_PTR_ADD(keys, slot_id * h->key_entry_size);
+	new_idx = slot_id;
 	/* The store to application data (by the application) at *data should
 	 * not leak after the store of pdata in the key store. i.e. pdata is
 	 * the guard variable. Release the application data to the readers.
@@ -1078,12 +1076,12 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 	/* Failed to get an empty entry from extendable buckets. Link a new
 	 * extendable bucket. We first get a free bucket from ring.
 	 */
-	if (rte_ring_sc_dequeue(h->free_ext_bkts, &ext_bkt_id) != 0) {
+	if (rte_ring_sc_dequeue_32(h->free_ext_bkts, &ext_bkt_id) != 0) {
 		ret = -ENOSPC;
 		goto failure;
 	}
 
-	bkt_id = (uint32_t)((uintptr_t)ext_bkt_id) - 1;
+	bkt_id = ext_bkt_id - 1;
 	/* Use the first location of the new bucket */
 	(h->buckets_ext[bkt_id]).sig_current[0] = short_sig;
 	/* Store to signature and key should not leak after
@@ -1373,7 +1371,7 @@ remove_entry(const struct rte_hash *h, struct rte_hash_bucket *bkt, unsigned i)
 		/* Cache full, need to free it. */
 		if (cached_free_slots->len == LCORE_CACHE_SIZE) {
 			/* Need to enqueue the free slots in global ring. */
-			n_slots = rte_ring_mp_enqueue_burst(h->free_slots,
+			n_slots = rte_ring_mp_enqueue_burst_32(h->free_slots,
 						cached_free_slots->objs,
 						LCORE_CACHE_SIZE, NULL);
 			ERR_IF_TRUE((n_slots == 0),
@@ -1383,11 +1381,10 @@ remove_entry(const struct rte_hash *h, struct rte_hash_bucket *bkt, unsigned i)
 		}
 		/* Put index of new free slot in cache. */
 		cached_free_slots->objs[cached_free_slots->len] =
-				(void *)((uintptr_t)bkt->key_idx[i]);
+				bkt->key_idx[i];
 		cached_free_slots->len++;
 	} else {
-		rte_ring_sp_enqueue(h->free_slots,
-				(void *)((uintptr_t)bkt->key_idx[i]));
+		rte_ring_sp_enqueue_32(h->free_slots, bkt->key_idx[i]);
 	}
 }
 
@@ -1551,7 +1548,7 @@ __rte_hash_del_key_with_hash(const struct rte_hash *h, const void *key,
 			 */
 			h->ext_bkt_to_free[ret] = index;
 		else
-			rte_ring_sp_enqueue(h->free_ext_bkts, (void *)(uintptr_t)index);
+			rte_ring_sp_enqueue_32(h->free_ext_bkts, index);
 	}
 	__hash_rw_writer_unlock(h);
 	return ret;
@@ -1614,7 +1611,7 @@ rte_hash_free_key_with_position(const struct rte_hash *h,
 		uint32_t index = h->ext_bkt_to_free[position];
 		if (index) {
 			/* Recycle empty ext bkt to free list. */
-			rte_ring_sp_enqueue(h->free_ext_bkts, (void *)(uintptr_t)index);
+			rte_ring_sp_enqueue_32(h->free_ext_bkts, index);
 			h->ext_bkt_to_free[position] = 0;
 		}
 	}
@@ -1625,19 +1622,17 @@ rte_hash_free_key_with_position(const struct rte_hash *h,
 		/* Cache full, need to free it. */
 		if (cached_free_slots->len == LCORE_CACHE_SIZE) {
 			/* Need to enqueue the free slots in global ring. */
-			n_slots = rte_ring_mp_enqueue_burst(h->free_slots,
+			n_slots = rte_ring_mp_enqueue_burst_32(h->free_slots,
 						cached_free_slots->objs,
 						LCORE_CACHE_SIZE, NULL);
 			RETURN_IF_TRUE((n_slots == 0), -EFAULT);
 			cached_free_slots->len -= n_slots;
 		}
 		/* Put index of new free slot in cache. */
-		cached_free_slots->objs[cached_free_slots->len] =
-					(void *)((uintptr_t)key_idx);
+		cached_free_slots->objs[cached_free_slots->len] = key_idx;
 		cached_free_slots->len++;
 	} else {
-		rte_ring_sp_enqueue(h->free_slots,
-				(void *)((uintptr_t)key_idx));
+		rte_ring_sp_enqueue_32(h->free_slots, key_idx);
 	}
 
 	return 0;
diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index fb19bb27d..345de6bf9 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -124,7 +124,7 @@ const rte_hash_cmp_eq_t cmp_jump_table[NUM_KEY_CMP_CASES] = {
 
 struct lcore_cache {
 	unsigned len; /**< Cache len */
-	void *objs[LCORE_CACHE_SIZE]; /**< Cache objects */
+	uint32_t objs[LCORE_CACHE_SIZE]; /**< Cache objects */
 } __rte_cache_aligned;
 
 /* Structure that stores key-value pair */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH 0/5] lib/ring: templates to support custom element size
  2019-08-28 14:46 [dpdk-dev] [PATCH 0/5] lib/ring: templates to support custom element size Honnappa Nagarahalli
                   ` (4 preceding siblings ...)
  2019-08-28 14:46 ` [dpdk-dev] [PATCH 5/5] lib/hash: use ring with 32b element size to save memory Honnappa Nagarahalli
@ 2019-08-28 15:12 ` Jerin Jacob Kollanukkaran
  2019-08-28 15:16 ` Pavan Nikhilesh Bhagavatula
  2019-09-06 19:05 ` [dpdk-dev] [PATCH v2 0/6] " Honnappa Nagarahalli
  7 siblings, 0 replies; 173+ messages in thread
From: Jerin Jacob Kollanukkaran @ 2019-08-28 15:12 UTC (permalink / raw)
  To: Honnappa Nagarahalli, olivier.matz, yipeng1.wang, sameh.gobriel,
	bruce.richardson, pablo.de.lara.guarch
  Cc: dev, dharmik.thakkar, gavin.hu, ruifeng.wang, nd

> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Honnappa Nagarahalli
> Sent: Wednesday, August 28, 2019 8:16 PM
> To: olivier.matz@6wind.com; yipeng1.wang@intel.com;
> sameh.gobriel@intel.com; bruce.richardson@intel.com;
> pablo.de.lara.guarch@intel.com; honnappa.nagarahalli@arm.com
> Cc: dev@dpdk.org; dharmik.thakkar@arm.com; gavin.hu@arm.com;
> ruifeng.wang@arm.com; nd@arm.com
> Subject: [dpdk-dev] [PATCH 0/5] lib/ring: templates to support custom element
> size
> 
> The current rte_ring hard-codes the type of the ring element to 'void *', hence
> the size of the element is hard-coded to 32b/64b. Since the ring element type is
> not an input to rte_ring APIs, it results in couple of issues:
> 
> 1) If an application requires to store an element which is not 64b, it
>    needs to writes its own ring APIs similar to rte_event_ring APIs. This
>    creates additional burden on the programmers, who simply end up making
>    work-arounds and often waste memory.

If we are taking this path, Could you change rte_event_ring implementation based
on new framework?



> 2) If there are multiple libraries that store elements of the same
>    type, currently they would have to write their own rte_ring APIs. This
>    results in code duplication.
> 
> This patch consists of 4 parts:
> 1) New APIs to support configurable ring element size
>    These will help reduce code duplication in the templates. I think these
>    can be made internal (do not expose to DPDK applications, but expose to
>    DPDK libraries), feedback needed.
> 
> 2) rte_ring templates
>    The templates provide an easy way to add new APIs for different ring
>    element types/sizes which can be used by multiple libraries. These
>    also allow for creating APIs to store elements of custom types
>    (for ex: a structure)
> 
>    The template needs 4 parameters:
>    a) RTE_RING_TMPLT_API_SUFFIX - This is used as a suffix to the
>       rte_ring APIs.
>       For ex: if RTE_RING_TMPLT_API_SUFFIX is '32b', the API name will be
>       rte_ring_create_32b
>    b) RTE_RING_TMPLT_ELEM_SIZE - Size of the ring element in bytes.
>       For ex: sizeof(uint32_t)
>    c) RTE_RING_TMPLT_ELEM_TYPE - Type of the ring element.
>       For ex: uint32_t. If a common ring library does not use a standard
>       data type, it should create its own type by defining a structure
>       with standard data type. For ex: for an elment size of 96b, one
>       could define a structure
> 
>       struct s_96b {
>           uint32_t a[3];
>       }
>       The common library can use this structure to define
>       RTE_RING_TMPLT_ELEM_TYPE.
> 
>       The application using this common ring library should define its
>       element type as a union with the above structure.
> 
>       union app_element_type {
>           struct s_96b v;
>           struct app_element {
>               uint16_t a;
>               uint16_t b;
>               uint32_t c;
>               uint32_t d;
>           }
>       }
>    d) RTE_RING_TMPLT_EXPERIMENTAL - Indicates if the new APIs being defined
>       are experimental. Should be set to empty to remove the experimental
>       tag.
> 
>    The ring library consists of some APIs that are defined as inline
>    functions and some APIs that are non-inline functions. The non-inline
>    functions are in rte_ring_template.c. However, this file needs to be
>    included in other .c files. Any feedback on how to handle this is
>    appreciated.
> 
>    Note that the templates help create the APIs that are dependent on the
>    element size (for ex: rte_ring_create, enqueue/dequeue etc). Other APIs
>    that do NOT depend on the element size do not need to be part of the
>    template (for ex: rte_ring_dump, rte_ring_count, rte_ring_free_count
>    etc).
> 
> 3) APIs for 32b ring element size
>    This uses the templates to create APIs to enqueue/dequeue elements of
>    size 32b.
> 
> 4) rte_hash libray is changed to use 32b ring APIs
>    The 32b APIs are used in rte_hash library to store the free slot index
>    and free bucket index.
> 
> This patch results in following checkpatch issue:
> WARNING:UNSPECIFIED_INT: Prefer 'unsigned int' to bare use of 'unsigned'
> 
> The patch is following the rules in the existing code. Please let me know if this
> needs to be fixed.
> 
> Honnappa Nagarahalli (5):
>   lib/ring: apis to support configurable element size
>   lib/ring: add template to support different element sizes
>   tools/checkpatch: relax constraints on __rte_experimental
>   lib/ring: add ring APIs to support 32b ring elements
>   lib/hash: use ring with 32b element size to save memory
> 
>  devtools/checkpatches.sh             |  11 +-
>  lib/librte_hash/rte_cuckoo_hash.c    |  55 ++---
>  lib/librte_hash/rte_cuckoo_hash.h    |   2 +-
>  lib/librte_ring/Makefile             |   9 +-
>  lib/librte_ring/meson.build          |  11 +-
>  lib/librte_ring/rte_ring.c           |  34 ++-
>  lib/librte_ring/rte_ring.h           |  72 ++++++
>  lib/librte_ring/rte_ring_32.c        |  19 ++
>  lib/librte_ring/rte_ring_32.h        |  36 +++
>  lib/librte_ring/rte_ring_template.c  |  46 ++++
> lib/librte_ring/rte_ring_template.h  | 330 +++++++++++++++++++++++++++
>  lib/librte_ring/rte_ring_version.map |   4 +
>  12 files changed, 582 insertions(+), 47 deletions(-)  create mode 100644
> lib/librte_ring/rte_ring_32.c  create mode 100644 lib/librte_ring/rte_ring_32.h
> create mode 100644 lib/librte_ring/rte_ring_template.c
>  create mode 100644 lib/librte_ring/rte_ring_template.h
> 
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH 0/5] lib/ring: templates to support custom element size
  2019-08-28 14:46 [dpdk-dev] [PATCH 0/5] lib/ring: templates to support custom element size Honnappa Nagarahalli
                   ` (5 preceding siblings ...)
  2019-08-28 15:12 ` [dpdk-dev] [PATCH 0/5] lib/ring: templates to support custom element size Jerin Jacob Kollanukkaran
@ 2019-08-28 15:16 ` Pavan Nikhilesh Bhagavatula
  2019-08-28 22:59   ` Honnappa Nagarahalli
  2019-09-06 19:05 ` [dpdk-dev] [PATCH v2 0/6] " Honnappa Nagarahalli
  7 siblings, 1 reply; 173+ messages in thread
From: Pavan Nikhilesh Bhagavatula @ 2019-08-28 15:16 UTC (permalink / raw)
  To: Honnappa Nagarahalli, olivier.matz, yipeng1.wang, sameh.gobriel,
	bruce.richardson, pablo.de.lara.guarch
  Cc: dev, dharmik.thakkar, gavin.hu, ruifeng.wang, nd,
	Jerin Jacob Kollanukkaran

Hi Honnappa, 

Great idea I think we can replace duplicated implementation lib/librte_eventdev/rte_event_ring.h which uses
element sizeof 16B.
 There are already a couple of SW eventdevice drivers using event_ring.

Pavan.

>-----Original Message-----
>From: dev <dev-bounces@dpdk.org> On Behalf Of Honnappa
>Nagarahalli
>Sent: Wednesday, August 28, 2019 8:16 PM
>To: olivier.matz@6wind.com; yipeng1.wang@intel.com;
>sameh.gobriel@intel.com; bruce.richardson@intel.com;
>pablo.de.lara.guarch@intel.com; honnappa.nagarahalli@arm.com
>Cc: dev@dpdk.org; dharmik.thakkar@arm.com; gavin.hu@arm.com;
>ruifeng.wang@arm.com; nd@arm.com
>Subject: [dpdk-dev] [PATCH 0/5] lib/ring: templates to support custom
>element size
>


^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH 0/5] lib/ring: templates to support custom element size
  2019-08-28 15:16 ` Pavan Nikhilesh Bhagavatula
@ 2019-08-28 22:59   ` Honnappa Nagarahalli
  0 siblings, 0 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-08-28 22:59 UTC (permalink / raw)
  To: Pavan Nikhilesh Bhagavatula, olivier.matz, yipeng1.wang,
	sameh.gobriel, bruce.richardson, pablo.de.lara.guarch
  Cc: dev, Dharmik Thakkar, Gavin Hu (Arm Technology China),
	Ruifeng Wang (Arm Technology China),
	nd, jerinj, Honnappa Nagarahalli, nd

<snip>

> Subject: RE: [dpdk-dev] [PATCH 0/5] lib/ring: templates to support custom
> element size
> 
> Hi Honnappa,
> 
> Great idea I think we can replace duplicated implementation
> lib/librte_eventdev/rte_event_ring.h which uses element sizeof 16B.
>  There are already a couple of SW eventdevice drivers using event_ring.
Thank you Pavan. I will take a look and get back.

> 
> Pavan.
> 
> >-----Original Message-----
> >From: dev <dev-bounces@dpdk.org> On Behalf Of Honnappa Nagarahalli
> >Sent: Wednesday, August 28, 2019 8:16 PM
> >To: olivier.matz@6wind.com; yipeng1.wang@intel.com;
> >sameh.gobriel@intel.com; bruce.richardson@intel.com;
> >pablo.de.lara.guarch@intel.com; honnappa.nagarahalli@arm.com
> >Cc: dev@dpdk.org; dharmik.thakkar@arm.com; gavin.hu@arm.com;
> >ruifeng.wang@arm.com; nd@arm.com
> >Subject: [dpdk-dev] [PATCH 0/5] lib/ring: templates to support custom
> >element size
> >


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH v2 0/6] lib/ring: templates to support custom element size
  2019-08-28 14:46 [dpdk-dev] [PATCH 0/5] lib/ring: templates to support custom element size Honnappa Nagarahalli
                   ` (6 preceding siblings ...)
  2019-08-28 15:16 ` Pavan Nikhilesh Bhagavatula
@ 2019-09-06 19:05 ` Honnappa Nagarahalli
  2019-09-06 19:05   ` [dpdk-dev] [PATCH v2 1/6] lib/ring: apis to support configurable " Honnappa Nagarahalli
                     ` (15 more replies)
  7 siblings, 16 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-09-06 19:05 UTC (permalink / raw)
  To: olivier.matz, yipeng1.wang, sameh.gobriel, bruce.richardson,
	pablo.de.lara.guarch
  Cc: dev, pbhagavatula, jerinj, Honnappa Nagarahalli

The current rte_ring hard-codes the type of the ring element to 'void *',
hence the size of the element is hard-coded to 32b/64b. Since the ring
element type is not an input to rte_ring APIs, it results in couple
of issues:

1) If an application requires to store an element which is not 64b, it
   needs to write its own ring APIs similar to rte_event_ring APIs. This
   creates additional burden on the programmers, who end up making
   work-arounds and often waste memory.
2) If there are multiple libraries that store elements of the same
   type, currently they would have to write their own rte_ring APIs. This
   results in code duplication.

This patch consists of several parts:
1) New APIs to support configurable ring element size
   These will help reduce code duplication in the templates. I think these
   can be made internal (do not expose to DPDK applications, but expose to
   DPDK libraries), feedback needed.

2) rte_ring templates
   The templates provide an easy way to add new APIs for different ring
   element types/sizes which can be used by multiple libraries. These
   also allow for creating APIs to store elements of custom types
   (for ex: a structure)

   The template needs 4 parameters:
   a) RTE_RING_TMPLT_API_SUFFIX - This is used as a suffix to the
      rte_ring APIs.
      For ex: if RTE_RING_TMPLT_API_SUFFIX is '32b', the API name will be
      rte_ring_create_32b
   b) RTE_RING_TMPLT_ELEM_SIZE - Size of the ring element in bytes.
      For ex: sizeof(uint32_t)
   c) RTE_RING_TMPLT_ELEM_TYPE - Type of the ring element.
      For ex: uint32_t. If a common ring library does not use a standard
      data type, it should create its own type by defining a structure
      with standard data type. For ex: for an elment size of 96b, one
      could define a structure

      struct s_96b {
          uint32_t a[3];
      }
      The common library can use this structure to define
      RTE_RING_TMPLT_ELEM_TYPE.

      The application using this common ring library should define its
      element type as a union with the above structure.

      union app_element_type {
          struct s_96b v;
          struct app_element {
              uint16_t a;
              uint16_t b;
              uint32_t c;
              uint32_t d;
          }
      }
   d) RTE_RING_TMPLT_EXPERIMENTAL - Indicates if the new APIs being defined
      are experimental. Should be set to empty to remove the experimental
      tag.

   The ring library consists of some APIs that are defined as inline
   functions and some APIs that are non-inline functions. The non-inline
   functions are in rte_ring_template.c. However, this file needs to be
   included in other .c files. Any feedback on how to handle this is
   appreciated.

   Note that the templates help create the APIs that are dependent on the
   element size (for ex: rte_ring_create, enqueue/dequeue etc). Other APIs
   that do NOT depend on the element size do not need to be part of the
   template (for ex: rte_ring_dump, rte_ring_count, rte_ring_free_count
   etc).

3) APIs for 32b ring element size
   This uses the templates to create APIs to enqueue/dequeue elements of
   size 32b.

4) rte_hash libray is changed to use 32b ring APIs
   The 32b APIs are used in rte_hash library to store the free slot index
   and free bucket index.

5) Event Dev changed to use ring templates
   Event Dev defines its own 128b ring APIs using the templates. This helps
   in keeping the 'struct rte_event' as is. If Event Dev has to use generic
   128b ring APIs, it requires 'struct rte_event' to change to
   'union rte_event' to include a generic data type such as '__int128_t'.
   This breaks the API compatibility and results in large number of
   changes.
   With this change, the event rings are stored on rte_ring's tailq.
   Event Dev specific ring list is NOT available. IMO, this does not have
   any impact to the user.

This patch results in following checkpatch issue:
WARNING:UNSPECIFIED_INT: Prefer 'unsigned int' to bare use of 'unsigned'

However, this patch is following the rules in the existing code. Please
let me know if this needs to be fixed.

v2
 - Change Event Ring implementation to use ring templates
   (Jerin, Pavan)

Honnappa Nagarahalli (6):
  lib/ring: apis to support configurable element size
  lib/ring: add template to support different element sizes
  tools/checkpatch: relax constraints on __rte_experimental
  lib/ring: add ring APIs to support 32b ring elements
  lib/hash: use ring with 32b element size to save memory
  lib/eventdev: use ring templates for event rings

 devtools/checkpatches.sh                  |  11 +-
 lib/librte_eventdev/Makefile              |   2 +
 lib/librte_eventdev/meson.build           |   2 +
 lib/librte_eventdev/rte_event_ring.c      | 146 +---------
 lib/librte_eventdev/rte_event_ring.h      |  41 +--
 lib/librte_eventdev/rte_event_ring_128b.c |  19 ++
 lib/librte_eventdev/rte_event_ring_128b.h |  44 +++
 lib/librte_hash/rte_cuckoo_hash.c         |  55 ++--
 lib/librte_hash/rte_cuckoo_hash.h         |   2 +-
 lib/librte_ring/Makefile                  |   9 +-
 lib/librte_ring/meson.build               |  11 +-
 lib/librte_ring/rte_ring.c                |  34 ++-
 lib/librte_ring/rte_ring.h                |  72 +++++
 lib/librte_ring/rte_ring_32.c             |  19 ++
 lib/librte_ring/rte_ring_32.h             |  36 +++
 lib/librte_ring/rte_ring_template.c       |  46 +++
 lib/librte_ring/rte_ring_template.h       | 330 ++++++++++++++++++++++
 lib/librte_ring/rte_ring_version.map      |   4 +
 18 files changed, 660 insertions(+), 223 deletions(-)
 create mode 100644 lib/librte_eventdev/rte_event_ring_128b.c
 create mode 100644 lib/librte_eventdev/rte_event_ring_128b.h
 create mode 100644 lib/librte_ring/rte_ring_32.c
 create mode 100644 lib/librte_ring/rte_ring_32.h
 create mode 100644 lib/librte_ring/rte_ring_template.c
 create mode 100644 lib/librte_ring/rte_ring_template.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH v2 1/6] lib/ring: apis to support configurable element size
  2019-09-06 19:05 ` [dpdk-dev] [PATCH v2 0/6] " Honnappa Nagarahalli
@ 2019-09-06 19:05   ` Honnappa Nagarahalli
  2019-09-06 19:05   ` [dpdk-dev] [PATCH v2 2/6] lib/ring: add template to support different element sizes Honnappa Nagarahalli
                     ` (14 subsequent siblings)
  15 siblings, 0 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-09-06 19:05 UTC (permalink / raw)
  To: olivier.matz, yipeng1.wang, sameh.gobriel, bruce.richardson,
	pablo.de.lara.guarch
  Cc: dev, pbhagavatula, jerinj, Honnappa Nagarahalli

Current APIs assume ring elements to be pointers. However, in many
use cases, the size can be different. The new APIs
rte_ring_get_memsize_elem and rte_ring_create_elem help reduce code
duplication while creating rte_ring templates.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_ring/Makefile             |  2 +-
 lib/librte_ring/meson.build          |  3 ++
 lib/librte_ring/rte_ring.c           | 34 +++++++++----
 lib/librte_ring/rte_ring.h           | 72 ++++++++++++++++++++++++++++
 lib/librte_ring/rte_ring_version.map |  2 +
 5 files changed, 104 insertions(+), 9 deletions(-)

diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index 21a36770d..4c8410229 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -6,7 +6,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 # library name
 LIB = librte_ring.a
 
-CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3
+CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -DALLOW_EXPERIMENTAL_API
 LDLIBS += -lrte_eal
 
 EXPORT_MAP := rte_ring_version.map
diff --git a/lib/librte_ring/meson.build b/lib/librte_ring/meson.build
index ab8b0b469..74219840a 100644
--- a/lib/librte_ring/meson.build
+++ b/lib/librte_ring/meson.build
@@ -6,3 +6,6 @@ sources = files('rte_ring.c')
 headers = files('rte_ring.h',
 		'rte_ring_c11_mem.h',
 		'rte_ring_generic.h')
+
+# rte_ring_create_elem and rte_ring_get_memsize_elem are experimental
+allow_experimental_apis = true
diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
index d9b308036..879feb9f6 100644
--- a/lib/librte_ring/rte_ring.c
+++ b/lib/librte_ring/rte_ring.c
@@ -46,23 +46,32 @@ EAL_REGISTER_TAILQ(rte_ring_tailq)
 
 /* return the size of memory occupied by a ring */
 ssize_t
-rte_ring_get_memsize(unsigned count)
+rte_ring_get_memsize_elem(unsigned count, size_t esize)
 {
 	ssize_t sz;
 
 	/* count must be a power of 2 */
 	if ((!POWEROF2(count)) || (count > RTE_RING_SZ_MASK )) {
 		RTE_LOG(ERR, RING,
-			"Requested size is invalid, must be power of 2, and "
-			"do not exceed the size limit %u\n", RTE_RING_SZ_MASK);
+			"Requested number of elements is invalid, must be "
+			"power of 2, and do not exceed the limit %u\n",
+			RTE_RING_SZ_MASK);
+
 		return -EINVAL;
 	}
 
-	sz = sizeof(struct rte_ring) + count * sizeof(void *);
+	sz = sizeof(struct rte_ring) + count * esize;
 	sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
 	return sz;
 }
 
+/* return the size of memory occupied by a ring */
+ssize_t
+rte_ring_get_memsize(unsigned count)
+{
+	return rte_ring_get_memsize_elem(count, sizeof(void *));
+}
+
 void
 rte_ring_reset(struct rte_ring *r)
 {
@@ -114,10 +123,10 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 	return 0;
 }
 
-/* create the ring */
+/* create the ring for a given element size */
 struct rte_ring *
-rte_ring_create(const char *name, unsigned count, int socket_id,
-		unsigned flags)
+rte_ring_create_elem(const char *name, unsigned count, size_t esize,
+		int socket_id, unsigned flags)
 {
 	char mz_name[RTE_MEMZONE_NAMESIZE];
 	struct rte_ring *r;
@@ -135,7 +144,7 @@ rte_ring_create(const char *name, unsigned count, int socket_id,
 	if (flags & RING_F_EXACT_SZ)
 		count = rte_align32pow2(count + 1);
 
-	ring_size = rte_ring_get_memsize(count);
+	ring_size = rte_ring_get_memsize_elem(count, esize);
 	if (ring_size < 0) {
 		rte_errno = ring_size;
 		return NULL;
@@ -182,6 +191,15 @@ rte_ring_create(const char *name, unsigned count, int socket_id,
 	return r;
 }
 
+/* create the ring */
+struct rte_ring *
+rte_ring_create(const char *name, unsigned count, int socket_id,
+		unsigned flags)
+{
+	return rte_ring_create_elem(name, count, sizeof(void *), socket_id,
+		flags);
+}
+
 /* free the ring */
 void
 rte_ring_free(struct rte_ring *r)
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 2a9f768a1..bbc1202d3 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -122,6 +122,29 @@ struct rte_ring {
 #define __IS_SC 1
 #define __IS_MC 0
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Calculate the memory size needed for a ring with given element size
+ *
+ * This function returns the number of bytes needed for a ring, given
+ * the number of elements in it and the size of the element. This value
+ * is the sum of the size of the structure rte_ring and the size of the
+ * memory needed for storing the elements. The value is aligned to a cache
+ * line size.
+ *
+ * @param count
+ *   The number of elements in the ring (must be a power of 2).
+ * @param esize
+ *   The size of elements in the ring (recommended to be a power of 2).
+ * @return
+ *   - The memory size needed for the ring on success.
+ *   - -EINVAL if count is not a power of 2.
+ */
+__rte_experimental
+ssize_t rte_ring_get_memsize_elem(unsigned count, size_t esize);
+
 /**
  * Calculate the memory size needed for a ring
  *
@@ -175,6 +198,54 @@ ssize_t rte_ring_get_memsize(unsigned count);
 int rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 	unsigned flags);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Create a new ring named *name* that stores elements with given size.
+ *
+ * This function uses ``memzone_reserve()`` to allocate memory. Then it
+ * calls rte_ring_init() to initialize an empty ring.
+ *
+ * The new ring size is set to *count*, which must be a power of
+ * two. Water marking is disabled by default. The real usable ring size
+ * is *count-1* instead of *count* to differentiate a free ring from an
+ * empty ring.
+ *
+ * The ring is added in RTE_TAILQ_RING list.
+ *
+ * @param name
+ *   The name of the ring.
+ * @param count
+ *   The number of elements in the ring (must be a power of 2).
+ * @param esize
+ *   The size of elements in the ring (recommended to be a power of 2).
+ * @param socket_id
+ *   The *socket_id* argument is the socket identifier in case of
+ *   NUMA. The value can be *SOCKET_ID_ANY* if there is no NUMA
+ *   constraint for the reserved zone.
+ * @param flags
+ *   An OR of the following:
+ *    - RING_F_SP_ENQ: If this flag is set, the default behavior when
+ *      using ``rte_ring_enqueue()`` or ``rte_ring_enqueue_bulk()``
+ *      is "single-producer". Otherwise, it is "multi-producers".
+ *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
+ *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
+ *      is "single-consumer". Otherwise, it is "multi-consumers".
+ * @return
+ *   On success, the pointer to the new allocated ring. NULL on error with
+ *    rte_errno set appropriately. Possible errno values include:
+ *    - E_RTE_NO_CONFIG - function could not get pointer to rte_config structure
+ *    - E_RTE_SECONDARY - function was called from a secondary process instance
+ *    - EINVAL - count provided is not a power of 2
+ *    - ENOSPC - the maximum number of memzones has already been allocated
+ *    - EEXIST - a memzone with the same name already exists
+ *    - ENOMEM - no appropriate memory area found in which to create memzone
+ */
+__rte_experimental
+struct rte_ring *rte_ring_create_elem(const char *name, unsigned count,
+				size_t esize, int socket_id, unsigned flags);
+
 /**
  * Create a new ring named *name* in memory.
  *
@@ -216,6 +287,7 @@ int rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
  */
 struct rte_ring *rte_ring_create(const char *name, unsigned count,
 				 int socket_id, unsigned flags);
+
 /**
  * De-allocate all memory used by the ring.
  *
diff --git a/lib/librte_ring/rte_ring_version.map b/lib/librte_ring/rte_ring_version.map
index 510c1386e..e410a7503 100644
--- a/lib/librte_ring/rte_ring_version.map
+++ b/lib/librte_ring/rte_ring_version.map
@@ -21,6 +21,8 @@ DPDK_2.2 {
 EXPERIMENTAL {
 	global:
 
+	rte_ring_create_elem;
+	rte_ring_get_memsize_elem;
 	rte_ring_reset;
 
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH v2 2/6] lib/ring: add template to support different element sizes
  2019-09-06 19:05 ` [dpdk-dev] [PATCH v2 0/6] " Honnappa Nagarahalli
  2019-09-06 19:05   ` [dpdk-dev] [PATCH v2 1/6] lib/ring: apis to support configurable " Honnappa Nagarahalli
@ 2019-09-06 19:05   ` Honnappa Nagarahalli
  2019-09-08 19:44     ` Stephen Hemminger
  2019-09-06 19:05   ` [dpdk-dev] [PATCH v2 3/6] tools/checkpatch: relax constraints on __rte_experimental Honnappa Nagarahalli
                     ` (13 subsequent siblings)
  15 siblings, 1 reply; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-09-06 19:05 UTC (permalink / raw)
  To: olivier.matz, yipeng1.wang, sameh.gobriel, bruce.richardson,
	pablo.de.lara.guarch
  Cc: dev, pbhagavatula, jerinj, Honnappa Nagarahalli

Add templates to support creating ring APIs with different
ring element sizes.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_ring/Makefile            |   4 +-
 lib/librte_ring/meson.build         |   4 +-
 lib/librte_ring/rte_ring_template.c |  46 ++++
 lib/librte_ring/rte_ring_template.h | 330 ++++++++++++++++++++++++++++
 4 files changed, 382 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_template.c
 create mode 100644 lib/librte_ring/rte_ring_template.h

diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index 4c8410229..818898110 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -19,6 +19,8 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
 					rte_ring_generic.h \
-					rte_ring_c11_mem.h
+					rte_ring_c11_mem.h \
+					rte_ring_template.h \
+					rte_ring_template.c
 
 include $(RTE_SDK)/mk/rte.lib.mk
diff --git a/lib/librte_ring/meson.build b/lib/librte_ring/meson.build
index 74219840a..e4e208a7c 100644
--- a/lib/librte_ring/meson.build
+++ b/lib/librte_ring/meson.build
@@ -5,7 +5,9 @@ version = 2
 sources = files('rte_ring.c')
 headers = files('rte_ring.h',
 		'rte_ring_c11_mem.h',
-		'rte_ring_generic.h')
+		'rte_ring_generic.h',
+		'rte_ring_template.h',
+		'rte_ring_template.c')
 
 # rte_ring_create_elem and rte_ring_get_memsize_elem are experimental
 allow_experimental_apis = true
diff --git a/lib/librte_ring/rte_ring_template.c b/lib/librte_ring/rte_ring_template.c
new file mode 100644
index 000000000..1ca593f95
--- /dev/null
+++ b/lib/librte_ring/rte_ring_template.c
@@ -0,0 +1,46 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2019 Arm Limited
+ */
+
+#include <stdio.h>
+#include <stdarg.h>
+#include <string.h>
+#include <stdint.h>
+#include <inttypes.h>
+#include <errno.h>
+#include <sys/queue.h>
+
+#include <rte_common.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+#include <rte_memzone.h>
+#include <rte_malloc.h>
+#include <rte_launch.h>
+#include <rte_eal.h>
+#include <rte_eal_memconfig.h>
+#include <rte_atomic.h>
+#include <rte_per_lcore.h>
+#include <rte_lcore.h>
+#include <rte_branch_prediction.h>
+#include <rte_errno.h>
+#include <rte_string_fns.h>
+#include <rte_spinlock.h>
+#include <rte_tailq.h>
+
+#include "rte_ring.h"
+
+/* return the size of memory occupied by a ring */
+ssize_t
+__RTE_RING_CONCAT(rte_ring_get_memsize)(unsigned count)
+{
+	return rte_ring_get_memsize_elem(count, RTE_RING_TMPLT_ELEM_SIZE);
+}
+
+/* create the ring */
+struct rte_ring *
+__RTE_RING_CONCAT(rte_ring_create)(const char *name, unsigned count,
+		int socket_id, unsigned flags)
+{
+	return rte_ring_create_elem(name, count, RTE_RING_TMPLT_ELEM_SIZE,
+		socket_id, flags);
+}
diff --git a/lib/librte_ring/rte_ring_template.h b/lib/librte_ring/rte_ring_template.h
new file mode 100644
index 000000000..5002a7485
--- /dev/null
+++ b/lib/librte_ring/rte_ring_template.h
@@ -0,0 +1,330 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2019 Arm Limited
+ */
+
+#ifndef _RTE_RING_TEMPLATE_H_
+#define _RTE_RING_TEMPLATE_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stdio.h>
+#include <stdint.h>
+#include <sys/queue.h>
+#include <errno.h>
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_memory.h>
+#include <rte_lcore.h>
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_memzone.h>
+#include <rte_pause.h>
+#include <rte_ring.h>
+
+/* Ring API suffix name - used to append to API names */
+#ifndef RTE_RING_TMPLT_API_SUFFIX
+#error RTE_RING_TMPLT_API_SUFFIX not defined
+#endif
+
+/* Ring's element size in bits, should be a power of 2 */
+#ifndef RTE_RING_TMPLT_ELEM_SIZE
+#error RTE_RING_TMPLT_ELEM_SIZE not defined
+#endif
+
+/* Type of ring elements */
+#ifndef RTE_RING_TMPLT_ELEM_TYPE
+#error RTE_RING_TMPLT_ELEM_TYPE not defined
+#endif
+
+#define _rte_fuse(a, b) a##_##b
+#define __rte_fuse(a, b) _rte_fuse(a, b)
+#define __RTE_RING_CONCAT(a) __rte_fuse(a, RTE_RING_TMPLT_API_SUFFIX)
+
+/* Calculate the memory size needed for a ring */
+RTE_RING_TMPLT_EXPERIMENTAL
+ssize_t __RTE_RING_CONCAT(rte_ring_get_memsize)(unsigned count);
+
+/* Create a new ring named *name* in memory. */
+RTE_RING_TMPLT_EXPERIMENTAL
+struct rte_ring *
+__RTE_RING_CONCAT(rte_ring_create)(const char *name, unsigned count,
+					int socket_id, unsigned flags);
+
+/**
+ * @internal Enqueue several objects on the ring
+ */
+static __rte_always_inline unsigned int
+__RTE_RING_CONCAT(__rte_ring_do_enqueue)(struct rte_ring *r,
+		RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int n,
+		enum rte_ring_queue_behavior behavior, unsigned int is_sp,
+		unsigned int *free_space)
+{
+	uint32_t prod_head, prod_next;
+	uint32_t free_entries;
+
+	n = __rte_ring_move_prod_head(r, is_sp, n, behavior,
+			&prod_head, &prod_next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n,
+		RTE_RING_TMPLT_ELEM_TYPE);
+
+	update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
+end:
+	if (free_space != NULL)
+		*free_space = free_entries - n;
+	return n;
+}
+
+/**
+ * @internal Dequeue several objects from the ring
+ */
+static __rte_always_inline unsigned int
+__RTE_RING_CONCAT(__rte_ring_do_dequeue)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
+	enum rte_ring_queue_behavior behavior, unsigned int is_sc,
+	unsigned int *available)
+{
+	uint32_t cons_head, cons_next;
+	uint32_t entries;
+
+	n = __rte_ring_move_cons_head(r, (int)is_sc, n, behavior,
+			&cons_head, &cons_next, &entries);
+	if (n == 0)
+		goto end;
+
+	DEQUEUE_PTRS(r, &r[1], cons_head, obj_table, n,
+		RTE_RING_TMPLT_ELEM_TYPE);
+
+	update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
+
+end:
+	if (available != NULL)
+		*available = entries - n;
+	return n;
+}
+
+
+/**
+ * Enqueue several objects on the ring (multi-producers safe).
+ */
+static __rte_always_inline unsigned int
+__RTE_RING_CONCAT(rte_ring_mp_enqueue_bulk)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int n,
+	unsigned int *free_space)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
+			RTE_RING_QUEUE_FIXED, __IS_MP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring (NOT multi-producers safe).
+ */
+static __rte_always_inline unsigned int
+__RTE_RING_CONCAT(rte_ring_sp_enqueue_bulk)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int n,
+	unsigned int *free_space)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
+			RTE_RING_QUEUE_FIXED, __IS_SP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring.
+ */
+static __rte_always_inline unsigned int
+__RTE_RING_CONCAT(rte_ring_enqueue_bulk)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int n,
+	unsigned int *free_space)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
+			RTE_RING_QUEUE_FIXED, r->prod.single, free_space);
+}
+
+/**
+ * Enqueue one object on a ring (multi-producers safe).
+ */
+static __rte_always_inline int
+__RTE_RING_CONCAT(rte_ring_mp_enqueue)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE const obj)
+{
+	return __RTE_RING_CONCAT(rte_ring_mp_enqueue_bulk)(r, &obj, 1, NULL) ?
+			0 : -ENOBUFS;
+}
+
+/**
+ * Enqueue one object on a ring (NOT multi-producers safe).
+ */
+static __rte_always_inline int
+__RTE_RING_CONCAT(rte_ring_sp_enqueue)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE const obj)
+{
+	return __RTE_RING_CONCAT(rte_ring_sp_enqueue_bulk)(r, &obj, 1, NULL) ?
+			0 : -ENOBUFS;
+}
+
+/**
+ * Enqueue one object on a ring.
+ */
+static __rte_always_inline int
+__RTE_RING_CONCAT(rte_ring_enqueue)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE const *obj)
+{
+	return __RTE_RING_CONCAT(rte_ring_enqueue_bulk)(r, obj, 1, NULL) ?
+			0 : -ENOBUFS;
+}
+
+/**
+ * Dequeue several objects from a ring (multi-consumers safe).
+ */
+static __rte_always_inline unsigned int
+__RTE_RING_CONCAT(rte_ring_mc_dequeue_bulk)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
+	unsigned int *available)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
+			RTE_RING_QUEUE_FIXED, __IS_MC, available);
+}
+
+/**
+ * Dequeue several objects from a ring (NOT multi-consumers safe).
+ */
+static __rte_always_inline unsigned int
+__RTE_RING_CONCAT(rte_ring_sc_dequeue_bulk)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
+	unsigned int *available)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
+			RTE_RING_QUEUE_FIXED, __IS_SC, available);
+}
+
+/**
+ * Dequeue several objects from a ring.
+ */
+static __rte_always_inline unsigned int
+__RTE_RING_CONCAT(rte_ring_dequeue_bulk)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
+	unsigned int *available)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
+			RTE_RING_QUEUE_FIXED, r->cons.single, available);
+}
+
+/**
+ * Dequeue one object from a ring (multi-consumers safe).
+ */
+static __rte_always_inline int
+__RTE_RING_CONCAT(rte_ring_mc_dequeue)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_p)
+{
+	return __RTE_RING_CONCAT(rte_ring_mc_dequeue_bulk)(r, obj_p, 1, NULL) ?
+			0 : -ENOENT;
+}
+
+/**
+ * Dequeue one object from a ring (NOT multi-consumers safe).
+ */
+static __rte_always_inline int
+__RTE_RING_CONCAT(rte_ring_sc_dequeue)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_p)
+{
+	return __RTE_RING_CONCAT(rte_ring_sc_dequeue_bulk)(r, obj_p, 1, NULL) ?
+			0 : -ENOENT;
+}
+
+/**
+ * Dequeue one object from a ring.
+ */
+static __rte_always_inline int
+__RTE_RING_CONCAT(rte_ring_dequeue)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_p)
+{
+	return __RTE_RING_CONCAT(rte_ring_dequeue_bulk)(r, obj_p, 1, NULL) ?
+			0 : -ENOENT;
+}
+
+/**
+ * Enqueue several objects on the ring (multi-producers safe).
+ */
+static __rte_always_inline unsigned
+__RTE_RING_CONCAT(rte_ring_mp_enqueue_burst)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_table,
+			 unsigned int n, unsigned int *free_space)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_MP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring (NOT multi-producers safe).
+ */
+static __rte_always_inline unsigned
+__RTE_RING_CONCAT(rte_ring_sp_enqueue_burst)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_table,
+			 unsigned int n, unsigned int *free_space)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_SP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring.
+ */
+static __rte_always_inline unsigned
+__RTE_RING_CONCAT(rte_ring_enqueue_burst)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int n,
+	unsigned int *free_space)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
+			RTE_RING_QUEUE_VARIABLE, r->prod.single, free_space);
+}
+
+/**
+ * Dequeue several objects from a ring (multi-consumers safe). When the request
+ * objects are more than the available objects, only dequeue the actual number
+ * of objects
+ */
+static __rte_always_inline unsigned
+__RTE_RING_CONCAT(rte_ring_mc_dequeue_burst)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
+	unsigned int *available)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_MC, available);
+}
+
+/**
+ * Dequeue several objects from a ring (NOT multi-consumers safe).When the
+ * request objects are more than the available objects, only dequeue the
+ * actual number of objects
+ */
+static __rte_always_inline unsigned
+__RTE_RING_CONCAT(rte_ring_sc_dequeue_burst)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
+	unsigned int *available)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_SC, available);
+}
+
+/**
+ * Dequeue multiple objects from a ring up to a maximum number.
+ */
+static __rte_always_inline unsigned
+__RTE_RING_CONCAT(rte_ring_dequeue_burst)(struct rte_ring *r,
+	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
+	unsigned int *available)
+{
+	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
+				RTE_RING_QUEUE_VARIABLE,
+				r->cons.single, available);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_RING_TEMPLATE_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH v2 3/6] tools/checkpatch: relax constraints on __rte_experimental
  2019-09-06 19:05 ` [dpdk-dev] [PATCH v2 0/6] " Honnappa Nagarahalli
  2019-09-06 19:05   ` [dpdk-dev] [PATCH v2 1/6] lib/ring: apis to support configurable " Honnappa Nagarahalli
  2019-09-06 19:05   ` [dpdk-dev] [PATCH v2 2/6] lib/ring: add template to support different element sizes Honnappa Nagarahalli
@ 2019-09-06 19:05   ` Honnappa Nagarahalli
  2019-09-06 19:05   ` [dpdk-dev] [PATCH v2 4/6] lib/ring: add ring APIs to support 32b ring elements Honnappa Nagarahalli
                     ` (12 subsequent siblings)
  15 siblings, 0 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-09-06 19:05 UTC (permalink / raw)
  To: olivier.matz, yipeng1.wang, sameh.gobriel, bruce.richardson,
	pablo.de.lara.guarch
  Cc: dev, pbhagavatula, jerinj, Honnappa Nagarahalli

Relax the constraints on __rte_experimental usage, allow redefining
to macros.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 devtools/checkpatches.sh | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/devtools/checkpatches.sh b/devtools/checkpatches.sh
index 560e6ce93..090c9b08a 100755
--- a/devtools/checkpatches.sh
+++ b/devtools/checkpatches.sh
@@ -99,9 +99,14 @@ check_experimental_tags() { # <patch>
 			ret = 1;
 		}
 		if ($1 != "+__rte_experimental" || $2 != "") {
-			print "__rte_experimental must appear alone on the line" \
-				" immediately preceding the return type of a function."
-			ret = 1;
+			# code such as "#define XYZ __rte_experimental" is
+			# allowed
+			if ($1 != "+#define") {
+				print "__rte_experimental must appear alone " \
+				      "on the line immediately preceding the " \
+				      "return type of a function."
+				ret = 1;
+			}
 		}
 	}
 	END {
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH v2 4/6] lib/ring: add ring APIs to support 32b ring elements
  2019-09-06 19:05 ` [dpdk-dev] [PATCH v2 0/6] " Honnappa Nagarahalli
                     ` (2 preceding siblings ...)
  2019-09-06 19:05   ` [dpdk-dev] [PATCH v2 3/6] tools/checkpatch: relax constraints on __rte_experimental Honnappa Nagarahalli
@ 2019-09-06 19:05   ` Honnappa Nagarahalli
  2019-09-06 19:05   ` [dpdk-dev] [PATCH v2 5/6] lib/hash: use ring with 32b element size to save memory Honnappa Nagarahalli
                     ` (11 subsequent siblings)
  15 siblings, 0 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-09-06 19:05 UTC (permalink / raw)
  To: olivier.matz, yipeng1.wang, sameh.gobriel, bruce.richardson,
	pablo.de.lara.guarch
  Cc: dev, pbhagavatula, jerinj, Honnappa Nagarahalli

Add ring APIs to support 32b ring elements using templates.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_ring/Makefile             |  3 ++-
 lib/librte_ring/meson.build          |  4 +++-
 lib/librte_ring/rte_ring_32.c        | 19 +++++++++++++++
 lib/librte_ring/rte_ring_32.h        | 36 ++++++++++++++++++++++++++++
 lib/librte_ring/rte_ring_version.map |  2 ++
 5 files changed, 62 insertions(+), 2 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_32.c
 create mode 100644 lib/librte_ring/rte_ring_32.h

diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index 818898110..3102bb64d 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -14,10 +14,11 @@ EXPORT_MAP := rte_ring_version.map
 LIBABIVER := 2
 
 # all source are stored in SRCS-y
-SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
+SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c rte_ring_32.c
 
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
+					rte_ring_32.h \
 					rte_ring_generic.h \
 					rte_ring_c11_mem.h \
 					rte_ring_template.h \
diff --git a/lib/librte_ring/meson.build b/lib/librte_ring/meson.build
index e4e208a7c..81ea53ed7 100644
--- a/lib/librte_ring/meson.build
+++ b/lib/librte_ring/meson.build
@@ -2,8 +2,10 @@
 # Copyright(c) 2017 Intel Corporation
 
 version = 2
-sources = files('rte_ring.c')
+sources = files('rte_ring.c',
+		'rte_ring_32.c')
 headers = files('rte_ring.h',
+		'rte_ring_32.h',
 		'rte_ring_c11_mem.h',
 		'rte_ring_generic.h',
 		'rte_ring_template.h',
diff --git a/lib/librte_ring/rte_ring_32.c b/lib/librte_ring/rte_ring_32.c
new file mode 100644
index 000000000..09e90cec1
--- /dev/null
+++ b/lib/librte_ring/rte_ring_32.c
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2019 Arm Limited
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <sys/queue.h>
+#include <errno.h>
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_memory.h>
+#include <rte_lcore.h>
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_memzone.h>
+#include <rte_pause.h>
+
+#include <rte_ring_32.h>
+#include <rte_ring_template.c>
diff --git a/lib/librte_ring/rte_ring_32.h b/lib/librte_ring/rte_ring_32.h
new file mode 100644
index 000000000..5270a9bc7
--- /dev/null
+++ b/lib/librte_ring/rte_ring_32.h
@@ -0,0 +1,36 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2019 Arm Limited
+ */
+
+#ifndef _RTE_RING_32_H_
+#define _RTE_RING_32_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stdio.h>
+#include <stdint.h>
+#include <sys/queue.h>
+#include <errno.h>
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_memory.h>
+#include <rte_lcore.h>
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_memzone.h>
+#include <rte_pause.h>
+
+#define RTE_RING_TMPLT_API_SUFFIX 32
+#define RTE_RING_TMPLT_ELEM_SIZE sizeof(uint32_t)
+#define RTE_RING_TMPLT_ELEM_TYPE uint32_t
+#define RTE_RING_TMPLT_EXPERIMENTAL __rte_experimental
+
+#include <rte_ring_template.h>
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_RING_32_H_ */
diff --git a/lib/librte_ring/rte_ring_version.map b/lib/librte_ring/rte_ring_version.map
index e410a7503..9efba91bb 100644
--- a/lib/librte_ring/rte_ring_version.map
+++ b/lib/librte_ring/rte_ring_version.map
@@ -21,7 +21,9 @@ DPDK_2.2 {
 EXPERIMENTAL {
 	global:
 
+	rte_ring_create_32;
 	rte_ring_create_elem;
+	rte_ring_get_memsize_32;
 	rte_ring_get_memsize_elem;
 	rte_ring_reset;
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH v2 5/6] lib/hash: use ring with 32b element size to save memory
  2019-09-06 19:05 ` [dpdk-dev] [PATCH v2 0/6] " Honnappa Nagarahalli
                     ` (3 preceding siblings ...)
  2019-09-06 19:05   ` [dpdk-dev] [PATCH v2 4/6] lib/ring: add ring APIs to support 32b ring elements Honnappa Nagarahalli
@ 2019-09-06 19:05   ` Honnappa Nagarahalli
  2019-09-06 19:05   ` [dpdk-dev] [PATCH v2 6/6] lib/eventdev: use ring templates for event rings Honnappa Nagarahalli
                     ` (10 subsequent siblings)
  15 siblings, 0 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-09-06 19:05 UTC (permalink / raw)
  To: olivier.matz, yipeng1.wang, sameh.gobriel, bruce.richardson,
	pablo.de.lara.guarch
  Cc: dev, pbhagavatula, jerinj, Honnappa Nagarahalli

The freelist and external bucket indices are 32b. Using rings
that use 32b element sizes will save memory.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_hash/rte_cuckoo_hash.c | 55 ++++++++++++++-----------------
 lib/librte_hash/rte_cuckoo_hash.h |  2 +-
 2 files changed, 26 insertions(+), 31 deletions(-)

diff --git a/lib/librte_hash/rte_cuckoo_hash.c b/lib/librte_hash/rte_cuckoo_hash.c
index 87a4c01f2..a0cd3360a 100644
--- a/lib/librte_hash/rte_cuckoo_hash.c
+++ b/lib/librte_hash/rte_cuckoo_hash.c
@@ -24,7 +24,7 @@
 #include <rte_cpuflags.h>
 #include <rte_rwlock.h>
 #include <rte_spinlock.h>
-#include <rte_ring.h>
+#include <rte_ring_32.h>
 #include <rte_compat.h>
 #include <rte_vect.h>
 #include <rte_tailq.h>
@@ -213,7 +213,7 @@ rte_hash_create(const struct rte_hash_parameters *params)
 
 	snprintf(ring_name, sizeof(ring_name), "HT_%s", params->name);
 	/* Create ring (Dummy slot index is not enqueued) */
-	r = rte_ring_create(ring_name, rte_align32pow2(num_key_slots),
+	r = rte_ring_create_32(ring_name, rte_align32pow2(num_key_slots),
 			params->socket_id, 0);
 	if (r == NULL) {
 		RTE_LOG(ERR, HASH, "memory allocation failed\n");
@@ -227,7 +227,7 @@ rte_hash_create(const struct rte_hash_parameters *params)
 	if (ext_table_support) {
 		snprintf(ext_ring_name, sizeof(ext_ring_name), "HT_EXT_%s",
 								params->name);
-		r_ext = rte_ring_create(ext_ring_name,
+		r_ext = rte_ring_create_32(ext_ring_name,
 				rte_align32pow2(num_buckets + 1),
 				params->socket_id, 0);
 
@@ -295,7 +295,7 @@ rte_hash_create(const struct rte_hash_parameters *params)
 		 * for next bucket
 		 */
 		for (i = 1; i <= num_buckets; i++)
-			rte_ring_sp_enqueue(r_ext, (void *)((uintptr_t) i));
+			rte_ring_sp_enqueue_32(r_ext, i);
 
 		if (readwrite_concur_lf_support) {
 			ext_bkt_to_free = rte_zmalloc(NULL, sizeof(uint32_t) *
@@ -434,7 +434,7 @@ rte_hash_create(const struct rte_hash_parameters *params)
 
 	/* Populate free slots ring. Entry zero is reserved for key misses. */
 	for (i = 1; i < num_key_slots; i++)
-		rte_ring_sp_enqueue(r, (void *)((uintptr_t) i));
+		rte_ring_sp_enqueue_32(r, i);
 
 	te->data = (void *) h;
 	TAILQ_INSERT_TAIL(hash_list, te, next);
@@ -598,13 +598,12 @@ rte_hash_reset(struct rte_hash *h)
 		tot_ring_cnt = h->entries;
 
 	for (i = 1; i < tot_ring_cnt + 1; i++)
-		rte_ring_sp_enqueue(h->free_slots, (void *)((uintptr_t) i));
+		rte_ring_sp_enqueue_32(h->free_slots, i);
 
 	/* Repopulate the free ext bkt ring. */
 	if (h->ext_table_support) {
 		for (i = 1; i <= h->num_buckets; i++)
-			rte_ring_sp_enqueue(h->free_ext_bkts,
-						(void *)((uintptr_t) i));
+			rte_ring_sp_enqueue_32(h->free_ext_bkts, i);
 	}
 
 	if (h->use_local_cache) {
@@ -623,13 +622,13 @@ rte_hash_reset(struct rte_hash *h)
 static inline void
 enqueue_slot_back(const struct rte_hash *h,
 		struct lcore_cache *cached_free_slots,
-		void *slot_id)
+		uint32_t slot_id)
 {
 	if (h->use_local_cache) {
 		cached_free_slots->objs[cached_free_slots->len] = slot_id;
 		cached_free_slots->len++;
 	} else
-		rte_ring_sp_enqueue(h->free_slots, slot_id);
+		rte_ring_sp_enqueue_32(h->free_slots, slot_id);
 }
 
 /* Search a key from bucket and update its data.
@@ -923,8 +922,8 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 	uint32_t prim_bucket_idx, sec_bucket_idx;
 	struct rte_hash_bucket *prim_bkt, *sec_bkt, *cur_bkt;
 	struct rte_hash_key *new_k, *keys = h->key_store;
-	void *slot_id = NULL;
-	void *ext_bkt_id = NULL;
+	uint32_t slot_id = 0;
+	uint32_t ext_bkt_id = 0;
 	uint32_t new_idx, bkt_id;
 	int ret;
 	unsigned n_slots;
@@ -968,7 +967,7 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 		/* Try to get a free slot from the local cache */
 		if (cached_free_slots->len == 0) {
 			/* Need to get another burst of free slots from global ring */
-			n_slots = rte_ring_mc_dequeue_burst(h->free_slots,
+			n_slots = rte_ring_mc_dequeue_burst_32(h->free_slots,
 					cached_free_slots->objs,
 					LCORE_CACHE_SIZE, NULL);
 			if (n_slots == 0) {
@@ -982,13 +981,12 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 		cached_free_slots->len--;
 		slot_id = cached_free_slots->objs[cached_free_slots->len];
 	} else {
-		if (rte_ring_sc_dequeue(h->free_slots, &slot_id) != 0) {
+		if (rte_ring_sc_dequeue_32(h->free_slots, &slot_id) != 0)
 			return -ENOSPC;
-		}
 	}
 
-	new_k = RTE_PTR_ADD(keys, (uintptr_t)slot_id * h->key_entry_size);
-	new_idx = (uint32_t)((uintptr_t) slot_id);
+	new_k = RTE_PTR_ADD(keys, slot_id * h->key_entry_size);
+	new_idx = slot_id;
 	/* The store to application data (by the application) at *data should
 	 * not leak after the store of pdata in the key store. i.e. pdata is
 	 * the guard variable. Release the application data to the readers.
@@ -1078,12 +1076,12 @@ __rte_hash_add_key_with_hash(const struct rte_hash *h, const void *key,
 	/* Failed to get an empty entry from extendable buckets. Link a new
 	 * extendable bucket. We first get a free bucket from ring.
 	 */
-	if (rte_ring_sc_dequeue(h->free_ext_bkts, &ext_bkt_id) != 0) {
+	if (rte_ring_sc_dequeue_32(h->free_ext_bkts, &ext_bkt_id) != 0) {
 		ret = -ENOSPC;
 		goto failure;
 	}
 
-	bkt_id = (uint32_t)((uintptr_t)ext_bkt_id) - 1;
+	bkt_id = ext_bkt_id - 1;
 	/* Use the first location of the new bucket */
 	(h->buckets_ext[bkt_id]).sig_current[0] = short_sig;
 	/* Store to signature and key should not leak after
@@ -1373,7 +1371,7 @@ remove_entry(const struct rte_hash *h, struct rte_hash_bucket *bkt, unsigned i)
 		/* Cache full, need to free it. */
 		if (cached_free_slots->len == LCORE_CACHE_SIZE) {
 			/* Need to enqueue the free slots in global ring. */
-			n_slots = rte_ring_mp_enqueue_burst(h->free_slots,
+			n_slots = rte_ring_mp_enqueue_burst_32(h->free_slots,
 						cached_free_slots->objs,
 						LCORE_CACHE_SIZE, NULL);
 			ERR_IF_TRUE((n_slots == 0),
@@ -1383,11 +1381,10 @@ remove_entry(const struct rte_hash *h, struct rte_hash_bucket *bkt, unsigned i)
 		}
 		/* Put index of new free slot in cache. */
 		cached_free_slots->objs[cached_free_slots->len] =
-				(void *)((uintptr_t)bkt->key_idx[i]);
+				bkt->key_idx[i];
 		cached_free_slots->len++;
 	} else {
-		rte_ring_sp_enqueue(h->free_slots,
-				(void *)((uintptr_t)bkt->key_idx[i]));
+		rte_ring_sp_enqueue_32(h->free_slots, bkt->key_idx[i]);
 	}
 }
 
@@ -1551,7 +1548,7 @@ __rte_hash_del_key_with_hash(const struct rte_hash *h, const void *key,
 			 */
 			h->ext_bkt_to_free[ret] = index;
 		else
-			rte_ring_sp_enqueue(h->free_ext_bkts, (void *)(uintptr_t)index);
+			rte_ring_sp_enqueue_32(h->free_ext_bkts, index);
 	}
 	__hash_rw_writer_unlock(h);
 	return ret;
@@ -1614,7 +1611,7 @@ rte_hash_free_key_with_position(const struct rte_hash *h,
 		uint32_t index = h->ext_bkt_to_free[position];
 		if (index) {
 			/* Recycle empty ext bkt to free list. */
-			rte_ring_sp_enqueue(h->free_ext_bkts, (void *)(uintptr_t)index);
+			rte_ring_sp_enqueue_32(h->free_ext_bkts, index);
 			h->ext_bkt_to_free[position] = 0;
 		}
 	}
@@ -1625,19 +1622,17 @@ rte_hash_free_key_with_position(const struct rte_hash *h,
 		/* Cache full, need to free it. */
 		if (cached_free_slots->len == LCORE_CACHE_SIZE) {
 			/* Need to enqueue the free slots in global ring. */
-			n_slots = rte_ring_mp_enqueue_burst(h->free_slots,
+			n_slots = rte_ring_mp_enqueue_burst_32(h->free_slots,
 						cached_free_slots->objs,
 						LCORE_CACHE_SIZE, NULL);
 			RETURN_IF_TRUE((n_slots == 0), -EFAULT);
 			cached_free_slots->len -= n_slots;
 		}
 		/* Put index of new free slot in cache. */
-		cached_free_slots->objs[cached_free_slots->len] =
-					(void *)((uintptr_t)key_idx);
+		cached_free_slots->objs[cached_free_slots->len] = key_idx;
 		cached_free_slots->len++;
 	} else {
-		rte_ring_sp_enqueue(h->free_slots,
-				(void *)((uintptr_t)key_idx));
+		rte_ring_sp_enqueue_32(h->free_slots, key_idx);
 	}
 
 	return 0;
diff --git a/lib/librte_hash/rte_cuckoo_hash.h b/lib/librte_hash/rte_cuckoo_hash.h
index fb19bb27d..345de6bf9 100644
--- a/lib/librte_hash/rte_cuckoo_hash.h
+++ b/lib/librte_hash/rte_cuckoo_hash.h
@@ -124,7 +124,7 @@ const rte_hash_cmp_eq_t cmp_jump_table[NUM_KEY_CMP_CASES] = {
 
 struct lcore_cache {
 	unsigned len; /**< Cache len */
-	void *objs[LCORE_CACHE_SIZE]; /**< Cache objects */
+	uint32_t objs[LCORE_CACHE_SIZE]; /**< Cache objects */
 } __rte_cache_aligned;
 
 /* Structure that stores key-value pair */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH v2 6/6] lib/eventdev: use ring templates for event rings
  2019-09-06 19:05 ` [dpdk-dev] [PATCH v2 0/6] " Honnappa Nagarahalli
                     ` (4 preceding siblings ...)
  2019-09-06 19:05   ` [dpdk-dev] [PATCH v2 5/6] lib/hash: use ring with 32b element size to save memory Honnappa Nagarahalli
@ 2019-09-06 19:05   ` Honnappa Nagarahalli
  2019-09-09 13:04   ` [dpdk-dev] [PATCH v2 0/6] lib/ring: templates to support custom element size Aaron Conole
                     ` (9 subsequent siblings)
  15 siblings, 0 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-09-06 19:05 UTC (permalink / raw)
  To: olivier.matz, yipeng1.wang, sameh.gobriel, bruce.richardson,
	pablo.de.lara.guarch
  Cc: dev, pbhagavatula, jerinj, Honnappa Nagarahalli

Use rte_ring templates to define ring APIs for 128b ring element
type. However, the generic 128b ring APIs are not defined. Doing
so, results in changes to 'struct rte_event' which results in
API changes.

Suggested-by: Jerin Jacob Kollanukkaran <jerinj@marvell.com>
Suggested-by: Pavan Nikhilesh Bhagavatula <pbhagavatula@marvell.com>
Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
---
 lib/librte_eventdev/Makefile              |   2 +
 lib/librte_eventdev/meson.build           |   2 +
 lib/librte_eventdev/rte_event_ring.c      | 146 +---------------------
 lib/librte_eventdev/rte_event_ring.h      |  41 +-----
 lib/librte_eventdev/rte_event_ring_128b.c |  19 +++
 lib/librte_eventdev/rte_event_ring_128b.h |  44 +++++++
 6 files changed, 78 insertions(+), 176 deletions(-)
 create mode 100644 lib/librte_eventdev/rte_event_ring_128b.c
 create mode 100644 lib/librte_eventdev/rte_event_ring_128b.h

diff --git a/lib/librte_eventdev/Makefile b/lib/librte_eventdev/Makefile
index cd3ff8040..4c76bbdf3 100644
--- a/lib/librte_eventdev/Makefile
+++ b/lib/librte_eventdev/Makefile
@@ -24,6 +24,7 @@ LDLIBS += -lrte_mbuf -lrte_cryptodev -lpthread
 
 # library source files
 SRCS-y += rte_eventdev.c
+SRCS-y += rte_event_ring_128b.c
 SRCS-y += rte_event_ring.c
 SRCS-y += rte_event_eth_rx_adapter.c
 SRCS-y += rte_event_timer_adapter.c
@@ -35,6 +36,7 @@ SYMLINK-y-include += rte_eventdev.h
 SYMLINK-y-include += rte_eventdev_pmd.h
 SYMLINK-y-include += rte_eventdev_pmd_pci.h
 SYMLINK-y-include += rte_eventdev_pmd_vdev.h
+SYMLINK-y-include += rte_event_ring_128b.h
 SYMLINK-y-include += rte_event_ring.h
 SYMLINK-y-include += rte_event_eth_rx_adapter.h
 SYMLINK-y-include += rte_event_timer_adapter.h
diff --git a/lib/librte_eventdev/meson.build b/lib/librte_eventdev/meson.build
index 19541f23f..8a0fd7332 100644
--- a/lib/librte_eventdev/meson.build
+++ b/lib/librte_eventdev/meson.build
@@ -11,6 +11,7 @@ else
 endif
 
 sources = files('rte_eventdev.c',
+		'rte_event_ring_128b.c',
 		'rte_event_ring.c',
 		'rte_event_eth_rx_adapter.c',
 		'rte_event_timer_adapter.c',
@@ -20,6 +21,7 @@ headers = files('rte_eventdev.h',
 		'rte_eventdev_pmd.h',
 		'rte_eventdev_pmd_pci.h',
 		'rte_eventdev_pmd_vdev.h',
+		'rte_event_ring_128b.h',
 		'rte_event_ring.h',
 		'rte_event_eth_rx_adapter.h',
 		'rte_event_timer_adapter.h',
diff --git a/lib/librte_eventdev/rte_event_ring.c b/lib/librte_eventdev/rte_event_ring.c
index 50190de01..479db53ea 100644
--- a/lib/librte_eventdev/rte_event_ring.c
+++ b/lib/librte_eventdev/rte_event_ring.c
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2017 Intel Corporation
+ * Copyright(c) 2019 Arm Limited
  */
 
 #include <sys/queue.h>
@@ -11,13 +12,6 @@
 #include <rte_eal_memconfig.h>
 #include "rte_event_ring.h"
 
-TAILQ_HEAD(rte_event_ring_list, rte_tailq_entry);
-
-static struct rte_tailq_elem rte_event_ring_tailq = {
-	.name = RTE_TAILQ_EVENT_RING_NAME,
-};
-EAL_REGISTER_TAILQ(rte_event_ring_tailq)
-
 int
 rte_event_ring_init(struct rte_event_ring *r, const char *name,
 	unsigned int count, unsigned int flags)
@@ -35,150 +29,20 @@ struct rte_event_ring *
 rte_event_ring_create(const char *name, unsigned int count, int socket_id,
 		unsigned int flags)
 {
-	char mz_name[RTE_MEMZONE_NAMESIZE];
-	struct rte_event_ring *r;
-	struct rte_tailq_entry *te;
-	const struct rte_memzone *mz;
-	ssize_t ring_size;
-	int mz_flags = 0;
-	struct rte_event_ring_list *ring_list = NULL;
-	const unsigned int requested_count = count;
-	int ret;
-
-	ring_list = RTE_TAILQ_CAST(rte_event_ring_tailq.head,
-		rte_event_ring_list);
-
-	/* for an exact size ring, round up from count to a power of two */
-	if (flags & RING_F_EXACT_SZ)
-		count = rte_align32pow2(count + 1);
-	else if (!rte_is_power_of_2(count)) {
-		rte_errno = EINVAL;
-		return NULL;
-	}
-
-	ring_size = sizeof(*r) + (count * sizeof(struct rte_event));
-
-	ret = snprintf(mz_name, sizeof(mz_name), "%s%s",
-		RTE_RING_MZ_PREFIX, name);
-	if (ret < 0 || ret >= (int)sizeof(mz_name)) {
-		rte_errno = ENAMETOOLONG;
-		return NULL;
-	}
-
-	te = rte_zmalloc("RING_TAILQ_ENTRY", sizeof(*te), 0);
-	if (te == NULL) {
-		RTE_LOG(ERR, RING, "Cannot reserve memory for tailq\n");
-		rte_errno = ENOMEM;
-		return NULL;
-	}
-
-	rte_mcfg_tailq_write_lock();
-
-	/*
-	 * reserve a memory zone for this ring. If we can't get rte_config or
-	 * we are secondary process, the memzone_reserve function will set
-	 * rte_errno for us appropriately - hence no check in this this function
-	 */
-	mz = rte_memzone_reserve(mz_name, ring_size, socket_id, mz_flags);
-	if (mz != NULL) {
-		r = mz->addr;
-		/* Check return value in case rte_ring_init() fails on size */
-		int err = rte_event_ring_init(r, name, requested_count, flags);
-		if (err) {
-			RTE_LOG(ERR, RING, "Ring init failed\n");
-			if (rte_memzone_free(mz) != 0)
-				RTE_LOG(ERR, RING, "Cannot free memzone\n");
-			rte_free(te);
-			rte_mcfg_tailq_write_unlock();
-			return NULL;
-		}
-
-		te->data = (void *) r;
-		r->r.memzone = mz;
-
-		TAILQ_INSERT_TAIL(ring_list, te, next);
-	} else {
-		r = NULL;
-		RTE_LOG(ERR, RING, "Cannot reserve memory\n");
-		rte_free(te);
-	}
-	rte_mcfg_tailq_write_unlock();
-
-	return r;
+	return (struct rte_event_ring *)rte_ring_create_event_128b(name, count,
+						socket_id, flags);
 }
 
 
 struct rte_event_ring *
 rte_event_ring_lookup(const char *name)
 {
-	struct rte_tailq_entry *te;
-	struct rte_event_ring *r = NULL;
-	struct rte_event_ring_list *ring_list;
-
-	ring_list = RTE_TAILQ_CAST(rte_event_ring_tailq.head,
-			rte_event_ring_list);
-
-	rte_mcfg_tailq_read_lock();
-
-	TAILQ_FOREACH(te, ring_list, next) {
-		r = (struct rte_event_ring *) te->data;
-		if (strncmp(name, r->r.name, RTE_RING_NAMESIZE) == 0)
-			break;
-	}
-
-	rte_mcfg_tailq_read_unlock();
-
-	if (te == NULL) {
-		rte_errno = ENOENT;
-		return NULL;
-	}
-
-	return r;
+	return (struct rte_event_ring *)rte_ring_lookup(name);
 }
 
 /* free the ring */
 void
 rte_event_ring_free(struct rte_event_ring *r)
 {
-	struct rte_event_ring_list *ring_list = NULL;
-	struct rte_tailq_entry *te;
-
-	if (r == NULL)
-		return;
-
-	/*
-	 * Ring was not created with rte_event_ring_create,
-	 * therefore, there is no memzone to free.
-	 */
-	if (r->r.memzone == NULL) {
-		RTE_LOG(ERR, RING,
-			"Cannot free ring (not created with rte_event_ring_create()");
-		return;
-	}
-
-	if (rte_memzone_free(r->r.memzone) != 0) {
-		RTE_LOG(ERR, RING, "Cannot free memory\n");
-		return;
-	}
-
-	ring_list = RTE_TAILQ_CAST(rte_event_ring_tailq.head,
-			rte_event_ring_list);
-	rte_mcfg_tailq_write_lock();
-
-	/* find out tailq entry */
-	TAILQ_FOREACH(te, ring_list, next) {
-		if (te->data == (void *) r)
-			break;
-	}
-
-	if (te == NULL) {
-		rte_mcfg_tailq_write_unlock();
-		return;
-	}
-
-	TAILQ_REMOVE(ring_list, te, next);
-
-	rte_mcfg_tailq_write_unlock();
-
-	rte_free(te);
+	rte_ring_free(&r->r);
 }
diff --git a/lib/librte_eventdev/rte_event_ring.h b/lib/librte_eventdev/rte_event_ring.h
index 827a3209e..4553c0076 100644
--- a/lib/librte_eventdev/rte_event_ring.h
+++ b/lib/librte_eventdev/rte_event_ring.h
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: BSD-3-Clause
  * Copyright(c) 2016-2017 Intel Corporation
+ * Copyright(c) 2019 Arm Limited
  */
 
 /**
@@ -20,8 +21,7 @@
 #include <rte_malloc.h>
 #include <rte_ring.h>
 #include "rte_eventdev.h"
-
-#define RTE_TAILQ_EVENT_RING_NAME "RTE_EVENT_RING"
+#include "rte_event_ring_128b.h"
 
 /**
  * Generic ring structure for passing rte_event objects from core to core.
@@ -88,22 +88,8 @@ rte_event_ring_enqueue_burst(struct rte_event_ring *r,
 		const struct rte_event *events,
 		unsigned int n, uint16_t *free_space)
 {
-	uint32_t prod_head, prod_next;
-	uint32_t free_entries;
-
-	n = __rte_ring_move_prod_head(&r->r, r->r.prod.single, n,
-			RTE_RING_QUEUE_VARIABLE,
-			&prod_head, &prod_next, &free_entries);
-	if (n == 0)
-		goto end;
-
-	ENQUEUE_PTRS(&r->r, &r[1], prod_head, events, n, struct rte_event);
-
-	update_tail(&r->r.prod, prod_head, prod_next, r->r.prod.single, 1);
-end:
-	if (free_space != NULL)
-		*free_space = free_entries - n;
-	return n;
+	return rte_ring_enqueue_burst_event_128b(&r->r, events, n,
+							(uint32_t *)free_space);
 }
 
 /**
@@ -129,23 +115,8 @@ rte_event_ring_dequeue_burst(struct rte_event_ring *r,
 		struct rte_event *events,
 		unsigned int n, uint16_t *available)
 {
-	uint32_t cons_head, cons_next;
-	uint32_t entries;
-
-	n = __rte_ring_move_cons_head(&r->r, r->r.cons.single, n,
-			RTE_RING_QUEUE_VARIABLE,
-			&cons_head, &cons_next, &entries);
-	if (n == 0)
-		goto end;
-
-	DEQUEUE_PTRS(&r->r, &r[1], cons_head, events, n, struct rte_event);
-
-	update_tail(&r->r.cons, cons_head, cons_next, r->r.cons.single, 0);
-
-end:
-	if (available != NULL)
-		*available = entries - n;
-	return n;
+	return rte_ring_dequeue_burst_event_128b(&r->r, events, n,
+							(uint32_t *)available);
 }
 
 /*
diff --git a/lib/librte_eventdev/rte_event_ring_128b.c b/lib/librte_eventdev/rte_event_ring_128b.c
new file mode 100644
index 000000000..5e4105a2f
--- /dev/null
+++ b/lib/librte_eventdev/rte_event_ring_128b.c
@@ -0,0 +1,19 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2019 Arm Limited
+ */
+
+#include <stdio.h>
+#include <stdint.h>
+#include <sys/queue.h>
+#include <errno.h>
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_memory.h>
+#include <rte_lcore.h>
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_memzone.h>
+#include <rte_pause.h>
+
+#include <rte_event_ring_128b.h>
+#include <rte_ring_template.c>
diff --git a/lib/librte_eventdev/rte_event_ring_128b.h b/lib/librte_eventdev/rte_event_ring_128b.h
new file mode 100644
index 000000000..3079d7b49
--- /dev/null
+++ b/lib/librte_eventdev/rte_event_ring_128b.h
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright (c) 2019 Arm Limited
+ */
+
+#ifndef _RTE_EVENT_RING_128_H_
+#define _RTE_EVENT_RING_128_H_
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stdio.h>
+#include <stdint.h>
+#include <sys/queue.h>
+#include <errno.h>
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_memory.h>
+#include <rte_lcore.h>
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_memzone.h>
+#include <rte_pause.h>
+#include "rte_eventdev.h"
+
+/* Event ring will use its own template. Otherwise, the 'struct rte_event'
+ * needs to change to 'union rte_event' to include a standard 128b data type
+ * such as __int128_t which results in API changes.
+ *
+ * The RTE_RING_TMPLT_API_SUFFIX cannot be just '128b' as that will be
+ * used for standard 128b element type APIs defined by the rte_ring library.
+ */
+#define RTE_RING_TMPLT_API_SUFFIX event_128b
+#define RTE_RING_TMPLT_ELEM_SIZE sizeof(struct rte_event)
+#define RTE_RING_TMPLT_ELEM_TYPE struct rte_event
+#define RTE_RING_TMPLT_EXPERIMENTAL
+
+#include <rte_ring_template.h>
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_EVENT_RING_128_H_ */
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/6] lib/ring: add template to support different element sizes
  2019-09-06 19:05   ` [dpdk-dev] [PATCH v2 2/6] lib/ring: add template to support different element sizes Honnappa Nagarahalli
@ 2019-09-08 19:44     ` Stephen Hemminger
  2019-09-09  9:01       ` Bruce Richardson
  0 siblings, 1 reply; 173+ messages in thread
From: Stephen Hemminger @ 2019-09-08 19:44 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: olivier.matz, yipeng1.wang, sameh.gobriel, bruce.richardson,
	pablo.de.lara.guarch, dev, pbhagavatula, jerinj

On Fri,  6 Sep 2019 14:05:06 -0500
Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> wrote:

> Add templates to support creating ring APIs with different
> ring element sizes.
> 
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>

Understand the desire for generic code, but macro's are much harder to maintain
and debug. Would it be possible to use inline code taking a size argument
and let compiler optimizations with constant folding do the same thing.

Sorry, I vote NO for large scale use of macro's.

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/6] lib/ring: add template to support different element sizes
  2019-09-08 19:44     ` Stephen Hemminger
@ 2019-09-09  9:01       ` Bruce Richardson
  2019-09-09 22:33         ` Honnappa Nagarahalli
  0 siblings, 1 reply; 173+ messages in thread
From: Bruce Richardson @ 2019-09-09  9:01 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: Honnappa Nagarahalli, olivier.matz, yipeng1.wang, sameh.gobriel,
	pablo.de.lara.guarch, dev, pbhagavatula, jerinj

On Sun, Sep 08, 2019 at 08:44:36PM +0100, Stephen Hemminger wrote:
> On Fri,  6 Sep 2019 14:05:06 -0500
> Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> wrote:
> 
> > Add templates to support creating ring APIs with different
> > ring element sizes.
> > 
> > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> 
> Understand the desire for generic code, but macro's are much harder to maintain
> and debug. Would it be possible to use inline code taking a size argument
> and let compiler optimizations with constant folding do the same thing.
> 
> Sorry, I vote NO for large scale use of macro's.

I would tend to agree. This use of macros makes the code very awkward to
read and understand.

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v2 0/6] lib/ring: templates to support custom element size
  2019-09-06 19:05 ` [dpdk-dev] [PATCH v2 0/6] " Honnappa Nagarahalli
                     ` (5 preceding siblings ...)
  2019-09-06 19:05   ` [dpdk-dev] [PATCH v2 6/6] lib/eventdev: use ring templates for event rings Honnappa Nagarahalli
@ 2019-09-09 13:04   ` Aaron Conole
  2019-10-07 13:49   ` David Marchand
                     ` (8 subsequent siblings)
  15 siblings, 0 replies; 173+ messages in thread
From: Aaron Conole @ 2019-09-09 13:04 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: olivier.matz, yipeng1.wang, sameh.gobriel, bruce.richardson,
	pablo.de.lara.guarch, dev, pbhagavatula, jerinj

Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> writes:

> The current rte_ring hard-codes the type of the ring element to 'void *',
> hence the size of the element is hard-coded to 32b/64b. Since the ring
> element type is not an input to rte_ring APIs, it results in couple
> of issues:
>
> 1) If an application requires to store an element which is not 64b, it
>    needs to write its own ring APIs similar to rte_event_ring APIs. This
>    creates additional burden on the programmers, who end up making
>    work-arounds and often waste memory.
> 2) If there are multiple libraries that store elements of the same
>    type, currently they would have to write their own rte_ring APIs. This
>    results in code duplication.
>
> This patch consists of several parts:
> 1) New APIs to support configurable ring element size
>    These will help reduce code duplication in the templates. I think these
>    can be made internal (do not expose to DPDK applications, but expose to
>    DPDK libraries), feedback needed.
>
> 2) rte_ring templates
>    The templates provide an easy way to add new APIs for different ring
>    element types/sizes which can be used by multiple libraries. These
>    also allow for creating APIs to store elements of custom types
>    (for ex: a structure)
>
>    The template needs 4 parameters:
>    a) RTE_RING_TMPLT_API_SUFFIX - This is used as a suffix to the
>       rte_ring APIs.
>       For ex: if RTE_RING_TMPLT_API_SUFFIX is '32b', the API name will be
>       rte_ring_create_32b
>    b) RTE_RING_TMPLT_ELEM_SIZE - Size of the ring element in bytes.
>       For ex: sizeof(uint32_t)
>    c) RTE_RING_TMPLT_ELEM_TYPE - Type of the ring element.
>       For ex: uint32_t. If a common ring library does not use a standard
>       data type, it should create its own type by defining a structure
>       with standard data type. For ex: for an elment size of 96b, one
>       could define a structure
>
>       struct s_96b {
>           uint32_t a[3];
>       }
>       The common library can use this structure to define
>       RTE_RING_TMPLT_ELEM_TYPE.
>
>       The application using this common ring library should define its
>       element type as a union with the above structure.
>
>       union app_element_type {
>           struct s_96b v;
>           struct app_element {
>               uint16_t a;
>               uint16_t b;
>               uint32_t c;
>               uint32_t d;
>           }
>       }
>    d) RTE_RING_TMPLT_EXPERIMENTAL - Indicates if the new APIs being defined
>       are experimental. Should be set to empty to remove the experimental
>       tag.
>
>    The ring library consists of some APIs that are defined as inline
>    functions and some APIs that are non-inline functions. The non-inline
>    functions are in rte_ring_template.c. However, this file needs to be
>    included in other .c files. Any feedback on how to handle this is
>    appreciated.
>
>    Note that the templates help create the APIs that are dependent on the
>    element size (for ex: rte_ring_create, enqueue/dequeue etc). Other APIs
>    that do NOT depend on the element size do not need to be part of the
>    template (for ex: rte_ring_dump, rte_ring_count, rte_ring_free_count
>    etc).
>
> 3) APIs for 32b ring element size
>    This uses the templates to create APIs to enqueue/dequeue elements of
>    size 32b.
>
> 4) rte_hash libray is changed to use 32b ring APIs
>    The 32b APIs are used in rte_hash library to store the free slot index
>    and free bucket index.
>
> 5) Event Dev changed to use ring templates
>    Event Dev defines its own 128b ring APIs using the templates. This helps
>    in keeping the 'struct rte_event' as is. If Event Dev has to use generic
>    128b ring APIs, it requires 'struct rte_event' to change to
>    'union rte_event' to include a generic data type such as '__int128_t'.
>    This breaks the API compatibility and results in large number of
>    changes.
>    With this change, the event rings are stored on rte_ring's tailq.
>    Event Dev specific ring list is NOT available. IMO, this does not have
>    any impact to the user.
>
> This patch results in following checkpatch issue:
> WARNING:UNSPECIFIED_INT: Prefer 'unsigned int' to bare use of 'unsigned'
>
> However, this patch is following the rules in the existing code. Please
> let me know if this needs to be fixed.
>
> v2
>  - Change Event Ring implementation to use ring templates
>    (Jerin, Pavan)

Since you'll likely be spinning a v3 (to switch off the macroization),
this series seems to have some unit test failures:

   24/82 DPDK:fast-tests / event_ring_autotest   FAIL     0.12 s (exit status 255 or signal 127 SIGinvalid)
   --- command ---
   DPDK_TEST='event_ring_autotest' /home/travis/build/ovsrobot/dpdk/build/app/test/dpdk-test -l 0-1 --file-prefix=event_ring_autotest
   --- stdout ---
   EAL: Probing VFIO support...
   APP: HPET is not enabled, using TSC as default timer
   RTE>>event_ring_autotest
   RING: Requested number of elements is invalid, must be power of 2, and do not exceed the limit 2147483647
   Test detected odd count
   Test detected NULL ring lookup
   RING: Requested number of elements is invalid, must be power of 2, and do not exceed the limit 2147483647
   RING: Requested number of elements is invalid, must be power of 2, and do not exceed the limit 2147483647
   Error, status after enqueue is unexpected
   Test Failed
   RTE>>
   --- stderr ---
   EAL: Detected 2 lcore(s)
   EAL: Detected 1 NUMA nodes
   EAL: Multi-process socket /var/run/dpdk/event_ring_autotest/mp_socket
   EAL: Selected IOVA mode 'PA'
   EAL: No available hugepages reported in hugepages-1048576kB
   -------

Please double check.  Seems to only happen with clang/llvm.

> Honnappa Nagarahalli (6):
>   lib/ring: apis to support configurable element size
>   lib/ring: add template to support different element sizes
>   tools/checkpatch: relax constraints on __rte_experimental
>   lib/ring: add ring APIs to support 32b ring elements
>   lib/hash: use ring with 32b element size to save memory
>   lib/eventdev: use ring templates for event rings
>
>  devtools/checkpatches.sh                  |  11 +-
>  lib/librte_eventdev/Makefile              |   2 +
>  lib/librte_eventdev/meson.build           |   2 +
>  lib/librte_eventdev/rte_event_ring.c      | 146 +---------
>  lib/librte_eventdev/rte_event_ring.h      |  41 +--
>  lib/librte_eventdev/rte_event_ring_128b.c |  19 ++
>  lib/librte_eventdev/rte_event_ring_128b.h |  44 +++
>  lib/librte_hash/rte_cuckoo_hash.c         |  55 ++--
>  lib/librte_hash/rte_cuckoo_hash.h         |   2 +-
>  lib/librte_ring/Makefile                  |   9 +-
>  lib/librte_ring/meson.build               |  11 +-
>  lib/librte_ring/rte_ring.c                |  34 ++-
>  lib/librte_ring/rte_ring.h                |  72 +++++
>  lib/librte_ring/rte_ring_32.c             |  19 ++
>  lib/librte_ring/rte_ring_32.h             |  36 +++
>  lib/librte_ring/rte_ring_template.c       |  46 +++
>  lib/librte_ring/rte_ring_template.h       | 330 ++++++++++++++++++++++
>  lib/librte_ring/rte_ring_version.map      |   4 +
>  18 files changed, 660 insertions(+), 223 deletions(-)
>  create mode 100644 lib/librte_eventdev/rte_event_ring_128b.c
>  create mode 100644 lib/librte_eventdev/rte_event_ring_128b.h
>  create mode 100644 lib/librte_ring/rte_ring_32.c
>  create mode 100644 lib/librte_ring/rte_ring_32.h
>  create mode 100644 lib/librte_ring/rte_ring_template.c
>  create mode 100644 lib/librte_ring/rte_ring_template.h

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v2 2/6] lib/ring: add template to support different element sizes
  2019-09-09  9:01       ` Bruce Richardson
@ 2019-09-09 22:33         ` Honnappa Nagarahalli
  0 siblings, 0 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-09-09 22:33 UTC (permalink / raw)
  To: Bruce Richardson, Stephen Hemminger
  Cc: olivier.matz, yipeng1.wang, sameh.gobriel, pablo.de.lara.guarch,
	dev, pbhagavatula, jerinj, Honnappa Nagarahalli, nd, nd


> -----Original Message-----
> From: Bruce Richardson <bruce.richardson@intel.com>
> Sent: Monday, September 9, 2019 4:01 AM
> To: Stephen Hemminger <stephen@networkplumber.org>
> Cc: Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>;
> olivier.matz@6wind.com; yipeng1.wang@intel.com;
> sameh.gobriel@intel.com; pablo.de.lara.guarch@intel.com; dev@dpdk.org;
> pbhagavatula@marvell.com; jerinj@marvell.com
> Subject: Re: [dpdk-dev] [PATCH v2 2/6] lib/ring: add template to support
> different element sizes
> 
> On Sun, Sep 08, 2019 at 08:44:36PM +0100, Stephen Hemminger wrote:
> > On Fri,  6 Sep 2019 14:05:06 -0500
> > Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> wrote:
> >
> > > Add templates to support creating ring APIs with different ring
> > > element sizes.
> > >
> > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> >
> > Understand the desire for generic code, but macro's are much harder to
> > maintain and debug. Would it be possible to use inline code taking a
> > size argument and let compiler optimizations with constant folding do the
> same thing.
> >
> > Sorry, I vote NO for large scale use of macro's.
> 
> I would tend to agree. This use of macros makes the code very awkward to
> read and understand.
Stephen, Bruce,  thank you for your feedback. Looks like we at least have an agreement on the problem definition, hopefully we can find a solution. I will try to rework this and get back with solutions/problems.

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH 2/5] lib/ring: add template to support different element sizes
  2019-08-28 14:46 ` [dpdk-dev] [PATCH 2/5] lib/ring: add template to support different element sizes Honnappa Nagarahalli
@ 2019-10-01 11:47   ` Ananyev, Konstantin
  2019-10-02  4:21     ` Honnappa Nagarahalli
  0 siblings, 1 reply; 173+ messages in thread
From: Ananyev, Konstantin @ 2019-10-01 11:47 UTC (permalink / raw)
  To: Honnappa Nagarahalli, olivier.matz, Wang, Yipeng1, Gobriel,
	Sameh, Richardson, Bruce, De Lara Guarch, Pablo
  Cc: dev, dharmik.thakkar, gavin.hu, ruifeng.wang, nd



> 
> 
> Add templates to support creating ring APIs with different
> ring element sizes.
> 
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>  lib/librte_ring/Makefile            |   4 +-
>  lib/librte_ring/meson.build         |   4 +-
>  lib/librte_ring/rte_ring_template.c |  46 ++++
>  lib/librte_ring/rte_ring_template.h | 330 ++++++++++++++++++++++++++++
>  4 files changed, 382 insertions(+), 2 deletions(-)
>  create mode 100644 lib/librte_ring/rte_ring_template.c
>  create mode 100644 lib/librte_ring/rte_ring_template.h
> 
> diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
> index 4c8410229..818898110 100644
> --- a/lib/librte_ring/Makefile
> +++ b/lib/librte_ring/Makefile
> @@ -19,6 +19,8 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
>  # install includes
>  SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
>  					rte_ring_generic.h \
> -					rte_ring_c11_mem.h
> +					rte_ring_c11_mem.h \
> +					rte_ring_template.h \
> +					rte_ring_template.c
> 
>  include $(RTE_SDK)/mk/rte.lib.mk
> diff --git a/lib/librte_ring/meson.build b/lib/librte_ring/meson.build
> index 74219840a..e4e208a7c 100644
> --- a/lib/librte_ring/meson.build
> +++ b/lib/librte_ring/meson.build
> @@ -5,7 +5,9 @@ version = 2
>  sources = files('rte_ring.c')
>  headers = files('rte_ring.h',
>  		'rte_ring_c11_mem.h',
> -		'rte_ring_generic.h')
> +		'rte_ring_generic.h',
> +		'rte_ring_template.h',
> +		'rte_ring_template.c')
> 
>  # rte_ring_create_elem and rte_ring_get_memsize_elem are experimental
>  allow_experimental_apis = true
> diff --git a/lib/librte_ring/rte_ring_template.c b/lib/librte_ring/rte_ring_template.c
> new file mode 100644
> index 000000000..1ca593f95
> --- /dev/null
> +++ b/lib/librte_ring/rte_ring_template.c
> @@ -0,0 +1,46 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright (c) 2019 Arm Limited
> + */
> +
> +#include <stdio.h>
> +#include <stdarg.h>
> +#include <string.h>
> +#include <stdint.h>
> +#include <inttypes.h>
> +#include <errno.h>
> +#include <sys/queue.h>
> +
> +#include <rte_common.h>
> +#include <rte_log.h>
> +#include <rte_memory.h>
> +#include <rte_memzone.h>
> +#include <rte_malloc.h>
> +#include <rte_launch.h>
> +#include <rte_eal.h>
> +#include <rte_eal_memconfig.h>
> +#include <rte_atomic.h>
> +#include <rte_per_lcore.h>
> +#include <rte_lcore.h>
> +#include <rte_branch_prediction.h>
> +#include <rte_errno.h>
> +#include <rte_string_fns.h>
> +#include <rte_spinlock.h>
> +#include <rte_tailq.h>
> +
> +#include "rte_ring.h"
> +
> +/* return the size of memory occupied by a ring */
> +ssize_t
> +__RTE_RING_CONCAT(rte_ring_get_memsize)(unsigned count)
> +{
> +	return rte_ring_get_memsize_elem(count, RTE_RING_TMPLT_ELEM_SIZE);
> +}
> +
> +/* create the ring */
> +struct rte_ring *
> +__RTE_RING_CONCAT(rte_ring_create)(const char *name, unsigned count,
> +		int socket_id, unsigned flags)
> +{
> +	return rte_ring_create_elem(name, count, RTE_RING_TMPLT_ELEM_SIZE,
> +		socket_id, flags);
> +}
> diff --git a/lib/librte_ring/rte_ring_template.h b/lib/librte_ring/rte_ring_template.h
> new file mode 100644
> index 000000000..b9b14dfbb
> --- /dev/null
> +++ b/lib/librte_ring/rte_ring_template.h
> @@ -0,0 +1,330 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright (c) 2019 Arm Limited
> + */
> +
> +#ifndef _RTE_RING_TEMPLATE_H_
> +#define _RTE_RING_TEMPLATE_H_
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <stdio.h>
> +#include <stdint.h>
> +#include <sys/queue.h>
> +#include <errno.h>
> +#include <rte_common.h>
> +#include <rte_config.h>
> +#include <rte_memory.h>
> +#include <rte_lcore.h>
> +#include <rte_atomic.h>
> +#include <rte_branch_prediction.h>
> +#include <rte_memzone.h>
> +#include <rte_pause.h>
> +#include <rte_ring.h>
> +
> +/* Ring API suffix name - used to append to API names */
> +#ifndef RTE_RING_TMPLT_API_SUFFIX
> +#error RTE_RING_TMPLT_API_SUFFIX not defined
> +#endif
> +
> +/* Ring's element size in bits, should be a power of 2 */
> +#ifndef RTE_RING_TMPLT_ELEM_SIZE
> +#error RTE_RING_TMPLT_ELEM_SIZE not defined
> +#endif
> +
> +/* Type of ring elements */
> +#ifndef RTE_RING_TMPLT_ELEM_TYPE
> +#error RTE_RING_TMPLT_ELEM_TYPE not defined
> +#endif
> +
> +#define _rte_fuse(a, b) a##_##b
> +#define __rte_fuse(a, b) _rte_fuse(a, b)
> +#define __RTE_RING_CONCAT(a) __rte_fuse(a, RTE_RING_TMPLT_API_SUFFIX)
> +
> +/* Calculate the memory size needed for a ring */
> +RTE_RING_TMPLT_EXPERIMENTAL
> +ssize_t __RTE_RING_CONCAT(rte_ring_get_memsize)(unsigned count);
> +
> +/* Create a new ring named *name* in memory. */
> +RTE_RING_TMPLT_EXPERIMENTAL
> +struct rte_ring *
> +__RTE_RING_CONCAT(rte_ring_create)(const char *name, unsigned count,
> +					int socket_id, unsigned flags);


Just an idea - probably same thing can be achieved in a different way.
Instead of all these defines - replace ENQUEUE_PTRS/DEQUEUE_PTRS macros
with static inline functions and then make all internal functions, i.e. __rte_ring_do_dequeue()
to accept enqueue/dequeue function pointer as a parameter.
Then let say default rte_ring_mc_dequeue_bulk will do:

rte_ring_mc_dequeue_bulk(struct rte_ring *r, void **obj_table,
                unsigned int n, unsigned int *available)
{
        return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
                        __IS_MC, available, dequeue_ptr_default);
}

Then if someone will like to define ring functions forelt_size==X, all he would need to do:
1. define his own enqueue/dequeuer functions.
2. do something like:
rte_ring_mc_dequeue_bulk(struct rte_ring *r, void **obj_table,
                unsigned int n, unsigned int *available)
{
        return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
                        __IS_MC, available, dequeue_X);
}

Konstantin


> +
> +/**
> + * @internal Enqueue several objects on the ring
> + */
> +static __rte_always_inline unsigned int
> +__RTE_RING_CONCAT(__rte_ring_do_enqueue)(struct rte_ring *r,
> +		RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int n,
> +		enum rte_ring_queue_behavior behavior, unsigned int is_sp,
> +		unsigned int *free_space)
> +{
> +	uint32_t prod_head, prod_next;
> +	uint32_t free_entries;
> +
> +	n = __rte_ring_move_prod_head(r, is_sp, n, behavior,
> +			&prod_head, &prod_next, &free_entries);
> +	if (n == 0)
> +		goto end;
> +
> +	ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n,
> +		RTE_RING_TMPLT_ELEM_TYPE);
> +
> +	update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
> +end:
> +	if (free_space != NULL)
> +		*free_space = free_entries - n;
> +	return n;
> +}
> +
> +/**
> + * @internal Dequeue several objects from the ring
> + */
> +static __rte_always_inline unsigned int
> +__RTE_RING_CONCAT(__rte_ring_do_dequeue)(struct rte_ring *r,
> +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> +	enum rte_ring_queue_behavior behavior, unsigned int is_sc,
> +	unsigned int *available)
> +{
> +	uint32_t cons_head, cons_next;
> +	uint32_t entries;
> +
> +	n = __rte_ring_move_cons_head(r, (int)is_sc, n, behavior,
> +			&cons_head, &cons_next, &entries);
> +	if (n == 0)
> +		goto end;
> +
> +	DEQUEUE_PTRS(r, &r[1], cons_head, obj_table, n,
> +		RTE_RING_TMPLT_ELEM_TYPE);
> +
> +	update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
> +
> +end:
> +	if (available != NULL)
> +		*available = entries - n;
> +	return n;
> +}
> +
> +
> +/**
> + * Enqueue several objects on the ring (multi-producers safe).
> + */
> +static __rte_always_inline unsigned int
> +__RTE_RING_CONCAT(rte_ring_mp_enqueue_bulk)(struct rte_ring *r,
> +	RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int n,
> +	unsigned int *free_space)
> +{
> +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> +			RTE_RING_QUEUE_FIXED, __IS_MP, free_space);
> +}
> +
> +/**
> + * Enqueue several objects on a ring (NOT multi-producers safe).
> + */
> +static __rte_always_inline unsigned int
> +__RTE_RING_CONCAT(rte_ring_sp_enqueue_bulk)(struct rte_ring *r,
> +	RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int n,
> +	unsigned int *free_space)
> +{
> +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> +			RTE_RING_QUEUE_FIXED, __IS_SP, free_space);
> +}
> +
> +/**
> + * Enqueue several objects on a ring.
> + */
> +static __rte_always_inline unsigned int
> +__RTE_RING_CONCAT(rte_ring_enqueue_bulk)(struct rte_ring *r,
> +	RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int n,
> +	unsigned int *free_space)
> +{
> +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> +			RTE_RING_QUEUE_FIXED, r->prod.single, free_space);
> +}
> +
> +/**
> + * Enqueue one object on a ring (multi-producers safe).
> + */
> +static __rte_always_inline int
> +__RTE_RING_CONCAT(rte_ring_mp_enqueue)(struct rte_ring *r,
> +	RTE_RING_TMPLT_ELEM_TYPE obj)
> +{
> +	return __RTE_RING_CONCAT(rte_ring_mp_enqueue_bulk)(r, &obj, 1, NULL) ?
> +			0 : -ENOBUFS;
> +}
> +
> +/**
> + * Enqueue one object on a ring (NOT multi-producers safe).
> + */
> +static __rte_always_inline int
> +__RTE_RING_CONCAT(rte_ring_sp_enqueue)(struct rte_ring *r,
> +	RTE_RING_TMPLT_ELEM_TYPE obj)
> +{
> +	return __RTE_RING_CONCAT(rte_ring_sp_enqueue_bulk)(r, &obj, 1, NULL) ?
> +			0 : -ENOBUFS;
> +}
> +
> +/**
> + * Enqueue one object on a ring.
> + */
> +static __rte_always_inline int
> +__RTE_RING_CONCAT(rte_ring_enqueue)(struct rte_ring *r,
> +	RTE_RING_TMPLT_ELEM_TYPE *obj)
> +{
> +	return __RTE_RING_CONCAT(rte_ring_enqueue_bulk)(r, obj, 1, NULL) ?
> +			0 : -ENOBUFS;
> +}
> +
> +/**
> + * Dequeue several objects from a ring (multi-consumers safe).
> + */
> +static __rte_always_inline unsigned int
> +__RTE_RING_CONCAT(rte_ring_mc_dequeue_bulk)(struct rte_ring *r,
> +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> +	unsigned int *available)
> +{
> +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> +			RTE_RING_QUEUE_FIXED, __IS_MC, available);
> +}
> +
> +/**
> + * Dequeue several objects from a ring (NOT multi-consumers safe).
> + */
> +static __rte_always_inline unsigned int
> +__RTE_RING_CONCAT(rte_ring_sc_dequeue_bulk)(struct rte_ring *r,
> +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> +	unsigned int *available)
> +{
> +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> +			RTE_RING_QUEUE_FIXED, __IS_SC, available);
> +}
> +
> +/**
> + * Dequeue several objects from a ring.
> + */
> +static __rte_always_inline unsigned int
> +__RTE_RING_CONCAT(rte_ring_dequeue_bulk)(struct rte_ring *r,
> +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> +	unsigned int *available)
> +{
> +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> +			RTE_RING_QUEUE_FIXED, r->cons.single, available);
> +}
> +
> +/**
> + * Dequeue one object from a ring (multi-consumers safe).
> + */
> +static __rte_always_inline int
> +__RTE_RING_CONCAT(rte_ring_mc_dequeue)(struct rte_ring *r,
> +	RTE_RING_TMPLT_ELEM_TYPE *obj_p)
> +{
> +	return __RTE_RING_CONCAT(rte_ring_mc_dequeue_bulk)(r, obj_p, 1, NULL) ?
> +			0 : -ENOENT;
> +}
> +
> +/**
> + * Dequeue one object from a ring (NOT multi-consumers safe).
> + */
> +static __rte_always_inline int
> +__RTE_RING_CONCAT(rte_ring_sc_dequeue)(struct rte_ring *r,
> +	RTE_RING_TMPLT_ELEM_TYPE *obj_p)
> +{
> +	return __RTE_RING_CONCAT(rte_ring_sc_dequeue_bulk)(r, obj_p, 1, NULL) ?
> +			0 : -ENOENT;
> +}
> +
> +/**
> + * Dequeue one object from a ring.
> + */
> +static __rte_always_inline int
> +__RTE_RING_CONCAT(rte_ring_dequeue)(struct rte_ring *r,
> +	RTE_RING_TMPLT_ELEM_TYPE *obj_p)
> +{
> +	return __RTE_RING_CONCAT(rte_ring_dequeue_bulk)(r, obj_p, 1, NULL) ?
> +			0 : -ENOENT;
> +}
> +
> +/**
> + * Enqueue several objects on the ring (multi-producers safe).
> + */
> +static __rte_always_inline unsigned
> +__RTE_RING_CONCAT(rte_ring_mp_enqueue_burst)(struct rte_ring *r,
> +	RTE_RING_TMPLT_ELEM_TYPE *obj_table,
> +			 unsigned int n, unsigned int *free_space)
> +{
> +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> +			RTE_RING_QUEUE_VARIABLE, __IS_MP, free_space);
> +}
> +
> +/**
> + * Enqueue several objects on a ring (NOT multi-producers safe).
> + */
> +static __rte_always_inline unsigned
> +__RTE_RING_CONCAT(rte_ring_sp_enqueue_burst)(struct rte_ring *r,
> +	RTE_RING_TMPLT_ELEM_TYPE *obj_table,
> +			 unsigned int n, unsigned int *free_space)
> +{
> +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> +			RTE_RING_QUEUE_VARIABLE, __IS_SP, free_space);
> +}
> +
> +/**
> + * Enqueue several objects on a ring.
> + */
> +static __rte_always_inline unsigned
> +__RTE_RING_CONCAT(rte_ring_enqueue_burst)(struct rte_ring *r,
> +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> +	unsigned int *free_space)
> +{
> +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> +			RTE_RING_QUEUE_VARIABLE, r->prod.single, free_space);
> +}
> +
> +/**
> + * Dequeue several objects from a ring (multi-consumers safe). When the request
> + * objects are more than the available objects, only dequeue the actual number
> + * of objects
> + */
> +static __rte_always_inline unsigned
> +__RTE_RING_CONCAT(rte_ring_mc_dequeue_burst)(struct rte_ring *r,
> +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> +	unsigned int *available)
> +{
> +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> +			RTE_RING_QUEUE_VARIABLE, __IS_MC, available);
> +}
> +
> +/**
> + * Dequeue several objects from a ring (NOT multi-consumers safe).When the
> + * request objects are more than the available objects, only dequeue the
> + * actual number of objects
> + */
> +static __rte_always_inline unsigned
> +__RTE_RING_CONCAT(rte_ring_sc_dequeue_burst)(struct rte_ring *r,
> +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> +	unsigned int *available)
> +{
> +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> +			RTE_RING_QUEUE_VARIABLE, __IS_SC, available);
> +}
> +
> +/**
> + * Dequeue multiple objects from a ring up to a maximum number.
> + */
> +static __rte_always_inline unsigned
> +__RTE_RING_CONCAT(rte_ring_dequeue_burst)(struct rte_ring *r,
> +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> +	unsigned int *available)
> +{
> +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> +				RTE_RING_QUEUE_VARIABLE,
> +				r->cons.single, available);
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_RING_TEMPLATE_H_ */
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH 2/5] lib/ring: add template to support different element sizes
  2019-10-01 11:47   ` Ananyev, Konstantin
@ 2019-10-02  4:21     ` Honnappa Nagarahalli
  2019-10-02  8:39       ` Ananyev, Konstantin
  0 siblings, 1 reply; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-02  4:21 UTC (permalink / raw)
  To: Ananyev, Konstantin, olivier.matz, Wang, Yipeng1, Gobriel, Sameh,
	Richardson, Bruce, De Lara Guarch, Pablo
  Cc: dev, Dharmik Thakkar, Gavin Hu (Arm Technology China),
	Ruifeng Wang (Arm Technology China),
	Honnappa Nagarahalli, nd, nd

> > Add templates to support creating ring APIs with different ring
> > element sizes.
> >
> > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > ---
> >  lib/librte_ring/Makefile            |   4 +-
> >  lib/librte_ring/meson.build         |   4 +-
> >  lib/librte_ring/rte_ring_template.c |  46 ++++
> > lib/librte_ring/rte_ring_template.h | 330 ++++++++++++++++++++++++++++
> >  4 files changed, 382 insertions(+), 2 deletions(-)  create mode
> > 100644 lib/librte_ring/rte_ring_template.c
> >  create mode 100644 lib/librte_ring/rte_ring_template.h
> >
> > diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile index
> > 4c8410229..818898110 100644
> > --- a/lib/librte_ring/Makefile
> > +++ b/lib/librte_ring/Makefile
> > @@ -19,6 +19,8 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c  #
> > install includes  SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include :=
> > rte_ring.h \
> >  					rte_ring_generic.h \
> > -					rte_ring_c11_mem.h
> > +					rte_ring_c11_mem.h \
> > +					rte_ring_template.h \
> > +					rte_ring_template.c
> >
> >  include $(RTE_SDK)/mk/rte.lib.mk
> > diff --git a/lib/librte_ring/meson.build b/lib/librte_ring/meson.build
> > index 74219840a..e4e208a7c 100644
> > --- a/lib/librte_ring/meson.build
> > +++ b/lib/librte_ring/meson.build
> > @@ -5,7 +5,9 @@ version = 2
> >  sources = files('rte_ring.c')
> >  headers = files('rte_ring.h',
> >  		'rte_ring_c11_mem.h',
> > -		'rte_ring_generic.h')
> > +		'rte_ring_generic.h',
> > +		'rte_ring_template.h',
> > +		'rte_ring_template.c')
> >
> >  # rte_ring_create_elem and rte_ring_get_memsize_elem are experimental
> > allow_experimental_apis = true diff --git
> > a/lib/librte_ring/rte_ring_template.c
> > b/lib/librte_ring/rte_ring_template.c
> > new file mode 100644
> > index 000000000..1ca593f95
> > --- /dev/null
> > +++ b/lib/librte_ring/rte_ring_template.c
> > @@ -0,0 +1,46 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright (c) 2019 Arm Limited
> > + */
> > +
> > +#include <stdio.h>
> > +#include <stdarg.h>
> > +#include <string.h>
> > +#include <stdint.h>
> > +#include <inttypes.h>
> > +#include <errno.h>
> > +#include <sys/queue.h>
> > +
> > +#include <rte_common.h>
> > +#include <rte_log.h>
> > +#include <rte_memory.h>
> > +#include <rte_memzone.h>
> > +#include <rte_malloc.h>
> > +#include <rte_launch.h>
> > +#include <rte_eal.h>
> > +#include <rte_eal_memconfig.h>
> > +#include <rte_atomic.h>
> > +#include <rte_per_lcore.h>
> > +#include <rte_lcore.h>
> > +#include <rte_branch_prediction.h>
> > +#include <rte_errno.h>
> > +#include <rte_string_fns.h>
> > +#include <rte_spinlock.h>
> > +#include <rte_tailq.h>
> > +
> > +#include "rte_ring.h"
> > +
> > +/* return the size of memory occupied by a ring */ ssize_t
> > +__RTE_RING_CONCAT(rte_ring_get_memsize)(unsigned count) {
> > +	return rte_ring_get_memsize_elem(count,
> RTE_RING_TMPLT_ELEM_SIZE); }
> > +
> > +/* create the ring */
> > +struct rte_ring *
> > +__RTE_RING_CONCAT(rte_ring_create)(const char *name, unsigned count,
> > +		int socket_id, unsigned flags)
> > +{
> > +	return rte_ring_create_elem(name, count,
> RTE_RING_TMPLT_ELEM_SIZE,
> > +		socket_id, flags);
> > +}
> > diff --git a/lib/librte_ring/rte_ring_template.h
> > b/lib/librte_ring/rte_ring_template.h
> > new file mode 100644
> > index 000000000..b9b14dfbb
> > --- /dev/null
> > +++ b/lib/librte_ring/rte_ring_template.h
> > @@ -0,0 +1,330 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + * Copyright (c) 2019 Arm Limited
> > + */
> > +
> > +#ifndef _RTE_RING_TEMPLATE_H_
> > +#define _RTE_RING_TEMPLATE_H_
> > +
> > +#ifdef __cplusplus
> > +extern "C" {
> > +#endif
> > +
> > +#include <stdio.h>
> > +#include <stdint.h>
> > +#include <sys/queue.h>
> > +#include <errno.h>
> > +#include <rte_common.h>
> > +#include <rte_config.h>
> > +#include <rte_memory.h>
> > +#include <rte_lcore.h>
> > +#include <rte_atomic.h>
> > +#include <rte_branch_prediction.h>
> > +#include <rte_memzone.h>
> > +#include <rte_pause.h>
> > +#include <rte_ring.h>
> > +
> > +/* Ring API suffix name - used to append to API names */ #ifndef
> > +RTE_RING_TMPLT_API_SUFFIX #error RTE_RING_TMPLT_API_SUFFIX not
> > +defined #endif
> > +
> > +/* Ring's element size in bits, should be a power of 2 */ #ifndef
> > +RTE_RING_TMPLT_ELEM_SIZE #error RTE_RING_TMPLT_ELEM_SIZE not
> defined
> > +#endif
> > +
> > +/* Type of ring elements */
> > +#ifndef RTE_RING_TMPLT_ELEM_TYPE
> > +#error RTE_RING_TMPLT_ELEM_TYPE not defined #endif
> > +
> > +#define _rte_fuse(a, b) a##_##b
> > +#define __rte_fuse(a, b) _rte_fuse(a, b) #define __RTE_RING_CONCAT(a)
> > +__rte_fuse(a, RTE_RING_TMPLT_API_SUFFIX)
> > +
> > +/* Calculate the memory size needed for a ring */
> > +RTE_RING_TMPLT_EXPERIMENTAL ssize_t
> > +__RTE_RING_CONCAT(rte_ring_get_memsize)(unsigned count);
> > +
> > +/* Create a new ring named *name* in memory. */
> > +RTE_RING_TMPLT_EXPERIMENTAL struct rte_ring *
> > +__RTE_RING_CONCAT(rte_ring_create)(const char *name, unsigned count,
> > +					int socket_id, unsigned flags);
> 
> 
> Just an idea - probably same thing can be achieved in a different way.
> Instead of all these defines - replace ENQUEUE_PTRS/DEQUEUE_PTRS macros
> with static inline functions and then make all internal functions, i.e.
> __rte_ring_do_dequeue()
> to accept enqueue/dequeue function pointer as a parameter.
> Then let say default rte_ring_mc_dequeue_bulk will do:
> 
> rte_ring_mc_dequeue_bulk(struct rte_ring *r, void **obj_table,
>                 unsigned int n, unsigned int *available)
> {
>         return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
>                         __IS_MC, available, dequeue_ptr_default);
> }
> 
> Then if someone will like to define ring functions forelt_size==X, all he would
> need to do:
> 1. define his own enqueue/dequeuer functions.
> 2. do something like:
> rte_ring_mc_dequeue_bulk(struct rte_ring *r, void **obj_table,
>                 unsigned int n, unsigned int *available)
> {
>         return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
>                         __IS_MC, available, dequeue_X);
> }
> 
> Konstantin
Thanks for the feedback/idea. The goal of this patch was to make it simple enough to define APIs to store any element size without code duplication. With this patch, the user has to write ~4 lines of code to get APIs for any element size. I would like to keep the goal still the same.

If we have to avoid the macro-fest, the main problem that needs to be addressed is - how to represent different sizes of element types in a generic way? IMO, we can do this by defining the element type to be a multiple of uint32_t (I do not think we need to go to uint16_t).

For ex:
rte_ring_mp_enqueue_bulk_objs(struct rte_ring *r,
                uint32_t *obj_table, unsigned int num_objs,
                unsigned int n,
                enum rte_ring_queue_behavior behavior, unsigned int is_sp,
                unsigned int *free_space)
{
}

This approach would ensure that we have generic enough APIs and they can be used for elements of any size. But the element itself needs to be a multiple of 32b - I think this should not be a concern.

The API suffix definitely needs to be better, any suggestions?

> 
> 
> > +
> > +/**
> > + * @internal Enqueue several objects on the ring
> > + */
> > +static __rte_always_inline unsigned int
> > +__RTE_RING_CONCAT(__rte_ring_do_enqueue)(struct rte_ring *r,
> > +		RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int
> n,
> > +		enum rte_ring_queue_behavior behavior, unsigned int is_sp,
> > +		unsigned int *free_space)
> > +{
> > +	uint32_t prod_head, prod_next;
> > +	uint32_t free_entries;
> > +
> > +	n = __rte_ring_move_prod_head(r, is_sp, n, behavior,
> > +			&prod_head, &prod_next, &free_entries);
> > +	if (n == 0)
> > +		goto end;
> > +
> > +	ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n,
> > +		RTE_RING_TMPLT_ELEM_TYPE);
> > +
> > +	update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
> > +end:
> > +	if (free_space != NULL)
> > +		*free_space = free_entries - n;
> > +	return n;
> > +}
> > +
> > +/**
> > + * @internal Dequeue several objects from the ring
> > + */
> > +static __rte_always_inline unsigned int
> > +__RTE_RING_CONCAT(__rte_ring_do_dequeue)(struct rte_ring *r,
> > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > +	enum rte_ring_queue_behavior behavior, unsigned int is_sc,
> > +	unsigned int *available)
> > +{
> > +	uint32_t cons_head, cons_next;
> > +	uint32_t entries;
> > +
> > +	n = __rte_ring_move_cons_head(r, (int)is_sc, n, behavior,
> > +			&cons_head, &cons_next, &entries);
> > +	if (n == 0)
> > +		goto end;
> > +
> > +	DEQUEUE_PTRS(r, &r[1], cons_head, obj_table, n,
> > +		RTE_RING_TMPLT_ELEM_TYPE);
> > +
> > +	update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
> > +
> > +end:
> > +	if (available != NULL)
> > +		*available = entries - n;
> > +	return n;
> > +}
> > +
> > +
> > +/**
> > + * Enqueue several objects on the ring (multi-producers safe).
> > + */
> > +static __rte_always_inline unsigned int
> > +__RTE_RING_CONCAT(rte_ring_mp_enqueue_bulk)(struct rte_ring *r,
> > +	RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int n,
> > +	unsigned int *free_space)
> > +{
> > +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> > +			RTE_RING_QUEUE_FIXED, __IS_MP, free_space);
> > +}
> > +
> > +/**
> > + * Enqueue several objects on a ring (NOT multi-producers safe).
> > + */
> > +static __rte_always_inline unsigned int
> > +__RTE_RING_CONCAT(rte_ring_sp_enqueue_bulk)(struct rte_ring *r,
> > +	RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int n,
> > +	unsigned int *free_space)
> > +{
> > +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> > +			RTE_RING_QUEUE_FIXED, __IS_SP, free_space);
> > +}
> > +
> > +/**
> > + * Enqueue several objects on a ring.
> > + */
> > +static __rte_always_inline unsigned int
> > +__RTE_RING_CONCAT(rte_ring_enqueue_bulk)(struct rte_ring *r,
> > +	RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int n,
> > +	unsigned int *free_space)
> > +{
> > +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> > +			RTE_RING_QUEUE_FIXED, r->prod.single, free_space);
> > +}
> > +
> > +/**
> > + * Enqueue one object on a ring (multi-producers safe).
> > + */
> > +static __rte_always_inline int
> > +__RTE_RING_CONCAT(rte_ring_mp_enqueue)(struct rte_ring *r,
> > +	RTE_RING_TMPLT_ELEM_TYPE obj)
> > +{
> > +	return __RTE_RING_CONCAT(rte_ring_mp_enqueue_bulk)(r, &obj, 1,
> NULL) ?
> > +			0 : -ENOBUFS;
> > +}
> > +
> > +/**
> > + * Enqueue one object on a ring (NOT multi-producers safe).
> > + */
> > +static __rte_always_inline int
> > +__RTE_RING_CONCAT(rte_ring_sp_enqueue)(struct rte_ring *r,
> > +	RTE_RING_TMPLT_ELEM_TYPE obj)
> > +{
> > +	return __RTE_RING_CONCAT(rte_ring_sp_enqueue_bulk)(r, &obj, 1,
> NULL) ?
> > +			0 : -ENOBUFS;
> > +}
> > +
> > +/**
> > + * Enqueue one object on a ring.
> > + */
> > +static __rte_always_inline int
> > +__RTE_RING_CONCAT(rte_ring_enqueue)(struct rte_ring *r,
> > +	RTE_RING_TMPLT_ELEM_TYPE *obj)
> > +{
> > +	return __RTE_RING_CONCAT(rte_ring_enqueue_bulk)(r, obj, 1,
> NULL) ?
> > +			0 : -ENOBUFS;
> > +}
> > +
> > +/**
> > + * Dequeue several objects from a ring (multi-consumers safe).
> > + */
> > +static __rte_always_inline unsigned int
> > +__RTE_RING_CONCAT(rte_ring_mc_dequeue_bulk)(struct rte_ring *r,
> > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > +	unsigned int *available)
> > +{
> > +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> > +			RTE_RING_QUEUE_FIXED, __IS_MC, available);
> > +}
> > +
> > +/**
> > + * Dequeue several objects from a ring (NOT multi-consumers safe).
> > + */
> > +static __rte_always_inline unsigned int
> > +__RTE_RING_CONCAT(rte_ring_sc_dequeue_bulk)(struct rte_ring *r,
> > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > +	unsigned int *available)
> > +{
> > +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> > +			RTE_RING_QUEUE_FIXED, __IS_SC, available);
> > +}
> > +
> > +/**
> > + * Dequeue several objects from a ring.
> > + */
> > +static __rte_always_inline unsigned int
> > +__RTE_RING_CONCAT(rte_ring_dequeue_bulk)(struct rte_ring *r,
> > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > +	unsigned int *available)
> > +{
> > +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> > +			RTE_RING_QUEUE_FIXED, r->cons.single, available);
> > +}
> > +
> > +/**
> > + * Dequeue one object from a ring (multi-consumers safe).
> > + */
> > +static __rte_always_inline int
> > +__RTE_RING_CONCAT(rte_ring_mc_dequeue)(struct rte_ring *r,
> > +	RTE_RING_TMPLT_ELEM_TYPE *obj_p)
> > +{
> > +	return __RTE_RING_CONCAT(rte_ring_mc_dequeue_bulk)(r, obj_p, 1,
> NULL) ?
> > +			0 : -ENOENT;
> > +}
> > +
> > +/**
> > + * Dequeue one object from a ring (NOT multi-consumers safe).
> > + */
> > +static __rte_always_inline int
> > +__RTE_RING_CONCAT(rte_ring_sc_dequeue)(struct rte_ring *r,
> > +	RTE_RING_TMPLT_ELEM_TYPE *obj_p)
> > +{
> > +	return __RTE_RING_CONCAT(rte_ring_sc_dequeue_bulk)(r, obj_p, 1,
> NULL) ?
> > +			0 : -ENOENT;
> > +}
> > +
> > +/**
> > + * Dequeue one object from a ring.
> > + */
> > +static __rte_always_inline int
> > +__RTE_RING_CONCAT(rte_ring_dequeue)(struct rte_ring *r,
> > +	RTE_RING_TMPLT_ELEM_TYPE *obj_p)
> > +{
> > +	return __RTE_RING_CONCAT(rte_ring_dequeue_bulk)(r, obj_p, 1,
> NULL) ?
> > +			0 : -ENOENT;
> > +}
> > +
> > +/**
> > + * Enqueue several objects on the ring (multi-producers safe).
> > + */
> > +static __rte_always_inline unsigned
> > +__RTE_RING_CONCAT(rte_ring_mp_enqueue_burst)(struct rte_ring *r,
> > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table,
> > +			 unsigned int n, unsigned int *free_space)
> > +{
> > +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> > +			RTE_RING_QUEUE_VARIABLE, __IS_MP, free_space);
> > +}
> > +
> > +/**
> > + * Enqueue several objects on a ring (NOT multi-producers safe).
> > + */
> > +static __rte_always_inline unsigned
> > +__RTE_RING_CONCAT(rte_ring_sp_enqueue_burst)(struct rte_ring *r,
> > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table,
> > +			 unsigned int n, unsigned int *free_space)
> > +{
> > +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> > +			RTE_RING_QUEUE_VARIABLE, __IS_SP, free_space);
> > +}
> > +
> > +/**
> > + * Enqueue several objects on a ring.
> > + */
> > +static __rte_always_inline unsigned
> > +__RTE_RING_CONCAT(rte_ring_enqueue_burst)(struct rte_ring *r,
> > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > +	unsigned int *free_space)
> > +{
> > +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> > +			RTE_RING_QUEUE_VARIABLE, r->prod.single,
> free_space);
> > +}
> > +
> > +/**
> > + * Dequeue several objects from a ring (multi-consumers safe). When the
> request
> > + * objects are more than the available objects, only dequeue the actual
> number
> > + * of objects
> > + */
> > +static __rte_always_inline unsigned
> > +__RTE_RING_CONCAT(rte_ring_mc_dequeue_burst)(struct rte_ring *r,
> > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > +	unsigned int *available)
> > +{
> > +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> > +			RTE_RING_QUEUE_VARIABLE, __IS_MC, available);
> > +}
> > +
> > +/**
> > + * Dequeue several objects from a ring (NOT multi-consumers safe).When
> the
> > + * request objects are more than the available objects, only dequeue the
> > + * actual number of objects
> > + */
> > +static __rte_always_inline unsigned
> > +__RTE_RING_CONCAT(rte_ring_sc_dequeue_burst)(struct rte_ring *r,
> > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > +	unsigned int *available)
> > +{
> > +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> > +			RTE_RING_QUEUE_VARIABLE, __IS_SC, available);
> > +}
> > +
> > +/**
> > + * Dequeue multiple objects from a ring up to a maximum number.
> > + */
> > +static __rte_always_inline unsigned
> > +__RTE_RING_CONCAT(rte_ring_dequeue_burst)(struct rte_ring *r,
> > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > +	unsigned int *available)
> > +{
> > +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> > +				RTE_RING_QUEUE_VARIABLE,
> > +				r->cons.single, available);
> > +}
> > +
> > +#ifdef __cplusplus
> > +}
> > +#endif
> > +
> > +#endif /* _RTE_RING_TEMPLATE_H_ */
> > --
> > 2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH 2/5] lib/ring: add template to support different element sizes
  2019-10-02  4:21     ` Honnappa Nagarahalli
@ 2019-10-02  8:39       ` Ananyev, Konstantin
  2019-10-03  3:33         ` Honnappa Nagarahalli
  0 siblings, 1 reply; 173+ messages in thread
From: Ananyev, Konstantin @ 2019-10-02  8:39 UTC (permalink / raw)
  To: Honnappa Nagarahalli, olivier.matz, Wang, Yipeng1, Gobriel,
	Sameh, Richardson, Bruce, De Lara Guarch, Pablo
  Cc: dev, Dharmik Thakkar, Gavin Hu (Arm Technology China),
	Ruifeng Wang (Arm Technology China),
	nd, nd



> -----Original Message-----
> From: Honnappa Nagarahalli [mailto:Honnappa.Nagarahalli@arm.com]
> Sent: Wednesday, October 2, 2019 5:22 AM
> To: Ananyev, Konstantin <konstantin.ananyev@intel.com>; olivier.matz@6wind.com; Wang, Yipeng1 <yipeng1.wang@intel.com>; Gobriel,
> Sameh <sameh.gobriel@intel.com>; Richardson, Bruce <bruce.richardson@intel.com>; De Lara Guarch, Pablo
> <pablo.de.lara.guarch@intel.com>
> Cc: dev@dpdk.org; Dharmik Thakkar <Dharmik.Thakkar@arm.com>; Gavin Hu (Arm Technology China) <Gavin.Hu@arm.com>; Ruifeng
> Wang (Arm Technology China) <Ruifeng.Wang@arm.com>; Honnappa Nagarahalli <Honnappa.Nagarahalli@arm.com>; nd
> <nd@arm.com>; nd <nd@arm.com>
> Subject: RE: [dpdk-dev] [PATCH 2/5] lib/ring: add template to support different element sizes
> 
> > > Add templates to support creating ring APIs with different ring
> > > element sizes.
> > >
> > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > ---
> > >  lib/librte_ring/Makefile            |   4 +-
> > >  lib/librte_ring/meson.build         |   4 +-
> > >  lib/librte_ring/rte_ring_template.c |  46 ++++
> > > lib/librte_ring/rte_ring_template.h | 330 ++++++++++++++++++++++++++++
> > >  4 files changed, 382 insertions(+), 2 deletions(-)  create mode
> > > 100644 lib/librte_ring/rte_ring_template.c
> > >  create mode 100644 lib/librte_ring/rte_ring_template.h
> > >
> > > diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile index
> > > 4c8410229..818898110 100644
> > > --- a/lib/librte_ring/Makefile
> > > +++ b/lib/librte_ring/Makefile
> > > @@ -19,6 +19,8 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c  #
> > > install includes  SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include :=
> > > rte_ring.h \
> > >  					rte_ring_generic.h \
> > > -					rte_ring_c11_mem.h
> > > +					rte_ring_c11_mem.h \
> > > +					rte_ring_template.h \
> > > +					rte_ring_template.c
> > >
> > >  include $(RTE_SDK)/mk/rte.lib.mk
> > > diff --git a/lib/librte_ring/meson.build b/lib/librte_ring/meson.build
> > > index 74219840a..e4e208a7c 100644
> > > --- a/lib/librte_ring/meson.build
> > > +++ b/lib/librte_ring/meson.build
> > > @@ -5,7 +5,9 @@ version = 2
> > >  sources = files('rte_ring.c')
> > >  headers = files('rte_ring.h',
> > >  		'rte_ring_c11_mem.h',
> > > -		'rte_ring_generic.h')
> > > +		'rte_ring_generic.h',
> > > +		'rte_ring_template.h',
> > > +		'rte_ring_template.c')
> > >
> > >  # rte_ring_create_elem and rte_ring_get_memsize_elem are experimental
> > > allow_experimental_apis = true diff --git
> > > a/lib/librte_ring/rte_ring_template.c
> > > b/lib/librte_ring/rte_ring_template.c
> > > new file mode 100644
> > > index 000000000..1ca593f95
> > > --- /dev/null
> > > +++ b/lib/librte_ring/rte_ring_template.c
> > > @@ -0,0 +1,46 @@
> > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > + * Copyright (c) 2019 Arm Limited
> > > + */
> > > +
> > > +#include <stdio.h>
> > > +#include <stdarg.h>
> > > +#include <string.h>
> > > +#include <stdint.h>
> > > +#include <inttypes.h>
> > > +#include <errno.h>
> > > +#include <sys/queue.h>
> > > +
> > > +#include <rte_common.h>
> > > +#include <rte_log.h>
> > > +#include <rte_memory.h>
> > > +#include <rte_memzone.h>
> > > +#include <rte_malloc.h>
> > > +#include <rte_launch.h>
> > > +#include <rte_eal.h>
> > > +#include <rte_eal_memconfig.h>
> > > +#include <rte_atomic.h>
> > > +#include <rte_per_lcore.h>
> > > +#include <rte_lcore.h>
> > > +#include <rte_branch_prediction.h>
> > > +#include <rte_errno.h>
> > > +#include <rte_string_fns.h>
> > > +#include <rte_spinlock.h>
> > > +#include <rte_tailq.h>
> > > +
> > > +#include "rte_ring.h"
> > > +
> > > +/* return the size of memory occupied by a ring */ ssize_t
> > > +__RTE_RING_CONCAT(rte_ring_get_memsize)(unsigned count) {
> > > +	return rte_ring_get_memsize_elem(count,
> > RTE_RING_TMPLT_ELEM_SIZE); }
> > > +
> > > +/* create the ring */
> > > +struct rte_ring *
> > > +__RTE_RING_CONCAT(rte_ring_create)(const char *name, unsigned count,
> > > +		int socket_id, unsigned flags)
> > > +{
> > > +	return rte_ring_create_elem(name, count,
> > RTE_RING_TMPLT_ELEM_SIZE,
> > > +		socket_id, flags);
> > > +}
> > > diff --git a/lib/librte_ring/rte_ring_template.h
> > > b/lib/librte_ring/rte_ring_template.h
> > > new file mode 100644
> > > index 000000000..b9b14dfbb
> > > --- /dev/null
> > > +++ b/lib/librte_ring/rte_ring_template.h
> > > @@ -0,0 +1,330 @@
> > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > + * Copyright (c) 2019 Arm Limited
> > > + */
> > > +
> > > +#ifndef _RTE_RING_TEMPLATE_H_
> > > +#define _RTE_RING_TEMPLATE_H_
> > > +
> > > +#ifdef __cplusplus
> > > +extern "C" {
> > > +#endif
> > > +
> > > +#include <stdio.h>
> > > +#include <stdint.h>
> > > +#include <sys/queue.h>
> > > +#include <errno.h>
> > > +#include <rte_common.h>
> > > +#include <rte_config.h>
> > > +#include <rte_memory.h>
> > > +#include <rte_lcore.h>
> > > +#include <rte_atomic.h>
> > > +#include <rte_branch_prediction.h>
> > > +#include <rte_memzone.h>
> > > +#include <rte_pause.h>
> > > +#include <rte_ring.h>
> > > +
> > > +/* Ring API suffix name - used to append to API names */ #ifndef
> > > +RTE_RING_TMPLT_API_SUFFIX #error RTE_RING_TMPLT_API_SUFFIX not
> > > +defined #endif
> > > +
> > > +/* Ring's element size in bits, should be a power of 2 */ #ifndef
> > > +RTE_RING_TMPLT_ELEM_SIZE #error RTE_RING_TMPLT_ELEM_SIZE not
> > defined
> > > +#endif
> > > +
> > > +/* Type of ring elements */
> > > +#ifndef RTE_RING_TMPLT_ELEM_TYPE
> > > +#error RTE_RING_TMPLT_ELEM_TYPE not defined #endif
> > > +
> > > +#define _rte_fuse(a, b) a##_##b
> > > +#define __rte_fuse(a, b) _rte_fuse(a, b) #define __RTE_RING_CONCAT(a)
> > > +__rte_fuse(a, RTE_RING_TMPLT_API_SUFFIX)
> > > +
> > > +/* Calculate the memory size needed for a ring */
> > > +RTE_RING_TMPLT_EXPERIMENTAL ssize_t
> > > +__RTE_RING_CONCAT(rte_ring_get_memsize)(unsigned count);
> > > +
> > > +/* Create a new ring named *name* in memory. */
> > > +RTE_RING_TMPLT_EXPERIMENTAL struct rte_ring *
> > > +__RTE_RING_CONCAT(rte_ring_create)(const char *name, unsigned count,
> > > +					int socket_id, unsigned flags);
> >
> >
> > Just an idea - probably same thing can be achieved in a different way.
> > Instead of all these defines - replace ENQUEUE_PTRS/DEQUEUE_PTRS macros
> > with static inline functions and then make all internal functions, i.e.
> > __rte_ring_do_dequeue()
> > to accept enqueue/dequeue function pointer as a parameter.
> > Then let say default rte_ring_mc_dequeue_bulk will do:
> >
> > rte_ring_mc_dequeue_bulk(struct rte_ring *r, void **obj_table,
> >                 unsigned int n, unsigned int *available)
> > {
> >         return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
> >                         __IS_MC, available, dequeue_ptr_default);
> > }
> >
> > Then if someone will like to define ring functions forelt_size==X, all he would
> > need to do:
> > 1. define his own enqueue/dequeuer functions.
> > 2. do something like:
> > rte_ring_mc_dequeue_bulk(struct rte_ring *r, void **obj_table,
> >                 unsigned int n, unsigned int *available)
> > {
> >         return __rte_ring_do_dequeue(r, obj_table, n, RTE_RING_QUEUE_FIXED,
> >                         __IS_MC, available, dequeue_X);
> > }
> >
> > Konstantin
> Thanks for the feedback/idea. The goal of this patch was to make it simple enough to define APIs to store any element size without code
> duplication. 

Well, then if we store elt_size inside the ring, it should be easy enough
to add  to the API generic functions that would use memcpy(or rte_memcpy) for enqueue/dequeue.
Yes, it might be slower than existing (8B per elem), but might be still acceptable.

>With this patch, the user has to write ~4 lines of code to get APIs for any element size. I would like to keep the goal still the
> same.
> 
> If we have to avoid the macro-fest, the main problem that needs to be addressed is - how to represent different sizes of element types in a
> generic way? IMO, we can do this by defining the element type to be a multiple of uint32_t (I do not think we need to go to uint16_t).
> 
> For ex:
> rte_ring_mp_enqueue_bulk_objs(struct rte_ring *r,
>                 uint32_t *obj_table, unsigned int num_objs,
>                 unsigned int n,
>                 enum rte_ring_queue_behavior behavior, unsigned int is_sp,
>                 unsigned int *free_space)
> {
> }
> 
> This approach would ensure that we have generic enough APIs and they can be used for elements of any size. But the element itself needs
> to be a multiple of 32b - I think this should not be a concern.
> 
> The API suffix definitely needs to be better, any suggestions?

> 
> >
> >
> > > +
> > > +/**
> > > + * @internal Enqueue several objects on the ring
> > > + */
> > > +static __rte_always_inline unsigned int
> > > +__RTE_RING_CONCAT(__rte_ring_do_enqueue)(struct rte_ring *r,
> > > +		RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int
> > n,
> > > +		enum rte_ring_queue_behavior behavior, unsigned int is_sp,
> > > +		unsigned int *free_space)
> > > +{
> > > +	uint32_t prod_head, prod_next;
> > > +	uint32_t free_entries;
> > > +
> > > +	n = __rte_ring_move_prod_head(r, is_sp, n, behavior,
> > > +			&prod_head, &prod_next, &free_entries);
> > > +	if (n == 0)
> > > +		goto end;
> > > +
> > > +	ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n,
> > > +		RTE_RING_TMPLT_ELEM_TYPE);
> > > +
> > > +	update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
> > > +end:
> > > +	if (free_space != NULL)
> > > +		*free_space = free_entries - n;
> > > +	return n;
> > > +}
> > > +
> > > +/**
> > > + * @internal Dequeue several objects from the ring
> > > + */
> > > +static __rte_always_inline unsigned int
> > > +__RTE_RING_CONCAT(__rte_ring_do_dequeue)(struct rte_ring *r,
> > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > > +	enum rte_ring_queue_behavior behavior, unsigned int is_sc,
> > > +	unsigned int *available)
> > > +{
> > > +	uint32_t cons_head, cons_next;
> > > +	uint32_t entries;
> > > +
> > > +	n = __rte_ring_move_cons_head(r, (int)is_sc, n, behavior,
> > > +			&cons_head, &cons_next, &entries);
> > > +	if (n == 0)
> > > +		goto end;
> > > +
> > > +	DEQUEUE_PTRS(r, &r[1], cons_head, obj_table, n,
> > > +		RTE_RING_TMPLT_ELEM_TYPE);
> > > +
> > > +	update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
> > > +
> > > +end:
> > > +	if (available != NULL)
> > > +		*available = entries - n;
> > > +	return n;
> > > +}
> > > +
> > > +
> > > +/**
> > > + * Enqueue several objects on the ring (multi-producers safe).
> > > + */
> > > +static __rte_always_inline unsigned int
> > > +__RTE_RING_CONCAT(rte_ring_mp_enqueue_bulk)(struct rte_ring *r,
> > > +	RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int n,
> > > +	unsigned int *free_space)
> > > +{
> > > +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> > > +			RTE_RING_QUEUE_FIXED, __IS_MP, free_space);
> > > +}
> > > +
> > > +/**
> > > + * Enqueue several objects on a ring (NOT multi-producers safe).
> > > + */
> > > +static __rte_always_inline unsigned int
> > > +__RTE_RING_CONCAT(rte_ring_sp_enqueue_bulk)(struct rte_ring *r,
> > > +	RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int n,
> > > +	unsigned int *free_space)
> > > +{
> > > +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> > > +			RTE_RING_QUEUE_FIXED, __IS_SP, free_space);
> > > +}
> > > +
> > > +/**
> > > + * Enqueue several objects on a ring.
> > > + */
> > > +static __rte_always_inline unsigned int
> > > +__RTE_RING_CONCAT(rte_ring_enqueue_bulk)(struct rte_ring *r,
> > > +	RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int n,
> > > +	unsigned int *free_space)
> > > +{
> > > +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> > > +			RTE_RING_QUEUE_FIXED, r->prod.single, free_space);
> > > +}
> > > +
> > > +/**
> > > + * Enqueue one object on a ring (multi-producers safe).
> > > + */
> > > +static __rte_always_inline int
> > > +__RTE_RING_CONCAT(rte_ring_mp_enqueue)(struct rte_ring *r,
> > > +	RTE_RING_TMPLT_ELEM_TYPE obj)
> > > +{
> > > +	return __RTE_RING_CONCAT(rte_ring_mp_enqueue_bulk)(r, &obj, 1,
> > NULL) ?
> > > +			0 : -ENOBUFS;
> > > +}
> > > +
> > > +/**
> > > + * Enqueue one object on a ring (NOT multi-producers safe).
> > > + */
> > > +static __rte_always_inline int
> > > +__RTE_RING_CONCAT(rte_ring_sp_enqueue)(struct rte_ring *r,
> > > +	RTE_RING_TMPLT_ELEM_TYPE obj)
> > > +{
> > > +	return __RTE_RING_CONCAT(rte_ring_sp_enqueue_bulk)(r, &obj, 1,
> > NULL) ?
> > > +			0 : -ENOBUFS;
> > > +}
> > > +
> > > +/**
> > > + * Enqueue one object on a ring.
> > > + */
> > > +static __rte_always_inline int
> > > +__RTE_RING_CONCAT(rte_ring_enqueue)(struct rte_ring *r,
> > > +	RTE_RING_TMPLT_ELEM_TYPE *obj)
> > > +{
> > > +	return __RTE_RING_CONCAT(rte_ring_enqueue_bulk)(r, obj, 1,
> > NULL) ?
> > > +			0 : -ENOBUFS;
> > > +}
> > > +
> > > +/**
> > > + * Dequeue several objects from a ring (multi-consumers safe).
> > > + */
> > > +static __rte_always_inline unsigned int
> > > +__RTE_RING_CONCAT(rte_ring_mc_dequeue_bulk)(struct rte_ring *r,
> > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > > +	unsigned int *available)
> > > +{
> > > +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> > > +			RTE_RING_QUEUE_FIXED, __IS_MC, available);
> > > +}
> > > +
> > > +/**
> > > + * Dequeue several objects from a ring (NOT multi-consumers safe).
> > > + */
> > > +static __rte_always_inline unsigned int
> > > +__RTE_RING_CONCAT(rte_ring_sc_dequeue_bulk)(struct rte_ring *r,
> > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > > +	unsigned int *available)
> > > +{
> > > +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> > > +			RTE_RING_QUEUE_FIXED, __IS_SC, available);
> > > +}
> > > +
> > > +/**
> > > + * Dequeue several objects from a ring.
> > > + */
> > > +static __rte_always_inline unsigned int
> > > +__RTE_RING_CONCAT(rte_ring_dequeue_bulk)(struct rte_ring *r,
> > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > > +	unsigned int *available)
> > > +{
> > > +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> > > +			RTE_RING_QUEUE_FIXED, r->cons.single, available);
> > > +}
> > > +
> > > +/**
> > > + * Dequeue one object from a ring (multi-consumers safe).
> > > + */
> > > +static __rte_always_inline int
> > > +__RTE_RING_CONCAT(rte_ring_mc_dequeue)(struct rte_ring *r,
> > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_p)
> > > +{
> > > +	return __RTE_RING_CONCAT(rte_ring_mc_dequeue_bulk)(r, obj_p, 1,
> > NULL) ?
> > > +			0 : -ENOENT;
> > > +}
> > > +
> > > +/**
> > > + * Dequeue one object from a ring (NOT multi-consumers safe).
> > > + */
> > > +static __rte_always_inline int
> > > +__RTE_RING_CONCAT(rte_ring_sc_dequeue)(struct rte_ring *r,
> > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_p)
> > > +{
> > > +	return __RTE_RING_CONCAT(rte_ring_sc_dequeue_bulk)(r, obj_p, 1,
> > NULL) ?
> > > +			0 : -ENOENT;
> > > +}
> > > +
> > > +/**
> > > + * Dequeue one object from a ring.
> > > + */
> > > +static __rte_always_inline int
> > > +__RTE_RING_CONCAT(rte_ring_dequeue)(struct rte_ring *r,
> > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_p)
> > > +{
> > > +	return __RTE_RING_CONCAT(rte_ring_dequeue_bulk)(r, obj_p, 1,
> > NULL) ?
> > > +			0 : -ENOENT;
> > > +}
> > > +
> > > +/**
> > > + * Enqueue several objects on the ring (multi-producers safe).
> > > + */
> > > +static __rte_always_inline unsigned
> > > +__RTE_RING_CONCAT(rte_ring_mp_enqueue_burst)(struct rte_ring *r,
> > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table,
> > > +			 unsigned int n, unsigned int *free_space)
> > > +{
> > > +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> > > +			RTE_RING_QUEUE_VARIABLE, __IS_MP, free_space);
> > > +}
> > > +
> > > +/**
> > > + * Enqueue several objects on a ring (NOT multi-producers safe).
> > > + */
> > > +static __rte_always_inline unsigned
> > > +__RTE_RING_CONCAT(rte_ring_sp_enqueue_burst)(struct rte_ring *r,
> > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table,
> > > +			 unsigned int n, unsigned int *free_space)
> > > +{
> > > +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> > > +			RTE_RING_QUEUE_VARIABLE, __IS_SP, free_space);
> > > +}
> > > +
> > > +/**
> > > + * Enqueue several objects on a ring.
> > > + */
> > > +static __rte_always_inline unsigned
> > > +__RTE_RING_CONCAT(rte_ring_enqueue_burst)(struct rte_ring *r,
> > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > > +	unsigned int *free_space)
> > > +{
> > > +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> > > +			RTE_RING_QUEUE_VARIABLE, r->prod.single,
> > free_space);
> > > +}
> > > +
> > > +/**
> > > + * Dequeue several objects from a ring (multi-consumers safe). When the
> > request
> > > + * objects are more than the available objects, only dequeue the actual
> > number
> > > + * of objects
> > > + */
> > > +static __rte_always_inline unsigned
> > > +__RTE_RING_CONCAT(rte_ring_mc_dequeue_burst)(struct rte_ring *r,
> > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > > +	unsigned int *available)
> > > +{
> > > +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> > > +			RTE_RING_QUEUE_VARIABLE, __IS_MC, available);
> > > +}
> > > +
> > > +/**
> > > + * Dequeue several objects from a ring (NOT multi-consumers safe).When
> > the
> > > + * request objects are more than the available objects, only dequeue the
> > > + * actual number of objects
> > > + */
> > > +static __rte_always_inline unsigned
> > > +__RTE_RING_CONCAT(rte_ring_sc_dequeue_burst)(struct rte_ring *r,
> > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > > +	unsigned int *available)
> > > +{
> > > +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> > > +			RTE_RING_QUEUE_VARIABLE, __IS_SC, available);
> > > +}
> > > +
> > > +/**
> > > + * Dequeue multiple objects from a ring up to a maximum number.
> > > + */
> > > +static __rte_always_inline unsigned
> > > +__RTE_RING_CONCAT(rte_ring_dequeue_burst)(struct rte_ring *r,
> > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > > +	unsigned int *available)
> > > +{
> > > +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> > > +				RTE_RING_QUEUE_VARIABLE,
> > > +				r->cons.single, available);
> > > +}
> > > +
> > > +#ifdef __cplusplus
> > > +}
> > > +#endif
> > > +
> > > +#endif /* _RTE_RING_TEMPLATE_H_ */
> > > --
> > > 2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH 2/5] lib/ring: add template to support different element sizes
  2019-10-02  8:39       ` Ananyev, Konstantin
@ 2019-10-03  3:33         ` Honnappa Nagarahalli
  2019-10-03 11:51           ` Ananyev, Konstantin
  0 siblings, 1 reply; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-03  3:33 UTC (permalink / raw)
  To: Ananyev, Konstantin, olivier.matz, Wang, Yipeng1, Gobriel, Sameh,
	Richardson, Bruce, De Lara Guarch, Pablo
  Cc: dev, Dharmik Thakkar, Gavin Hu (Arm Technology China),
	Ruifeng Wang (Arm Technology China),
	nd, nd, nd

<snip>

> >
> > > > Add templates to support creating ring APIs with different ring
> > > > element sizes.
> > > >
> > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > ---
> > > >  lib/librte_ring/Makefile            |   4 +-
> > > >  lib/librte_ring/meson.build         |   4 +-
> > > >  lib/librte_ring/rte_ring_template.c |  46 ++++
> > > > lib/librte_ring/rte_ring_template.h | 330
> > > > ++++++++++++++++++++++++++++
> > > >  4 files changed, 382 insertions(+), 2 deletions(-)  create mode
> > > > 100644 lib/librte_ring/rte_ring_template.c
> > > >  create mode 100644 lib/librte_ring/rte_ring_template.h
> > > >
> > > > diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
> > > > index
> > > > 4c8410229..818898110 100644
> > > > --- a/lib/librte_ring/Makefile
> > > > +++ b/lib/librte_ring/Makefile
> > > > @@ -19,6 +19,8 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c  #
> > > > install includes  SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include :=
> > > > rte_ring.h \
> > > >  					rte_ring_generic.h \
> > > > -					rte_ring_c11_mem.h
> > > > +					rte_ring_c11_mem.h \
> > > > +					rte_ring_template.h \
> > > > +					rte_ring_template.c
> > > >
> > > >  include $(RTE_SDK)/mk/rte.lib.mk
> > > > diff --git a/lib/librte_ring/meson.build
> > > > b/lib/librte_ring/meson.build index 74219840a..e4e208a7c 100644
> > > > --- a/lib/librte_ring/meson.build
> > > > +++ b/lib/librte_ring/meson.build
> > > > @@ -5,7 +5,9 @@ version = 2
> > > >  sources = files('rte_ring.c')
> > > >  headers = files('rte_ring.h',
> > > >  		'rte_ring_c11_mem.h',
> > > > -		'rte_ring_generic.h')
> > > > +		'rte_ring_generic.h',
> > > > +		'rte_ring_template.h',
> > > > +		'rte_ring_template.c')
> > > >
> > > >  # rte_ring_create_elem and rte_ring_get_memsize_elem are
> > > > experimental allow_experimental_apis = true diff --git
> > > > a/lib/librte_ring/rte_ring_template.c
> > > > b/lib/librte_ring/rte_ring_template.c
> > > > new file mode 100644
> > > > index 000000000..1ca593f95
> > > > --- /dev/null
> > > > +++ b/lib/librte_ring/rte_ring_template.c
> > > > @@ -0,0 +1,46 @@
> > > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > > + * Copyright (c) 2019 Arm Limited  */
> > > > +
> > > > +#include <stdio.h>
> > > > +#include <stdarg.h>
> > > > +#include <string.h>
> > > > +#include <stdint.h>
> > > > +#include <inttypes.h>
> > > > +#include <errno.h>
> > > > +#include <sys/queue.h>
> > > > +
> > > > +#include <rte_common.h>
> > > > +#include <rte_log.h>
> > > > +#include <rte_memory.h>
> > > > +#include <rte_memzone.h>
> > > > +#include <rte_malloc.h>
> > > > +#include <rte_launch.h>
> > > > +#include <rte_eal.h>
> > > > +#include <rte_eal_memconfig.h>
> > > > +#include <rte_atomic.h>
> > > > +#include <rte_per_lcore.h>
> > > > +#include <rte_lcore.h>
> > > > +#include <rte_branch_prediction.h> #include <rte_errno.h>
> > > > +#include <rte_string_fns.h> #include <rte_spinlock.h> #include
> > > > +<rte_tailq.h>
> > > > +
> > > > +#include "rte_ring.h"
> > > > +
> > > > +/* return the size of memory occupied by a ring */ ssize_t
> > > > +__RTE_RING_CONCAT(rte_ring_get_memsize)(unsigned count) {
> > > > +	return rte_ring_get_memsize_elem(count,
> > > RTE_RING_TMPLT_ELEM_SIZE); }
> > > > +
> > > > +/* create the ring */
> > > > +struct rte_ring *
> > > > +__RTE_RING_CONCAT(rte_ring_create)(const char *name, unsigned
> count,
> > > > +		int socket_id, unsigned flags)
> > > > +{
> > > > +	return rte_ring_create_elem(name, count,
> > > RTE_RING_TMPLT_ELEM_SIZE,
> > > > +		socket_id, flags);
> > > > +}
> > > > diff --git a/lib/librte_ring/rte_ring_template.h
> > > > b/lib/librte_ring/rte_ring_template.h
> > > > new file mode 100644
> > > > index 000000000..b9b14dfbb
> > > > --- /dev/null
> > > > +++ b/lib/librte_ring/rte_ring_template.h
> > > > @@ -0,0 +1,330 @@
> > > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > > + * Copyright (c) 2019 Arm Limited  */
> > > > +
> > > > +#ifndef _RTE_RING_TEMPLATE_H_
> > > > +#define _RTE_RING_TEMPLATE_H_
> > > > +
> > > > +#ifdef __cplusplus
> > > > +extern "C" {
> > > > +#endif
> > > > +
> > > > +#include <stdio.h>
> > > > +#include <stdint.h>
> > > > +#include <sys/queue.h>
> > > > +#include <errno.h>
> > > > +#include <rte_common.h>
> > > > +#include <rte_config.h>
> > > > +#include <rte_memory.h>
> > > > +#include <rte_lcore.h>
> > > > +#include <rte_atomic.h>
> > > > +#include <rte_branch_prediction.h> #include <rte_memzone.h>
> > > > +#include <rte_pause.h> #include <rte_ring.h>
> > > > +
> > > > +/* Ring API suffix name - used to append to API names */ #ifndef
> > > > +RTE_RING_TMPLT_API_SUFFIX #error RTE_RING_TMPLT_API_SUFFIX
> not
> > > > +defined #endif
> > > > +
> > > > +/* Ring's element size in bits, should be a power of 2 */ #ifndef
> > > > +RTE_RING_TMPLT_ELEM_SIZE #error RTE_RING_TMPLT_ELEM_SIZE
> not
> > > defined
> > > > +#endif
> > > > +
> > > > +/* Type of ring elements */
> > > > +#ifndef RTE_RING_TMPLT_ELEM_TYPE
> > > > +#error RTE_RING_TMPLT_ELEM_TYPE not defined #endif
> > > > +
> > > > +#define _rte_fuse(a, b) a##_##b
> > > > +#define __rte_fuse(a, b) _rte_fuse(a, b) #define
> > > > +__RTE_RING_CONCAT(a) __rte_fuse(a, RTE_RING_TMPLT_API_SUFFIX)
> > > > +
> > > > +/* Calculate the memory size needed for a ring */
> > > > +RTE_RING_TMPLT_EXPERIMENTAL ssize_t
> > > > +__RTE_RING_CONCAT(rte_ring_get_memsize)(unsigned count);
> > > > +
> > > > +/* Create a new ring named *name* in memory. */
> > > > +RTE_RING_TMPLT_EXPERIMENTAL struct rte_ring *
> > > > +__RTE_RING_CONCAT(rte_ring_create)(const char *name, unsigned
> count,
> > > > +					int socket_id, unsigned flags);
> > >
> > >
> > > Just an idea - probably same thing can be achieved in a different way.
> > > Instead of all these defines - replace ENQUEUE_PTRS/DEQUEUE_PTRS
> > > macros with static inline functions and then make all internal functions,
> i.e.
> > > __rte_ring_do_dequeue()
> > > to accept enqueue/dequeue function pointer as a parameter.
> > > Then let say default rte_ring_mc_dequeue_bulk will do:
> > >
> > > rte_ring_mc_dequeue_bulk(struct rte_ring *r, void **obj_table,
> > >                 unsigned int n, unsigned int *available) {
> > >         return __rte_ring_do_dequeue(r, obj_table, n,
> RTE_RING_QUEUE_FIXED,
> > >                         __IS_MC, available, dequeue_ptr_default); }
> > >
> > > Then if someone will like to define ring functions forelt_size==X,
> > > all he would need to do:
> > > 1. define his own enqueue/dequeuer functions.
> > > 2. do something like:
> > > rte_ring_mc_dequeue_bulk(struct rte_ring *r, void **obj_table,
> > >                 unsigned int n, unsigned int *available) {
> > >         return __rte_ring_do_dequeue(r, obj_table, n,
> RTE_RING_QUEUE_FIXED,
> > >                         __IS_MC, available, dequeue_X); }
> > >
> > > Konstantin
> > Thanks for the feedback/idea. The goal of this patch was to make it
> > simple enough to define APIs to store any element size without code
> duplication.
> 
> Well, then if we store elt_size inside the ring, it should be easy enough to add
> to the API generic functions that would use memcpy(or rte_memcpy) for
> enqueue/dequeue.
> Yes, it might be slower than existing (8B per elem), but might be still
> acceptable.
The element size will be a constant in most use cases. If we keep the element size as a parameter, it allows the compiler to do any loop unrolling and auto-vectorization optimizations on copying.
Storing the element size will result in additional memory access.

> 
> >With this patch, the user has to write ~4 lines of code to get APIs for
> >any element size. I would like to keep the goal still the  same.
> >
> > If we have to avoid the macro-fest, the main problem that needs to be
> > addressed is - how to represent different sizes of element types in a generic
> way? IMO, we can do this by defining the element type to be a multiple of
> uint32_t (I do not think we need to go to uint16_t).
> >
> > For ex:
> > rte_ring_mp_enqueue_bulk_objs(struct rte_ring *r,
> >                 uint32_t *obj_table, unsigned int num_objs,
> >                 unsigned int n,
> >                 enum rte_ring_queue_behavior behavior, unsigned int is_sp,
> >                 unsigned int *free_space) { }
> >
> > This approach would ensure that we have generic enough APIs and they
> > can be used for elements of any size. But the element itself needs to be a
> multiple of 32b - I think this should not be a concern.
> >
> > The API suffix definitely needs to be better, any suggestions?
> 
> >
> > >
> > >
> > > > +
> > > > +/**
> > > > + * @internal Enqueue several objects on the ring  */ static
> > > > +__rte_always_inline unsigned int
> > > > +__RTE_RING_CONCAT(__rte_ring_do_enqueue)(struct rte_ring *r,
> > > > +		RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int
> > > n,
> > > > +		enum rte_ring_queue_behavior behavior, unsigned int is_sp,
> > > > +		unsigned int *free_space)
> > > > +{
> > > > +	uint32_t prod_head, prod_next;
> > > > +	uint32_t free_entries;
> > > > +
> > > > +	n = __rte_ring_move_prod_head(r, is_sp, n, behavior,
> > > > +			&prod_head, &prod_next, &free_entries);
> > > > +	if (n == 0)
> > > > +		goto end;
> > > > +
> > > > +	ENQUEUE_PTRS(r, &r[1], prod_head, obj_table, n,
> > > > +		RTE_RING_TMPLT_ELEM_TYPE);
> > > > +
> > > > +	update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
> > > > +end:
> > > > +	if (free_space != NULL)
> > > > +		*free_space = free_entries - n;
> > > > +	return n;
> > > > +}
> > > > +
> > > > +/**
> > > > + * @internal Dequeue several objects from the ring  */ static
> > > > +__rte_always_inline unsigned int
> > > > +__RTE_RING_CONCAT(__rte_ring_do_dequeue)(struct rte_ring *r,
> > > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > > > +	enum rte_ring_queue_behavior behavior, unsigned int is_sc,
> > > > +	unsigned int *available)
> > > > +{
> > > > +	uint32_t cons_head, cons_next;
> > > > +	uint32_t entries;
> > > > +
> > > > +	n = __rte_ring_move_cons_head(r, (int)is_sc, n, behavior,
> > > > +			&cons_head, &cons_next, &entries);
> > > > +	if (n == 0)
> > > > +		goto end;
> > > > +
> > > > +	DEQUEUE_PTRS(r, &r[1], cons_head, obj_table, n,
> > > > +		RTE_RING_TMPLT_ELEM_TYPE);
> > > > +
> > > > +	update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
> > > > +
> > > > +end:
> > > > +	if (available != NULL)
> > > > +		*available = entries - n;
> > > > +	return n;
> > > > +}
> > > > +
> > > > +
> > > > +/**
> > > > + * Enqueue several objects on the ring (multi-producers safe).
> > > > + */
> > > > +static __rte_always_inline unsigned int
> > > > +__RTE_RING_CONCAT(rte_ring_mp_enqueue_bulk)(struct rte_ring *r,
> > > > +	RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int n,
> > > > +	unsigned int *free_space)
> > > > +{
> > > > +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> > > > +			RTE_RING_QUEUE_FIXED, __IS_MP, free_space); }
> > > > +
> > > > +/**
> > > > + * Enqueue several objects on a ring (NOT multi-producers safe).
> > > > + */
> > > > +static __rte_always_inline unsigned int
> > > > +__RTE_RING_CONCAT(rte_ring_sp_enqueue_bulk)(struct rte_ring *r,
> > > > +	RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int n,
> > > > +	unsigned int *free_space)
> > > > +{
> > > > +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> > > > +			RTE_RING_QUEUE_FIXED, __IS_SP, free_space); }
> > > > +
> > > > +/**
> > > > + * Enqueue several objects on a ring.
> > > > + */
> > > > +static __rte_always_inline unsigned int
> > > > +__RTE_RING_CONCAT(rte_ring_enqueue_bulk)(struct rte_ring *r,
> > > > +	RTE_RING_TMPLT_ELEM_TYPE const *obj_table, unsigned int n,
> > > > +	unsigned int *free_space)
> > > > +{
> > > > +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> > > > +			RTE_RING_QUEUE_FIXED, r->prod.single,
> free_space); }
> > > > +
> > > > +/**
> > > > + * Enqueue one object on a ring (multi-producers safe).
> > > > + */
> > > > +static __rte_always_inline int
> > > > +__RTE_RING_CONCAT(rte_ring_mp_enqueue)(struct rte_ring *r,
> > > > +	RTE_RING_TMPLT_ELEM_TYPE obj)
> > > > +{
> > > > +	return __RTE_RING_CONCAT(rte_ring_mp_enqueue_bulk)(r, &obj, 1,
> > > NULL) ?
> > > > +			0 : -ENOBUFS;
> > > > +}
> > > > +
> > > > +/**
> > > > + * Enqueue one object on a ring (NOT multi-producers safe).
> > > > + */
> > > > +static __rte_always_inline int
> > > > +__RTE_RING_CONCAT(rte_ring_sp_enqueue)(struct rte_ring *r,
> > > > +	RTE_RING_TMPLT_ELEM_TYPE obj)
> > > > +{
> > > > +	return __RTE_RING_CONCAT(rte_ring_sp_enqueue_bulk)(r, &obj, 1,
> > > NULL) ?
> > > > +			0 : -ENOBUFS;
> > > > +}
> > > > +
> > > > +/**
> > > > + * Enqueue one object on a ring.
> > > > + */
> > > > +static __rte_always_inline int
> > > > +__RTE_RING_CONCAT(rte_ring_enqueue)(struct rte_ring *r,
> > > > +	RTE_RING_TMPLT_ELEM_TYPE *obj)
> > > > +{
> > > > +	return __RTE_RING_CONCAT(rte_ring_enqueue_bulk)(r, obj, 1,
> > > NULL) ?
> > > > +			0 : -ENOBUFS;
> > > > +}
> > > > +
> > > > +/**
> > > > + * Dequeue several objects from a ring (multi-consumers safe).
> > > > + */
> > > > +static __rte_always_inline unsigned int
> > > > +__RTE_RING_CONCAT(rte_ring_mc_dequeue_bulk)(struct rte_ring *r,
> > > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > > > +	unsigned int *available)
> > > > +{
> > > > +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> > > > +			RTE_RING_QUEUE_FIXED, __IS_MC, available); }
> > > > +
> > > > +/**
> > > > + * Dequeue several objects from a ring (NOT multi-consumers safe).
> > > > + */
> > > > +static __rte_always_inline unsigned int
> > > > +__RTE_RING_CONCAT(rte_ring_sc_dequeue_bulk)(struct rte_ring *r,
> > > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > > > +	unsigned int *available)
> > > > +{
> > > > +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> > > > +			RTE_RING_QUEUE_FIXED, __IS_SC, available); }
> > > > +
> > > > +/**
> > > > + * Dequeue several objects from a ring.
> > > > + */
> > > > +static __rte_always_inline unsigned int
> > > > +__RTE_RING_CONCAT(rte_ring_dequeue_bulk)(struct rte_ring *r,
> > > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > > > +	unsigned int *available)
> > > > +{
> > > > +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> > > > +			RTE_RING_QUEUE_FIXED, r->cons.single, available); }
> > > > +
> > > > +/**
> > > > + * Dequeue one object from a ring (multi-consumers safe).
> > > > + */
> > > > +static __rte_always_inline int
> > > > +__RTE_RING_CONCAT(rte_ring_mc_dequeue)(struct rte_ring *r,
> > > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_p) {
> > > > +	return __RTE_RING_CONCAT(rte_ring_mc_dequeue_bulk)(r, obj_p, 1,
> > > NULL) ?
> > > > +			0 : -ENOENT;
> > > > +}
> > > > +
> > > > +/**
> > > > + * Dequeue one object from a ring (NOT multi-consumers safe).
> > > > + */
> > > > +static __rte_always_inline int
> > > > +__RTE_RING_CONCAT(rte_ring_sc_dequeue)(struct rte_ring *r,
> > > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_p) {
> > > > +	return __RTE_RING_CONCAT(rte_ring_sc_dequeue_bulk)(r, obj_p, 1,
> > > NULL) ?
> > > > +			0 : -ENOENT;
> > > > +}
> > > > +
> > > > +/**
> > > > + * Dequeue one object from a ring.
> > > > + */
> > > > +static __rte_always_inline int
> > > > +__RTE_RING_CONCAT(rte_ring_dequeue)(struct rte_ring *r,
> > > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_p) {
> > > > +	return __RTE_RING_CONCAT(rte_ring_dequeue_bulk)(r, obj_p, 1,
> > > NULL) ?
> > > > +			0 : -ENOENT;
> > > > +}
> > > > +
> > > > +/**
> > > > + * Enqueue several objects on the ring (multi-producers safe).
> > > > + */
> > > > +static __rte_always_inline unsigned
> > > > +__RTE_RING_CONCAT(rte_ring_mp_enqueue_burst)(struct rte_ring *r,
> > > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table,
> > > > +			 unsigned int n, unsigned int *free_space) {
> > > > +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> > > > +			RTE_RING_QUEUE_VARIABLE, __IS_MP, free_space); }
> > > > +
> > > > +/**
> > > > + * Enqueue several objects on a ring (NOT multi-producers safe).
> > > > + */
> > > > +static __rte_always_inline unsigned
> > > > +__RTE_RING_CONCAT(rte_ring_sp_enqueue_burst)(struct rte_ring *r,
> > > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table,
> > > > +			 unsigned int n, unsigned int *free_space) {
> > > > +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> > > > +			RTE_RING_QUEUE_VARIABLE, __IS_SP, free_space); }
> > > > +
> > > > +/**
> > > > + * Enqueue several objects on a ring.
> > > > + */
> > > > +static __rte_always_inline unsigned
> > > > +__RTE_RING_CONCAT(rte_ring_enqueue_burst)(struct rte_ring *r,
> > > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > > > +	unsigned int *free_space)
> > > > +{
> > > > +	return __RTE_RING_CONCAT(__rte_ring_do_enqueue)(r, obj_table, n,
> > > > +			RTE_RING_QUEUE_VARIABLE, r->prod.single,
> > > free_space);
> > > > +}
> > > > +
> > > > +/**
> > > > + * Dequeue several objects from a ring (multi-consumers safe).
> > > > +When the
> > > request
> > > > + * objects are more than the available objects, only dequeue the
> > > > + actual
> > > number
> > > > + * of objects
> > > > + */
> > > > +static __rte_always_inline unsigned
> > > > +__RTE_RING_CONCAT(rte_ring_mc_dequeue_burst)(struct rte_ring *r,
> > > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > > > +	unsigned int *available)
> > > > +{
> > > > +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> > > > +			RTE_RING_QUEUE_VARIABLE, __IS_MC, available); }
> > > > +
> > > > +/**
> > > > + * Dequeue several objects from a ring (NOT multi-consumers
> > > > +safe).When
> > > the
> > > > + * request objects are more than the available objects, only
> > > > +dequeue the
> > > > + * actual number of objects
> > > > + */
> > > > +static __rte_always_inline unsigned
> > > > +__RTE_RING_CONCAT(rte_ring_sc_dequeue_burst)(struct rte_ring *r,
> > > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > > > +	unsigned int *available)
> > > > +{
> > > > +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> > > > +			RTE_RING_QUEUE_VARIABLE, __IS_SC, available); }
> > > > +
> > > > +/**
> > > > + * Dequeue multiple objects from a ring up to a maximum number.
> > > > + */
> > > > +static __rte_always_inline unsigned
> > > > +__RTE_RING_CONCAT(rte_ring_dequeue_burst)(struct rte_ring *r,
> > > > +	RTE_RING_TMPLT_ELEM_TYPE *obj_table, unsigned int n,
> > > > +	unsigned int *available)
> > > > +{
> > > > +	return __RTE_RING_CONCAT(__rte_ring_do_dequeue)(r, obj_table, n,
> > > > +				RTE_RING_QUEUE_VARIABLE,
> > > > +				r->cons.single, available);
> > > > +}
> > > > +
> > > > +#ifdef __cplusplus
> > > > +}
> > > > +#endif
> > > > +
> > > > +#endif /* _RTE_RING_TEMPLATE_H_ */
> > > > --
> > > > 2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH 2/5] lib/ring: add template to support different element sizes
  2019-10-03  3:33         ` Honnappa Nagarahalli
@ 2019-10-03 11:51           ` Ananyev, Konstantin
  2019-10-03 12:27             ` Ananyev, Konstantin
  0 siblings, 1 reply; 173+ messages in thread
From: Ananyev, Konstantin @ 2019-10-03 11:51 UTC (permalink / raw)
  To: Honnappa Nagarahalli, olivier.matz, Wang, Yipeng1, Gobriel,
	Sameh, Richardson, Bruce, De Lara Guarch, Pablo
  Cc: dev, Dharmik Thakkar, Gavin Hu (Arm Technology China),
	Ruifeng Wang (Arm Technology China),
	nd, nd, nd



> > > > > +++ b/lib/librte_ring/rte_ring_template.h
> > > > > @@ -0,0 +1,330 @@
> > > > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > > > + * Copyright (c) 2019 Arm Limited  */
> > > > > +
> > > > > +#ifndef _RTE_RING_TEMPLATE_H_
> > > > > +#define _RTE_RING_TEMPLATE_H_
> > > > > +
> > > > > +#ifdef __cplusplus
> > > > > +extern "C" {
> > > > > +#endif
> > > > > +
> > > > > +#include <stdio.h>
> > > > > +#include <stdint.h>
> > > > > +#include <sys/queue.h>
> > > > > +#include <errno.h>
> > > > > +#include <rte_common.h>
> > > > > +#include <rte_config.h>
> > > > > +#include <rte_memory.h>
> > > > > +#include <rte_lcore.h>
> > > > > +#include <rte_atomic.h>
> > > > > +#include <rte_branch_prediction.h> #include <rte_memzone.h>
> > > > > +#include <rte_pause.h> #include <rte_ring.h>
> > > > > +
> > > > > +/* Ring API suffix name - used to append to API names */ #ifndef
> > > > > +RTE_RING_TMPLT_API_SUFFIX #error RTE_RING_TMPLT_API_SUFFIX
> > not
> > > > > +defined #endif
> > > > > +
> > > > > +/* Ring's element size in bits, should be a power of 2 */ #ifndef
> > > > > +RTE_RING_TMPLT_ELEM_SIZE #error RTE_RING_TMPLT_ELEM_SIZE
> > not
> > > > defined
> > > > > +#endif
> > > > > +
> > > > > +/* Type of ring elements */
> > > > > +#ifndef RTE_RING_TMPLT_ELEM_TYPE
> > > > > +#error RTE_RING_TMPLT_ELEM_TYPE not defined #endif
> > > > > +
> > > > > +#define _rte_fuse(a, b) a##_##b
> > > > > +#define __rte_fuse(a, b) _rte_fuse(a, b) #define
> > > > > +__RTE_RING_CONCAT(a) __rte_fuse(a, RTE_RING_TMPLT_API_SUFFIX)
> > > > > +
> > > > > +/* Calculate the memory size needed for a ring */
> > > > > +RTE_RING_TMPLT_EXPERIMENTAL ssize_t
> > > > > +__RTE_RING_CONCAT(rte_ring_get_memsize)(unsigned count);
> > > > > +
> > > > > +/* Create a new ring named *name* in memory. */
> > > > > +RTE_RING_TMPLT_EXPERIMENTAL struct rte_ring *
> > > > > +__RTE_RING_CONCAT(rte_ring_create)(const char *name, unsigned
> > count,
> > > > > +					int socket_id, unsigned flags);
> > > >
> > > >
> > > > Just an idea - probably same thing can be achieved in a different way.
> > > > Instead of all these defines - replace ENQUEUE_PTRS/DEQUEUE_PTRS
> > > > macros with static inline functions and then make all internal functions,
> > i.e.
> > > > __rte_ring_do_dequeue()
> > > > to accept enqueue/dequeue function pointer as a parameter.
> > > > Then let say default rte_ring_mc_dequeue_bulk will do:
> > > >
> > > > rte_ring_mc_dequeue_bulk(struct rte_ring *r, void **obj_table,
> > > >                 unsigned int n, unsigned int *available) {
> > > >         return __rte_ring_do_dequeue(r, obj_table, n,
> > RTE_RING_QUEUE_FIXED,
> > > >                         __IS_MC, available, dequeue_ptr_default); }
> > > >
> > > > Then if someone will like to define ring functions forelt_size==X,
> > > > all he would need to do:
> > > > 1. define his own enqueue/dequeuer functions.
> > > > 2. do something like:
> > > > rte_ring_mc_dequeue_bulk(struct rte_ring *r, void **obj_table,
> > > >                 unsigned int n, unsigned int *available) {
> > > >         return __rte_ring_do_dequeue(r, obj_table, n,
> > RTE_RING_QUEUE_FIXED,
> > > >                         __IS_MC, available, dequeue_X); }
> > > >
> > > > Konstantin
> > > Thanks for the feedback/idea. The goal of this patch was to make it
> > > simple enough to define APIs to store any element size without code
> > duplication.
> >
> > Well, then if we store elt_size inside the ring, it should be easy enough to add
> > to the API generic functions that would use memcpy(or rte_memcpy) for
> > enqueue/dequeue.
> > Yes, it might be slower than existing (8B per elem), but might be still
> > acceptable.
> The element size will be a constant in most use cases. If we keep the element size as a parameter, it allows the compiler to do any loop
> unrolling and auto-vectorization optimizations on copying.
> Storing the element size will result in additional memory access.

I understand that, but for you case (rcu defer queue) you probably need highest possible performance, right?
I am sure there will be other cases where such slight perf degradation is acceptatble.

> 
> >
> > >With this patch, the user has to write ~4 lines of code to get APIs for
> > >any element size. I would like to keep the goal still the  same.
> > >
> > > If we have to avoid the macro-fest, the main problem that needs to be
> > > addressed is - how to represent different sizes of element types in a generic
> > way? IMO, we can do this by defining the element type to be a multiple of
> > uint32_t (I do not think we need to go to uint16_t).
> > >
> > > For ex:
> > > rte_ring_mp_enqueue_bulk_objs(struct rte_ring *r,
> > >                 uint32_t *obj_table, unsigned int num_objs,
> > >                 unsigned int n,
> > >                 enum rte_ring_queue_behavior behavior, unsigned int is_sp,
> > >                 unsigned int *free_space) { }
> > >
> > > This approach would ensure that we have generic enough APIs and they
> > > can be used for elements of any size. But the element itself needs to be a
> > multiple of 32b - I think this should not be a concern.
> > >
> > > The API suffix definitely needs to be better, any suggestions?
> >

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH 2/5] lib/ring: add template to support different element sizes
  2019-10-03 11:51           ` Ananyev, Konstantin
@ 2019-10-03 12:27             ` Ananyev, Konstantin
  2019-10-03 22:49               ` Honnappa Nagarahalli
  0 siblings, 1 reply; 173+ messages in thread
From: Ananyev, Konstantin @ 2019-10-03 12:27 UTC (permalink / raw)
  To: Ananyev, Konstantin, Honnappa Nagarahalli, olivier.matz, Wang,
	Yipeng1, Gobriel, Sameh, Richardson, Bruce, De Lara Guarch,
	Pablo
  Cc: dev, Dharmik Thakkar, Gavin Hu (Arm Technology China),
	Ruifeng Wang (Arm Technology China),
	nd, nd, nd


> 
> > > > > > +++ b/lib/librte_ring/rte_ring_template.h
> > > > > > @@ -0,0 +1,330 @@
> > > > > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > > > > + * Copyright (c) 2019 Arm Limited  */
> > > > > > +
> > > > > > +#ifndef _RTE_RING_TEMPLATE_H_
> > > > > > +#define _RTE_RING_TEMPLATE_H_
> > > > > > +
> > > > > > +#ifdef __cplusplus
> > > > > > +extern "C" {
> > > > > > +#endif
> > > > > > +
> > > > > > +#include <stdio.h>
> > > > > > +#include <stdint.h>
> > > > > > +#include <sys/queue.h>
> > > > > > +#include <errno.h>
> > > > > > +#include <rte_common.h>
> > > > > > +#include <rte_config.h>
> > > > > > +#include <rte_memory.h>
> > > > > > +#include <rte_lcore.h>
> > > > > > +#include <rte_atomic.h>
> > > > > > +#include <rte_branch_prediction.h> #include <rte_memzone.h>
> > > > > > +#include <rte_pause.h> #include <rte_ring.h>
> > > > > > +
> > > > > > +/* Ring API suffix name - used to append to API names */ #ifndef
> > > > > > +RTE_RING_TMPLT_API_SUFFIX #error RTE_RING_TMPLT_API_SUFFIX
> > > not
> > > > > > +defined #endif
> > > > > > +
> > > > > > +/* Ring's element size in bits, should be a power of 2 */ #ifndef
> > > > > > +RTE_RING_TMPLT_ELEM_SIZE #error RTE_RING_TMPLT_ELEM_SIZE
> > > not
> > > > > defined
> > > > > > +#endif
> > > > > > +
> > > > > > +/* Type of ring elements */
> > > > > > +#ifndef RTE_RING_TMPLT_ELEM_TYPE
> > > > > > +#error RTE_RING_TMPLT_ELEM_TYPE not defined #endif
> > > > > > +
> > > > > > +#define _rte_fuse(a, b) a##_##b
> > > > > > +#define __rte_fuse(a, b) _rte_fuse(a, b) #define
> > > > > > +__RTE_RING_CONCAT(a) __rte_fuse(a, RTE_RING_TMPLT_API_SUFFIX)
> > > > > > +
> > > > > > +/* Calculate the memory size needed for a ring */
> > > > > > +RTE_RING_TMPLT_EXPERIMENTAL ssize_t
> > > > > > +__RTE_RING_CONCAT(rte_ring_get_memsize)(unsigned count);
> > > > > > +
> > > > > > +/* Create a new ring named *name* in memory. */
> > > > > > +RTE_RING_TMPLT_EXPERIMENTAL struct rte_ring *
> > > > > > +__RTE_RING_CONCAT(rte_ring_create)(const char *name, unsigned
> > > count,
> > > > > > +					int socket_id, unsigned flags);
> > > > >
> > > > >
> > > > > Just an idea - probably same thing can be achieved in a different way.
> > > > > Instead of all these defines - replace ENQUEUE_PTRS/DEQUEUE_PTRS
> > > > > macros with static inline functions and then make all internal functions,
> > > i.e.
> > > > > __rte_ring_do_dequeue()
> > > > > to accept enqueue/dequeue function pointer as a parameter.
> > > > > Then let say default rte_ring_mc_dequeue_bulk will do:
> > > > >
> > > > > rte_ring_mc_dequeue_bulk(struct rte_ring *r, void **obj_table,
> > > > >                 unsigned int n, unsigned int *available) {
> > > > >         return __rte_ring_do_dequeue(r, obj_table, n,
> > > RTE_RING_QUEUE_FIXED,
> > > > >                         __IS_MC, available, dequeue_ptr_default); }
> > > > >
> > > > > Then if someone will like to define ring functions forelt_size==X,
> > > > > all he would need to do:
> > > > > 1. define his own enqueue/dequeuer functions.
> > > > > 2. do something like:
> > > > > rte_ring_mc_dequeue_bulk(struct rte_ring *r, void **obj_table,
> > > > >                 unsigned int n, unsigned int *available) {
> > > > >         return __rte_ring_do_dequeue(r, obj_table, n,
> > > RTE_RING_QUEUE_FIXED,
> > > > >                         __IS_MC, available, dequeue_X); }
> > > > >
> > > > > Konstantin
> > > > Thanks for the feedback/idea. The goal of this patch was to make it
> > > > simple enough to define APIs to store any element size without code
> > > duplication.
> > >
> > > Well, then if we store elt_size inside the ring, it should be easy enough to add
> > > to the API generic functions that would use memcpy(or rte_memcpy) for
> > > enqueue/dequeue.
> > > Yes, it might be slower than existing (8B per elem), but might be still
> > > acceptable.
> > The element size will be a constant in most use cases. If we keep the element size as a parameter, it allows the compiler to do any loop
> > unrolling and auto-vectorization optimizations on copying.
> > Storing the element size will result in additional memory access.
> 
> I understand that, but for you case (rcu defer queue) you probably need highest possible performance, right?

Meant 'don't need' of course :)

> I am sure there will be other cases where such slight perf degradation is acceptatble.
> 
> >
> > >
> > > >With this patch, the user has to write ~4 lines of code to get APIs for
> > > >any element size. I would like to keep the goal still the  same.
> > > >
> > > > If we have to avoid the macro-fest, the main problem that needs to be
> > > > addressed is - how to represent different sizes of element types in a generic
> > > way? IMO, we can do this by defining the element type to be a multiple of
> > > uint32_t (I do not think we need to go to uint16_t).
> > > >
> > > > For ex:
> > > > rte_ring_mp_enqueue_bulk_objs(struct rte_ring *r,
> > > >                 uint32_t *obj_table, unsigned int num_objs,
> > > >                 unsigned int n,
> > > >                 enum rte_ring_queue_behavior behavior, unsigned int is_sp,
> > > >                 unsigned int *free_space) { }
> > > >
> > > > This approach would ensure that we have generic enough APIs and they
> > > > can be used for elements of any size. But the element itself needs to be a
> > > multiple of 32b - I think this should not be a concern.
> > > >
> > > > The API suffix definitely needs to be better, any suggestions?
> > >

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH 2/5] lib/ring: add template to support different element sizes
  2019-10-03 12:27             ` Ananyev, Konstantin
@ 2019-10-03 22:49               ` Honnappa Nagarahalli
  0 siblings, 0 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-03 22:49 UTC (permalink / raw)
  To: Ananyev, Konstantin, olivier.matz, Wang, Yipeng1, Gobriel, Sameh,
	Richardson, Bruce, De Lara Guarch, Pablo
  Cc: dev, Dharmik Thakkar, Gavin Hu (Arm Technology China),
	Ruifeng Wang (Arm Technology China),
	Honnappa Nagarahalli, nd, nd

<snip>

> > > > > > > +++ b/lib/librte_ring/rte_ring_template.h
> > > > > > > @@ -0,0 +1,330 @@
> > > > > > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > > > > > + * Copyright (c) 2019 Arm Limited  */
> > > > > > > +
> > > > > > > +#ifndef _RTE_RING_TEMPLATE_H_ #define
> _RTE_RING_TEMPLATE_H_
> > > > > > > +
> > > > > > > +#ifdef __cplusplus
> > > > > > > +extern "C" {
> > > > > > > +#endif
> > > > > > > +
> > > > > > > +#include <stdio.h>
> > > > > > > +#include <stdint.h>
> > > > > > > +#include <sys/queue.h>
> > > > > > > +#include <errno.h>
> > > > > > > +#include <rte_common.h>
> > > > > > > +#include <rte_config.h>
> > > > > > > +#include <rte_memory.h>
> > > > > > > +#include <rte_lcore.h>
> > > > > > > +#include <rte_atomic.h>
> > > > > > > +#include <rte_branch_prediction.h> #include <rte_memzone.h>
> > > > > > > +#include <rte_pause.h> #include <rte_ring.h>
> > > > > > > +
> > > > > > > +/* Ring API suffix name - used to append to API names */
> > > > > > > +#ifndef RTE_RING_TMPLT_API_SUFFIX #error
> > > > > > > +RTE_RING_TMPLT_API_SUFFIX
> > > > not
> > > > > > > +defined #endif
> > > > > > > +
> > > > > > > +/* Ring's element size in bits, should be a power of 2 */
> > > > > > > +#ifndef RTE_RING_TMPLT_ELEM_SIZE #error
> > > > > > > +RTE_RING_TMPLT_ELEM_SIZE
> > > > not
> > > > > > defined
> > > > > > > +#endif
> > > > > > > +
> > > > > > > +/* Type of ring elements */ #ifndef
> > > > > > > +RTE_RING_TMPLT_ELEM_TYPE #error
> RTE_RING_TMPLT_ELEM_TYPE
> > > > > > > +not defined #endif
> > > > > > > +
> > > > > > > +#define _rte_fuse(a, b) a##_##b #define __rte_fuse(a, b)
> > > > > > > +_rte_fuse(a, b) #define
> > > > > > > +__RTE_RING_CONCAT(a) __rte_fuse(a,
> > > > > > > +RTE_RING_TMPLT_API_SUFFIX)
> > > > > > > +
> > > > > > > +/* Calculate the memory size needed for a ring */
> > > > > > > +RTE_RING_TMPLT_EXPERIMENTAL ssize_t
> > > > > > > +__RTE_RING_CONCAT(rte_ring_get_memsize)(unsigned count);
> > > > > > > +
> > > > > > > +/* Create a new ring named *name* in memory. */
> > > > > > > +RTE_RING_TMPLT_EXPERIMENTAL struct rte_ring *
> > > > > > > +__RTE_RING_CONCAT(rte_ring_create)(const char *name,
> > > > > > > +unsigned
> > > > count,
> > > > > > > +					int socket_id, unsigned flags);
> > > > > >
> > > > > >
> > > > > > Just an idea - probably same thing can be achieved in a different
> way.
> > > > > > Instead of all these defines - replace
> > > > > > ENQUEUE_PTRS/DEQUEUE_PTRS macros with static inline functions
> > > > > > and then make all internal functions,
> > > > i.e.
> > > > > > __rte_ring_do_dequeue()
> > > > > > to accept enqueue/dequeue function pointer as a parameter.
> > > > > > Then let say default rte_ring_mc_dequeue_bulk will do:
> > > > > >
> > > > > > rte_ring_mc_dequeue_bulk(struct rte_ring *r, void **obj_table,
> > > > > >                 unsigned int n, unsigned int *available) {
> > > > > >         return __rte_ring_do_dequeue(r, obj_table, n,
> > > > RTE_RING_QUEUE_FIXED,
> > > > > >                         __IS_MC, available,
> > > > > > dequeue_ptr_default); }
> > > > > >
> > > > > > Then if someone will like to define ring functions
> > > > > > forelt_size==X, all he would need to do:
> > > > > > 1. define his own enqueue/dequeuer functions.
> > > > > > 2. do something like:
> > > > > > rte_ring_mc_dequeue_bulk(struct rte_ring *r, void **obj_table,
> > > > > >                 unsigned int n, unsigned int *available) {
> > > > > >         return __rte_ring_do_dequeue(r, obj_table, n,
> > > > RTE_RING_QUEUE_FIXED,
> > > > > >                         __IS_MC, available, dequeue_X); }
> > > > > >
> > > > > > Konstantin
> > > > > Thanks for the feedback/idea. The goal of this patch was to make
> > > > > it simple enough to define APIs to store any element size
> > > > > without code
> > > > duplication.
> > > >
> > > > Well, then if we store elt_size inside the ring, it should be easy
> > > > enough to add to the API generic functions that would use
> > > > memcpy(or rte_memcpy) for enqueue/dequeue.
> > > > Yes, it might be slower than existing (8B per elem), but might be
> > > > still acceptable.
> > > The element size will be a constant in most use cases. If we keep
> > > the element size as a parameter, it allows the compiler to do any loop
> unrolling and auto-vectorization optimizations on copying.
> > > Storing the element size will result in additional memory access.
> >
> > I understand that, but for you case (rcu defer queue) you probably need
> highest possible performance, right?
> 
> Meant 'don't need' of course :)
😊 understood. that is just one use case. It actually started as an option to reduce memory usage in different places. You can look at the rte_hash changes in this patch. I also have plans for further changes.

> 
> > I am sure there will be other cases where such slight perf degradation is
> acceptatble.
> >
> > >
> > > >
> > > > >With this patch, the user has to write ~4 lines of code to get
> > > > >APIs for any element size. I would like to keep the goal still the  same.
> > > > >
> > > > > If we have to avoid the macro-fest, the main problem that needs
> > > > > to be addressed is - how to represent different sizes of element
> > > > > types in a generic
> > > > way? IMO, we can do this by defining the element type to be a
> > > > multiple of uint32_t (I do not think we need to go to uint16_t).
> > > > >
> > > > > For ex:
> > > > > rte_ring_mp_enqueue_bulk_objs(struct rte_ring *r,
> > > > >                 uint32_t *obj_table, unsigned int num_objs,
> > > > >                 unsigned int n,
> > > > >                 enum rte_ring_queue_behavior behavior, unsigned int is_sp,
> > > > >                 unsigned int *free_space) { }
> > > > >
> > > > > This approach would ensure that we have generic enough APIs and
> > > > > they can be used for elements of any size. But the element
> > > > > itself needs to be a
> > > > multiple of 32b - I think this should not be a concern.
> > > > >
> > > > > The API suffix definitely needs to be better, any suggestions?
> > > >

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v2 0/6] lib/ring: templates to support custom element size
  2019-09-06 19:05 ` [dpdk-dev] [PATCH v2 0/6] " Honnappa Nagarahalli
                     ` (6 preceding siblings ...)
  2019-09-09 13:04   ` [dpdk-dev] [PATCH v2 0/6] lib/ring: templates to support custom element size Aaron Conole
@ 2019-10-07 13:49   ` David Marchand
  2019-10-08 19:19   ` [dpdk-dev] [PATCH v3 0/2] lib/ring: APIs " Honnappa Nagarahalli
                     ` (7 subsequent siblings)
  15 siblings, 0 replies; 173+ messages in thread
From: David Marchand @ 2019-10-07 13:49 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: Olivier Matz, Wang, Yipeng1, Gobriel, Sameh, Bruce Richardson,
	Pablo de Lara, dev, pbhagavatula, Jerin Jacob Kollanukkaran

On Fri, Sep 6, 2019 at 9:05 PM Honnappa Nagarahalli
<honnappa.nagarahalli@arm.com> wrote:
>
> The current rte_ring hard-codes the type of the ring element to 'void *',
> hence the size of the element is hard-coded to 32b/64b. Since the ring
> element type is not an input to rte_ring APIs, it results in couple
> of issues:
>
> 1) If an application requires to store an element which is not 64b, it
>    needs to write its own ring APIs similar to rte_event_ring APIs. This
>    creates additional burden on the programmers, who end up making
>    work-arounds and often waste memory.
> 2) If there are multiple libraries that store elements of the same
>    type, currently they would have to write their own rte_ring APIs. This
>    results in code duplication.
>
> This patch consists of several parts:
> 1) New APIs to support configurable ring element size
>    These will help reduce code duplication in the templates. I think these
>    can be made internal (do not expose to DPDK applications, but expose to
>    DPDK libraries), feedback needed.
>
> 2) rte_ring templates
>    The templates provide an easy way to add new APIs for different ring
>    element types/sizes which can be used by multiple libraries. These
>    also allow for creating APIs to store elements of custom types
>    (for ex: a structure)
>
>    The template needs 4 parameters:
>    a) RTE_RING_TMPLT_API_SUFFIX - This is used as a suffix to the
>       rte_ring APIs.
>       For ex: if RTE_RING_TMPLT_API_SUFFIX is '32b', the API name will be
>       rte_ring_create_32b
>    b) RTE_RING_TMPLT_ELEM_SIZE - Size of the ring element in bytes.
>       For ex: sizeof(uint32_t)
>    c) RTE_RING_TMPLT_ELEM_TYPE - Type of the ring element.
>       For ex: uint32_t. If a common ring library does not use a standard
>       data type, it should create its own type by defining a structure
>       with standard data type. For ex: for an elment size of 96b, one
>       could define a structure
>
>       struct s_96b {
>           uint32_t a[3];
>       }
>       The common library can use this structure to define
>       RTE_RING_TMPLT_ELEM_TYPE.
>
>       The application using this common ring library should define its
>       element type as a union with the above structure.
>
>       union app_element_type {
>           struct s_96b v;
>           struct app_element {
>               uint16_t a;
>               uint16_t b;
>               uint32_t c;
>               uint32_t d;
>           }
>       }
>    d) RTE_RING_TMPLT_EXPERIMENTAL - Indicates if the new APIs being defined
>       are experimental. Should be set to empty to remove the experimental
>       tag.
>
>    The ring library consists of some APIs that are defined as inline
>    functions and some APIs that are non-inline functions. The non-inline
>    functions are in rte_ring_template.c. However, this file needs to be
>    included in other .c files. Any feedback on how to handle this is
>    appreciated.
>
>    Note that the templates help create the APIs that are dependent on the
>    element size (for ex: rte_ring_create, enqueue/dequeue etc). Other APIs
>    that do NOT depend on the element size do not need to be part of the
>    template (for ex: rte_ring_dump, rte_ring_count, rte_ring_free_count
>    etc).
>
> 3) APIs for 32b ring element size
>    This uses the templates to create APIs to enqueue/dequeue elements of
>    size 32b.
>
> 4) rte_hash libray is changed to use 32b ring APIs
>    The 32b APIs are used in rte_hash library to store the free slot index
>    and free bucket index.
>
> 5) Event Dev changed to use ring templates
>    Event Dev defines its own 128b ring APIs using the templates. This helps
>    in keeping the 'struct rte_event' as is. If Event Dev has to use generic
>    128b ring APIs, it requires 'struct rte_event' to change to
>    'union rte_event' to include a generic data type such as '__int128_t'.
>    This breaks the API compatibility and results in large number of
>    changes.
>    With this change, the event rings are stored on rte_ring's tailq.
>    Event Dev specific ring list is NOT available. IMO, this does not have
>    any impact to the user.
>
> This patch results in following checkpatch issue:
> WARNING:UNSPECIFIED_INT: Prefer 'unsigned int' to bare use of 'unsigned'
>
> However, this patch is following the rules in the existing code. Please
> let me know if this needs to be fixed.
>
> v2
>  - Change Event Ring implementation to use ring templates
>    (Jerin, Pavan)

I expect a v3 on this series:
- Bruce/Stephen were not happy with using macros,
- Aaron caught test issues,
- from my side, if patch 3 still applies after your changes, I prefer
we drop this patch on the check script, we can live with these
warnings,


Thanks.

-- 
David Marchand


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH v3 0/2] lib/ring: APIs to support custom element size
  2019-09-06 19:05 ` [dpdk-dev] [PATCH v2 0/6] " Honnappa Nagarahalli
                     ` (7 preceding siblings ...)
  2019-10-07 13:49   ` David Marchand
@ 2019-10-08 19:19   ` Honnappa Nagarahalli
  2019-10-08 19:19     ` [dpdk-dev] [PATCH v3 1/2] lib/ring: apis to support configurable " Honnappa Nagarahalli
  2019-10-08 19:19     ` [dpdk-dev] [PATCH v3 2/2] test/ring: add test cases for configurable element size ring Honnappa Nagarahalli
  2019-10-09  2:47   ` [dpdk-dev] [PATCH v3 0/2] lib/ring: APIs to support custom element size Honnappa Nagarahalli
                     ` (6 subsequent siblings)
  15 siblings, 2 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-08 19:19 UTC (permalink / raw)
  To: olivier.matz, sthemmin, jerinj, bruce.richardson, david.marchand,
	pbhagavatula, konstantin.ananyev, honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, ruifeng.wang, gavin.hu

The current rte_ring hard-codes the type of the ring element to 'void *',
hence the size of the element is hard-coded to 32b/64b. Since the ring
element type is not an input to rte_ring APIs, it results in couple
of issues:

1) If an application requires to store an element which is not 64b, it
   needs to write its own ring APIs similar to rte_event_ring APIs. This
   creates additional burden on the programmers, who end up making
   work-arounds and often waste memory.
2) If there are multiple libraries that store elements of the same
   type, currently they would have to write their own rte_ring APIs. This
   results in code duplication.

This patch adds new APIs to support configurable ring element size.
The APIs support custom element sizes by allowing to define the ring
element to be a multiple of 32b.

The aim is to achieve same performance as the existing ring
implementation. The patch adds same performance tests that are run
for existing APIs. This allows for performance comparison.

I also tested with memcpy. x86 shows significant improvements on bulk
and burst tests. On the Arm platform, I used, there is a drop of
4% to 6% in few tests. May be this is something that we can explore
later.

Note that this version skips changes to other libraries as I would
like to get an agreement on the implementation from the community.
They will be added once there is agreement on the rte_ring changes.

v3
 - Removed macro-fest and used inline functions
   (Stephen, Bruce)

v2
 - Change Event Ring implementation to use ring templates
   (Jerin, Pavan)

Honnappa Nagarahalli (2):
  lib/ring: apis to support configurable element size
  test/ring: add test cases for configurable element size ring

 app/test/Makefile                    |   1 +
 app/test/meson.build                 |   1 +
 app/test/test_ring_perf_elem.c       | 419 ++++++++++++
 lib/librte_ring/Makefile             |   3 +-
 lib/librte_ring/meson.build          |   3 +
 lib/librte_ring/rte_ring.c           |  45 +-
 lib/librte_ring/rte_ring.h           |   1 +
 lib/librte_ring/rte_ring_elem.h      | 946 +++++++++++++++++++++++++++
 lib/librte_ring/rte_ring_version.map |   2 +
 9 files changed, 1412 insertions(+), 9 deletions(-)
 create mode 100644 app/test/test_ring_perf_elem.c
 create mode 100644 lib/librte_ring/rte_ring_elem.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH v3 1/2] lib/ring: apis to support configurable element size
  2019-10-08 19:19   ` [dpdk-dev] [PATCH v3 0/2] lib/ring: APIs " Honnappa Nagarahalli
@ 2019-10-08 19:19     ` Honnappa Nagarahalli
  2019-10-08 19:19     ` [dpdk-dev] [PATCH v3 2/2] test/ring: add test cases for configurable element size ring Honnappa Nagarahalli
  1 sibling, 0 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-08 19:19 UTC (permalink / raw)
  To: olivier.matz, sthemmin, jerinj, bruce.richardson, david.marchand,
	pbhagavatula, konstantin.ananyev, honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, ruifeng.wang, gavin.hu

Current APIs assume ring elements to be pointers. However, in many
use cases, the size can be different. Add new APIs to support
configurable ring element sizes.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_ring/Makefile             |   3 +-
 lib/librte_ring/meson.build          |   3 +
 lib/librte_ring/rte_ring.c           |  45 +-
 lib/librte_ring/rte_ring.h           |   1 +
 lib/librte_ring/rte_ring_elem.h      | 946 +++++++++++++++++++++++++++
 lib/librte_ring/rte_ring_version.map |   2 +
 6 files changed, 991 insertions(+), 9 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_elem.h

diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index 21a36770d..515a967bb 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -6,7 +6,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 # library name
 LIB = librte_ring.a
 
-CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3
+CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -DALLOW_EXPERIMENTAL_API
 LDLIBS += -lrte_eal
 
 EXPORT_MAP := rte_ring_version.map
@@ -18,6 +18,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
+					rte_ring_elem.h \
 					rte_ring_generic.h \
 					rte_ring_c11_mem.h
 
diff --git a/lib/librte_ring/meson.build b/lib/librte_ring/meson.build
index ab8b0b469..74219840a 100644
--- a/lib/librte_ring/meson.build
+++ b/lib/librte_ring/meson.build
@@ -6,3 +6,6 @@ sources = files('rte_ring.c')
 headers = files('rte_ring.h',
 		'rte_ring_c11_mem.h',
 		'rte_ring_generic.h')
+
+# rte_ring_create_elem and rte_ring_get_memsize_elem are experimental
+allow_experimental_apis = true
diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
index d9b308036..6fed3648b 100644
--- a/lib/librte_ring/rte_ring.c
+++ b/lib/librte_ring/rte_ring.c
@@ -33,6 +33,7 @@
 #include <rte_tailq.h>
 
 #include "rte_ring.h"
+#include "rte_ring_elem.h"
 
 TAILQ_HEAD(rte_ring_list, rte_tailq_entry);
 
@@ -46,23 +47,42 @@ EAL_REGISTER_TAILQ(rte_ring_tailq)
 
 /* return the size of memory occupied by a ring */
 ssize_t
-rte_ring_get_memsize(unsigned count)
+rte_ring_get_memsize_elem(unsigned count, unsigned esize)
 {
 	ssize_t sz;
 
+	/* Supported esize values are 4/8/16.
+	 * Others can be added on need basis.
+	 */
+	if ((esize != 4) && (esize != 8) && (esize != 16)) {
+		RTE_LOG(ERR, RING,
+			"Unsupported esize value. Supported values are 4, 8 and 16\n");
+
+		return -EINVAL;
+	}
+
 	/* count must be a power of 2 */
 	if ((!POWEROF2(count)) || (count > RTE_RING_SZ_MASK )) {
 		RTE_LOG(ERR, RING,
-			"Requested size is invalid, must be power of 2, and "
-			"do not exceed the size limit %u\n", RTE_RING_SZ_MASK);
+			"Requested number of elements is invalid, must be "
+			"power of 2, and do not exceed the limit %u\n",
+			RTE_RING_SZ_MASK);
+
 		return -EINVAL;
 	}
 
-	sz = sizeof(struct rte_ring) + count * sizeof(void *);
+	sz = sizeof(struct rte_ring) + count * esize;
 	sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
 	return sz;
 }
 
+/* return the size of memory occupied by a ring */
+ssize_t
+rte_ring_get_memsize(unsigned count)
+{
+	return rte_ring_get_memsize_elem(count, sizeof(void *));
+}
+
 void
 rte_ring_reset(struct rte_ring *r)
 {
@@ -114,10 +134,10 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 	return 0;
 }
 
-/* create the ring */
+/* create the ring for a given element size */
 struct rte_ring *
-rte_ring_create(const char *name, unsigned count, int socket_id,
-		unsigned flags)
+rte_ring_create_elem(const char *name, unsigned count, unsigned esize,
+		int socket_id, unsigned flags)
 {
 	char mz_name[RTE_MEMZONE_NAMESIZE];
 	struct rte_ring *r;
@@ -135,7 +155,7 @@ rte_ring_create(const char *name, unsigned count, int socket_id,
 	if (flags & RING_F_EXACT_SZ)
 		count = rte_align32pow2(count + 1);
 
-	ring_size = rte_ring_get_memsize(count);
+	ring_size = rte_ring_get_memsize_elem(count, esize);
 	if (ring_size < 0) {
 		rte_errno = ring_size;
 		return NULL;
@@ -182,6 +202,15 @@ rte_ring_create(const char *name, unsigned count, int socket_id,
 	return r;
 }
 
+/* create the ring */
+struct rte_ring *
+rte_ring_create(const char *name, unsigned count, int socket_id,
+		unsigned flags)
+{
+	return rte_ring_create_elem(name, count, sizeof(void *), socket_id,
+		flags);
+}
+
 /* free the ring */
 void
 rte_ring_free(struct rte_ring *r)
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 2a9f768a1..18fc5d845 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -216,6 +216,7 @@ int rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
  */
 struct rte_ring *rte_ring_create(const char *name, unsigned count,
 				 int socket_id, unsigned flags);
+
 /**
  * De-allocate all memory used by the ring.
  *
diff --git a/lib/librte_ring/rte_ring_elem.h b/lib/librte_ring/rte_ring_elem.h
new file mode 100644
index 000000000..d395229f1
--- /dev/null
+++ b/lib/librte_ring/rte_ring_elem.h
@@ -0,0 +1,946 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Copyright (c) 2019 Arm Limited
+ * Copyright (c) 2010-2017 Intel Corporation
+ * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
+ * All rights reserved.
+ * Derived from FreeBSD's bufring.h
+ * Used as BSD-3 Licensed with permission from Kip Macy.
+ */
+
+#ifndef _RTE_RING_ELEM_H_
+#define _RTE_RING_ELEM_H_
+
+/**
+ * @file
+ * RTE Ring with flexible element size
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stdio.h>
+#include <stdint.h>
+#include <sys/queue.h>
+#include <errno.h>
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_memory.h>
+#include <rte_lcore.h>
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_memzone.h>
+#include <rte_pause.h>
+
+#include "rte_ring.h"
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Calculate the memory size needed for a ring with given element size
+ *
+ * This function returns the number of bytes needed for a ring, given
+ * the number of elements in it and the size of the element. This value
+ * is the sum of the size of the structure rte_ring and the size of the
+ * memory needed for storing the elements. The value is aligned to a cache
+ * line size.
+ *
+ * @param count
+ *   The number of elements in the ring (must be a power of 2).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported.
+ * @return
+ *   - The memory size needed for the ring on success.
+ *   - -EINVAL if count is not a power of 2.
+ */
+__rte_experimental
+ssize_t rte_ring_get_memsize_elem(unsigned count, unsigned esize);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Create a new ring named *name* that stores elements with given size.
+ *
+ * This function uses ``memzone_reserve()`` to allocate memory. Then it
+ * calls rte_ring_init() to initialize an empty ring.
+ *
+ * The new ring size is set to *count*, which must be a power of
+ * two. Water marking is disabled by default. The real usable ring size
+ * is *count-1* instead of *count* to differentiate a free ring from an
+ * empty ring.
+ *
+ * The ring is added in RTE_TAILQ_RING list.
+ *
+ * @param name
+ *   The name of the ring.
+ * @param count
+ *   The number of elements in the ring (must be a power of 2).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported.
+ * @param socket_id
+ *   The *socket_id* argument is the socket identifier in case of
+ *   NUMA. The value can be *SOCKET_ID_ANY* if there is no NUMA
+ *   constraint for the reserved zone.
+ * @param flags
+ *   An OR of the following:
+ *    - RING_F_SP_ENQ: If this flag is set, the default behavior when
+ *      using ``rte_ring_enqueue()`` or ``rte_ring_enqueue_bulk()``
+ *      is "single-producer". Otherwise, it is "multi-producers".
+ *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
+ *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
+ *      is "single-consumer". Otherwise, it is "multi-consumers".
+ * @return
+ *   On success, the pointer to the new allocated ring. NULL on error with
+ *    rte_errno set appropriately. Possible errno values include:
+ *    - E_RTE_NO_CONFIG - function could not get pointer to rte_config structure
+ *    - E_RTE_SECONDARY - function was called from a secondary process instance
+ *    - EINVAL - count provided is not a power of 2
+ *    - ENOSPC - the maximum number of memzones has already been allocated
+ *    - EEXIST - a memzone with the same name already exists
+ *    - ENOMEM - no appropriate memory area found in which to create memzone
+ */
+__rte_experimental
+struct rte_ring *rte_ring_create_elem(const char *name, unsigned count,
+				unsigned esize, int socket_id, unsigned flags);
+
+/* the actual enqueue of pointers on the ring.
+ * Placed here since identical code needed in both
+ * single and multi producer enqueue functions.
+ */
+#define ENQUEUE_PTRS_ELEM(r, ring_start, prod_head, obj_table, esize, n) do { \
+	if (esize == 4) \
+		ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n); \
+	else if (esize == 8) \
+		ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n); \
+	else if (esize == 16) \
+		ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table, n); \
+} while (0)
+
+#define ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n) do { \
+	unsigned int i; \
+	const uint32_t size = (r)->size; \
+	uint32_t idx = prod_head & (r)->mask; \
+	uint32_t *ring = (uint32_t *)ring_start; \
+	uint32_t *obj = (uint32_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & ((~(unsigned)0x8))); i += 8, idx += 8) { \
+			ring[idx] = obj[i]; \
+			ring[idx + 1] = obj[i + 1]; \
+			ring[idx + 2] = obj[i + 2]; \
+			ring[idx + 3] = obj[i + 3]; \
+			ring[idx + 4] = obj[i + 4]; \
+			ring[idx + 5] = obj[i + 5]; \
+			ring[idx + 6] = obj[i + 6]; \
+			ring[idx + 7] = obj[i + 7]; \
+		} \
+		switch (n & 0x8) { \
+		case 7: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 6: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 5: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 4: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 3: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 2: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 1: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++)\
+			ring[idx] = obj[i]; \
+		for (idx = 0; i < n; i++, idx++) \
+			ring[idx] = obj[i]; \
+	} \
+} while (0)
+
+#define ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n) do { \
+	unsigned int i; \
+	const uint32_t size = (r)->size; \
+	uint32_t idx = prod_head & (r)->mask; \
+	uint64_t *ring = (uint64_t *)ring_start; \
+	uint64_t *obj = (uint64_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & ((~(unsigned)0x3))); i += 4, idx += 4) { \
+			ring[idx] = obj[i]; \
+			ring[idx + 1] = obj[i + 1]; \
+			ring[idx + 2] = obj[i + 2]; \
+			ring[idx + 3] = obj[i + 3]; \
+		} \
+		switch (n & 0x3) { \
+		case 3: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 2: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 1: \
+			ring[idx++] = obj[i++]; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++)\
+			ring[idx] = obj[i]; \
+		for (idx = 0; i < n; i++, idx++) \
+			ring[idx] = obj[i]; \
+	} \
+} while (0)
+
+#define ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table, n) do { \
+	unsigned int i; \
+	const uint32_t size = (r)->size; \
+	uint32_t idx = prod_head & (r)->mask; \
+	__uint128_t *ring = (__uint128_t *)ring_start; \
+	__uint128_t *obj = (__uint128_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
+			ring[idx] = obj[i]; \
+			ring[idx + 1] = obj[i + 1]; \
+		} \
+		switch (n & 0x1) { \
+		case 1: \
+			ring[idx++] = obj[i++]; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++)\
+			ring[idx] = obj[i]; \
+		for (idx = 0; i < n; i++, idx++) \
+			ring[idx] = obj[i]; \
+	} \
+} while (0)
+
+/* the actual copy of pointers on the ring to obj_table.
+ * Placed here since identical code needed in both
+ * single and multi consumer dequeue functions.
+ */
+#define DEQUEUE_PTRS_ELEM(r, ring_start, cons_head, obj_table, esize, n) do { \
+	if (esize == 4) \
+		DEQUEUE_PTRS_32(r, ring_start, cons_head, obj_table, n); \
+	else if (esize == 8) \
+		DEQUEUE_PTRS_64(r, ring_start, cons_head, obj_table, n); \
+	else if (esize == 16) \
+		DEQUEUE_PTRS_128(r, ring_start, cons_head, obj_table, n); \
+} while (0)
+
+#define DEQUEUE_PTRS_32(r, ring_start, cons_head, obj_table, n) do { \
+	unsigned int i; \
+	uint32_t idx = cons_head & (r)->mask; \
+	const uint32_t size = (r)->size; \
+	uint32_t *ring = (uint32_t *)ring_start; \
+	uint32_t *obj = (uint32_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & (~(unsigned)0x8)); i += 8, idx += 8) {\
+			obj[i] = ring[idx]; \
+			obj[i + 1] = ring[idx + 1]; \
+			obj[i + 2] = ring[idx + 2]; \
+			obj[i + 3] = ring[idx + 3]; \
+			obj[i + 4] = ring[idx + 4]; \
+			obj[i + 5] = ring[idx + 5]; \
+			obj[i + 6] = ring[idx + 6]; \
+			obj[i+7] = ring[idx+7]; \
+		} \
+		switch (n & 0x8) { \
+		case 7: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 6: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 5: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 4: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 3: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 2: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 1: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) \
+			obj[i] = ring[idx]; \
+		for (idx = 0; i < n; i++, idx++) \
+			obj[i] = ring[idx]; \
+	} \
+} while (0)
+
+#define DEQUEUE_PTRS_64(r, ring_start, cons_head, obj_table, n) do { \
+	unsigned int i; \
+	uint32_t idx = cons_head & (r)->mask; \
+	const uint32_t size = (r)->size; \
+	uint64_t *ring = (uint64_t *)ring_start; \
+	uint64_t *obj = (uint64_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & (~(unsigned)0x3)); i += 4, idx += 4) {\
+			obj[i] = ring[idx]; \
+			obj[i + 1] = ring[idx + 1]; \
+			obj[i + 2] = ring[idx + 2]; \
+			obj[i + 3] = ring[idx + 3]; \
+		} \
+		switch (n & 0x3) { \
+		case 3: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 2: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 1: \
+			obj[i++] = ring[idx++]; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) \
+			obj[i] = ring[idx]; \
+		for (idx = 0; i < n; i++, idx++) \
+			obj[i] = ring[idx]; \
+	} \
+} while (0)
+
+#define DEQUEUE_PTRS_128(r, ring_start, cons_head, obj_table, n) do { \
+	unsigned int i; \
+	uint32_t idx = cons_head & (r)->mask; \
+	const uint32_t size = (r)->size; \
+	__uint128_t *ring = (__uint128_t *)ring_start; \
+	__uint128_t *obj = (__uint128_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
+			obj[i] = ring[idx]; \
+			obj[i + 1] = ring[idx + 1]; \
+		} \
+		switch (n & 0x1) { \
+		case 1: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) \
+			obj[i] = ring[idx]; \
+		for (idx = 0; i < n; i++, idx++) \
+			obj[i] = ring[idx]; \
+	} \
+} while (0)
+
+/* Between load and load. there might be cpu reorder in weak model
+ * (powerpc/arm).
+ * There are 2 choices for the users
+ * 1.use rmb() memory barrier
+ * 2.use one-direction load_acquire/store_release barrier,defined by
+ * CONFIG_RTE_USE_C11_MEM_MODEL=y
+ * It depends on performance test results.
+ * By default, move common functions to rte_ring_generic.h
+ */
+#ifdef RTE_USE_C11_MEM_MODEL
+#include "rte_ring_c11_mem.h"
+#else
+#include "rte_ring_generic.h"
+#endif
+
+/**
+ * @internal Enqueue several objects on the ring
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
+ * @param is_sp
+ *   Indicates whether to use single producer or multi-producer head update
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_enqueue_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n,
+		enum rte_ring_queue_behavior behavior, unsigned int is_sp,
+		unsigned int *free_space)
+{
+	uint32_t prod_head, prod_next;
+	uint32_t free_entries;
+
+	n = __rte_ring_move_prod_head(r, is_sp, n, behavior,
+			&prod_head, &prod_next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	ENQUEUE_PTRS_ELEM(r, &r[1], prod_head, obj_table, esize, n);
+
+	update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
+end:
+	if (free_space != NULL)
+		*free_space = free_entries - n;
+	return n;
+}
+
+/**
+ * @internal Dequeue several objects from the ring
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from ring
+ * @param is_sc
+ *   Indicates whether to use single consumer or multi-consumer head update
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n,
+		enum rte_ring_queue_behavior behavior, unsigned int is_sc,
+		unsigned int *available)
+{
+	uint32_t cons_head, cons_next;
+	uint32_t entries;
+
+	n = __rte_ring_move_cons_head(r, (int)is_sc, n, behavior,
+			&cons_head, &cons_next, &entries);
+	if (n == 0)
+		goto end;
+
+	DEQUEUE_PTRS_ELEM(r, &r[1], cons_head, obj_table, esize, n);
+
+	update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
+
+end:
+	if (available != NULL)
+		*available = entries - n;
+	return n;
+}
+
+/**
+ * Enqueue several objects on the ring (multi-producers safe).
+ *
+ * This function uses a "compare and set" instruction to move the
+ * producer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   The number of objects enqueued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_mp_enqueue_bulk_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_FIXED, __IS_MP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring (NOT multi-producers safe).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   The number of objects enqueued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_sp_enqueue_bulk_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_FIXED, __IS_SP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring.
+ *
+ * This function calls the multi-producer or the single-producer
+ * version depending on the default behavior that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   The number of objects enqueued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_enqueue_bulk_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_FIXED, r->prod.single, free_space);
+}
+
+/**
+ * Enqueue one object on a ring (multi-producers safe).
+ *
+ * This function uses a "compare and set" instruction to move the
+ * producer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj
+ *   A pointer to the object to be added.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success; objects enqueued.
+ *   - -ENOBUFS: Not enough room in the ring to enqueue; no object is enqueued.
+ */
+static __rte_always_inline int
+rte_ring_mp_enqueue_elem(struct rte_ring *r, void *obj, unsigned int esize)
+{
+	return rte_ring_mp_enqueue_bulk_elem(r, obj, esize, 1, NULL) ? 0 :
+								-ENOBUFS;
+}
+
+/**
+ * Enqueue one object on a ring (NOT multi-producers safe).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj
+ *   A pointer to the object to be added.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success; objects enqueued.
+ *   - -ENOBUFS: Not enough room in the ring to enqueue; no object is enqueued.
+ */
+static __rte_always_inline int
+rte_ring_sp_enqueue_elem(struct rte_ring *r, void *obj, unsigned int esize)
+{
+	return rte_ring_sp_enqueue_bulk_elem(r, obj, esize, 1, NULL) ? 0 :
+								-ENOBUFS;
+}
+
+/**
+ * Enqueue one object on a ring.
+ *
+ * This function calls the multi-producer or the single-producer
+ * version, depending on the default behaviour that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj
+ *   A pointer to the object to be added.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success; objects enqueued.
+ *   - -ENOBUFS: Not enough room in the ring to enqueue; no object is enqueued.
+ */
+static __rte_always_inline int
+rte_ring_enqueue_elem(struct rte_ring *r, void *obj, unsigned int esize)
+{
+	return rte_ring_enqueue_bulk_elem(r, obj, esize, 1, NULL) ? 0 :
+								-ENOBUFS;
+}
+
+/**
+ * Dequeue several objects from a ring (multi-consumers safe).
+ *
+ * This function uses a "compare and set" instruction to move the
+ * consumer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   The number of objects dequeued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_mc_dequeue_bulk_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+				RTE_RING_QUEUE_FIXED, __IS_MC, available);
+}
+
+/**
+ * Dequeue several objects from a ring (NOT multi-consumers safe).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table,
+ *   must be strictly positive.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   The number of objects dequeued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_sc_dequeue_bulk_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_FIXED, __IS_SC, available);
+}
+
+/**
+ * Dequeue several objects from a ring.
+ *
+ * This function calls the multi-consumers or the single-consumer
+ * version, depending on the default behaviour that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   The number of objects dequeued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_dequeue_bulk_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_FIXED, r->cons.single, available);
+}
+
+/**
+ * Dequeue one object from a ring (multi-consumers safe).
+ *
+ * This function uses a "compare and set" instruction to move the
+ * consumer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_p
+ *   A pointer to a void * pointer (object) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success; objects dequeued.
+ *   - -ENOENT: Not enough entries in the ring to dequeue; no object is
+ *     dequeued.
+ */
+static __rte_always_inline int
+rte_ring_mc_dequeue_elem(struct rte_ring *r, void *obj_p,
+				unsigned int esize)
+{
+	return rte_ring_mc_dequeue_bulk_elem(r, obj_p, esize, 1, NULL)  ? 0 :
+								-ENOENT;
+}
+
+/**
+ * Dequeue one object from a ring (NOT multi-consumers safe).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_p
+ *   A pointer to a void * pointer (object) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success; objects dequeued.
+ *   - -ENOENT: Not enough entries in the ring to dequeue, no object is
+ *     dequeued.
+ */
+static __rte_always_inline int
+rte_ring_sc_dequeue_elem(struct rte_ring *r, void *obj_p,
+				unsigned int esize)
+{
+	return rte_ring_sc_dequeue_bulk_elem(r, obj_p, esize, 1, NULL) ? 0 :
+								-ENOENT;
+}
+
+/**
+ * Dequeue one object from a ring.
+ *
+ * This function calls the multi-consumers or the single-consumer
+ * version depending on the default behaviour that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_p
+ *   A pointer to a void * pointer (object) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success, objects dequeued.
+ *   - -ENOENT: Not enough entries in the ring to dequeue, no object is
+ *     dequeued.
+ */
+static __rte_always_inline int
+rte_ring_dequeue_elem(struct rte_ring *r, void *obj_p, unsigned int esize)
+{
+	return rte_ring_dequeue_bulk_elem(r, obj_p, esize, 1, NULL) ? 0 :
+								-ENOENT;
+}
+
+/**
+ * Enqueue several objects on the ring (multi-producers safe).
+ *
+ * This function uses a "compare and set" instruction to move the
+ * producer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   - n: Actual number of objects enqueued.
+ */
+static __rte_always_inline unsigned
+rte_ring_mp_enqueue_burst_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_MP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring (NOT multi-producers safe).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   - n: Actual number of objects enqueued.
+ */
+static __rte_always_inline unsigned
+rte_ring_sp_enqueue_burst_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_SP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring.
+ *
+ * This function calls the multi-producer or the single-producer
+ * version depending on the default behavior that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   - n: Actual number of objects enqueued.
+ */
+static __rte_always_inline unsigned
+rte_ring_enqueue_burst_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_VARIABLE, r->prod.single, free_space);
+}
+
+/**
+ * Dequeue several objects from a ring (multi-consumers safe). When the request
+ * objects are more than the available objects, only dequeue the actual number
+ * of objects
+ *
+ * This function uses a "compare and set" instruction to move the
+ * consumer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   - n: Actual number of objects dequeued, 0 if ring is empty
+ */
+static __rte_always_inline unsigned
+rte_ring_mc_dequeue_burst_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_MC, available);
+}
+
+/**
+ * Dequeue several objects from a ring (NOT multi-consumers safe).When the
+ * request objects are more than the available objects, only dequeue the
+ * actual number of objects
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   - n: Actual number of objects dequeued, 0 if ring is empty
+ */
+static __rte_always_inline unsigned
+rte_ring_sc_dequeue_burst_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_SC, available);
+}
+
+/**
+ * Dequeue multiple objects from a ring up to a maximum number.
+ *
+ * This function calls the multi-consumers or the single-consumer
+ * version, depending on the default behaviour that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   - Number of objects dequeued
+ */
+static __rte_always_inline unsigned
+rte_ring_dequeue_burst_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+				RTE_RING_QUEUE_VARIABLE,
+				r->cons.single, available);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_RING_ELEM_H_ */
diff --git a/lib/librte_ring/rte_ring_version.map b/lib/librte_ring/rte_ring_version.map
index 510c1386e..e410a7503 100644
--- a/lib/librte_ring/rte_ring_version.map
+++ b/lib/librte_ring/rte_ring_version.map
@@ -21,6 +21,8 @@ DPDK_2.2 {
 EXPERIMENTAL {
 	global:
 
+	rte_ring_create_elem;
+	rte_ring_get_memsize_elem;
 	rte_ring_reset;
 
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH v3 2/2] test/ring: add test cases for configurable element size ring
  2019-10-08 19:19   ` [dpdk-dev] [PATCH v3 0/2] lib/ring: APIs " Honnappa Nagarahalli
  2019-10-08 19:19     ` [dpdk-dev] [PATCH v3 1/2] lib/ring: apis to support configurable " Honnappa Nagarahalli
@ 2019-10-08 19:19     ` Honnappa Nagarahalli
  1 sibling, 0 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-08 19:19 UTC (permalink / raw)
  To: olivier.matz, sthemmin, jerinj, bruce.richardson, david.marchand,
	pbhagavatula, konstantin.ananyev, honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, ruifeng.wang, gavin.hu

Add test cases to test APIs for configurable element size ring.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 app/test/Makefile              |   1 +
 app/test/meson.build           |   1 +
 app/test/test_ring_perf_elem.c | 419 +++++++++++++++++++++++++++++++++
 3 files changed, 421 insertions(+)
 create mode 100644 app/test/test_ring_perf_elem.c

diff --git a/app/test/Makefile b/app/test/Makefile
index 26ba6fe2b..e5cb27b75 100644
--- a/app/test/Makefile
+++ b/app/test/Makefile
@@ -78,6 +78,7 @@ SRCS-y += test_rand_perf.c
 
 SRCS-y += test_ring.c
 SRCS-y += test_ring_perf.c
+SRCS-y += test_ring_perf_elem.c
 SRCS-y += test_pmd_perf.c
 
 ifeq ($(CONFIG_RTE_LIBRTE_TABLE),y)
diff --git a/app/test/meson.build b/app/test/meson.build
index ec40943bd..995ee9bc7 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -101,6 +101,7 @@ test_sources = files('commands.c',
 	'test_reorder.c',
 	'test_ring.c',
 	'test_ring_perf.c',
+	'test_ring_perf_elem.c',
 	'test_rwlock.c',
 	'test_sched.c',
 	'test_service_cores.c',
diff --git a/app/test/test_ring_perf_elem.c b/app/test/test_ring_perf_elem.c
new file mode 100644
index 000000000..fc5b82d71
--- /dev/null
+++ b/app/test/test_ring_perf_elem.c
@@ -0,0 +1,419 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2014 Intel Corporation
+ */
+
+
+#include <stdio.h>
+#include <inttypes.h>
+#include <rte_ring.h>
+#include <rte_ring_elem.h>
+#include <rte_cycles.h>
+#include <rte_launch.h>
+#include <rte_pause.h>
+
+#include "test.h"
+
+/*
+ * Ring
+ * ====
+ *
+ * Measures performance of various operations using rdtsc
+ *  * Empty ring dequeue
+ *  * Enqueue/dequeue of bursts in 1 threads
+ *  * Enqueue/dequeue of bursts in 2 threads
+ */
+
+#define RING_NAME "RING_PERF"
+#define RING_SIZE 4096
+#define MAX_BURST 64
+
+/*
+ * the sizes to enqueue and dequeue in testing
+ * (marked volatile so they won't be seen as compile-time constants)
+ */
+static const volatile unsigned bulk_sizes[] = { 8, 32 };
+
+struct lcore_pair {
+	unsigned c1, c2;
+};
+
+static volatile unsigned lcore_count;
+
+/**** Functions to analyse our core mask to get cores for different tests ***/
+
+static int
+get_two_hyperthreads(struct lcore_pair *lcp)
+{
+	unsigned id1, id2;
+	unsigned c1, c2, s1, s2;
+	RTE_LCORE_FOREACH(id1) {
+		/* inner loop just re-reads all id's. We could skip the
+		 * first few elements, but since number of cores is small
+		 * there is little point
+		 */
+		RTE_LCORE_FOREACH(id2) {
+			if (id1 == id2)
+				continue;
+
+			c1 = rte_lcore_to_cpu_id(id1);
+			c2 = rte_lcore_to_cpu_id(id2);
+			s1 = rte_lcore_to_socket_id(id1);
+			s2 = rte_lcore_to_socket_id(id2);
+			if ((c1 == c2) && (s1 == s2)) {
+				lcp->c1 = id1;
+				lcp->c2 = id2;
+				return 0;
+			}
+		}
+	}
+	return 1;
+}
+
+static int
+get_two_cores(struct lcore_pair *lcp)
+{
+	unsigned id1, id2;
+	unsigned c1, c2, s1, s2;
+	RTE_LCORE_FOREACH(id1) {
+		RTE_LCORE_FOREACH(id2) {
+			if (id1 == id2)
+				continue;
+
+			c1 = rte_lcore_to_cpu_id(id1);
+			c2 = rte_lcore_to_cpu_id(id2);
+			s1 = rte_lcore_to_socket_id(id1);
+			s2 = rte_lcore_to_socket_id(id2);
+			if ((c1 != c2) && (s1 == s2)) {
+				lcp->c1 = id1;
+				lcp->c2 = id2;
+				return 0;
+			}
+		}
+	}
+	return 1;
+}
+
+static int
+get_two_sockets(struct lcore_pair *lcp)
+{
+	unsigned id1, id2;
+	unsigned s1, s2;
+	RTE_LCORE_FOREACH(id1) {
+		RTE_LCORE_FOREACH(id2) {
+			if (id1 == id2)
+				continue;
+			s1 = rte_lcore_to_socket_id(id1);
+			s2 = rte_lcore_to_socket_id(id2);
+			if (s1 != s2) {
+				lcp->c1 = id1;
+				lcp->c2 = id2;
+				return 0;
+			}
+		}
+	}
+	return 1;
+}
+
+/* Get cycle counts for dequeuing from an empty ring. Should be 2 or 3 cycles */
+static void
+test_empty_dequeue(struct rte_ring *r)
+{
+	const unsigned iter_shift = 26;
+	const unsigned iterations = 1<<iter_shift;
+	unsigned i = 0;
+	uint32_t burst[MAX_BURST];
+
+	const uint64_t sc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		rte_ring_sc_dequeue_bulk_elem(r, burst, 8, bulk_sizes[0], NULL);
+	const uint64_t sc_end = rte_rdtsc();
+
+	const uint64_t mc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		rte_ring_mc_dequeue_bulk_elem(r, burst, 8, bulk_sizes[0], NULL);
+	const uint64_t mc_end = rte_rdtsc();
+
+	printf("SC empty dequeue: %.2F\n",
+			(double)(sc_end-sc_start) / iterations);
+	printf("MC empty dequeue: %.2F\n",
+			(double)(mc_end-mc_start) / iterations);
+}
+
+/*
+ * for the separate enqueue and dequeue threads they take in one param
+ * and return two. Input = burst size, output = cycle average for sp/sc & mp/mc
+ */
+struct thread_params {
+	struct rte_ring *r;
+	unsigned size;        /* input value, the burst size */
+	double spsc, mpmc;    /* output value, the single or multi timings */
+};
+
+/*
+ * Function that uses rdtsc to measure timing for ring enqueue. Needs pair
+ * thread running dequeue_bulk function
+ */
+static int
+enqueue_bulk(void *p)
+{
+	const unsigned iter_shift = 23;
+	const unsigned iterations = 1<<iter_shift;
+	struct thread_params *params = p;
+	struct rte_ring *r = params->r;
+	const unsigned size = params->size;
+	unsigned i;
+	uint32_t burst[MAX_BURST] = {0};
+
+#ifdef RTE_USE_C11_MEM_MODEL
+	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
+#else
+	if (__sync_add_and_fetch(&lcore_count, 1) != 2)
+#endif
+		while (lcore_count != 2)
+			rte_pause();
+
+	const uint64_t sp_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		while (rte_ring_sp_enqueue_bulk_elem(r, burst, 8, size, NULL)
+				== 0)
+			rte_pause();
+	const uint64_t sp_end = rte_rdtsc();
+
+	const uint64_t mp_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		while (rte_ring_mp_enqueue_bulk_elem(r, burst, 8, size, NULL)
+				== 0)
+			rte_pause();
+	const uint64_t mp_end = rte_rdtsc();
+
+	params->spsc = ((double)(sp_end - sp_start))/(iterations*size);
+	params->mpmc = ((double)(mp_end - mp_start))/(iterations*size);
+	return 0;
+}
+
+/*
+ * Function that uses rdtsc to measure timing for ring dequeue. Needs pair
+ * thread running enqueue_bulk function
+ */
+static int
+dequeue_bulk(void *p)
+{
+	const unsigned iter_shift = 23;
+	const unsigned iterations = 1<<iter_shift;
+	struct thread_params *params = p;
+	struct rte_ring *r = params->r;
+	const unsigned size = params->size;
+	unsigned i;
+	uint32_t burst[MAX_BURST] = {0};
+
+#ifdef RTE_USE_C11_MEM_MODEL
+	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
+#else
+	if (__sync_add_and_fetch(&lcore_count, 1) != 2)
+#endif
+		while (lcore_count != 2)
+			rte_pause();
+
+	const uint64_t sc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		while (rte_ring_sc_dequeue_bulk_elem(r, burst, 8, size, NULL)
+				== 0)
+			rte_pause();
+	const uint64_t sc_end = rte_rdtsc();
+
+	const uint64_t mc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		while (rte_ring_mc_dequeue_bulk_elem(r, burst, 8, size, NULL)
+				== 0)
+			rte_pause();
+	const uint64_t mc_end = rte_rdtsc();
+
+	params->spsc = ((double)(sc_end - sc_start))/(iterations*size);
+	params->mpmc = ((double)(mc_end - mc_start))/(iterations*size);
+	return 0;
+}
+
+/*
+ * Function that calls the enqueue and dequeue bulk functions on pairs of cores.
+ * used to measure ring perf between hyperthreads, cores and sockets.
+ */
+static void
+run_on_core_pair(struct lcore_pair *cores, struct rte_ring *r,
+		lcore_function_t f1, lcore_function_t f2)
+{
+	struct thread_params param1 = {0}, param2 = {0};
+	unsigned i;
+	for (i = 0; i < sizeof(bulk_sizes)/sizeof(bulk_sizes[0]); i++) {
+		lcore_count = 0;
+		param1.size = param2.size = bulk_sizes[i];
+		param1.r = param2.r = r;
+		if (cores->c1 == rte_get_master_lcore()) {
+			rte_eal_remote_launch(f2, &param2, cores->c2);
+			f1(&param1);
+			rte_eal_wait_lcore(cores->c2);
+		} else {
+			rte_eal_remote_launch(f1, &param1, cores->c1);
+			rte_eal_remote_launch(f2, &param2, cores->c2);
+			rte_eal_wait_lcore(cores->c1);
+			rte_eal_wait_lcore(cores->c2);
+		}
+		printf("SP/SC bulk enq/dequeue (size: %u): %.2F\n",
+				bulk_sizes[i], param1.spsc + param2.spsc);
+		printf("MP/MC bulk enq/dequeue (size: %u): %.2F\n",
+				bulk_sizes[i], param1.mpmc + param2.mpmc);
+	}
+}
+
+/*
+ * Test function that determines how long an enqueue + dequeue of a single item
+ * takes on a single lcore. Result is for comparison with the bulk enq+deq.
+ */
+static void
+test_single_enqueue_dequeue(struct rte_ring *r)
+{
+	const unsigned iter_shift = 24;
+	const unsigned iterations = 1<<iter_shift;
+	unsigned i = 0;
+	uint32_t burst[2];
+
+	const uint64_t sc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++) {
+		rte_ring_sp_enqueue_elem(r, burst, 8);
+		rte_ring_sc_dequeue_elem(r, burst, 8);
+	}
+	const uint64_t sc_end = rte_rdtsc();
+
+	const uint64_t mc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++) {
+		rte_ring_mp_enqueue_elem(r, burst, 8);
+		rte_ring_mc_dequeue_elem(r, burst, 8);
+	}
+	const uint64_t mc_end = rte_rdtsc();
+
+	printf("SP/SC single enq/dequeue: %"PRIu64"\n",
+			(sc_end-sc_start) >> iter_shift);
+	printf("MP/MC single enq/dequeue: %"PRIu64"\n",
+			(mc_end-mc_start) >> iter_shift);
+}
+
+/*
+ * Test that does both enqueue and dequeue on a core using the burst() API calls
+ * instead of the bulk() calls used in other tests. Results should be the same
+ * as for the bulk function called on a single lcore.
+ */
+static void
+test_burst_enqueue_dequeue(struct rte_ring *r)
+{
+	const unsigned iter_shift = 23;
+	const unsigned iterations = 1<<iter_shift;
+	unsigned sz, i = 0;
+	uint32_t burst[MAX_BURST] = {0};
+
+	for (sz = 0; sz < sizeof(bulk_sizes)/sizeof(bulk_sizes[0]); sz++) {
+		const uint64_t sc_start = rte_rdtsc();
+		for (i = 0; i < iterations; i++) {
+			rte_ring_sp_enqueue_burst_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+			rte_ring_sc_dequeue_burst_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+		}
+		const uint64_t sc_end = rte_rdtsc();
+
+		const uint64_t mc_start = rte_rdtsc();
+		for (i = 0; i < iterations; i++) {
+			rte_ring_mp_enqueue_burst_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+			rte_ring_mc_dequeue_burst_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+		}
+		const uint64_t mc_end = rte_rdtsc();
+
+		uint64_t mc_avg = ((mc_end-mc_start) >> iter_shift) /
+					bulk_sizes[sz];
+		uint64_t sc_avg = ((sc_end-sc_start) >> iter_shift) /
+					bulk_sizes[sz];
+
+		printf("SP/SC burst enq/dequeue (size: %u): %"PRIu64"\n",
+				bulk_sizes[sz], sc_avg);
+		printf("MP/MC burst enq/dequeue (size: %u): %"PRIu64"\n",
+				bulk_sizes[sz], mc_avg);
+	}
+}
+
+/* Times enqueue and dequeue on a single lcore */
+static void
+test_bulk_enqueue_dequeue(struct rte_ring *r)
+{
+	const unsigned iter_shift = 23;
+	const unsigned iterations = 1<<iter_shift;
+	unsigned sz, i = 0;
+	uint32_t burst[MAX_BURST] = {0};
+
+	for (sz = 0; sz < sizeof(bulk_sizes)/sizeof(bulk_sizes[0]); sz++) {
+		const uint64_t sc_start = rte_rdtsc();
+		for (i = 0; i < iterations; i++) {
+			rte_ring_sp_enqueue_bulk_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+			rte_ring_sc_dequeue_bulk_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+		}
+		const uint64_t sc_end = rte_rdtsc();
+
+		const uint64_t mc_start = rte_rdtsc();
+		for (i = 0; i < iterations; i++) {
+			rte_ring_mp_enqueue_bulk_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+			rte_ring_mc_dequeue_bulk_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+		}
+		const uint64_t mc_end = rte_rdtsc();
+
+		double sc_avg = ((double)(sc_end-sc_start) /
+				(iterations * bulk_sizes[sz]));
+		double mc_avg = ((double)(mc_end-mc_start) /
+				(iterations * bulk_sizes[sz]));
+
+		printf("SP/SC bulk enq/dequeue (size: %u): %.2F\n",
+				bulk_sizes[sz], sc_avg);
+		printf("MP/MC bulk enq/dequeue (size: %u): %.2F\n",
+				bulk_sizes[sz], mc_avg);
+	}
+}
+
+static int
+test_ring_perf_elem(void)
+{
+	struct lcore_pair cores;
+	struct rte_ring *r = NULL;
+
+	r = rte_ring_create_elem(RING_NAME, RING_SIZE, 8, rte_socket_id(), 0);
+	if (r == NULL)
+		return -1;
+
+	printf("### Testing single element and burst enq/deq ###\n");
+	test_single_enqueue_dequeue(r);
+	test_burst_enqueue_dequeue(r);
+
+	printf("\n### Testing empty dequeue ###\n");
+	test_empty_dequeue(r);
+
+	printf("\n### Testing using a single lcore ###\n");
+	test_bulk_enqueue_dequeue(r);
+
+	if (get_two_hyperthreads(&cores) == 0) {
+		printf("\n### Testing using two hyperthreads ###\n");
+		run_on_core_pair(&cores, r, enqueue_bulk, dequeue_bulk);
+	}
+	if (get_two_cores(&cores) == 0) {
+		printf("\n### Testing using two physical cores ###\n");
+		run_on_core_pair(&cores, r, enqueue_bulk, dequeue_bulk);
+	}
+	if (get_two_sockets(&cores) == 0) {
+		printf("\n### Testing using two NUMA nodes ###\n");
+		run_on_core_pair(&cores, r, enqueue_bulk, dequeue_bulk);
+	}
+	rte_ring_free(r);
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(ring_perf_elem_autotest, test_ring_perf_elem);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH v3 0/2] lib/ring: APIs to support custom element size
  2019-09-06 19:05 ` [dpdk-dev] [PATCH v2 0/6] " Honnappa Nagarahalli
                     ` (8 preceding siblings ...)
  2019-10-08 19:19   ` [dpdk-dev] [PATCH v3 0/2] lib/ring: APIs " Honnappa Nagarahalli
@ 2019-10-09  2:47   ` Honnappa Nagarahalli
  2019-10-09  2:47     ` [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable " Honnappa Nagarahalli
  2019-10-09  2:47     ` [dpdk-dev] [PATCH v4 2/2] test/ring: add test cases for configurable element size ring Honnappa Nagarahalli
  2019-10-17 20:08   ` [dpdk-dev] [PATCH v5 0/3] lib/ring: APIs to support custom element size Honnappa Nagarahalli
                     ` (5 subsequent siblings)
  15 siblings, 2 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-09  2:47 UTC (permalink / raw)
  To: olivier.matz, sthemmin, jerinj, bruce.richardson, david.marchand,
	pbhagavatula, konstantin.ananyev, honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, ruifeng.wang, gavin.hu

The current rte_ring hard-codes the type of the ring element to 'void *',
hence the size of the element is hard-coded to 32b/64b. Since the ring
element type is not an input to rte_ring APIs, it results in couple
of issues:

1) If an application requires to store an element which is not 64b, it
   needs to write its own ring APIs similar to rte_event_ring APIs. This
   creates additional burden on the programmers, who end up making
   work-arounds and often waste memory.
2) If there are multiple libraries that store elements of the same
   type, currently they would have to write their own rte_ring APIs. This
   results in code duplication.

This patch adds new APIs to support configurable ring element size.
The APIs support custom element sizes by allowing to define the ring
element to be a multiple of 32b.

The aim is to achieve same performance as the existing ring
implementation. The patch adds same performance tests that are run
for existing APIs. This allows for performance comparison.

I also tested with memcpy. x86 shows significant improvements on bulk
and burst tests. On the Arm platform, I used, there is a drop of
4% to 6% in few tests. May be this is something that we can explore
later.

Note that this version skips changes to other libraries as I would
like to get an agreement on the implementation from the community.
They will be added once there is agreement on the rte_ring changes.

v4
 - Few fixes after more performance testing

v3
 - Removed macro-fest and used inline functions
   (Stephen, Bruce)

v2
 - Change Event Ring implementation to use ring templates
   (Jerin, Pavan)

Honnappa Nagarahalli (2):
  lib/ring: apis to support configurable element size
  test/ring: add test cases for configurable element size ring

 app/test/Makefile                    |   1 +
 app/test/meson.build                 |   1 +
 app/test/test_ring_perf_elem.c       | 419 ++++++++++++
 lib/librte_ring/Makefile             |   3 +-
 lib/librte_ring/meson.build          |   3 +
 lib/librte_ring/rte_ring.c           |  45 +-
 lib/librte_ring/rte_ring.h           |   1 +
 lib/librte_ring/rte_ring_elem.h      | 946 +++++++++++++++++++++++++++
 lib/librte_ring/rte_ring_version.map |   2 +
 9 files changed, 1412 insertions(+), 9 deletions(-)
 create mode 100644 app/test/test_ring_perf_elem.c
 create mode 100644 lib/librte_ring/rte_ring_elem.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable element size
  2019-10-09  2:47   ` [dpdk-dev] [PATCH v3 0/2] lib/ring: APIs to support custom element size Honnappa Nagarahalli
@ 2019-10-09  2:47     ` Honnappa Nagarahalli
  2019-10-11 19:21       ` Honnappa Nagarahalli
  2019-10-09  2:47     ` [dpdk-dev] [PATCH v4 2/2] test/ring: add test cases for configurable element size ring Honnappa Nagarahalli
  1 sibling, 1 reply; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-09  2:47 UTC (permalink / raw)
  To: olivier.matz, sthemmin, jerinj, bruce.richardson, david.marchand,
	pbhagavatula, konstantin.ananyev, honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, ruifeng.wang, gavin.hu

Current APIs assume ring elements to be pointers. However, in many
use cases, the size can be different. Add new APIs to support
configurable ring element sizes.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_ring/Makefile             |   3 +-
 lib/librte_ring/meson.build          |   3 +
 lib/librte_ring/rte_ring.c           |  45 +-
 lib/librte_ring/rte_ring.h           |   1 +
 lib/librte_ring/rte_ring_elem.h      | 946 +++++++++++++++++++++++++++
 lib/librte_ring/rte_ring_version.map |   2 +
 6 files changed, 991 insertions(+), 9 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_elem.h

diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index 21a36770d..515a967bb 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -6,7 +6,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 # library name
 LIB = librte_ring.a
 
-CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3
+CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -DALLOW_EXPERIMENTAL_API
 LDLIBS += -lrte_eal
 
 EXPORT_MAP := rte_ring_version.map
@@ -18,6 +18,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
+					rte_ring_elem.h \
 					rte_ring_generic.h \
 					rte_ring_c11_mem.h
 
diff --git a/lib/librte_ring/meson.build b/lib/librte_ring/meson.build
index ab8b0b469..74219840a 100644
--- a/lib/librte_ring/meson.build
+++ b/lib/librte_ring/meson.build
@@ -6,3 +6,6 @@ sources = files('rte_ring.c')
 headers = files('rte_ring.h',
 		'rte_ring_c11_mem.h',
 		'rte_ring_generic.h')
+
+# rte_ring_create_elem and rte_ring_get_memsize_elem are experimental
+allow_experimental_apis = true
diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
index d9b308036..6fed3648b 100644
--- a/lib/librte_ring/rte_ring.c
+++ b/lib/librte_ring/rte_ring.c
@@ -33,6 +33,7 @@
 #include <rte_tailq.h>
 
 #include "rte_ring.h"
+#include "rte_ring_elem.h"
 
 TAILQ_HEAD(rte_ring_list, rte_tailq_entry);
 
@@ -46,23 +47,42 @@ EAL_REGISTER_TAILQ(rte_ring_tailq)
 
 /* return the size of memory occupied by a ring */
 ssize_t
-rte_ring_get_memsize(unsigned count)
+rte_ring_get_memsize_elem(unsigned count, unsigned esize)
 {
 	ssize_t sz;
 
+	/* Supported esize values are 4/8/16.
+	 * Others can be added on need basis.
+	 */
+	if ((esize != 4) && (esize != 8) && (esize != 16)) {
+		RTE_LOG(ERR, RING,
+			"Unsupported esize value. Supported values are 4, 8 and 16\n");
+
+		return -EINVAL;
+	}
+
 	/* count must be a power of 2 */
 	if ((!POWEROF2(count)) || (count > RTE_RING_SZ_MASK )) {
 		RTE_LOG(ERR, RING,
-			"Requested size is invalid, must be power of 2, and "
-			"do not exceed the size limit %u\n", RTE_RING_SZ_MASK);
+			"Requested number of elements is invalid, must be "
+			"power of 2, and do not exceed the limit %u\n",
+			RTE_RING_SZ_MASK);
+
 		return -EINVAL;
 	}
 
-	sz = sizeof(struct rte_ring) + count * sizeof(void *);
+	sz = sizeof(struct rte_ring) + count * esize;
 	sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
 	return sz;
 }
 
+/* return the size of memory occupied by a ring */
+ssize_t
+rte_ring_get_memsize(unsigned count)
+{
+	return rte_ring_get_memsize_elem(count, sizeof(void *));
+}
+
 void
 rte_ring_reset(struct rte_ring *r)
 {
@@ -114,10 +134,10 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 	return 0;
 }
 
-/* create the ring */
+/* create the ring for a given element size */
 struct rte_ring *
-rte_ring_create(const char *name, unsigned count, int socket_id,
-		unsigned flags)
+rte_ring_create_elem(const char *name, unsigned count, unsigned esize,
+		int socket_id, unsigned flags)
 {
 	char mz_name[RTE_MEMZONE_NAMESIZE];
 	struct rte_ring *r;
@@ -135,7 +155,7 @@ rte_ring_create(const char *name, unsigned count, int socket_id,
 	if (flags & RING_F_EXACT_SZ)
 		count = rte_align32pow2(count + 1);
 
-	ring_size = rte_ring_get_memsize(count);
+	ring_size = rte_ring_get_memsize_elem(count, esize);
 	if (ring_size < 0) {
 		rte_errno = ring_size;
 		return NULL;
@@ -182,6 +202,15 @@ rte_ring_create(const char *name, unsigned count, int socket_id,
 	return r;
 }
 
+/* create the ring */
+struct rte_ring *
+rte_ring_create(const char *name, unsigned count, int socket_id,
+		unsigned flags)
+{
+	return rte_ring_create_elem(name, count, sizeof(void *), socket_id,
+		flags);
+}
+
 /* free the ring */
 void
 rte_ring_free(struct rte_ring *r)
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 2a9f768a1..18fc5d845 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -216,6 +216,7 @@ int rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
  */
 struct rte_ring *rte_ring_create(const char *name, unsigned count,
 				 int socket_id, unsigned flags);
+
 /**
  * De-allocate all memory used by the ring.
  *
diff --git a/lib/librte_ring/rte_ring_elem.h b/lib/librte_ring/rte_ring_elem.h
new file mode 100644
index 000000000..860f059ad
--- /dev/null
+++ b/lib/librte_ring/rte_ring_elem.h
@@ -0,0 +1,946 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Copyright (c) 2019 Arm Limited
+ * Copyright (c) 2010-2017 Intel Corporation
+ * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
+ * All rights reserved.
+ * Derived from FreeBSD's bufring.h
+ * Used as BSD-3 Licensed with permission from Kip Macy.
+ */
+
+#ifndef _RTE_RING_ELEM_H_
+#define _RTE_RING_ELEM_H_
+
+/**
+ * @file
+ * RTE Ring with flexible element size
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stdio.h>
+#include <stdint.h>
+#include <sys/queue.h>
+#include <errno.h>
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_memory.h>
+#include <rte_lcore.h>
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_memzone.h>
+#include <rte_pause.h>
+
+#include "rte_ring.h"
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Calculate the memory size needed for a ring with given element size
+ *
+ * This function returns the number of bytes needed for a ring, given
+ * the number of elements in it and the size of the element. This value
+ * is the sum of the size of the structure rte_ring and the size of the
+ * memory needed for storing the elements. The value is aligned to a cache
+ * line size.
+ *
+ * @param count
+ *   The number of elements in the ring (must be a power of 2).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported.
+ * @return
+ *   - The memory size needed for the ring on success.
+ *   - -EINVAL if count is not a power of 2.
+ */
+__rte_experimental
+ssize_t rte_ring_get_memsize_elem(unsigned count, unsigned esize);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Create a new ring named *name* that stores elements with given size.
+ *
+ * This function uses ``memzone_reserve()`` to allocate memory. Then it
+ * calls rte_ring_init() to initialize an empty ring.
+ *
+ * The new ring size is set to *count*, which must be a power of
+ * two. Water marking is disabled by default. The real usable ring size
+ * is *count-1* instead of *count* to differentiate a free ring from an
+ * empty ring.
+ *
+ * The ring is added in RTE_TAILQ_RING list.
+ *
+ * @param name
+ *   The name of the ring.
+ * @param count
+ *   The number of elements in the ring (must be a power of 2).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported.
+ * @param socket_id
+ *   The *socket_id* argument is the socket identifier in case of
+ *   NUMA. The value can be *SOCKET_ID_ANY* if there is no NUMA
+ *   constraint for the reserved zone.
+ * @param flags
+ *   An OR of the following:
+ *    - RING_F_SP_ENQ: If this flag is set, the default behavior when
+ *      using ``rte_ring_enqueue()`` or ``rte_ring_enqueue_bulk()``
+ *      is "single-producer". Otherwise, it is "multi-producers".
+ *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
+ *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
+ *      is "single-consumer". Otherwise, it is "multi-consumers".
+ * @return
+ *   On success, the pointer to the new allocated ring. NULL on error with
+ *    rte_errno set appropriately. Possible errno values include:
+ *    - E_RTE_NO_CONFIG - function could not get pointer to rte_config structure
+ *    - E_RTE_SECONDARY - function was called from a secondary process instance
+ *    - EINVAL - count provided is not a power of 2
+ *    - ENOSPC - the maximum number of memzones has already been allocated
+ *    - EEXIST - a memzone with the same name already exists
+ *    - ENOMEM - no appropriate memory area found in which to create memzone
+ */
+__rte_experimental
+struct rte_ring *rte_ring_create_elem(const char *name, unsigned count,
+				unsigned esize, int socket_id, unsigned flags);
+
+/* the actual enqueue of pointers on the ring.
+ * Placed here since identical code needed in both
+ * single and multi producer enqueue functions.
+ */
+#define ENQUEUE_PTRS_ELEM(r, ring_start, prod_head, obj_table, esize, n) do { \
+	if (esize == 4) \
+		ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n); \
+	else if (esize == 8) \
+		ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n); \
+	else if (esize == 16) \
+		ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table, n); \
+} while (0)
+
+#define ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n) do { \
+	unsigned int i; \
+	const uint32_t size = (r)->size; \
+	uint32_t idx = prod_head & (r)->mask; \
+	uint32_t *ring = (uint32_t *)ring_start; \
+	uint32_t *obj = (uint32_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & ((~(unsigned)0x7))); i += 8, idx += 8) { \
+			ring[idx] = obj[i]; \
+			ring[idx + 1] = obj[i + 1]; \
+			ring[idx + 2] = obj[i + 2]; \
+			ring[idx + 3] = obj[i + 3]; \
+			ring[idx + 4] = obj[i + 4]; \
+			ring[idx + 5] = obj[i + 5]; \
+			ring[idx + 6] = obj[i + 6]; \
+			ring[idx + 7] = obj[i + 7]; \
+		} \
+		switch (n & 0x7) { \
+		case 7: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 6: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 5: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 4: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 3: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 2: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 1: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++)\
+			ring[idx] = obj[i]; \
+		for (idx = 0; i < n; i++, idx++) \
+			ring[idx] = obj[i]; \
+	} \
+} while (0)
+
+#define ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n) do { \
+	unsigned int i; \
+	const uint32_t size = (r)->size; \
+	uint32_t idx = prod_head & (r)->mask; \
+	uint64_t *ring = (uint64_t *)ring_start; \
+	uint64_t *obj = (uint64_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & ((~(unsigned)0x3))); i += 4, idx += 4) { \
+			ring[idx] = obj[i]; \
+			ring[idx + 1] = obj[i + 1]; \
+			ring[idx + 2] = obj[i + 2]; \
+			ring[idx + 3] = obj[i + 3]; \
+		} \
+		switch (n & 0x3) { \
+		case 3: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 2: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 1: \
+			ring[idx++] = obj[i++]; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++)\
+			ring[idx] = obj[i]; \
+		for (idx = 0; i < n; i++, idx++) \
+			ring[idx] = obj[i]; \
+	} \
+} while (0)
+
+#define ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table, n) do { \
+	unsigned int i; \
+	const uint32_t size = (r)->size; \
+	uint32_t idx = prod_head & (r)->mask; \
+	__uint128_t *ring = (__uint128_t *)ring_start; \
+	__uint128_t *obj = (__uint128_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
+			ring[idx] = obj[i]; \
+			ring[idx + 1] = obj[i + 1]; \
+		} \
+		switch (n & 0x1) { \
+		case 1: \
+			ring[idx++] = obj[i++]; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++)\
+			ring[idx] = obj[i]; \
+		for (idx = 0; i < n; i++, idx++) \
+			ring[idx] = obj[i]; \
+	} \
+} while (0)
+
+/* the actual copy of pointers on the ring to obj_table.
+ * Placed here since identical code needed in both
+ * single and multi consumer dequeue functions.
+ */
+#define DEQUEUE_PTRS_ELEM(r, ring_start, cons_head, obj_table, esize, n) do { \
+	if (esize == 4) \
+		DEQUEUE_PTRS_32(r, ring_start, cons_head, obj_table, n); \
+	else if (esize == 8) \
+		DEQUEUE_PTRS_64(r, ring_start, cons_head, obj_table, n); \
+	else if (esize == 16) \
+		DEQUEUE_PTRS_128(r, ring_start, cons_head, obj_table, n); \
+} while (0)
+
+#define DEQUEUE_PTRS_32(r, ring_start, cons_head, obj_table, n) do { \
+	unsigned int i; \
+	uint32_t idx = cons_head & (r)->mask; \
+	const uint32_t size = (r)->size; \
+	uint32_t *ring = (uint32_t *)ring_start; \
+	uint32_t *obj = (uint32_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & (~(unsigned)0x7)); i += 8, idx += 8) {\
+			obj[i] = ring[idx]; \
+			obj[i + 1] = ring[idx + 1]; \
+			obj[i + 2] = ring[idx + 2]; \
+			obj[i + 3] = ring[idx + 3]; \
+			obj[i + 4] = ring[idx + 4]; \
+			obj[i + 5] = ring[idx + 5]; \
+			obj[i + 6] = ring[idx + 6]; \
+			obj[i + 7] = ring[idx + 7]; \
+		} \
+		switch (n & 0x7) { \
+		case 7: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 6: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 5: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 4: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 3: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 2: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 1: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) \
+			obj[i] = ring[idx]; \
+		for (idx = 0; i < n; i++, idx++) \
+			obj[i] = ring[idx]; \
+	} \
+} while (0)
+
+#define DEQUEUE_PTRS_64(r, ring_start, cons_head, obj_table, n) do { \
+	unsigned int i; \
+	uint32_t idx = cons_head & (r)->mask; \
+	const uint32_t size = (r)->size; \
+	uint64_t *ring = (uint64_t *)ring_start; \
+	uint64_t *obj = (uint64_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & (~(unsigned)0x3)); i += 4, idx += 4) {\
+			obj[i] = ring[idx]; \
+			obj[i + 1] = ring[idx + 1]; \
+			obj[i + 2] = ring[idx + 2]; \
+			obj[i + 3] = ring[idx + 3]; \
+		} \
+		switch (n & 0x3) { \
+		case 3: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 2: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 1: \
+			obj[i++] = ring[idx++]; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) \
+			obj[i] = ring[idx]; \
+		for (idx = 0; i < n; i++, idx++) \
+			obj[i] = ring[idx]; \
+	} \
+} while (0)
+
+#define DEQUEUE_PTRS_128(r, ring_start, cons_head, obj_table, n) do { \
+	unsigned int i; \
+	uint32_t idx = cons_head & (r)->mask; \
+	const uint32_t size = (r)->size; \
+	__uint128_t *ring = (__uint128_t *)ring_start; \
+	__uint128_t *obj = (__uint128_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
+			obj[i] = ring[idx]; \
+			obj[i + 1] = ring[idx + 1]; \
+		} \
+		switch (n & 0x1) { \
+		case 1: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) \
+			obj[i] = ring[idx]; \
+		for (idx = 0; i < n; i++, idx++) \
+			obj[i] = ring[idx]; \
+	} \
+} while (0)
+
+/* Between load and load. there might be cpu reorder in weak model
+ * (powerpc/arm).
+ * There are 2 choices for the users
+ * 1.use rmb() memory barrier
+ * 2.use one-direction load_acquire/store_release barrier,defined by
+ * CONFIG_RTE_USE_C11_MEM_MODEL=y
+ * It depends on performance test results.
+ * By default, move common functions to rte_ring_generic.h
+ */
+#ifdef RTE_USE_C11_MEM_MODEL
+#include "rte_ring_c11_mem.h"
+#else
+#include "rte_ring_generic.h"
+#endif
+
+/**
+ * @internal Enqueue several objects on the ring
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
+ * @param is_sp
+ *   Indicates whether to use single producer or multi-producer head update
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_enqueue_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n,
+		enum rte_ring_queue_behavior behavior, unsigned int is_sp,
+		unsigned int *free_space)
+{
+	uint32_t prod_head, prod_next;
+	uint32_t free_entries;
+
+	n = __rte_ring_move_prod_head(r, is_sp, n, behavior,
+			&prod_head, &prod_next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	ENQUEUE_PTRS_ELEM(r, &r[1], prod_head, obj_table, esize, n);
+
+	update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
+end:
+	if (free_space != NULL)
+		*free_space = free_entries - n;
+	return n;
+}
+
+/**
+ * @internal Dequeue several objects from the ring
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from ring
+ * @param is_sc
+ *   Indicates whether to use single consumer or multi-consumer head update
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n,
+		enum rte_ring_queue_behavior behavior, unsigned int is_sc,
+		unsigned int *available)
+{
+	uint32_t cons_head, cons_next;
+	uint32_t entries;
+
+	n = __rte_ring_move_cons_head(r, (int)is_sc, n, behavior,
+			&cons_head, &cons_next, &entries);
+	if (n == 0)
+		goto end;
+
+	DEQUEUE_PTRS_ELEM(r, &r[1], cons_head, obj_table, esize, n);
+
+	update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
+
+end:
+	if (available != NULL)
+		*available = entries - n;
+	return n;
+}
+
+/**
+ * Enqueue several objects on the ring (multi-producers safe).
+ *
+ * This function uses a "compare and set" instruction to move the
+ * producer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   The number of objects enqueued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_mp_enqueue_bulk_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_FIXED, __IS_MP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring (NOT multi-producers safe).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   The number of objects enqueued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_sp_enqueue_bulk_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_FIXED, __IS_SP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring.
+ *
+ * This function calls the multi-producer or the single-producer
+ * version depending on the default behavior that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   The number of objects enqueued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_enqueue_bulk_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_FIXED, r->prod.single, free_space);
+}
+
+/**
+ * Enqueue one object on a ring (multi-producers safe).
+ *
+ * This function uses a "compare and set" instruction to move the
+ * producer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj
+ *   A pointer to the object to be added.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success; objects enqueued.
+ *   - -ENOBUFS: Not enough room in the ring to enqueue; no object is enqueued.
+ */
+static __rte_always_inline int
+rte_ring_mp_enqueue_elem(struct rte_ring *r, void *obj, unsigned int esize)
+{
+	return rte_ring_mp_enqueue_bulk_elem(r, obj, esize, 1, NULL) ? 0 :
+								-ENOBUFS;
+}
+
+/**
+ * Enqueue one object on a ring (NOT multi-producers safe).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj
+ *   A pointer to the object to be added.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success; objects enqueued.
+ *   - -ENOBUFS: Not enough room in the ring to enqueue; no object is enqueued.
+ */
+static __rte_always_inline int
+rte_ring_sp_enqueue_elem(struct rte_ring *r, void *obj, unsigned int esize)
+{
+	return rte_ring_sp_enqueue_bulk_elem(r, obj, esize, 1, NULL) ? 0 :
+								-ENOBUFS;
+}
+
+/**
+ * Enqueue one object on a ring.
+ *
+ * This function calls the multi-producer or the single-producer
+ * version, depending on the default behaviour that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj
+ *   A pointer to the object to be added.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success; objects enqueued.
+ *   - -ENOBUFS: Not enough room in the ring to enqueue; no object is enqueued.
+ */
+static __rte_always_inline int
+rte_ring_enqueue_elem(struct rte_ring *r, void *obj, unsigned int esize)
+{
+	return rte_ring_enqueue_bulk_elem(r, obj, esize, 1, NULL) ? 0 :
+								-ENOBUFS;
+}
+
+/**
+ * Dequeue several objects from a ring (multi-consumers safe).
+ *
+ * This function uses a "compare and set" instruction to move the
+ * consumer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   The number of objects dequeued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_mc_dequeue_bulk_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+				RTE_RING_QUEUE_FIXED, __IS_MC, available);
+}
+
+/**
+ * Dequeue several objects from a ring (NOT multi-consumers safe).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table,
+ *   must be strictly positive.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   The number of objects dequeued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_sc_dequeue_bulk_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_FIXED, __IS_SC, available);
+}
+
+/**
+ * Dequeue several objects from a ring.
+ *
+ * This function calls the multi-consumers or the single-consumer
+ * version, depending on the default behaviour that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   The number of objects dequeued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_dequeue_bulk_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_FIXED, r->cons.single, available);
+}
+
+/**
+ * Dequeue one object from a ring (multi-consumers safe).
+ *
+ * This function uses a "compare and set" instruction to move the
+ * consumer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_p
+ *   A pointer to a void * pointer (object) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success; objects dequeued.
+ *   - -ENOENT: Not enough entries in the ring to dequeue; no object is
+ *     dequeued.
+ */
+static __rte_always_inline int
+rte_ring_mc_dequeue_elem(struct rte_ring *r, void *obj_p,
+				unsigned int esize)
+{
+	return rte_ring_mc_dequeue_bulk_elem(r, obj_p, esize, 1, NULL)  ? 0 :
+								-ENOENT;
+}
+
+/**
+ * Dequeue one object from a ring (NOT multi-consumers safe).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_p
+ *   A pointer to a void * pointer (object) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success; objects dequeued.
+ *   - -ENOENT: Not enough entries in the ring to dequeue, no object is
+ *     dequeued.
+ */
+static __rte_always_inline int
+rte_ring_sc_dequeue_elem(struct rte_ring *r, void *obj_p,
+				unsigned int esize)
+{
+	return rte_ring_sc_dequeue_bulk_elem(r, obj_p, esize, 1, NULL) ? 0 :
+								-ENOENT;
+}
+
+/**
+ * Dequeue one object from a ring.
+ *
+ * This function calls the multi-consumers or the single-consumer
+ * version depending on the default behaviour that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_p
+ *   A pointer to a void * pointer (object) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success, objects dequeued.
+ *   - -ENOENT: Not enough entries in the ring to dequeue, no object is
+ *     dequeued.
+ */
+static __rte_always_inline int
+rte_ring_dequeue_elem(struct rte_ring *r, void *obj_p, unsigned int esize)
+{
+	return rte_ring_dequeue_bulk_elem(r, obj_p, esize, 1, NULL) ? 0 :
+								-ENOENT;
+}
+
+/**
+ * Enqueue several objects on the ring (multi-producers safe).
+ *
+ * This function uses a "compare and set" instruction to move the
+ * producer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   - n: Actual number of objects enqueued.
+ */
+static __rte_always_inline unsigned
+rte_ring_mp_enqueue_burst_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_MP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring (NOT multi-producers safe).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   - n: Actual number of objects enqueued.
+ */
+static __rte_always_inline unsigned
+rte_ring_sp_enqueue_burst_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_SP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring.
+ *
+ * This function calls the multi-producer or the single-producer
+ * version depending on the default behavior that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   - n: Actual number of objects enqueued.
+ */
+static __rte_always_inline unsigned
+rte_ring_enqueue_burst_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_VARIABLE, r->prod.single, free_space);
+}
+
+/**
+ * Dequeue several objects from a ring (multi-consumers safe). When the request
+ * objects are more than the available objects, only dequeue the actual number
+ * of objects
+ *
+ * This function uses a "compare and set" instruction to move the
+ * consumer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   - n: Actual number of objects dequeued, 0 if ring is empty
+ */
+static __rte_always_inline unsigned
+rte_ring_mc_dequeue_burst_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_MC, available);
+}
+
+/**
+ * Dequeue several objects from a ring (NOT multi-consumers safe).When the
+ * request objects are more than the available objects, only dequeue the
+ * actual number of objects
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   - n: Actual number of objects dequeued, 0 if ring is empty
+ */
+static __rte_always_inline unsigned
+rte_ring_sc_dequeue_burst_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_SC, available);
+}
+
+/**
+ * Dequeue multiple objects from a ring up to a maximum number.
+ *
+ * This function calls the multi-consumers or the single-consumer
+ * version, depending on the default behaviour that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   - Number of objects dequeued
+ */
+static __rte_always_inline unsigned
+rte_ring_dequeue_burst_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+				RTE_RING_QUEUE_VARIABLE,
+				r->cons.single, available);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_RING_ELEM_H_ */
diff --git a/lib/librte_ring/rte_ring_version.map b/lib/librte_ring/rte_ring_version.map
index 510c1386e..e410a7503 100644
--- a/lib/librte_ring/rte_ring_version.map
+++ b/lib/librte_ring/rte_ring_version.map
@@ -21,6 +21,8 @@ DPDK_2.2 {
 EXPERIMENTAL {
 	global:
 
+	rte_ring_create_elem;
+	rte_ring_get_memsize_elem;
 	rte_ring_reset;
 
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH v4 2/2] test/ring: add test cases for configurable element size ring
  2019-10-09  2:47   ` [dpdk-dev] [PATCH v3 0/2] lib/ring: APIs to support custom element size Honnappa Nagarahalli
  2019-10-09  2:47     ` [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable " Honnappa Nagarahalli
@ 2019-10-09  2:47     ` Honnappa Nagarahalli
  1 sibling, 0 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-09  2:47 UTC (permalink / raw)
  To: olivier.matz, sthemmin, jerinj, bruce.richardson, david.marchand,
	pbhagavatula, konstantin.ananyev, honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, ruifeng.wang, gavin.hu

Add test cases to test APIs for configurable element size ring.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 app/test/Makefile              |   1 +
 app/test/meson.build           |   1 +
 app/test/test_ring_perf_elem.c | 419 +++++++++++++++++++++++++++++++++
 3 files changed, 421 insertions(+)
 create mode 100644 app/test/test_ring_perf_elem.c

diff --git a/app/test/Makefile b/app/test/Makefile
index 26ba6fe2b..e5cb27b75 100644
--- a/app/test/Makefile
+++ b/app/test/Makefile
@@ -78,6 +78,7 @@ SRCS-y += test_rand_perf.c
 
 SRCS-y += test_ring.c
 SRCS-y += test_ring_perf.c
+SRCS-y += test_ring_perf_elem.c
 SRCS-y += test_pmd_perf.c
 
 ifeq ($(CONFIG_RTE_LIBRTE_TABLE),y)
diff --git a/app/test/meson.build b/app/test/meson.build
index ec40943bd..995ee9bc7 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -101,6 +101,7 @@ test_sources = files('commands.c',
 	'test_reorder.c',
 	'test_ring.c',
 	'test_ring_perf.c',
+	'test_ring_perf_elem.c',
 	'test_rwlock.c',
 	'test_sched.c',
 	'test_service_cores.c',
diff --git a/app/test/test_ring_perf_elem.c b/app/test/test_ring_perf_elem.c
new file mode 100644
index 000000000..fc5b82d71
--- /dev/null
+++ b/app/test/test_ring_perf_elem.c
@@ -0,0 +1,419 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2014 Intel Corporation
+ */
+
+
+#include <stdio.h>
+#include <inttypes.h>
+#include <rte_ring.h>
+#include <rte_ring_elem.h>
+#include <rte_cycles.h>
+#include <rte_launch.h>
+#include <rte_pause.h>
+
+#include "test.h"
+
+/*
+ * Ring
+ * ====
+ *
+ * Measures performance of various operations using rdtsc
+ *  * Empty ring dequeue
+ *  * Enqueue/dequeue of bursts in 1 threads
+ *  * Enqueue/dequeue of bursts in 2 threads
+ */
+
+#define RING_NAME "RING_PERF"
+#define RING_SIZE 4096
+#define MAX_BURST 64
+
+/*
+ * the sizes to enqueue and dequeue in testing
+ * (marked volatile so they won't be seen as compile-time constants)
+ */
+static const volatile unsigned bulk_sizes[] = { 8, 32 };
+
+struct lcore_pair {
+	unsigned c1, c2;
+};
+
+static volatile unsigned lcore_count;
+
+/**** Functions to analyse our core mask to get cores for different tests ***/
+
+static int
+get_two_hyperthreads(struct lcore_pair *lcp)
+{
+	unsigned id1, id2;
+	unsigned c1, c2, s1, s2;
+	RTE_LCORE_FOREACH(id1) {
+		/* inner loop just re-reads all id's. We could skip the
+		 * first few elements, but since number of cores is small
+		 * there is little point
+		 */
+		RTE_LCORE_FOREACH(id2) {
+			if (id1 == id2)
+				continue;
+
+			c1 = rte_lcore_to_cpu_id(id1);
+			c2 = rte_lcore_to_cpu_id(id2);
+			s1 = rte_lcore_to_socket_id(id1);
+			s2 = rte_lcore_to_socket_id(id2);
+			if ((c1 == c2) && (s1 == s2)) {
+				lcp->c1 = id1;
+				lcp->c2 = id2;
+				return 0;
+			}
+		}
+	}
+	return 1;
+}
+
+static int
+get_two_cores(struct lcore_pair *lcp)
+{
+	unsigned id1, id2;
+	unsigned c1, c2, s1, s2;
+	RTE_LCORE_FOREACH(id1) {
+		RTE_LCORE_FOREACH(id2) {
+			if (id1 == id2)
+				continue;
+
+			c1 = rte_lcore_to_cpu_id(id1);
+			c2 = rte_lcore_to_cpu_id(id2);
+			s1 = rte_lcore_to_socket_id(id1);
+			s2 = rte_lcore_to_socket_id(id2);
+			if ((c1 != c2) && (s1 == s2)) {
+				lcp->c1 = id1;
+				lcp->c2 = id2;
+				return 0;
+			}
+		}
+	}
+	return 1;
+}
+
+static int
+get_two_sockets(struct lcore_pair *lcp)
+{
+	unsigned id1, id2;
+	unsigned s1, s2;
+	RTE_LCORE_FOREACH(id1) {
+		RTE_LCORE_FOREACH(id2) {
+			if (id1 == id2)
+				continue;
+			s1 = rte_lcore_to_socket_id(id1);
+			s2 = rte_lcore_to_socket_id(id2);
+			if (s1 != s2) {
+				lcp->c1 = id1;
+				lcp->c2 = id2;
+				return 0;
+			}
+		}
+	}
+	return 1;
+}
+
+/* Get cycle counts for dequeuing from an empty ring. Should be 2 or 3 cycles */
+static void
+test_empty_dequeue(struct rte_ring *r)
+{
+	const unsigned iter_shift = 26;
+	const unsigned iterations = 1<<iter_shift;
+	unsigned i = 0;
+	uint32_t burst[MAX_BURST];
+
+	const uint64_t sc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		rte_ring_sc_dequeue_bulk_elem(r, burst, 8, bulk_sizes[0], NULL);
+	const uint64_t sc_end = rte_rdtsc();
+
+	const uint64_t mc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		rte_ring_mc_dequeue_bulk_elem(r, burst, 8, bulk_sizes[0], NULL);
+	const uint64_t mc_end = rte_rdtsc();
+
+	printf("SC empty dequeue: %.2F\n",
+			(double)(sc_end-sc_start) / iterations);
+	printf("MC empty dequeue: %.2F\n",
+			(double)(mc_end-mc_start) / iterations);
+}
+
+/*
+ * for the separate enqueue and dequeue threads they take in one param
+ * and return two. Input = burst size, output = cycle average for sp/sc & mp/mc
+ */
+struct thread_params {
+	struct rte_ring *r;
+	unsigned size;        /* input value, the burst size */
+	double spsc, mpmc;    /* output value, the single or multi timings */
+};
+
+/*
+ * Function that uses rdtsc to measure timing for ring enqueue. Needs pair
+ * thread running dequeue_bulk function
+ */
+static int
+enqueue_bulk(void *p)
+{
+	const unsigned iter_shift = 23;
+	const unsigned iterations = 1<<iter_shift;
+	struct thread_params *params = p;
+	struct rte_ring *r = params->r;
+	const unsigned size = params->size;
+	unsigned i;
+	uint32_t burst[MAX_BURST] = {0};
+
+#ifdef RTE_USE_C11_MEM_MODEL
+	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
+#else
+	if (__sync_add_and_fetch(&lcore_count, 1) != 2)
+#endif
+		while (lcore_count != 2)
+			rte_pause();
+
+	const uint64_t sp_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		while (rte_ring_sp_enqueue_bulk_elem(r, burst, 8, size, NULL)
+				== 0)
+			rte_pause();
+	const uint64_t sp_end = rte_rdtsc();
+
+	const uint64_t mp_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		while (rte_ring_mp_enqueue_bulk_elem(r, burst, 8, size, NULL)
+				== 0)
+			rte_pause();
+	const uint64_t mp_end = rte_rdtsc();
+
+	params->spsc = ((double)(sp_end - sp_start))/(iterations*size);
+	params->mpmc = ((double)(mp_end - mp_start))/(iterations*size);
+	return 0;
+}
+
+/*
+ * Function that uses rdtsc to measure timing for ring dequeue. Needs pair
+ * thread running enqueue_bulk function
+ */
+static int
+dequeue_bulk(void *p)
+{
+	const unsigned iter_shift = 23;
+	const unsigned iterations = 1<<iter_shift;
+	struct thread_params *params = p;
+	struct rte_ring *r = params->r;
+	const unsigned size = params->size;
+	unsigned i;
+	uint32_t burst[MAX_BURST] = {0};
+
+#ifdef RTE_USE_C11_MEM_MODEL
+	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
+#else
+	if (__sync_add_and_fetch(&lcore_count, 1) != 2)
+#endif
+		while (lcore_count != 2)
+			rte_pause();
+
+	const uint64_t sc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		while (rte_ring_sc_dequeue_bulk_elem(r, burst, 8, size, NULL)
+				== 0)
+			rte_pause();
+	const uint64_t sc_end = rte_rdtsc();
+
+	const uint64_t mc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		while (rte_ring_mc_dequeue_bulk_elem(r, burst, 8, size, NULL)
+				== 0)
+			rte_pause();
+	const uint64_t mc_end = rte_rdtsc();
+
+	params->spsc = ((double)(sc_end - sc_start))/(iterations*size);
+	params->mpmc = ((double)(mc_end - mc_start))/(iterations*size);
+	return 0;
+}
+
+/*
+ * Function that calls the enqueue and dequeue bulk functions on pairs of cores.
+ * used to measure ring perf between hyperthreads, cores and sockets.
+ */
+static void
+run_on_core_pair(struct lcore_pair *cores, struct rte_ring *r,
+		lcore_function_t f1, lcore_function_t f2)
+{
+	struct thread_params param1 = {0}, param2 = {0};
+	unsigned i;
+	for (i = 0; i < sizeof(bulk_sizes)/sizeof(bulk_sizes[0]); i++) {
+		lcore_count = 0;
+		param1.size = param2.size = bulk_sizes[i];
+		param1.r = param2.r = r;
+		if (cores->c1 == rte_get_master_lcore()) {
+			rte_eal_remote_launch(f2, &param2, cores->c2);
+			f1(&param1);
+			rte_eal_wait_lcore(cores->c2);
+		} else {
+			rte_eal_remote_launch(f1, &param1, cores->c1);
+			rte_eal_remote_launch(f2, &param2, cores->c2);
+			rte_eal_wait_lcore(cores->c1);
+			rte_eal_wait_lcore(cores->c2);
+		}
+		printf("SP/SC bulk enq/dequeue (size: %u): %.2F\n",
+				bulk_sizes[i], param1.spsc + param2.spsc);
+		printf("MP/MC bulk enq/dequeue (size: %u): %.2F\n",
+				bulk_sizes[i], param1.mpmc + param2.mpmc);
+	}
+}
+
+/*
+ * Test function that determines how long an enqueue + dequeue of a single item
+ * takes on a single lcore. Result is for comparison with the bulk enq+deq.
+ */
+static void
+test_single_enqueue_dequeue(struct rte_ring *r)
+{
+	const unsigned iter_shift = 24;
+	const unsigned iterations = 1<<iter_shift;
+	unsigned i = 0;
+	uint32_t burst[2];
+
+	const uint64_t sc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++) {
+		rte_ring_sp_enqueue_elem(r, burst, 8);
+		rte_ring_sc_dequeue_elem(r, burst, 8);
+	}
+	const uint64_t sc_end = rte_rdtsc();
+
+	const uint64_t mc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++) {
+		rte_ring_mp_enqueue_elem(r, burst, 8);
+		rte_ring_mc_dequeue_elem(r, burst, 8);
+	}
+	const uint64_t mc_end = rte_rdtsc();
+
+	printf("SP/SC single enq/dequeue: %"PRIu64"\n",
+			(sc_end-sc_start) >> iter_shift);
+	printf("MP/MC single enq/dequeue: %"PRIu64"\n",
+			(mc_end-mc_start) >> iter_shift);
+}
+
+/*
+ * Test that does both enqueue and dequeue on a core using the burst() API calls
+ * instead of the bulk() calls used in other tests. Results should be the same
+ * as for the bulk function called on a single lcore.
+ */
+static void
+test_burst_enqueue_dequeue(struct rte_ring *r)
+{
+	const unsigned iter_shift = 23;
+	const unsigned iterations = 1<<iter_shift;
+	unsigned sz, i = 0;
+	uint32_t burst[MAX_BURST] = {0};
+
+	for (sz = 0; sz < sizeof(bulk_sizes)/sizeof(bulk_sizes[0]); sz++) {
+		const uint64_t sc_start = rte_rdtsc();
+		for (i = 0; i < iterations; i++) {
+			rte_ring_sp_enqueue_burst_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+			rte_ring_sc_dequeue_burst_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+		}
+		const uint64_t sc_end = rte_rdtsc();
+
+		const uint64_t mc_start = rte_rdtsc();
+		for (i = 0; i < iterations; i++) {
+			rte_ring_mp_enqueue_burst_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+			rte_ring_mc_dequeue_burst_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+		}
+		const uint64_t mc_end = rte_rdtsc();
+
+		uint64_t mc_avg = ((mc_end-mc_start) >> iter_shift) /
+					bulk_sizes[sz];
+		uint64_t sc_avg = ((sc_end-sc_start) >> iter_shift) /
+					bulk_sizes[sz];
+
+		printf("SP/SC burst enq/dequeue (size: %u): %"PRIu64"\n",
+				bulk_sizes[sz], sc_avg);
+		printf("MP/MC burst enq/dequeue (size: %u): %"PRIu64"\n",
+				bulk_sizes[sz], mc_avg);
+	}
+}
+
+/* Times enqueue and dequeue on a single lcore */
+static void
+test_bulk_enqueue_dequeue(struct rte_ring *r)
+{
+	const unsigned iter_shift = 23;
+	const unsigned iterations = 1<<iter_shift;
+	unsigned sz, i = 0;
+	uint32_t burst[MAX_BURST] = {0};
+
+	for (sz = 0; sz < sizeof(bulk_sizes)/sizeof(bulk_sizes[0]); sz++) {
+		const uint64_t sc_start = rte_rdtsc();
+		for (i = 0; i < iterations; i++) {
+			rte_ring_sp_enqueue_bulk_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+			rte_ring_sc_dequeue_bulk_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+		}
+		const uint64_t sc_end = rte_rdtsc();
+
+		const uint64_t mc_start = rte_rdtsc();
+		for (i = 0; i < iterations; i++) {
+			rte_ring_mp_enqueue_bulk_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+			rte_ring_mc_dequeue_bulk_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+		}
+		const uint64_t mc_end = rte_rdtsc();
+
+		double sc_avg = ((double)(sc_end-sc_start) /
+				(iterations * bulk_sizes[sz]));
+		double mc_avg = ((double)(mc_end-mc_start) /
+				(iterations * bulk_sizes[sz]));
+
+		printf("SP/SC bulk enq/dequeue (size: %u): %.2F\n",
+				bulk_sizes[sz], sc_avg);
+		printf("MP/MC bulk enq/dequeue (size: %u): %.2F\n",
+				bulk_sizes[sz], mc_avg);
+	}
+}
+
+static int
+test_ring_perf_elem(void)
+{
+	struct lcore_pair cores;
+	struct rte_ring *r = NULL;
+
+	r = rte_ring_create_elem(RING_NAME, RING_SIZE, 8, rte_socket_id(), 0);
+	if (r == NULL)
+		return -1;
+
+	printf("### Testing single element and burst enq/deq ###\n");
+	test_single_enqueue_dequeue(r);
+	test_burst_enqueue_dequeue(r);
+
+	printf("\n### Testing empty dequeue ###\n");
+	test_empty_dequeue(r);
+
+	printf("\n### Testing using a single lcore ###\n");
+	test_bulk_enqueue_dequeue(r);
+
+	if (get_two_hyperthreads(&cores) == 0) {
+		printf("\n### Testing using two hyperthreads ###\n");
+		run_on_core_pair(&cores, r, enqueue_bulk, dequeue_bulk);
+	}
+	if (get_two_cores(&cores) == 0) {
+		printf("\n### Testing using two physical cores ###\n");
+		run_on_core_pair(&cores, r, enqueue_bulk, dequeue_bulk);
+	}
+	if (get_two_sockets(&cores) == 0) {
+		printf("\n### Testing using two NUMA nodes ###\n");
+		run_on_core_pair(&cores, r, enqueue_bulk, dequeue_bulk);
+	}
+	rte_ring_free(r);
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(ring_perf_elem_autotest, test_ring_perf_elem);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable element size
  2019-10-09  2:47     ` [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable " Honnappa Nagarahalli
@ 2019-10-11 19:21       ` Honnappa Nagarahalli
  2019-10-14 19:41         ` Ananyev, Konstantin
  0 siblings, 1 reply; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-11 19:21 UTC (permalink / raw)
  To: Honnappa Nagarahalli, olivier.matz, sthemmin, jerinj,
	bruce.richardson, david.marchand, pbhagavatula,
	konstantin.ananyev
  Cc: dev, Dharmik Thakkar, Ruifeng Wang (Arm Technology China),
	Gavin Hu (Arm Technology China),
	stephen, Honnappa Nagarahalli, nd, nd

Hi Bruce, Konstantin, Stephen,
	Appreciate if you could provide feedback on this.

Thanks,
Honnappa

> -----Original Message-----
> From: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Sent: Tuesday, October 8, 2019 9:47 PM
> To: olivier.matz@6wind.com; sthemmin@microsoft.com; jerinj@marvell.com;
> bruce.richardson@intel.com; david.marchand@redhat.com;
> pbhagavatula@marvell.com; konstantin.ananyev@intel.com; Honnappa
> Nagarahalli <Honnappa.Nagarahalli@arm.com>
> Cc: dev@dpdk.org; Dharmik Thakkar <Dharmik.Thakkar@arm.com>; Ruifeng
> Wang (Arm Technology China) <Ruifeng.Wang@arm.com>; Gavin Hu (Arm
> Technology China) <Gavin.Hu@arm.com>
> Subject: [PATCH v4 1/2] lib/ring: apis to support configurable element size
> 
> Current APIs assume ring elements to be pointers. However, in many use cases,
> the size can be different. Add new APIs to support configurable ring element
> sizes.
> 
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>  lib/librte_ring/Makefile             |   3 +-
>  lib/librte_ring/meson.build          |   3 +
>  lib/librte_ring/rte_ring.c           |  45 +-
>  lib/librte_ring/rte_ring.h           |   1 +
>  lib/librte_ring/rte_ring_elem.h      | 946 +++++++++++++++++++++++++++
>  lib/librte_ring/rte_ring_version.map |   2 +
>  6 files changed, 991 insertions(+), 9 deletions(-)  create mode 100644
> lib/librte_ring/rte_ring_elem.h
> 
> diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile index
> 21a36770d..515a967bb 100644
> --- a/lib/librte_ring/Makefile
> +++ b/lib/librte_ring/Makefile
> @@ -6,7 +6,7 @@ include $(RTE_SDK)/mk/rte.vars.mk  # library name  LIB =
> librte_ring.a
> 
> -CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3
> +CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -
> DALLOW_EXPERIMENTAL_API
>  LDLIBS += -lrte_eal
> 
>  EXPORT_MAP := rte_ring_version.map
> @@ -18,6 +18,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
> 
>  # install includes
>  SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
> +					rte_ring_elem.h \
>  					rte_ring_generic.h \
>  					rte_ring_c11_mem.h
> 
> diff --git a/lib/librte_ring/meson.build b/lib/librte_ring/meson.build index
> ab8b0b469..74219840a 100644
> --- a/lib/librte_ring/meson.build
> +++ b/lib/librte_ring/meson.build
> @@ -6,3 +6,6 @@ sources = files('rte_ring.c')  headers = files('rte_ring.h',
>  		'rte_ring_c11_mem.h',
>  		'rte_ring_generic.h')
> +
> +# rte_ring_create_elem and rte_ring_get_memsize_elem are experimental
> +allow_experimental_apis = true
> diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c index
> d9b308036..6fed3648b 100644
> --- a/lib/librte_ring/rte_ring.c
> +++ b/lib/librte_ring/rte_ring.c
> @@ -33,6 +33,7 @@
>  #include <rte_tailq.h>
> 
>  #include "rte_ring.h"
> +#include "rte_ring_elem.h"
> 
>  TAILQ_HEAD(rte_ring_list, rte_tailq_entry);
> 
> @@ -46,23 +47,42 @@ EAL_REGISTER_TAILQ(rte_ring_tailq)
> 
>  /* return the size of memory occupied by a ring */  ssize_t -
> rte_ring_get_memsize(unsigned count)
> +rte_ring_get_memsize_elem(unsigned count, unsigned esize)
>  {
>  	ssize_t sz;
> 
> +	/* Supported esize values are 4/8/16.
> +	 * Others can be added on need basis.
> +	 */
> +	if ((esize != 4) && (esize != 8) && (esize != 16)) {
> +		RTE_LOG(ERR, RING,
> +			"Unsupported esize value. Supported values are 4, 8
> and 16\n");
> +
> +		return -EINVAL;
> +	}
> +
>  	/* count must be a power of 2 */
>  	if ((!POWEROF2(count)) || (count > RTE_RING_SZ_MASK )) {
>  		RTE_LOG(ERR, RING,
> -			"Requested size is invalid, must be power of 2, and "
> -			"do not exceed the size limit %u\n",
> RTE_RING_SZ_MASK);
> +			"Requested number of elements is invalid, must be "
> +			"power of 2, and do not exceed the limit %u\n",
> +			RTE_RING_SZ_MASK);
> +
>  		return -EINVAL;
>  	}
> 
> -	sz = sizeof(struct rte_ring) + count * sizeof(void *);
> +	sz = sizeof(struct rte_ring) + count * esize;
>  	sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
>  	return sz;
>  }
> 
> +/* return the size of memory occupied by a ring */ ssize_t
> +rte_ring_get_memsize(unsigned count) {
> +	return rte_ring_get_memsize_elem(count, sizeof(void *)); }
> +
>  void
>  rte_ring_reset(struct rte_ring *r)
>  {
> @@ -114,10 +134,10 @@ rte_ring_init(struct rte_ring *r, const char *name,
> unsigned count,
>  	return 0;
>  }
> 
> -/* create the ring */
> +/* create the ring for a given element size */
>  struct rte_ring *
> -rte_ring_create(const char *name, unsigned count, int socket_id,
> -		unsigned flags)
> +rte_ring_create_elem(const char *name, unsigned count, unsigned esize,
> +		int socket_id, unsigned flags)
>  {
>  	char mz_name[RTE_MEMZONE_NAMESIZE];
>  	struct rte_ring *r;
> @@ -135,7 +155,7 @@ rte_ring_create(const char *name, unsigned count,
> int socket_id,
>  	if (flags & RING_F_EXACT_SZ)
>  		count = rte_align32pow2(count + 1);
> 
> -	ring_size = rte_ring_get_memsize(count);
> +	ring_size = rte_ring_get_memsize_elem(count, esize);
>  	if (ring_size < 0) {
>  		rte_errno = ring_size;
>  		return NULL;
> @@ -182,6 +202,15 @@ rte_ring_create(const char *name, unsigned count,
> int socket_id,
>  	return r;
>  }
> 
> +/* create the ring */
> +struct rte_ring *
> +rte_ring_create(const char *name, unsigned count, int socket_id,
> +		unsigned flags)
> +{
> +	return rte_ring_create_elem(name, count, sizeof(void *), socket_id,
> +		flags);
> +}
> +
>  /* free the ring */
>  void
>  rte_ring_free(struct rte_ring *r)
> diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index
> 2a9f768a1..18fc5d845 100644
> --- a/lib/librte_ring/rte_ring.h
> +++ b/lib/librte_ring/rte_ring.h
> @@ -216,6 +216,7 @@ int rte_ring_init(struct rte_ring *r, const char *name,
> unsigned count,
>   */
>  struct rte_ring *rte_ring_create(const char *name, unsigned count,
>  				 int socket_id, unsigned flags);
> +
>  /**
>   * De-allocate all memory used by the ring.
>   *
> diff --git a/lib/librte_ring/rte_ring_elem.h b/lib/librte_ring/rte_ring_elem.h
> new file mode 100644 index 000000000..860f059ad
> --- /dev/null
> +++ b/lib/librte_ring/rte_ring_elem.h
> @@ -0,0 +1,946 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + *
> + * Copyright (c) 2019 Arm Limited
> + * Copyright (c) 2010-2017 Intel Corporation
> + * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
> + * All rights reserved.
> + * Derived from FreeBSD's bufring.h
> + * Used as BSD-3 Licensed with permission from Kip Macy.
> + */
> +
> +#ifndef _RTE_RING_ELEM_H_
> +#define _RTE_RING_ELEM_H_
> +
> +/**
> + * @file
> + * RTE Ring with flexible element size
> + */
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +#include <stdio.h>
> +#include <stdint.h>
> +#include <sys/queue.h>
> +#include <errno.h>
> +#include <rte_common.h>
> +#include <rte_config.h>
> +#include <rte_memory.h>
> +#include <rte_lcore.h>
> +#include <rte_atomic.h>
> +#include <rte_branch_prediction.h>
> +#include <rte_memzone.h>
> +#include <rte_pause.h>
> +
> +#include "rte_ring.h"
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Calculate the memory size needed for a ring with given element size
> + *
> + * This function returns the number of bytes needed for a ring, given
> + * the number of elements in it and the size of the element. This value
> + * is the sum of the size of the structure rte_ring and the size of the
> + * memory needed for storing the elements. The value is aligned to a
> +cache
> + * line size.
> + *
> + * @param count
> + *   The number of elements in the ring (must be a power of 2).
> + * @param esize
> + *   The size of ring element, in bytes. It must be a multiple of 4.
> + *   Currently, sizes 4, 8 and 16 are supported.
> + * @return
> + *   - The memory size needed for the ring on success.
> + *   - -EINVAL if count is not a power of 2.
> + */
> +__rte_experimental
> +ssize_t rte_ring_get_memsize_elem(unsigned count, unsigned esize);
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice
> + *
> + * Create a new ring named *name* that stores elements with given size.
> + *
> + * This function uses ``memzone_reserve()`` to allocate memory. Then it
> + * calls rte_ring_init() to initialize an empty ring.
> + *
> + * The new ring size is set to *count*, which must be a power of
> + * two. Water marking is disabled by default. The real usable ring size
> + * is *count-1* instead of *count* to differentiate a free ring from an
> + * empty ring.
> + *
> + * The ring is added in RTE_TAILQ_RING list.
> + *
> + * @param name
> + *   The name of the ring.
> + * @param count
> + *   The number of elements in the ring (must be a power of 2).
> + * @param esize
> + *   The size of ring element, in bytes. It must be a multiple of 4.
> + *   Currently, sizes 4, 8 and 16 are supported.
> + * @param socket_id
> + *   The *socket_id* argument is the socket identifier in case of
> + *   NUMA. The value can be *SOCKET_ID_ANY* if there is no NUMA
> + *   constraint for the reserved zone.
> + * @param flags
> + *   An OR of the following:
> + *    - RING_F_SP_ENQ: If this flag is set, the default behavior when
> + *      using ``rte_ring_enqueue()`` or ``rte_ring_enqueue_bulk()``
> + *      is "single-producer". Otherwise, it is "multi-producers".
> + *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
> + *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
> + *      is "single-consumer". Otherwise, it is "multi-consumers".
> + * @return
> + *   On success, the pointer to the new allocated ring. NULL on error with
> + *    rte_errno set appropriately. Possible errno values include:
> + *    - E_RTE_NO_CONFIG - function could not get pointer to rte_config
> structure
> + *    - E_RTE_SECONDARY - function was called from a secondary process
> instance
> + *    - EINVAL - count provided is not a power of 2
> + *    - ENOSPC - the maximum number of memzones has already been
> allocated
> + *    - EEXIST - a memzone with the same name already exists
> + *    - ENOMEM - no appropriate memory area found in which to create
> memzone
> + */
> +__rte_experimental
> +struct rte_ring *rte_ring_create_elem(const char *name, unsigned count,
> +				unsigned esize, int socket_id, unsigned flags);
> +
> +/* the actual enqueue of pointers on the ring.
> + * Placed here since identical code needed in both
> + * single and multi producer enqueue functions.
> + */
> +#define ENQUEUE_PTRS_ELEM(r, ring_start, prod_head, obj_table, esize, n)
> do { \
> +	if (esize == 4) \
> +		ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n); \
> +	else if (esize == 8) \
> +		ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n); \
> +	else if (esize == 16) \
> +		ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table, n); \ }
> while
> +(0)
> +
> +#define ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n) do { \
> +	unsigned int i; \
> +	const uint32_t size = (r)->size; \
> +	uint32_t idx = prod_head & (r)->mask; \
> +	uint32_t *ring = (uint32_t *)ring_start; \
> +	uint32_t *obj = (uint32_t *)obj_table; \
> +	if (likely(idx + n < size)) { \
> +		for (i = 0; i < (n & ((~(unsigned)0x7))); i += 8, idx += 8) { \
> +			ring[idx] = obj[i]; \
> +			ring[idx + 1] = obj[i + 1]; \
> +			ring[idx + 2] = obj[i + 2]; \
> +			ring[idx + 3] = obj[i + 3]; \
> +			ring[idx + 4] = obj[i + 4]; \
> +			ring[idx + 5] = obj[i + 5]; \
> +			ring[idx + 6] = obj[i + 6]; \
> +			ring[idx + 7] = obj[i + 7]; \
> +		} \
> +		switch (n & 0x7) { \
> +		case 7: \
> +			ring[idx++] = obj[i++]; /* fallthrough */ \
> +		case 6: \
> +			ring[idx++] = obj[i++]; /* fallthrough */ \
> +		case 5: \
> +			ring[idx++] = obj[i++]; /* fallthrough */ \
> +		case 4: \
> +			ring[idx++] = obj[i++]; /* fallthrough */ \
> +		case 3: \
> +			ring[idx++] = obj[i++]; /* fallthrough */ \
> +		case 2: \
> +			ring[idx++] = obj[i++]; /* fallthrough */ \
> +		case 1: \
> +			ring[idx++] = obj[i++]; /* fallthrough */ \
> +		} \
> +	} else { \
> +		for (i = 0; idx < size; i++, idx++)\
> +			ring[idx] = obj[i]; \
> +		for (idx = 0; i < n; i++, idx++) \
> +			ring[idx] = obj[i]; \
> +	} \
> +} while (0)
> +
> +#define ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n) do { \
> +	unsigned int i; \
> +	const uint32_t size = (r)->size; \
> +	uint32_t idx = prod_head & (r)->mask; \
> +	uint64_t *ring = (uint64_t *)ring_start; \
> +	uint64_t *obj = (uint64_t *)obj_table; \
> +	if (likely(idx + n < size)) { \
> +		for (i = 0; i < (n & ((~(unsigned)0x3))); i += 4, idx += 4) { \
> +			ring[idx] = obj[i]; \
> +			ring[idx + 1] = obj[i + 1]; \
> +			ring[idx + 2] = obj[i + 2]; \
> +			ring[idx + 3] = obj[i + 3]; \
> +		} \
> +		switch (n & 0x3) { \
> +		case 3: \
> +			ring[idx++] = obj[i++]; /* fallthrough */ \
> +		case 2: \
> +			ring[idx++] = obj[i++]; /* fallthrough */ \
> +		case 1: \
> +			ring[idx++] = obj[i++]; \
> +		} \
> +	} else { \
> +		for (i = 0; idx < size; i++, idx++)\
> +			ring[idx] = obj[i]; \
> +		for (idx = 0; i < n; i++, idx++) \
> +			ring[idx] = obj[i]; \
> +	} \
> +} while (0)
> +
> +#define ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table, n) do { \
> +	unsigned int i; \
> +	const uint32_t size = (r)->size; \
> +	uint32_t idx = prod_head & (r)->mask; \
> +	__uint128_t *ring = (__uint128_t *)ring_start; \
> +	__uint128_t *obj = (__uint128_t *)obj_table; \
> +	if (likely(idx + n < size)) { \
> +		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
> +			ring[idx] = obj[i]; \
> +			ring[idx + 1] = obj[i + 1]; \
> +		} \
> +		switch (n & 0x1) { \
> +		case 1: \
> +			ring[idx++] = obj[i++]; \
> +		} \
> +	} else { \
> +		for (i = 0; idx < size; i++, idx++)\
> +			ring[idx] = obj[i]; \
> +		for (idx = 0; i < n; i++, idx++) \
> +			ring[idx] = obj[i]; \
> +	} \
> +} while (0)
> +
> +/* the actual copy of pointers on the ring to obj_table.
> + * Placed here since identical code needed in both
> + * single and multi consumer dequeue functions.
> + */
> +#define DEQUEUE_PTRS_ELEM(r, ring_start, cons_head, obj_table, esize, n)
> do { \
> +	if (esize == 4) \
> +		DEQUEUE_PTRS_32(r, ring_start, cons_head, obj_table, n); \
> +	else if (esize == 8) \
> +		DEQUEUE_PTRS_64(r, ring_start, cons_head, obj_table, n); \
> +	else if (esize == 16) \
> +		DEQUEUE_PTRS_128(r, ring_start, cons_head, obj_table, n); \ }
> while
> +(0)
> +
> +#define DEQUEUE_PTRS_32(r, ring_start, cons_head, obj_table, n) do { \
> +	unsigned int i; \
> +	uint32_t idx = cons_head & (r)->mask; \
> +	const uint32_t size = (r)->size; \
> +	uint32_t *ring = (uint32_t *)ring_start; \
> +	uint32_t *obj = (uint32_t *)obj_table; \
> +	if (likely(idx + n < size)) { \
> +		for (i = 0; i < (n & (~(unsigned)0x7)); i += 8, idx += 8) {\
> +			obj[i] = ring[idx]; \
> +			obj[i + 1] = ring[idx + 1]; \
> +			obj[i + 2] = ring[idx + 2]; \
> +			obj[i + 3] = ring[idx + 3]; \
> +			obj[i + 4] = ring[idx + 4]; \
> +			obj[i + 5] = ring[idx + 5]; \
> +			obj[i + 6] = ring[idx + 6]; \
> +			obj[i + 7] = ring[idx + 7]; \
> +		} \
> +		switch (n & 0x7) { \
> +		case 7: \
> +			obj[i++] = ring[idx++]; /* fallthrough */ \
> +		case 6: \
> +			obj[i++] = ring[idx++]; /* fallthrough */ \
> +		case 5: \
> +			obj[i++] = ring[idx++]; /* fallthrough */ \
> +		case 4: \
> +			obj[i++] = ring[idx++]; /* fallthrough */ \
> +		case 3: \
> +			obj[i++] = ring[idx++]; /* fallthrough */ \
> +		case 2: \
> +			obj[i++] = ring[idx++]; /* fallthrough */ \
> +		case 1: \
> +			obj[i++] = ring[idx++]; /* fallthrough */ \
> +		} \
> +	} else { \
> +		for (i = 0; idx < size; i++, idx++) \
> +			obj[i] = ring[idx]; \
> +		for (idx = 0; i < n; i++, idx++) \
> +			obj[i] = ring[idx]; \
> +	} \
> +} while (0)
> +
> +#define DEQUEUE_PTRS_64(r, ring_start, cons_head, obj_table, n) do { \
> +	unsigned int i; \
> +	uint32_t idx = cons_head & (r)->mask; \
> +	const uint32_t size = (r)->size; \
> +	uint64_t *ring = (uint64_t *)ring_start; \
> +	uint64_t *obj = (uint64_t *)obj_table; \
> +	if (likely(idx + n < size)) { \
> +		for (i = 0; i < (n & (~(unsigned)0x3)); i += 4, idx += 4) {\
> +			obj[i] = ring[idx]; \
> +			obj[i + 1] = ring[idx + 1]; \
> +			obj[i + 2] = ring[idx + 2]; \
> +			obj[i + 3] = ring[idx + 3]; \
> +		} \
> +		switch (n & 0x3) { \
> +		case 3: \
> +			obj[i++] = ring[idx++]; /* fallthrough */ \
> +		case 2: \
> +			obj[i++] = ring[idx++]; /* fallthrough */ \
> +		case 1: \
> +			obj[i++] = ring[idx++]; \
> +		} \
> +	} else { \
> +		for (i = 0; idx < size; i++, idx++) \
> +			obj[i] = ring[idx]; \
> +		for (idx = 0; i < n; i++, idx++) \
> +			obj[i] = ring[idx]; \
> +	} \
> +} while (0)
> +
> +#define DEQUEUE_PTRS_128(r, ring_start, cons_head, obj_table, n) do { \
> +	unsigned int i; \
> +	uint32_t idx = cons_head & (r)->mask; \
> +	const uint32_t size = (r)->size; \
> +	__uint128_t *ring = (__uint128_t *)ring_start; \
> +	__uint128_t *obj = (__uint128_t *)obj_table; \
> +	if (likely(idx + n < size)) { \
> +		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
> +			obj[i] = ring[idx]; \
> +			obj[i + 1] = ring[idx + 1]; \
> +		} \
> +		switch (n & 0x1) { \
> +		case 1: \
> +			obj[i++] = ring[idx++]; /* fallthrough */ \
> +		} \
> +	} else { \
> +		for (i = 0; idx < size; i++, idx++) \
> +			obj[i] = ring[idx]; \
> +		for (idx = 0; i < n; i++, idx++) \
> +			obj[i] = ring[idx]; \
> +	} \
> +} while (0)
> +
> +/* Between load and load. there might be cpu reorder in weak model
> + * (powerpc/arm).
> + * There are 2 choices for the users
> + * 1.use rmb() memory barrier
> + * 2.use one-direction load_acquire/store_release barrier,defined by
> + * CONFIG_RTE_USE_C11_MEM_MODEL=y
> + * It depends on performance test results.
> + * By default, move common functions to rte_ring_generic.h  */ #ifdef
> +RTE_USE_C11_MEM_MODEL #include "rte_ring_c11_mem.h"
> +#else
> +#include "rte_ring_generic.h"
> +#endif
> +
> +/**
> + * @internal Enqueue several objects on the ring
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects).
> + * @param esize
> + *   The size of ring element, in bytes. It must be a multiple of 4.
> + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> + *   as passed while creating the ring, otherwise the results are undefined.
> + * @param n
> + *   The number of objects to add in the ring from the obj_table.
> + * @param behavior
> + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items from a ring
> + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from
> ring
> + * @param is_sp
> + *   Indicates whether to use single producer or multi-producer head update
> + * @param free_space
> + *   returns the amount of space after the enqueue operation has finished
> + * @return
> + *   Actual number of objects enqueued.
> + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> + */
> +static __rte_always_inline unsigned int
> +__rte_ring_do_enqueue_elem(struct rte_ring *r, void * const obj_table,
> +		unsigned int esize, unsigned int n,
> +		enum rte_ring_queue_behavior behavior, unsigned int is_sp,
> +		unsigned int *free_space)
> +{
> +	uint32_t prod_head, prod_next;
> +	uint32_t free_entries;
> +
> +	n = __rte_ring_move_prod_head(r, is_sp, n, behavior,
> +			&prod_head, &prod_next, &free_entries);
> +	if (n == 0)
> +		goto end;
> +
> +	ENQUEUE_PTRS_ELEM(r, &r[1], prod_head, obj_table, esize, n);
> +
> +	update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
> +end:
> +	if (free_space != NULL)
> +		*free_space = free_entries - n;
> +	return n;
> +}
> +
> +/**
> + * @internal Dequeue several objects from the ring
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects).
> + * @param esize
> + *   The size of ring element, in bytes. It must be a multiple of 4.
> + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> + *   as passed while creating the ring, otherwise the results are undefined.
> + * @param n
> + *   The number of objects to pull from the ring.
> + * @param behavior
> + *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from a ring
> + *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from
> ring
> + * @param is_sc
> + *   Indicates whether to use single consumer or multi-consumer head update
> + * @param available
> + *   returns the number of remaining ring entries after the dequeue has
> finished
> + * @return
> + *   - Actual number of objects dequeued.
> + *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> + */
> +static __rte_always_inline unsigned int
> +__rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
> +		unsigned int esize, unsigned int n,
> +		enum rte_ring_queue_behavior behavior, unsigned int is_sc,
> +		unsigned int *available)
> +{
> +	uint32_t cons_head, cons_next;
> +	uint32_t entries;
> +
> +	n = __rte_ring_move_cons_head(r, (int)is_sc, n, behavior,
> +			&cons_head, &cons_next, &entries);
> +	if (n == 0)
> +		goto end;
> +
> +	DEQUEUE_PTRS_ELEM(r, &r[1], cons_head, obj_table, esize, n);
> +
> +	update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
> +
> +end:
> +	if (available != NULL)
> +		*available = entries - n;
> +	return n;
> +}
> +
> +/**
> + * Enqueue several objects on the ring (multi-producers safe).
> + *
> + * This function uses a "compare and set" instruction to move the
> + * producer index atomically.
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects).
> + * @param esize
> + *   The size of ring element, in bytes. It must be a multiple of 4.
> + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> + *   as passed while creating the ring, otherwise the results are undefined.
> + * @param n
> + *   The number of objects to add in the ring from the obj_table.
> + * @param free_space
> + *   if non-NULL, returns the amount of space in the ring after the
> + *   enqueue operation has finished.
> + * @return
> + *   The number of objects enqueued, either 0 or n
> + */
> +static __rte_always_inline unsigned int
> +rte_ring_mp_enqueue_bulk_elem(struct rte_ring *r, void * const obj_table,
> +		unsigned int esize, unsigned int n, unsigned int *free_space) {
> +	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
> +			RTE_RING_QUEUE_FIXED, __IS_MP, free_space); }
> +
> +/**
> + * Enqueue several objects on a ring (NOT multi-producers safe).
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects).
> + * @param esize
> + *   The size of ring element, in bytes. It must be a multiple of 4.
> + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> + *   as passed while creating the ring, otherwise the results are undefined.
> + * @param n
> + *   The number of objects to add in the ring from the obj_table.
> + * @param free_space
> + *   if non-NULL, returns the amount of space in the ring after the
> + *   enqueue operation has finished.
> + * @return
> + *   The number of objects enqueued, either 0 or n
> + */
> +static __rte_always_inline unsigned int
> +rte_ring_sp_enqueue_bulk_elem(struct rte_ring *r, void * const obj_table,
> +		unsigned int esize, unsigned int n, unsigned int *free_space) {
> +	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
> +			RTE_RING_QUEUE_FIXED, __IS_SP, free_space); }
> +
> +/**
> + * Enqueue several objects on a ring.
> + *
> + * This function calls the multi-producer or the single-producer
> + * version depending on the default behavior that was specified at
> + * ring creation time (see flags).
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects).
> + * @param esize
> + *   The size of ring element, in bytes. It must be a multiple of 4.
> + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> + *   as passed while creating the ring, otherwise the results are undefined.
> + * @param n
> + *   The number of objects to add in the ring from the obj_table.
> + * @param free_space
> + *   if non-NULL, returns the amount of space in the ring after the
> + *   enqueue operation has finished.
> + * @return
> + *   The number of objects enqueued, either 0 or n
> + */
> +static __rte_always_inline unsigned int
> +rte_ring_enqueue_bulk_elem(struct rte_ring *r, void * const obj_table,
> +		unsigned int esize, unsigned int n, unsigned int *free_space) {
> +	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
> +			RTE_RING_QUEUE_FIXED, r->prod.single, free_space); }
> +
> +/**
> + * Enqueue one object on a ring (multi-producers safe).
> + *
> + * This function uses a "compare and set" instruction to move the
> + * producer index atomically.
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj
> + *   A pointer to the object to be added.
> + * @param esize
> + *   The size of ring element, in bytes. It must be a multiple of 4.
> + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> + *   as passed while creating the ring, otherwise the results are undefined.
> + * @return
> + *   - 0: Success; objects enqueued.
> + *   - -ENOBUFS: Not enough room in the ring to enqueue; no object is
> enqueued.
> + */
> +static __rte_always_inline int
> +rte_ring_mp_enqueue_elem(struct rte_ring *r, void *obj, unsigned int
> +esize) {
> +	return rte_ring_mp_enqueue_bulk_elem(r, obj, esize, 1, NULL) ? 0 :
> +								-ENOBUFS;
> +}
> +
> +/**
> + * Enqueue one object on a ring (NOT multi-producers safe).
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj
> + *   A pointer to the object to be added.
> + * @param esize
> + *   The size of ring element, in bytes. It must be a multiple of 4.
> + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> + *   as passed while creating the ring, otherwise the results are undefined.
> + * @return
> + *   - 0: Success; objects enqueued.
> + *   - -ENOBUFS: Not enough room in the ring to enqueue; no object is
> enqueued.
> + */
> +static __rte_always_inline int
> +rte_ring_sp_enqueue_elem(struct rte_ring *r, void *obj, unsigned int
> +esize) {
> +	return rte_ring_sp_enqueue_bulk_elem(r, obj, esize, 1, NULL) ? 0 :
> +								-ENOBUFS;
> +}
> +
> +/**
> + * Enqueue one object on a ring.
> + *
> + * This function calls the multi-producer or the single-producer
> + * version, depending on the default behaviour that was specified at
> + * ring creation time (see flags).
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj
> + *   A pointer to the object to be added.
> + * @param esize
> + *   The size of ring element, in bytes. It must be a multiple of 4.
> + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> + *   as passed while creating the ring, otherwise the results are undefined.
> + * @return
> + *   - 0: Success; objects enqueued.
> + *   - -ENOBUFS: Not enough room in the ring to enqueue; no object is
> enqueued.
> + */
> +static __rte_always_inline int
> +rte_ring_enqueue_elem(struct rte_ring *r, void *obj, unsigned int
> +esize) {
> +	return rte_ring_enqueue_bulk_elem(r, obj, esize, 1, NULL) ? 0 :
> +								-ENOBUFS;
> +}
> +
> +/**
> + * Dequeue several objects from a ring (multi-consumers safe).
> + *
> + * This function uses a "compare and set" instruction to move the
> + * consumer index atomically.
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects) that will be filled.
> + * @param esize
> + *   The size of ring element, in bytes. It must be a multiple of 4.
> + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> + *   as passed while creating the ring, otherwise the results are undefined.
> + * @param n
> + *   The number of objects to dequeue from the ring to the obj_table.
> + * @param available
> + *   If non-NULL, returns the number of remaining ring entries after the
> + *   dequeue has finished.
> + * @return
> + *   The number of objects dequeued, either 0 or n
> + */
> +static __rte_always_inline unsigned int
> +rte_ring_mc_dequeue_bulk_elem(struct rte_ring *r, void *obj_table,
> +		unsigned int esize, unsigned int n, unsigned int *available) {
> +	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
> +				RTE_RING_QUEUE_FIXED, __IS_MC,
> available); }
> +
> +/**
> + * Dequeue several objects from a ring (NOT multi-consumers safe).
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects) that will be filled.
> + * @param esize
> + *   The size of ring element, in bytes. It must be a multiple of 4.
> + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> + *   as passed while creating the ring, otherwise the results are undefined.
> + * @param n
> + *   The number of objects to dequeue from the ring to the obj_table,
> + *   must be strictly positive.
> + * @param available
> + *   If non-NULL, returns the number of remaining ring entries after the
> + *   dequeue has finished.
> + * @return
> + *   The number of objects dequeued, either 0 or n
> + */
> +static __rte_always_inline unsigned int
> +rte_ring_sc_dequeue_bulk_elem(struct rte_ring *r, void *obj_table,
> +		unsigned int esize, unsigned int n, unsigned int *available) {
> +	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
> +			RTE_RING_QUEUE_FIXED, __IS_SC, available); }
> +
> +/**
> + * Dequeue several objects from a ring.
> + *
> + * This function calls the multi-consumers or the single-consumer
> + * version, depending on the default behaviour that was specified at
> + * ring creation time (see flags).
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects) that will be filled.
> + * @param esize
> + *   The size of ring element, in bytes. It must be a multiple of 4.
> + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> + *   as passed while creating the ring, otherwise the results are undefined.
> + * @param n
> + *   The number of objects to dequeue from the ring to the obj_table.
> + * @param available
> + *   If non-NULL, returns the number of remaining ring entries after the
> + *   dequeue has finished.
> + * @return
> + *   The number of objects dequeued, either 0 or n
> + */
> +static __rte_always_inline unsigned int
> +rte_ring_dequeue_bulk_elem(struct rte_ring *r, void *obj_table,
> +		unsigned int esize, unsigned int n, unsigned int *available) {
> +	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
> +			RTE_RING_QUEUE_FIXED, r->cons.single, available); }
> +
> +/**
> + * Dequeue one object from a ring (multi-consumers safe).
> + *
> + * This function uses a "compare and set" instruction to move the
> + * consumer index atomically.
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_p
> + *   A pointer to a void * pointer (object) that will be filled.
> + * @param esize
> + *   The size of ring element, in bytes. It must be a multiple of 4.
> + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> + *   as passed while creating the ring, otherwise the results are undefined.
> + * @return
> + *   - 0: Success; objects dequeued.
> + *   - -ENOENT: Not enough entries in the ring to dequeue; no object is
> + *     dequeued.
> + */
> +static __rte_always_inline int
> +rte_ring_mc_dequeue_elem(struct rte_ring *r, void *obj_p,
> +				unsigned int esize)
> +{
> +	return rte_ring_mc_dequeue_bulk_elem(r, obj_p, esize, 1, NULL)  ? 0 :
> +								-ENOENT;
> +}
> +
> +/**
> + * Dequeue one object from a ring (NOT multi-consumers safe).
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_p
> + *   A pointer to a void * pointer (object) that will be filled.
> + * @param esize
> + *   The size of ring element, in bytes. It must be a multiple of 4.
> + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> + *   as passed while creating the ring, otherwise the results are undefined.
> + * @return
> + *   - 0: Success; objects dequeued.
> + *   - -ENOENT: Not enough entries in the ring to dequeue, no object is
> + *     dequeued.
> + */
> +static __rte_always_inline int
> +rte_ring_sc_dequeue_elem(struct rte_ring *r, void *obj_p,
> +				unsigned int esize)
> +{
> +	return rte_ring_sc_dequeue_bulk_elem(r, obj_p, esize, 1, NULL) ? 0 :
> +								-ENOENT;
> +}
> +
> +/**
> + * Dequeue one object from a ring.
> + *
> + * This function calls the multi-consumers or the single-consumer
> + * version depending on the default behaviour that was specified at
> + * ring creation time (see flags).
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_p
> + *   A pointer to a void * pointer (object) that will be filled.
> + * @param esize
> + *   The size of ring element, in bytes. It must be a multiple of 4.
> + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> + *   as passed while creating the ring, otherwise the results are undefined.
> + * @return
> + *   - 0: Success, objects dequeued.
> + *   - -ENOENT: Not enough entries in the ring to dequeue, no object is
> + *     dequeued.
> + */
> +static __rte_always_inline int
> +rte_ring_dequeue_elem(struct rte_ring *r, void *obj_p, unsigned int
> +esize) {
> +	return rte_ring_dequeue_bulk_elem(r, obj_p, esize, 1, NULL) ? 0 :
> +								-ENOENT;
> +}
> +
> +/**
> + * Enqueue several objects on the ring (multi-producers safe).
> + *
> + * This function uses a "compare and set" instruction to move the
> + * producer index atomically.
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects).
> + * @param esize
> + *   The size of ring element, in bytes. It must be a multiple of 4.
> + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> + *   as passed while creating the ring, otherwise the results are undefined.
> + * @param n
> + *   The number of objects to add in the ring from the obj_table.
> + * @param free_space
> + *   if non-NULL, returns the amount of space in the ring after the
> + *   enqueue operation has finished.
> + * @return
> + *   - n: Actual number of objects enqueued.
> + */
> +static __rte_always_inline unsigned
> +rte_ring_mp_enqueue_burst_elem(struct rte_ring *r, void * const obj_table,
> +		unsigned int esize, unsigned int n, unsigned int *free_space) {
> +	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
> +			RTE_RING_QUEUE_VARIABLE, __IS_MP, free_space); }
> +
> +/**
> + * Enqueue several objects on a ring (NOT multi-producers safe).
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects).
> + * @param esize
> + *   The size of ring element, in bytes. It must be a multiple of 4.
> + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> + *   as passed while creating the ring, otherwise the results are undefined.
> + * @param n
> + *   The number of objects to add in the ring from the obj_table.
> + * @param free_space
> + *   if non-NULL, returns the amount of space in the ring after the
> + *   enqueue operation has finished.
> + * @return
> + *   - n: Actual number of objects enqueued.
> + */
> +static __rte_always_inline unsigned
> +rte_ring_sp_enqueue_burst_elem(struct rte_ring *r, void * const obj_table,
> +		unsigned int esize, unsigned int n, unsigned int *free_space) {
> +	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
> +			RTE_RING_QUEUE_VARIABLE, __IS_SP, free_space); }
> +
> +/**
> + * Enqueue several objects on a ring.
> + *
> + * This function calls the multi-producer or the single-producer
> + * version depending on the default behavior that was specified at
> + * ring creation time (see flags).
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects).
> + * @param esize
> + *   The size of ring element, in bytes. It must be a multiple of 4.
> + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> + *   as passed while creating the ring, otherwise the results are undefined.
> + * @param n
> + *   The number of objects to add in the ring from the obj_table.
> + * @param free_space
> + *   if non-NULL, returns the amount of space in the ring after the
> + *   enqueue operation has finished.
> + * @return
> + *   - n: Actual number of objects enqueued.
> + */
> +static __rte_always_inline unsigned
> +rte_ring_enqueue_burst_elem(struct rte_ring *r, void * const obj_table,
> +		unsigned int esize, unsigned int n, unsigned int *free_space) {
> +	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
> +			RTE_RING_QUEUE_VARIABLE, r->prod.single,
> free_space); }
> +
> +/**
> + * Dequeue several objects from a ring (multi-consumers safe). When the
> +request
> + * objects are more than the available objects, only dequeue the actual
> +number
> + * of objects
> + *
> + * This function uses a "compare and set" instruction to move the
> + * consumer index atomically.
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects) that will be filled.
> + * @param esize
> + *   The size of ring element, in bytes. It must be a multiple of 4.
> + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> + *   as passed while creating the ring, otherwise the results are undefined.
> + * @param n
> + *   The number of objects to dequeue from the ring to the obj_table.
> + * @param available
> + *   If non-NULL, returns the number of remaining ring entries after the
> + *   dequeue has finished.
> + * @return
> + *   - n: Actual number of objects dequeued, 0 if ring is empty
> + */
> +static __rte_always_inline unsigned
> +rte_ring_mc_dequeue_burst_elem(struct rte_ring *r, void *obj_table,
> +		unsigned int esize, unsigned int n, unsigned int *available) {
> +	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
> +			RTE_RING_QUEUE_VARIABLE, __IS_MC, available); }
> +
> +/**
> + * Dequeue several objects from a ring (NOT multi-consumers safe).When
> +the
> + * request objects are more than the available objects, only dequeue
> +the
> + * actual number of objects
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects) that will be filled.
> + * @param esize
> + *   The size of ring element, in bytes. It must be a multiple of 4.
> + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> + *   as passed while creating the ring, otherwise the results are undefined.
> + * @param n
> + *   The number of objects to dequeue from the ring to the obj_table.
> + * @param available
> + *   If non-NULL, returns the number of remaining ring entries after the
> + *   dequeue has finished.
> + * @return
> + *   - n: Actual number of objects dequeued, 0 if ring is empty
> + */
> +static __rte_always_inline unsigned
> +rte_ring_sc_dequeue_burst_elem(struct rte_ring *r, void *obj_table,
> +		unsigned int esize, unsigned int n, unsigned int *available) {
> +	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
> +			RTE_RING_QUEUE_VARIABLE, __IS_SC, available); }
> +
> +/**
> + * Dequeue multiple objects from a ring up to a maximum number.
> + *
> + * This function calls the multi-consumers or the single-consumer
> + * version, depending on the default behaviour that was specified at
> + * ring creation time (see flags).
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects) that will be filled.
> + * @param esize
> + *   The size of ring element, in bytes. It must be a multiple of 4.
> + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> + *   as passed while creating the ring, otherwise the results are undefined.
> + * @param n
> + *   The number of objects to dequeue from the ring to the obj_table.
> + * @param available
> + *   If non-NULL, returns the number of remaining ring entries after the
> + *   dequeue has finished.
> + * @return
> + *   - Number of objects dequeued
> + */
> +static __rte_always_inline unsigned
> +rte_ring_dequeue_burst_elem(struct rte_ring *r, void *obj_table,
> +		unsigned int esize, unsigned int n, unsigned int *available) {
> +	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
> +				RTE_RING_QUEUE_VARIABLE,
> +				r->cons.single, available);
> +}
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_RING_ELEM_H_ */
> diff --git a/lib/librte_ring/rte_ring_version.map
> b/lib/librte_ring/rte_ring_version.map
> index 510c1386e..e410a7503 100644
> --- a/lib/librte_ring/rte_ring_version.map
> +++ b/lib/librte_ring/rte_ring_version.map
> @@ -21,6 +21,8 @@ DPDK_2.2 {
>  EXPERIMENTAL {
>  	global:
> 
> +	rte_ring_create_elem;
> +	rte_ring_get_memsize_elem;
>  	rte_ring_reset;
> 
>  };
> --
> 2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable element size
  2019-10-11 19:21       ` Honnappa Nagarahalli
@ 2019-10-14 19:41         ` Ananyev, Konstantin
  2019-10-14 23:56           ` Honnappa Nagarahalli
  0 siblings, 1 reply; 173+ messages in thread
From: Ananyev, Konstantin @ 2019-10-14 19:41 UTC (permalink / raw)
  To: Honnappa Nagarahalli, olivier.matz, sthemmin, jerinj, Richardson,
	Bruce, david.marchand, pbhagavatula
  Cc: dev, Dharmik Thakkar, Ruifeng Wang (Arm Technology China),
	Gavin Hu (Arm Technology China),
	stephen, nd, nd


> >
> > Current APIs assume ring elements to be pointers. However, in many use cases,
> > the size can be different. Add new APIs to support configurable ring element
> > sizes.
> >
> > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > ---
> >  lib/librte_ring/Makefile             |   3 +-
> >  lib/librte_ring/meson.build          |   3 +
> >  lib/librte_ring/rte_ring.c           |  45 +-
> >  lib/librte_ring/rte_ring.h           |   1 +
> >  lib/librte_ring/rte_ring_elem.h      | 946 +++++++++++++++++++++++++++
> >  lib/librte_ring/rte_ring_version.map |   2 +
> >  6 files changed, 991 insertions(+), 9 deletions(-)  create mode 100644
> > lib/librte_ring/rte_ring_elem.h
> >
> > diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile index
> > 21a36770d..515a967bb 100644
> > --- a/lib/librte_ring/Makefile
> > +++ b/lib/librte_ring/Makefile
> > @@ -6,7 +6,7 @@ include $(RTE_SDK)/mk/rte.vars.mk  # library name  LIB =
> > librte_ring.a
> >
> > -CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3
> > +CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -
> > DALLOW_EXPERIMENTAL_API
> >  LDLIBS += -lrte_eal
> >
> >  EXPORT_MAP := rte_ring_version.map
> > @@ -18,6 +18,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
> >
> >  # install includes
> >  SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
> > +					rte_ring_elem.h \
> >  					rte_ring_generic.h \
> >  					rte_ring_c11_mem.h
> >
> > diff --git a/lib/librte_ring/meson.build b/lib/librte_ring/meson.build index
> > ab8b0b469..74219840a 100644
> > --- a/lib/librte_ring/meson.build
> > +++ b/lib/librte_ring/meson.build
> > @@ -6,3 +6,6 @@ sources = files('rte_ring.c')  headers = files('rte_ring.h',
> >  		'rte_ring_c11_mem.h',
> >  		'rte_ring_generic.h')
> > +
> > +# rte_ring_create_elem and rte_ring_get_memsize_elem are experimental
> > +allow_experimental_apis = true
> > diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c index
> > d9b308036..6fed3648b 100644
> > --- a/lib/librte_ring/rte_ring.c
> > +++ b/lib/librte_ring/rte_ring.c
> > @@ -33,6 +33,7 @@
> >  #include <rte_tailq.h>
> >
> >  #include "rte_ring.h"
> > +#include "rte_ring_elem.h"
> >
> >  TAILQ_HEAD(rte_ring_list, rte_tailq_entry);
> >
> > @@ -46,23 +47,42 @@ EAL_REGISTER_TAILQ(rte_ring_tailq)
> >
> >  /* return the size of memory occupied by a ring */  ssize_t -
> > rte_ring_get_memsize(unsigned count)
> > +rte_ring_get_memsize_elem(unsigned count, unsigned esize)
> >  {
> >  	ssize_t sz;
> >
> > +	/* Supported esize values are 4/8/16.
> > +	 * Others can be added on need basis.
> > +	 */
> > +	if ((esize != 4) && (esize != 8) && (esize != 16)) {
> > +		RTE_LOG(ERR, RING,
> > +			"Unsupported esize value. Supported values are 4, 8
> > and 16\n");
> > +
> > +		return -EINVAL;
> > +	}
> > +
> >  	/* count must be a power of 2 */
> >  	if ((!POWEROF2(count)) || (count > RTE_RING_SZ_MASK )) {
> >  		RTE_LOG(ERR, RING,
> > -			"Requested size is invalid, must be power of 2, and "
> > -			"do not exceed the size limit %u\n",
> > RTE_RING_SZ_MASK);
> > +			"Requested number of elements is invalid, must be "
> > +			"power of 2, and do not exceed the limit %u\n",
> > +			RTE_RING_SZ_MASK);
> > +
> >  		return -EINVAL;
> >  	}
> >
> > -	sz = sizeof(struct rte_ring) + count * sizeof(void *);
> > +	sz = sizeof(struct rte_ring) + count * esize;
> >  	sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
> >  	return sz;
> >  }
> >
> > +/* return the size of memory occupied by a ring */ ssize_t
> > +rte_ring_get_memsize(unsigned count) {
> > +	return rte_ring_get_memsize_elem(count, sizeof(void *)); }
> > +
> >  void
> >  rte_ring_reset(struct rte_ring *r)
> >  {
> > @@ -114,10 +134,10 @@ rte_ring_init(struct rte_ring *r, const char *name,
> > unsigned count,
> >  	return 0;
> >  }
> >
> > -/* create the ring */
> > +/* create the ring for a given element size */
> >  struct rte_ring *
> > -rte_ring_create(const char *name, unsigned count, int socket_id,
> > -		unsigned flags)
> > +rte_ring_create_elem(const char *name, unsigned count, unsigned esize,
> > +		int socket_id, unsigned flags)
> >  {
> >  	char mz_name[RTE_MEMZONE_NAMESIZE];
> >  	struct rte_ring *r;
> > @@ -135,7 +155,7 @@ rte_ring_create(const char *name, unsigned count,
> > int socket_id,
> >  	if (flags & RING_F_EXACT_SZ)
> >  		count = rte_align32pow2(count + 1);
> >
> > -	ring_size = rte_ring_get_memsize(count);
> > +	ring_size = rte_ring_get_memsize_elem(count, esize);
> >  	if (ring_size < 0) {
> >  		rte_errno = ring_size;
> >  		return NULL;
> > @@ -182,6 +202,15 @@ rte_ring_create(const char *name, unsigned count,
> > int socket_id,
> >  	return r;
> >  }
> >
> > +/* create the ring */
> > +struct rte_ring *
> > +rte_ring_create(const char *name, unsigned count, int socket_id,
> > +		unsigned flags)
> > +{
> > +	return rte_ring_create_elem(name, count, sizeof(void *), socket_id,
> > +		flags);
> > +}
> > +
> >  /* free the ring */
> >  void
> >  rte_ring_free(struct rte_ring *r)
> > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h index
> > 2a9f768a1..18fc5d845 100644
> > --- a/lib/librte_ring/rte_ring.h
> > +++ b/lib/librte_ring/rte_ring.h
> > @@ -216,6 +216,7 @@ int rte_ring_init(struct rte_ring *r, const char *name,
> > unsigned count,
> >   */
> >  struct rte_ring *rte_ring_create(const char *name, unsigned count,
> >  				 int socket_id, unsigned flags);
> > +
> >  /**
> >   * De-allocate all memory used by the ring.
> >   *
> > diff --git a/lib/librte_ring/rte_ring_elem.h b/lib/librte_ring/rte_ring_elem.h
> > new file mode 100644 index 000000000..860f059ad
> > --- /dev/null
> > +++ b/lib/librte_ring/rte_ring_elem.h
> > @@ -0,0 +1,946 @@
> > +/* SPDX-License-Identifier: BSD-3-Clause
> > + *
> > + * Copyright (c) 2019 Arm Limited
> > + * Copyright (c) 2010-2017 Intel Corporation
> > + * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
> > + * All rights reserved.
> > + * Derived from FreeBSD's bufring.h
> > + * Used as BSD-3 Licensed with permission from Kip Macy.
> > + */
> > +
> > +#ifndef _RTE_RING_ELEM_H_
> > +#define _RTE_RING_ELEM_H_
> > +
> > +/**
> > + * @file
> > + * RTE Ring with flexible element size
> > + */
> > +
> > +#ifdef __cplusplus
> > +extern "C" {
> > +#endif
> > +
> > +#include <stdio.h>
> > +#include <stdint.h>
> > +#include <sys/queue.h>
> > +#include <errno.h>
> > +#include <rte_common.h>
> > +#include <rte_config.h>
> > +#include <rte_memory.h>
> > +#include <rte_lcore.h>
> > +#include <rte_atomic.h>
> > +#include <rte_branch_prediction.h>
> > +#include <rte_memzone.h>
> > +#include <rte_pause.h>
> > +
> > +#include "rte_ring.h"
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change without prior notice
> > + *
> > + * Calculate the memory size needed for a ring with given element size
> > + *
> > + * This function returns the number of bytes needed for a ring, given
> > + * the number of elements in it and the size of the element. This value
> > + * is the sum of the size of the structure rte_ring and the size of the
> > + * memory needed for storing the elements. The value is aligned to a
> > +cache
> > + * line size.
> > + *
> > + * @param count
> > + *   The number of elements in the ring (must be a power of 2).
> > + * @param esize
> > + *   The size of ring element, in bytes. It must be a multiple of 4.
> > + *   Currently, sizes 4, 8 and 16 are supported.
> > + * @return
> > + *   - The memory size needed for the ring on success.
> > + *   - -EINVAL if count is not a power of 2.
> > + */
> > +__rte_experimental
> > +ssize_t rte_ring_get_memsize_elem(unsigned count, unsigned esize);
> > +
> > +/**
> > + * @warning
> > + * @b EXPERIMENTAL: this API may change without prior notice
> > + *
> > + * Create a new ring named *name* that stores elements with given size.
> > + *
> > + * This function uses ``memzone_reserve()`` to allocate memory. Then it
> > + * calls rte_ring_init() to initialize an empty ring.
> > + *
> > + * The new ring size is set to *count*, which must be a power of
> > + * two. Water marking is disabled by default. The real usable ring size
> > + * is *count-1* instead of *count* to differentiate a free ring from an
> > + * empty ring.
> > + *
> > + * The ring is added in RTE_TAILQ_RING list.
> > + *
> > + * @param name
> > + *   The name of the ring.
> > + * @param count
> > + *   The number of elements in the ring (must be a power of 2).
> > + * @param esize
> > + *   The size of ring element, in bytes. It must be a multiple of 4.
> > + *   Currently, sizes 4, 8 and 16 are supported.
> > + * @param socket_id
> > + *   The *socket_id* argument is the socket identifier in case of
> > + *   NUMA. The value can be *SOCKET_ID_ANY* if there is no NUMA
> > + *   constraint for the reserved zone.
> > + * @param flags
> > + *   An OR of the following:
> > + *    - RING_F_SP_ENQ: If this flag is set, the default behavior when
> > + *      using ``rte_ring_enqueue()`` or ``rte_ring_enqueue_bulk()``
> > + *      is "single-producer". Otherwise, it is "multi-producers".
> > + *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
> > + *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
> > + *      is "single-consumer". Otherwise, it is "multi-consumers".
> > + * @return
> > + *   On success, the pointer to the new allocated ring. NULL on error with
> > + *    rte_errno set appropriately. Possible errno values include:
> > + *    - E_RTE_NO_CONFIG - function could not get pointer to rte_config
> > structure
> > + *    - E_RTE_SECONDARY - function was called from a secondary process
> > instance
> > + *    - EINVAL - count provided is not a power of 2
> > + *    - ENOSPC - the maximum number of memzones has already been
> > allocated
> > + *    - EEXIST - a memzone with the same name already exists
> > + *    - ENOMEM - no appropriate memory area found in which to create
> > memzone
> > + */
> > +__rte_experimental
> > +struct rte_ring *rte_ring_create_elem(const char *name, unsigned count,
> > +				unsigned esize, int socket_id, unsigned flags);
> > +
> > +/* the actual enqueue of pointers on the ring.
> > + * Placed here since identical code needed in both
> > + * single and multi producer enqueue functions.
> > + */
> > +#define ENQUEUE_PTRS_ELEM(r, ring_start, prod_head, obj_table, esize, n)
> > do { \
> > +	if (esize == 4) \
> > +		ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n); \
> > +	else if (esize == 8) \
> > +		ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n); \
> > +	else if (esize == 16) \
> > +		ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table, n); \ }
> > while
> > +(0)
> > +
> > +#define ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n) do { \
> > +	unsigned int i; \
> > +	const uint32_t size = (r)->size; \
> > +	uint32_t idx = prod_head & (r)->mask; \
> > +	uint32_t *ring = (uint32_t *)ring_start; \
> > +	uint32_t *obj = (uint32_t *)obj_table; \
> > +	if (likely(idx + n < size)) { \
> > +		for (i = 0; i < (n & ((~(unsigned)0x7))); i += 8, idx += 8) { \
> > +			ring[idx] = obj[i]; \
> > +			ring[idx + 1] = obj[i + 1]; \
> > +			ring[idx + 2] = obj[i + 2]; \
> > +			ring[idx + 3] = obj[i + 3]; \
> > +			ring[idx + 4] = obj[i + 4]; \
> > +			ring[idx + 5] = obj[i + 5]; \
> > +			ring[idx + 6] = obj[i + 6]; \
> > +			ring[idx + 7] = obj[i + 7]; \
> > +		} \
> > +		switch (n & 0x7) { \
> > +		case 7: \
> > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > +		case 6: \
> > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > +		case 5: \
> > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > +		case 4: \
> > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > +		case 3: \
> > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > +		case 2: \
> > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > +		case 1: \
> > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > +		} \
> > +	} else { \
> > +		for (i = 0; idx < size; i++, idx++)\
> > +			ring[idx] = obj[i]; \
> > +		for (idx = 0; i < n; i++, idx++) \
> > +			ring[idx] = obj[i]; \
> > +	} \
> > +} while (0)
> > +
> > +#define ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n) do { \
> > +	unsigned int i; \
> > +	const uint32_t size = (r)->size; \
> > +	uint32_t idx = prod_head & (r)->mask; \
> > +	uint64_t *ring = (uint64_t *)ring_start; \
> > +	uint64_t *obj = (uint64_t *)obj_table; \
> > +	if (likely(idx + n < size)) { \
> > +		for (i = 0; i < (n & ((~(unsigned)0x3))); i += 4, idx += 4) { \
> > +			ring[idx] = obj[i]; \
> > +			ring[idx + 1] = obj[i + 1]; \
> > +			ring[idx + 2] = obj[i + 2]; \
> > +			ring[idx + 3] = obj[i + 3]; \
> > +		} \
> > +		switch (n & 0x3) { \
> > +		case 3: \
> > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > +		case 2: \
> > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > +		case 1: \
> > +			ring[idx++] = obj[i++]; \
> > +		} \
> > +	} else { \
> > +		for (i = 0; idx < size; i++, idx++)\
> > +			ring[idx] = obj[i]; \
> > +		for (idx = 0; i < n; i++, idx++) \
> > +			ring[idx] = obj[i]; \
> > +	} \
> > +} while (0)
> > +
> > +#define ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table, n) do { \
> > +	unsigned int i; \
> > +	const uint32_t size = (r)->size; \
> > +	uint32_t idx = prod_head & (r)->mask; \
> > +	__uint128_t *ring = (__uint128_t *)ring_start; \
> > +	__uint128_t *obj = (__uint128_t *)obj_table; \
> > +	if (likely(idx + n < size)) { \
> > +		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
> > +			ring[idx] = obj[i]; \
> > +			ring[idx + 1] = obj[i + 1]; \
> > +		} \
> > +		switch (n & 0x1) { \
> > +		case 1: \
> > +			ring[idx++] = obj[i++]; \
> > +		} \
> > +	} else { \
> > +		for (i = 0; idx < size; i++, idx++)\
> > +			ring[idx] = obj[i]; \
> > +		for (idx = 0; i < n; i++, idx++) \
> > +			ring[idx] = obj[i]; \
> > +	} \
> > +} while (0)
> > +
> > +/* the actual copy of pointers on the ring to obj_table.
> > + * Placed here since identical code needed in both
> > + * single and multi consumer dequeue functions.
> > + */
> > +#define DEQUEUE_PTRS_ELEM(r, ring_start, cons_head, obj_table, esize, n)
> > do { \
> > +	if (esize == 4) \
> > +		DEQUEUE_PTRS_32(r, ring_start, cons_head, obj_table, n); \
> > +	else if (esize == 8) \
> > +		DEQUEUE_PTRS_64(r, ring_start, cons_head, obj_table, n); \
> > +	else if (esize == 16) \
> > +		DEQUEUE_PTRS_128(r, ring_start, cons_head, obj_table, n); \ }
> > while
> > +(0)
> > +
> > +#define DEQUEUE_PTRS_32(r, ring_start, cons_head, obj_table, n) do { \
> > +	unsigned int i; \
> > +	uint32_t idx = cons_head & (r)->mask; \
> > +	const uint32_t size = (r)->size; \
> > +	uint32_t *ring = (uint32_t *)ring_start; \
> > +	uint32_t *obj = (uint32_t *)obj_table; \
> > +	if (likely(idx + n < size)) { \
> > +		for (i = 0; i < (n & (~(unsigned)0x7)); i += 8, idx += 8) {\
> > +			obj[i] = ring[idx]; \
> > +			obj[i + 1] = ring[idx + 1]; \
> > +			obj[i + 2] = ring[idx + 2]; \
> > +			obj[i + 3] = ring[idx + 3]; \
> > +			obj[i + 4] = ring[idx + 4]; \
> > +			obj[i + 5] = ring[idx + 5]; \
> > +			obj[i + 6] = ring[idx + 6]; \
> > +			obj[i + 7] = ring[idx + 7]; \
> > +		} \
> > +		switch (n & 0x7) { \
> > +		case 7: \
> > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > +		case 6: \
> > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > +		case 5: \
> > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > +		case 4: \
> > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > +		case 3: \
> > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > +		case 2: \
> > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > +		case 1: \
> > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > +		} \
> > +	} else { \
> > +		for (i = 0; idx < size; i++, idx++) \
> > +			obj[i] = ring[idx]; \
> > +		for (idx = 0; i < n; i++, idx++) \
> > +			obj[i] = ring[idx]; \
> > +	} \
> > +} while (0)
> > +
> > +#define DEQUEUE_PTRS_64(r, ring_start, cons_head, obj_table, n) do { \
> > +	unsigned int i; \
> > +	uint32_t idx = cons_head & (r)->mask; \
> > +	const uint32_t size = (r)->size; \
> > +	uint64_t *ring = (uint64_t *)ring_start; \
> > +	uint64_t *obj = (uint64_t *)obj_table; \
> > +	if (likely(idx + n < size)) { \
> > +		for (i = 0; i < (n & (~(unsigned)0x3)); i += 4, idx += 4) {\
> > +			obj[i] = ring[idx]; \
> > +			obj[i + 1] = ring[idx + 1]; \
> > +			obj[i + 2] = ring[idx + 2]; \
> > +			obj[i + 3] = ring[idx + 3]; \
> > +		} \
> > +		switch (n & 0x3) { \
> > +		case 3: \
> > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > +		case 2: \
> > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > +		case 1: \
> > +			obj[i++] = ring[idx++]; \
> > +		} \
> > +	} else { \
> > +		for (i = 0; idx < size; i++, idx++) \
> > +			obj[i] = ring[idx]; \
> > +		for (idx = 0; i < n; i++, idx++) \
> > +			obj[i] = ring[idx]; \
> > +	} \
> > +} while (0)
> > +
> > +#define DEQUEUE_PTRS_128(r, ring_start, cons_head, obj_table, n) do { \
> > +	unsigned int i; \
> > +	uint32_t idx = cons_head & (r)->mask; \
> > +	const uint32_t size = (r)->size; \
> > +	__uint128_t *ring = (__uint128_t *)ring_start; \
> > +	__uint128_t *obj = (__uint128_t *)obj_table; \
> > +	if (likely(idx + n < size)) { \
> > +		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
> > +			obj[i] = ring[idx]; \
> > +			obj[i + 1] = ring[idx + 1]; \
> > +		} \
> > +		switch (n & 0x1) { \
> > +		case 1: \
> > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > +		} \
> > +	} else { \
> > +		for (i = 0; idx < size; i++, idx++) \
> > +			obj[i] = ring[idx]; \
> > +		for (idx = 0; i < n; i++, idx++) \
> > +			obj[i] = ring[idx]; \
> > +	} \
> > +} while (0)
> > +
> > +/* Between load and load. there might be cpu reorder in weak model
> > + * (powerpc/arm).
> > + * There are 2 choices for the users
> > + * 1.use rmb() memory barrier
> > + * 2.use one-direction load_acquire/store_release barrier,defined by
> > + * CONFIG_RTE_USE_C11_MEM_MODEL=y
> > + * It depends on performance test results.
> > + * By default, move common functions to rte_ring_generic.h  */ #ifdef
> > +RTE_USE_C11_MEM_MODEL #include "rte_ring_c11_mem.h"
> > +#else
> > +#include "rte_ring_generic.h"
> > +#endif
> > +
> > +/**
> > + * @internal Enqueue several objects on the ring
> > + *
> > + * @param r
> > + *   A pointer to the ring structure.
> > + * @param obj_table
> > + *   A pointer to a table of void * pointers (objects).
> > + * @param esize
> > + *   The size of ring element, in bytes. It must be a multiple of 4.
> > + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> > + *   as passed while creating the ring, otherwise the results are undefined.
> > + * @param n
> > + *   The number of objects to add in the ring from the obj_table.
> > + * @param behavior
> > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items from a ring
> > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from
> > ring
> > + * @param is_sp
> > + *   Indicates whether to use single producer or multi-producer head update
> > + * @param free_space
> > + *   returns the amount of space after the enqueue operation has finished
> > + * @return
> > + *   Actual number of objects enqueued.
> > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > + */
> > +static __rte_always_inline unsigned int
> > +__rte_ring_do_enqueue_elem(struct rte_ring *r, void * const obj_table,
> > +		unsigned int esize, unsigned int n,
> > +		enum rte_ring_queue_behavior behavior, unsigned int is_sp,
> > +		unsigned int *free_space)


I like the idea to add esize as an argument to the public API,
so the compiler can do it's jib optimizing calls with constant esize.
Though I am not very happy with the rest of implementation:
1. It doesn't really provide configurable elem size - only 4/8/16B elems are supported.
2. A lot of code duplication with these 3 copies of ENQUEUE/DEQUEUE macros.

Looking at ENQUEUE/DEQUEUE macros, I can see that main loop always
does 32B copy per iteration.
So wonder can we make a generic function that would do 32B copy per iteration
in a main loop, and copy tail  by 4B chunks?
That would avoid copy duplication and will allow user to have any elem
size (multiple of 4B) he wants.
Something like that (note didn't test it, just a rough idea):

 static inline void
copy_elems(uint32_t du32[], const uint32_t su32[], uint32_t num, uint32_t esize)
{
        uint32_t i, sz;

        sz = (num * esize) / sizeof(uint32_t);

        for (i = 0; i < (sz & ~7); i += 8)
                memcpy(du32 + i, su32 + i, 8 * sizeof(uint32_t));

        switch (sz & 7) {
        case 7: du32[sz - 7] = su32[sz - 7]; /* fallthrough */
        case 6: du32[sz - 6] = su32[sz - 6]; /* fallthrough */
        case 5: du32[sz - 5] = su32[sz - 5]; /* fallthrough */
        case 4: du32[sz - 4] = su32[sz - 4]; /* fallthrough */
        case 3: du32[sz - 3] = su32[sz - 3]; /* fallthrough */
        case 2: du32[sz - 2] = su32[sz - 2]; /* fallthrough */
        case 1: du32[sz - 1] = su32[sz - 1]; /* fallthrough */
        }
}

static inline void
enqueue_elems(struct rte_ring *r, void *ring_start, uint32_t prod_head,
                void *obj_table, uint32_t num, uint32_t esize)
{
        uint32_t idx, n;
        uint32_t *du32;

        const uint32_t size = r->size;

        idx = prod_head & (r)->mask;

        du32 = ring_start + idx * sizeof(uint32_t);

        if (idx + num < size)
                copy_elems(du32, obj_table, num, esize);
        else {
                n = size - idx;
                copy_elems(du32, obj_table, n, esize);
                copy_elems(ring_start, obj_table + n * sizeof(uint32_t),
                        num - n, esize);
        }
}

And then, in that function, instead of ENQUEUE_PTRS_ELEM(), just:

enqueue_elems(r, &r[1], prod_head, obj_table, n, esize);

 
> > +{
> > +	uint32_t prod_head, prod_next;
> > +	uint32_t free_entries;
> > +
> > +	n = __rte_ring_move_prod_head(r, is_sp, n, behavior,
> > +			&prod_head, &prod_next, &free_entries);
> > +	if (n == 0)
> > +		goto end;
> > +
> > +	ENQUEUE_PTRS_ELEM(r, &r[1], prod_head, obj_table, esize, n);
> > +
> > +	update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
> > +end:
> > +	if (free_space != NULL)
> > +		*free_space = free_entries - n;
> > +	return n;
> > +}
> > +

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable element size
  2019-10-14 19:41         ` Ananyev, Konstantin
@ 2019-10-14 23:56           ` Honnappa Nagarahalli
  2019-10-15  9:34             ` Ananyev, Konstantin
  0 siblings, 1 reply; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-14 23:56 UTC (permalink / raw)
  To: Ananyev, Konstantin, olivier.matz, sthemmin, jerinj, Richardson,
	Bruce, david.marchand, pbhagavatula
  Cc: dev, Dharmik Thakkar, Ruifeng Wang (Arm Technology China),
	Gavin Hu (Arm Technology China),
	stephen, Honnappa Nagarahalli, nd, nd

Hi Konstantin,
	Thank you for the feedback.

<snip>

> 
> > >
> > > Current APIs assume ring elements to be pointers. However, in many
> > > use cases, the size can be different. Add new APIs to support
> > > configurable ring element sizes.
> > >
> > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > ---
> > >  lib/librte_ring/Makefile             |   3 +-
> > >  lib/librte_ring/meson.build          |   3 +
> > >  lib/librte_ring/rte_ring.c           |  45 +-
> > >  lib/librte_ring/rte_ring.h           |   1 +
> > >  lib/librte_ring/rte_ring_elem.h      | 946 +++++++++++++++++++++++++++
> > >  lib/librte_ring/rte_ring_version.map |   2 +
> > >  6 files changed, 991 insertions(+), 9 deletions(-)  create mode
> > > 100644 lib/librte_ring/rte_ring_elem.h
> > >
> > > diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
> > > index 21a36770d..515a967bb 100644
> > > --- a/lib/librte_ring/Makefile
> > > +++ b/lib/librte_ring/Makefile
> > > @@ -6,7 +6,7 @@ include $(RTE_SDK)/mk/rte.vars.mk  # library name
> > > LIB = librte_ring.a
> > >
> > > -CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3
> > > +CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -
> > > DALLOW_EXPERIMENTAL_API
> > >  LDLIBS += -lrte_eal
> > >
> > >  EXPORT_MAP := rte_ring_version.map
> > > @@ -18,6 +18,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
> > >
> > >  # install includes
> > >  SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
> > > +					rte_ring_elem.h \
> > >  					rte_ring_generic.h \
> > >  					rte_ring_c11_mem.h
> > >
> > > diff --git a/lib/librte_ring/meson.build
> > > b/lib/librte_ring/meson.build index ab8b0b469..74219840a 100644
> > > --- a/lib/librte_ring/meson.build
> > > +++ b/lib/librte_ring/meson.build
> > > @@ -6,3 +6,6 @@ sources = files('rte_ring.c')  headers = files('rte_ring.h',
> > >  		'rte_ring_c11_mem.h',
> > >  		'rte_ring_generic.h')
> > > +
> > > +# rte_ring_create_elem and rte_ring_get_memsize_elem are
> > > +experimental allow_experimental_apis = true
> > > diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
> > > index d9b308036..6fed3648b 100644
> > > --- a/lib/librte_ring/rte_ring.c
> > > +++ b/lib/librte_ring/rte_ring.c
> > > @@ -33,6 +33,7 @@
> > >  #include <rte_tailq.h>
> > >
> > >  #include "rte_ring.h"
> > > +#include "rte_ring_elem.h"
> > >
> > >  TAILQ_HEAD(rte_ring_list, rte_tailq_entry);
> > >
> > > @@ -46,23 +47,42 @@ EAL_REGISTER_TAILQ(rte_ring_tailq)
> > >
> > >  /* return the size of memory occupied by a ring */  ssize_t -
> > > rte_ring_get_memsize(unsigned count)
> > > +rte_ring_get_memsize_elem(unsigned count, unsigned esize)
> > >  {
> > >  	ssize_t sz;
> > >
> > > +	/* Supported esize values are 4/8/16.
> > > +	 * Others can be added on need basis.
> > > +	 */
> > > +	if ((esize != 4) && (esize != 8) && (esize != 16)) {
> > > +		RTE_LOG(ERR, RING,
> > > +			"Unsupported esize value. Supported values are 4, 8
> > > and 16\n");
> > > +
> > > +		return -EINVAL;
> > > +	}
> > > +
> > >  	/* count must be a power of 2 */
> > >  	if ((!POWEROF2(count)) || (count > RTE_RING_SZ_MASK )) {
> > >  		RTE_LOG(ERR, RING,
> > > -			"Requested size is invalid, must be power of 2, and "
> > > -			"do not exceed the size limit %u\n",
> > > RTE_RING_SZ_MASK);
> > > +			"Requested number of elements is invalid, must be "
> > > +			"power of 2, and do not exceed the limit %u\n",
> > > +			RTE_RING_SZ_MASK);
> > > +
> > >  		return -EINVAL;
> > >  	}
> > >
> > > -	sz = sizeof(struct rte_ring) + count * sizeof(void *);
> > > +	sz = sizeof(struct rte_ring) + count * esize;
> > >  	sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
> > >  	return sz;
> > >  }
> > >
> > > +/* return the size of memory occupied by a ring */ ssize_t
> > > +rte_ring_get_memsize(unsigned count) {
> > > +	return rte_ring_get_memsize_elem(count, sizeof(void *)); }
> > > +
> > >  void
> > >  rte_ring_reset(struct rte_ring *r)
> > >  {
> > > @@ -114,10 +134,10 @@ rte_ring_init(struct rte_ring *r, const char
> > > *name, unsigned count,
> > >  	return 0;
> > >  }
> > >
> > > -/* create the ring */
> > > +/* create the ring for a given element size */
> > >  struct rte_ring *
> > > -rte_ring_create(const char *name, unsigned count, int socket_id,
> > > -		unsigned flags)
> > > +rte_ring_create_elem(const char *name, unsigned count, unsigned esize,
> > > +		int socket_id, unsigned flags)
> > >  {
> > >  	char mz_name[RTE_MEMZONE_NAMESIZE];
> > >  	struct rte_ring *r;
> > > @@ -135,7 +155,7 @@ rte_ring_create(const char *name, unsigned
> > > count, int socket_id,
> > >  	if (flags & RING_F_EXACT_SZ)
> > >  		count = rte_align32pow2(count + 1);
> > >
> > > -	ring_size = rte_ring_get_memsize(count);
> > > +	ring_size = rte_ring_get_memsize_elem(count, esize);
> > >  	if (ring_size < 0) {
> > >  		rte_errno = ring_size;
> > >  		return NULL;
> > > @@ -182,6 +202,15 @@ rte_ring_create(const char *name, unsigned
> > > count, int socket_id,
> > >  	return r;
> > >  }
> > >
> > > +/* create the ring */
> > > +struct rte_ring *
> > > +rte_ring_create(const char *name, unsigned count, int socket_id,
> > > +		unsigned flags)
> > > +{
> > > +	return rte_ring_create_elem(name, count, sizeof(void *), socket_id,
> > > +		flags);
> > > +}
> > > +
> > >  /* free the ring */
> > >  void
> > >  rte_ring_free(struct rte_ring *r)
> > > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> > > index
> > > 2a9f768a1..18fc5d845 100644
> > > --- a/lib/librte_ring/rte_ring.h
> > > +++ b/lib/librte_ring/rte_ring.h
> > > @@ -216,6 +216,7 @@ int rte_ring_init(struct rte_ring *r, const char
> > > *name, unsigned count,
> > >   */
> > >  struct rte_ring *rte_ring_create(const char *name, unsigned count,
> > >  				 int socket_id, unsigned flags);
> > > +
> > >  /**
> > >   * De-allocate all memory used by the ring.
> > >   *
> > > diff --git a/lib/librte_ring/rte_ring_elem.h
> > > b/lib/librte_ring/rte_ring_elem.h new file mode 100644 index
> > > 000000000..860f059ad
> > > --- /dev/null
> > > +++ b/lib/librte_ring/rte_ring_elem.h
> > > @@ -0,0 +1,946 @@
> > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > + *
> > > + * Copyright (c) 2019 Arm Limited
> > > + * Copyright (c) 2010-2017 Intel Corporation
> > > + * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
> > > + * All rights reserved.
> > > + * Derived from FreeBSD's bufring.h
> > > + * Used as BSD-3 Licensed with permission from Kip Macy.
> > > + */
> > > +
> > > +#ifndef _RTE_RING_ELEM_H_
> > > +#define _RTE_RING_ELEM_H_
> > > +
> > > +/**
> > > + * @file
> > > + * RTE Ring with flexible element size  */
> > > +
> > > +#ifdef __cplusplus
> > > +extern "C" {
> > > +#endif
> > > +
> > > +#include <stdio.h>
> > > +#include <stdint.h>
> > > +#include <sys/queue.h>
> > > +#include <errno.h>
> > > +#include <rte_common.h>
> > > +#include <rte_config.h>
> > > +#include <rte_memory.h>
> > > +#include <rte_lcore.h>
> > > +#include <rte_atomic.h>
> > > +#include <rte_branch_prediction.h>
> > > +#include <rte_memzone.h>
> > > +#include <rte_pause.h>
> > > +
> > > +#include "rte_ring.h"
> > > +
> > > +/**
> > > + * @warning
> > > + * @b EXPERIMENTAL: this API may change without prior notice
> > > + *
> > > + * Calculate the memory size needed for a ring with given element
> > > +size
> > > + *
> > > + * This function returns the number of bytes needed for a ring,
> > > +given
> > > + * the number of elements in it and the size of the element. This
> > > +value
> > > + * is the sum of the size of the structure rte_ring and the size of
> > > +the
> > > + * memory needed for storing the elements. The value is aligned to
> > > +a cache
> > > + * line size.
> > > + *
> > > + * @param count
> > > + *   The number of elements in the ring (must be a power of 2).
> > > + * @param esize
> > > + *   The size of ring element, in bytes. It must be a multiple of 4.
> > > + *   Currently, sizes 4, 8 and 16 are supported.
> > > + * @return
> > > + *   - The memory size needed for the ring on success.
> > > + *   - -EINVAL if count is not a power of 2.
> > > + */
> > > +__rte_experimental
> > > +ssize_t rte_ring_get_memsize_elem(unsigned count, unsigned esize);
> > > +
> > > +/**
> > > + * @warning
> > > + * @b EXPERIMENTAL: this API may change without prior notice
> > > + *
> > > + * Create a new ring named *name* that stores elements with given size.
> > > + *
> > > + * This function uses ``memzone_reserve()`` to allocate memory.
> > > +Then it
> > > + * calls rte_ring_init() to initialize an empty ring.
> > > + *
> > > + * The new ring size is set to *count*, which must be a power of
> > > + * two. Water marking is disabled by default. The real usable ring
> > > +size
> > > + * is *count-1* instead of *count* to differentiate a free ring
> > > +from an
> > > + * empty ring.
> > > + *
> > > + * The ring is added in RTE_TAILQ_RING list.
> > > + *
> > > + * @param name
> > > + *   The name of the ring.
> > > + * @param count
> > > + *   The number of elements in the ring (must be a power of 2).
> > > + * @param esize
> > > + *   The size of ring element, in bytes. It must be a multiple of 4.
> > > + *   Currently, sizes 4, 8 and 16 are supported.
> > > + * @param socket_id
> > > + *   The *socket_id* argument is the socket identifier in case of
> > > + *   NUMA. The value can be *SOCKET_ID_ANY* if there is no NUMA
> > > + *   constraint for the reserved zone.
> > > + * @param flags
> > > + *   An OR of the following:
> > > + *    - RING_F_SP_ENQ: If this flag is set, the default behavior when
> > > + *      using ``rte_ring_enqueue()`` or ``rte_ring_enqueue_bulk()``
> > > + *      is "single-producer". Otherwise, it is "multi-producers".
> > > + *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
> > > + *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
> > > + *      is "single-consumer". Otherwise, it is "multi-consumers".
> > > + * @return
> > > + *   On success, the pointer to the new allocated ring. NULL on error with
> > > + *    rte_errno set appropriately. Possible errno values include:
> > > + *    - E_RTE_NO_CONFIG - function could not get pointer to rte_config
> > > structure
> > > + *    - E_RTE_SECONDARY - function was called from a secondary process
> > > instance
> > > + *    - EINVAL - count provided is not a power of 2
> > > + *    - ENOSPC - the maximum number of memzones has already been
> > > allocated
> > > + *    - EEXIST - a memzone with the same name already exists
> > > + *    - ENOMEM - no appropriate memory area found in which to create
> > > memzone
> > > + */
> > > +__rte_experimental
> > > +struct rte_ring *rte_ring_create_elem(const char *name, unsigned count,
> > > +				unsigned esize, int socket_id, unsigned flags);
> > > +
> > > +/* the actual enqueue of pointers on the ring.
> > > + * Placed here since identical code needed in both
> > > + * single and multi producer enqueue functions.
> > > + */
> > > +#define ENQUEUE_PTRS_ELEM(r, ring_start, prod_head, obj_table,
> > > +esize, n)
> > > do { \
> > > +	if (esize == 4) \
> > > +		ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n); \
> > > +	else if (esize == 8) \
> > > +		ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n); \
> > > +	else if (esize == 16) \
> > > +		ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table, n);
> \ }
> > > while
> > > +(0)
> > > +
> > > +#define ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n) do { \
> > > +	unsigned int i; \
> > > +	const uint32_t size = (r)->size; \
> > > +	uint32_t idx = prod_head & (r)->mask; \
> > > +	uint32_t *ring = (uint32_t *)ring_start; \
> > > +	uint32_t *obj = (uint32_t *)obj_table; \
> > > +	if (likely(idx + n < size)) { \
> > > +		for (i = 0; i < (n & ((~(unsigned)0x7))); i += 8, idx += 8) { \
> > > +			ring[idx] = obj[i]; \
> > > +			ring[idx + 1] = obj[i + 1]; \
> > > +			ring[idx + 2] = obj[i + 2]; \
> > > +			ring[idx + 3] = obj[i + 3]; \
> > > +			ring[idx + 4] = obj[i + 4]; \
> > > +			ring[idx + 5] = obj[i + 5]; \
> > > +			ring[idx + 6] = obj[i + 6]; \
> > > +			ring[idx + 7] = obj[i + 7]; \
> > > +		} \
> > > +		switch (n & 0x7) { \
> > > +		case 7: \
> > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > +		case 6: \
> > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > +		case 5: \
> > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > +		case 4: \
> > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > +		case 3: \
> > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > +		case 2: \
> > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > +		case 1: \
> > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > +		} \
> > > +	} else { \
> > > +		for (i = 0; idx < size; i++, idx++)\
> > > +			ring[idx] = obj[i]; \
> > > +		for (idx = 0; i < n; i++, idx++) \
> > > +			ring[idx] = obj[i]; \
> > > +	} \
> > > +} while (0)
> > > +
> > > +#define ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n) do { \
> > > +	unsigned int i; \
> > > +	const uint32_t size = (r)->size; \
> > > +	uint32_t idx = prod_head & (r)->mask; \
> > > +	uint64_t *ring = (uint64_t *)ring_start; \
> > > +	uint64_t *obj = (uint64_t *)obj_table; \
> > > +	if (likely(idx + n < size)) { \
> > > +		for (i = 0; i < (n & ((~(unsigned)0x3))); i += 4, idx += 4) { \
> > > +			ring[idx] = obj[i]; \
> > > +			ring[idx + 1] = obj[i + 1]; \
> > > +			ring[idx + 2] = obj[i + 2]; \
> > > +			ring[idx + 3] = obj[i + 3]; \
> > > +		} \
> > > +		switch (n & 0x3) { \
> > > +		case 3: \
> > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > +		case 2: \
> > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > +		case 1: \
> > > +			ring[idx++] = obj[i++]; \
> > > +		} \
> > > +	} else { \
> > > +		for (i = 0; idx < size; i++, idx++)\
> > > +			ring[idx] = obj[i]; \
> > > +		for (idx = 0; i < n; i++, idx++) \
> > > +			ring[idx] = obj[i]; \
> > > +	} \
> > > +} while (0)
> > > +
> > > +#define ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table, n) do
> { \
> > > +	unsigned int i; \
> > > +	const uint32_t size = (r)->size; \
> > > +	uint32_t idx = prod_head & (r)->mask; \
> > > +	__uint128_t *ring = (__uint128_t *)ring_start; \
> > > +	__uint128_t *obj = (__uint128_t *)obj_table; \
> > > +	if (likely(idx + n < size)) { \
> > > +		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
> > > +			ring[idx] = obj[i]; \
> > > +			ring[idx + 1] = obj[i + 1]; \
> > > +		} \
> > > +		switch (n & 0x1) { \
> > > +		case 1: \
> > > +			ring[idx++] = obj[i++]; \
> > > +		} \
> > > +	} else { \
> > > +		for (i = 0; idx < size; i++, idx++)\
> > > +			ring[idx] = obj[i]; \
> > > +		for (idx = 0; i < n; i++, idx++) \
> > > +			ring[idx] = obj[i]; \
> > > +	} \
> > > +} while (0)
> > > +
> > > +/* the actual copy of pointers on the ring to obj_table.
> > > + * Placed here since identical code needed in both
> > > + * single and multi consumer dequeue functions.
> > > + */
> > > +#define DEQUEUE_PTRS_ELEM(r, ring_start, cons_head, obj_table,
> > > +esize, n)
> > > do { \
> > > +	if (esize == 4) \
> > > +		DEQUEUE_PTRS_32(r, ring_start, cons_head, obj_table, n); \
> > > +	else if (esize == 8) \
> > > +		DEQUEUE_PTRS_64(r, ring_start, cons_head, obj_table, n); \
> > > +	else if (esize == 16) \
> > > +		DEQUEUE_PTRS_128(r, ring_start, cons_head, obj_table, n);
> \ }
> > > while
> > > +(0)
> > > +
> > > +#define DEQUEUE_PTRS_32(r, ring_start, cons_head, obj_table, n) do { \
> > > +	unsigned int i; \
> > > +	uint32_t idx = cons_head & (r)->mask; \
> > > +	const uint32_t size = (r)->size; \
> > > +	uint32_t *ring = (uint32_t *)ring_start; \
> > > +	uint32_t *obj = (uint32_t *)obj_table; \
> > > +	if (likely(idx + n < size)) { \
> > > +		for (i = 0; i < (n & (~(unsigned)0x7)); i += 8, idx += 8) {\
> > > +			obj[i] = ring[idx]; \
> > > +			obj[i + 1] = ring[idx + 1]; \
> > > +			obj[i + 2] = ring[idx + 2]; \
> > > +			obj[i + 3] = ring[idx + 3]; \
> > > +			obj[i + 4] = ring[idx + 4]; \
> > > +			obj[i + 5] = ring[idx + 5]; \
> > > +			obj[i + 6] = ring[idx + 6]; \
> > > +			obj[i + 7] = ring[idx + 7]; \
> > > +		} \
> > > +		switch (n & 0x7) { \
> > > +		case 7: \
> > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > +		case 6: \
> > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > +		case 5: \
> > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > +		case 4: \
> > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > +		case 3: \
> > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > +		case 2: \
> > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > +		case 1: \
> > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > +		} \
> > > +	} else { \
> > > +		for (i = 0; idx < size; i++, idx++) \
> > > +			obj[i] = ring[idx]; \
> > > +		for (idx = 0; i < n; i++, idx++) \
> > > +			obj[i] = ring[idx]; \
> > > +	} \
> > > +} while (0)
> > > +
> > > +#define DEQUEUE_PTRS_64(r, ring_start, cons_head, obj_table, n) do { \
> > > +	unsigned int i; \
> > > +	uint32_t idx = cons_head & (r)->mask; \
> > > +	const uint32_t size = (r)->size; \
> > > +	uint64_t *ring = (uint64_t *)ring_start; \
> > > +	uint64_t *obj = (uint64_t *)obj_table; \
> > > +	if (likely(idx + n < size)) { \
> > > +		for (i = 0; i < (n & (~(unsigned)0x3)); i += 4, idx += 4) {\
> > > +			obj[i] = ring[idx]; \
> > > +			obj[i + 1] = ring[idx + 1]; \
> > > +			obj[i + 2] = ring[idx + 2]; \
> > > +			obj[i + 3] = ring[idx + 3]; \
> > > +		} \
> > > +		switch (n & 0x3) { \
> > > +		case 3: \
> > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > +		case 2: \
> > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > +		case 1: \
> > > +			obj[i++] = ring[idx++]; \
> > > +		} \
> > > +	} else { \
> > > +		for (i = 0; idx < size; i++, idx++) \
> > > +			obj[i] = ring[idx]; \
> > > +		for (idx = 0; i < n; i++, idx++) \
> > > +			obj[i] = ring[idx]; \
> > > +	} \
> > > +} while (0)
> > > +
> > > +#define DEQUEUE_PTRS_128(r, ring_start, cons_head, obj_table, n) do
> { \
> > > +	unsigned int i; \
> > > +	uint32_t idx = cons_head & (r)->mask; \
> > > +	const uint32_t size = (r)->size; \
> > > +	__uint128_t *ring = (__uint128_t *)ring_start; \
> > > +	__uint128_t *obj = (__uint128_t *)obj_table; \
> > > +	if (likely(idx + n < size)) { \
> > > +		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
> > > +			obj[i] = ring[idx]; \
> > > +			obj[i + 1] = ring[idx + 1]; \
> > > +		} \
> > > +		switch (n & 0x1) { \
> > > +		case 1: \
> > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > +		} \
> > > +	} else { \
> > > +		for (i = 0; idx < size; i++, idx++) \
> > > +			obj[i] = ring[idx]; \
> > > +		for (idx = 0; i < n; i++, idx++) \
> > > +			obj[i] = ring[idx]; \
> > > +	} \
> > > +} while (0)
> > > +
> > > +/* Between load and load. there might be cpu reorder in weak model
> > > + * (powerpc/arm).
> > > + * There are 2 choices for the users
> > > + * 1.use rmb() memory barrier
> > > + * 2.use one-direction load_acquire/store_release barrier,defined
> > > +by
> > > + * CONFIG_RTE_USE_C11_MEM_MODEL=y
> > > + * It depends on performance test results.
> > > + * By default, move common functions to rte_ring_generic.h  */
> > > +#ifdef RTE_USE_C11_MEM_MODEL #include "rte_ring_c11_mem.h"
> > > +#else
> > > +#include "rte_ring_generic.h"
> > > +#endif
> > > +
> > > +/**
> > > + * @internal Enqueue several objects on the ring
> > > + *
> > > + * @param r
> > > + *   A pointer to the ring structure.
> > > + * @param obj_table
> > > + *   A pointer to a table of void * pointers (objects).
> > > + * @param esize
> > > + *   The size of ring element, in bytes. It must be a multiple of 4.
> > > + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> > > + *   as passed while creating the ring, otherwise the results are undefined.
> > > + * @param n
> > > + *   The number of objects to add in the ring from the obj_table.
> > > + * @param behavior
> > > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items from a
> ring
> > > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible
> from
> > > ring
> > > + * @param is_sp
> > > + *   Indicates whether to use single producer or multi-producer head
> update
> > > + * @param free_space
> > > + *   returns the amount of space after the enqueue operation has
> finished
> > > + * @return
> > > + *   Actual number of objects enqueued.
> > > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > > + */
> > > +static __rte_always_inline unsigned int
> > > +__rte_ring_do_enqueue_elem(struct rte_ring *r, void * const obj_table,
> > > +		unsigned int esize, unsigned int n,
> > > +		enum rte_ring_queue_behavior behavior, unsigned int is_sp,
> > > +		unsigned int *free_space)
> 
> 
> I like the idea to add esize as an argument to the public API, so the compiler
> can do it's jib optimizing calls with constant esize.
> Though I am not very happy with the rest of implementation:
> 1. It doesn't really provide configurable elem size - only 4/8/16B elems are
> supported.
Agree. I was thinking other sizes can be added on need basis.
However, I am wondering if we should just provide for 4B and then the users can use bulk operations to construct whatever they need? It would mean extra work for the users.

> 2. A lot of code duplication with these 3 copies of ENQUEUE/DEQUEUE
> macros.
> 
> Looking at ENQUEUE/DEQUEUE macros, I can see that main loop always does
> 32B copy per iteration.
Yes, I tried to keep it the same as the existing one (originally, I guess the intention was to allow for 256b vector instructions to be generated)

> So wonder can we make a generic function that would do 32B copy per
> iteration in a main loop, and copy tail  by 4B chunks?
> That would avoid copy duplication and will allow user to have any elem size
> (multiple of 4B) he wants.
> Something like that (note didn't test it, just a rough idea):
> 
>  static inline void
> copy_elems(uint32_t du32[], const uint32_t su32[], uint32_t num, uint32_t
> esize) {
>         uint32_t i, sz;
> 
>         sz = (num * esize) / sizeof(uint32_t);
If 'num' is a compile time constant, 'sz' will be a compile time constant. Otherwise, this will result in a multiplication operation. I have tried to avoid the multiplication operation and try to use shift and mask operations (just like how the rest of the ring code does).

> 
>         for (i = 0; i < (sz & ~7); i += 8)
>                 memcpy(du32 + i, su32 + i, 8 * sizeof(uint32_t));
I had used memcpy to start with (for the entire copy operation), performance is not the same for 64b elements when compared with the existing ring APIs (some cases more and some cases less).

IMO, we have to keep the performance of the 64b and 128b the same as what we get with the existing ring and event-ring APIs. That would allow us to replace them with these new APIs. I suggest that we keep the macros in this patch for 64b and 128b.

For the rest of the sizes, we could put a for loop around 32b macro (this would allow for all sizes as well).

> 
>         switch (sz & 7) {
>         case 7: du32[sz - 7] = su32[sz - 7]; /* fallthrough */
>         case 6: du32[sz - 6] = su32[sz - 6]; /* fallthrough */
>         case 5: du32[sz - 5] = su32[sz - 5]; /* fallthrough */
>         case 4: du32[sz - 4] = su32[sz - 4]; /* fallthrough */
>         case 3: du32[sz - 3] = su32[sz - 3]; /* fallthrough */
>         case 2: du32[sz - 2] = su32[sz - 2]; /* fallthrough */
>         case 1: du32[sz - 1] = su32[sz - 1]; /* fallthrough */
>         }
> }
> 
> static inline void
> enqueue_elems(struct rte_ring *r, void *ring_start, uint32_t prod_head,
>                 void *obj_table, uint32_t num, uint32_t esize) {
>         uint32_t idx, n;
>         uint32_t *du32;
> 
>         const uint32_t size = r->size;
> 
>         idx = prod_head & (r)->mask;
> 
>         du32 = ring_start + idx * sizeof(uint32_t);
> 
>         if (idx + num < size)
>                 copy_elems(du32, obj_table, num, esize);
>         else {
>                 n = size - idx;
>                 copy_elems(du32, obj_table, n, esize);
>                 copy_elems(ring_start, obj_table + n * sizeof(uint32_t),
>                         num - n, esize);
>         }
> }
> 
> And then, in that function, instead of ENQUEUE_PTRS_ELEM(), just:
> 
> enqueue_elems(r, &r[1], prod_head, obj_table, n, esize);
> 
> 
> > > +{
> > > +	uint32_t prod_head, prod_next;
> > > +	uint32_t free_entries;
> > > +
> > > +	n = __rte_ring_move_prod_head(r, is_sp, n, behavior,
> > > +			&prod_head, &prod_next, &free_entries);
> > > +	if (n == 0)
> > > +		goto end;
> > > +
> > > +	ENQUEUE_PTRS_ELEM(r, &r[1], prod_head, obj_table, esize, n);
> > > +
> > > +	update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
> > > +end:
> > > +	if (free_space != NULL)
> > > +		*free_space = free_entries - n;
> > > +	return n;
> > > +}
> > > +

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable element size
  2019-10-14 23:56           ` Honnappa Nagarahalli
@ 2019-10-15  9:34             ` Ananyev, Konstantin
  2019-10-17  4:46               ` Honnappa Nagarahalli
  0 siblings, 1 reply; 173+ messages in thread
From: Ananyev, Konstantin @ 2019-10-15  9:34 UTC (permalink / raw)
  To: Honnappa Nagarahalli, olivier.matz, sthemmin, jerinj, Richardson,
	Bruce, david.marchand, pbhagavatula
  Cc: dev, Dharmik Thakkar, Ruifeng Wang (Arm Technology China),
	Gavin Hu (Arm Technology China),
	stephen, nd, nd


Hi Honnappa,
 
> > > >
> > > > Current APIs assume ring elements to be pointers. However, in many
> > > > use cases, the size can be different. Add new APIs to support
> > > > configurable ring element sizes.
> > > >
> > > > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > > > Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > ---
> > > >  lib/librte_ring/Makefile             |   3 +-
> > > >  lib/librte_ring/meson.build          |   3 +
> > > >  lib/librte_ring/rte_ring.c           |  45 +-
> > > >  lib/librte_ring/rte_ring.h           |   1 +
> > > >  lib/librte_ring/rte_ring_elem.h      | 946 +++++++++++++++++++++++++++
> > > >  lib/librte_ring/rte_ring_version.map |   2 +
> > > >  6 files changed, 991 insertions(+), 9 deletions(-)  create mode
> > > > 100644 lib/librte_ring/rte_ring_elem.h
> > > >
> > > > diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
> > > > index 21a36770d..515a967bb 100644
> > > > --- a/lib/librte_ring/Makefile
> > > > +++ b/lib/librte_ring/Makefile
> > > > @@ -6,7 +6,7 @@ include $(RTE_SDK)/mk/rte.vars.mk  # library name
> > > > LIB = librte_ring.a
> > > >
> > > > -CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3
> > > > +CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -
> > > > DALLOW_EXPERIMENTAL_API
> > > >  LDLIBS += -lrte_eal
> > > >
> > > >  EXPORT_MAP := rte_ring_version.map
> > > > @@ -18,6 +18,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
> > > >
> > > >  # install includes
> > > >  SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
> > > > +					rte_ring_elem.h \
> > > >  					rte_ring_generic.h \
> > > >  					rte_ring_c11_mem.h
> > > >
> > > > diff --git a/lib/librte_ring/meson.build
> > > > b/lib/librte_ring/meson.build index ab8b0b469..74219840a 100644
> > > > --- a/lib/librte_ring/meson.build
> > > > +++ b/lib/librte_ring/meson.build
> > > > @@ -6,3 +6,6 @@ sources = files('rte_ring.c')  headers = files('rte_ring.h',
> > > >  		'rte_ring_c11_mem.h',
> > > >  		'rte_ring_generic.h')
> > > > +
> > > > +# rte_ring_create_elem and rte_ring_get_memsize_elem are
> > > > +experimental allow_experimental_apis = true
> > > > diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
> > > > index d9b308036..6fed3648b 100644
> > > > --- a/lib/librte_ring/rte_ring.c
> > > > +++ b/lib/librte_ring/rte_ring.c
> > > > @@ -33,6 +33,7 @@
> > > >  #include <rte_tailq.h>
> > > >
> > > >  #include "rte_ring.h"
> > > > +#include "rte_ring_elem.h"
> > > >
> > > >  TAILQ_HEAD(rte_ring_list, rte_tailq_entry);
> > > >
> > > > @@ -46,23 +47,42 @@ EAL_REGISTER_TAILQ(rte_ring_tailq)
> > > >
> > > >  /* return the size of memory occupied by a ring */  ssize_t -
> > > > rte_ring_get_memsize(unsigned count)
> > > > +rte_ring_get_memsize_elem(unsigned count, unsigned esize)
> > > >  {
> > > >  	ssize_t sz;
> > > >
> > > > +	/* Supported esize values are 4/8/16.
> > > > +	 * Others can be added on need basis.
> > > > +	 */
> > > > +	if ((esize != 4) && (esize != 8) && (esize != 16)) {
> > > > +		RTE_LOG(ERR, RING,
> > > > +			"Unsupported esize value. Supported values are 4, 8
> > > > and 16\n");
> > > > +
> > > > +		return -EINVAL;
> > > > +	}
> > > > +
> > > >  	/* count must be a power of 2 */
> > > >  	if ((!POWEROF2(count)) || (count > RTE_RING_SZ_MASK )) {
> > > >  		RTE_LOG(ERR, RING,
> > > > -			"Requested size is invalid, must be power of 2, and "
> > > > -			"do not exceed the size limit %u\n",
> > > > RTE_RING_SZ_MASK);
> > > > +			"Requested number of elements is invalid, must be "
> > > > +			"power of 2, and do not exceed the limit %u\n",
> > > > +			RTE_RING_SZ_MASK);
> > > > +
> > > >  		return -EINVAL;
> > > >  	}
> > > >
> > > > -	sz = sizeof(struct rte_ring) + count * sizeof(void *);
> > > > +	sz = sizeof(struct rte_ring) + count * esize;
> > > >  	sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
> > > >  	return sz;
> > > >  }
> > > >
> > > > +/* return the size of memory occupied by a ring */ ssize_t
> > > > +rte_ring_get_memsize(unsigned count) {
> > > > +	return rte_ring_get_memsize_elem(count, sizeof(void *)); }
> > > > +
> > > >  void
> > > >  rte_ring_reset(struct rte_ring *r)
> > > >  {
> > > > @@ -114,10 +134,10 @@ rte_ring_init(struct rte_ring *r, const char
> > > > *name, unsigned count,
> > > >  	return 0;
> > > >  }
> > > >
> > > > -/* create the ring */
> > > > +/* create the ring for a given element size */
> > > >  struct rte_ring *
> > > > -rte_ring_create(const char *name, unsigned count, int socket_id,
> > > > -		unsigned flags)
> > > > +rte_ring_create_elem(const char *name, unsigned count, unsigned esize,
> > > > +		int socket_id, unsigned flags)
> > > >  {
> > > >  	char mz_name[RTE_MEMZONE_NAMESIZE];
> > > >  	struct rte_ring *r;
> > > > @@ -135,7 +155,7 @@ rte_ring_create(const char *name, unsigned
> > > > count, int socket_id,
> > > >  	if (flags & RING_F_EXACT_SZ)
> > > >  		count = rte_align32pow2(count + 1);
> > > >
> > > > -	ring_size = rte_ring_get_memsize(count);
> > > > +	ring_size = rte_ring_get_memsize_elem(count, esize);
> > > >  	if (ring_size < 0) {
> > > >  		rte_errno = ring_size;
> > > >  		return NULL;
> > > > @@ -182,6 +202,15 @@ rte_ring_create(const char *name, unsigned
> > > > count, int socket_id,
> > > >  	return r;
> > > >  }
> > > >
> > > > +/* create the ring */
> > > > +struct rte_ring *
> > > > +rte_ring_create(const char *name, unsigned count, int socket_id,
> > > > +		unsigned flags)
> > > > +{
> > > > +	return rte_ring_create_elem(name, count, sizeof(void *), socket_id,
> > > > +		flags);
> > > > +}
> > > > +
> > > >  /* free the ring */
> > > >  void
> > > >  rte_ring_free(struct rte_ring *r)
> > > > diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
> > > > index
> > > > 2a9f768a1..18fc5d845 100644
> > > > --- a/lib/librte_ring/rte_ring.h
> > > > +++ b/lib/librte_ring/rte_ring.h
> > > > @@ -216,6 +216,7 @@ int rte_ring_init(struct rte_ring *r, const char
> > > > *name, unsigned count,
> > > >   */
> > > >  struct rte_ring *rte_ring_create(const char *name, unsigned count,
> > > >  				 int socket_id, unsigned flags);
> > > > +
> > > >  /**
> > > >   * De-allocate all memory used by the ring.
> > > >   *
> > > > diff --git a/lib/librte_ring/rte_ring_elem.h
> > > > b/lib/librte_ring/rte_ring_elem.h new file mode 100644 index
> > > > 000000000..860f059ad
> > > > --- /dev/null
> > > > +++ b/lib/librte_ring/rte_ring_elem.h
> > > > @@ -0,0 +1,946 @@
> > > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > > + *
> > > > + * Copyright (c) 2019 Arm Limited
> > > > + * Copyright (c) 2010-2017 Intel Corporation
> > > > + * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
> > > > + * All rights reserved.
> > > > + * Derived from FreeBSD's bufring.h
> > > > + * Used as BSD-3 Licensed with permission from Kip Macy.
> > > > + */
> > > > +
> > > > +#ifndef _RTE_RING_ELEM_H_
> > > > +#define _RTE_RING_ELEM_H_
> > > > +
> > > > +/**
> > > > + * @file
> > > > + * RTE Ring with flexible element size  */
> > > > +
> > > > +#ifdef __cplusplus
> > > > +extern "C" {
> > > > +#endif
> > > > +
> > > > +#include <stdio.h>
> > > > +#include <stdint.h>
> > > > +#include <sys/queue.h>
> > > > +#include <errno.h>
> > > > +#include <rte_common.h>
> > > > +#include <rte_config.h>
> > > > +#include <rte_memory.h>
> > > > +#include <rte_lcore.h>
> > > > +#include <rte_atomic.h>
> > > > +#include <rte_branch_prediction.h>
> > > > +#include <rte_memzone.h>
> > > > +#include <rte_pause.h>
> > > > +
> > > > +#include "rte_ring.h"
> > > > +
> > > > +/**
> > > > + * @warning
> > > > + * @b EXPERIMENTAL: this API may change without prior notice
> > > > + *
> > > > + * Calculate the memory size needed for a ring with given element
> > > > +size
> > > > + *
> > > > + * This function returns the number of bytes needed for a ring,
> > > > +given
> > > > + * the number of elements in it and the size of the element. This
> > > > +value
> > > > + * is the sum of the size of the structure rte_ring and the size of
> > > > +the
> > > > + * memory needed for storing the elements. The value is aligned to
> > > > +a cache
> > > > + * line size.
> > > > + *
> > > > + * @param count
> > > > + *   The number of elements in the ring (must be a power of 2).
> > > > + * @param esize
> > > > + *   The size of ring element, in bytes. It must be a multiple of 4.
> > > > + *   Currently, sizes 4, 8 and 16 are supported.
> > > > + * @return
> > > > + *   - The memory size needed for the ring on success.
> > > > + *   - -EINVAL if count is not a power of 2.
> > > > + */
> > > > +__rte_experimental
> > > > +ssize_t rte_ring_get_memsize_elem(unsigned count, unsigned esize);
> > > > +
> > > > +/**
> > > > + * @warning
> > > > + * @b EXPERIMENTAL: this API may change without prior notice
> > > > + *
> > > > + * Create a new ring named *name* that stores elements with given size.
> > > > + *
> > > > + * This function uses ``memzone_reserve()`` to allocate memory.
> > > > +Then it
> > > > + * calls rte_ring_init() to initialize an empty ring.
> > > > + *
> > > > + * The new ring size is set to *count*, which must be a power of
> > > > + * two. Water marking is disabled by default. The real usable ring
> > > > +size
> > > > + * is *count-1* instead of *count* to differentiate a free ring
> > > > +from an
> > > > + * empty ring.
> > > > + *
> > > > + * The ring is added in RTE_TAILQ_RING list.
> > > > + *
> > > > + * @param name
> > > > + *   The name of the ring.
> > > > + * @param count
> > > > + *   The number of elements in the ring (must be a power of 2).
> > > > + * @param esize
> > > > + *   The size of ring element, in bytes. It must be a multiple of 4.
> > > > + *   Currently, sizes 4, 8 and 16 are supported.
> > > > + * @param socket_id
> > > > + *   The *socket_id* argument is the socket identifier in case of
> > > > + *   NUMA. The value can be *SOCKET_ID_ANY* if there is no NUMA
> > > > + *   constraint for the reserved zone.
> > > > + * @param flags
> > > > + *   An OR of the following:
> > > > + *    - RING_F_SP_ENQ: If this flag is set, the default behavior when
> > > > + *      using ``rte_ring_enqueue()`` or ``rte_ring_enqueue_bulk()``
> > > > + *      is "single-producer". Otherwise, it is "multi-producers".
> > > > + *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
> > > > + *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
> > > > + *      is "single-consumer". Otherwise, it is "multi-consumers".
> > > > + * @return
> > > > + *   On success, the pointer to the new allocated ring. NULL on error with
> > > > + *    rte_errno set appropriately. Possible errno values include:
> > > > + *    - E_RTE_NO_CONFIG - function could not get pointer to rte_config
> > > > structure
> > > > + *    - E_RTE_SECONDARY - function was called from a secondary process
> > > > instance
> > > > + *    - EINVAL - count provided is not a power of 2
> > > > + *    - ENOSPC - the maximum number of memzones has already been
> > > > allocated
> > > > + *    - EEXIST - a memzone with the same name already exists
> > > > + *    - ENOMEM - no appropriate memory area found in which to create
> > > > memzone
> > > > + */
> > > > +__rte_experimental
> > > > +struct rte_ring *rte_ring_create_elem(const char *name, unsigned count,
> > > > +				unsigned esize, int socket_id, unsigned flags);
> > > > +
> > > > +/* the actual enqueue of pointers on the ring.
> > > > + * Placed here since identical code needed in both
> > > > + * single and multi producer enqueue functions.
> > > > + */
> > > > +#define ENQUEUE_PTRS_ELEM(r, ring_start, prod_head, obj_table,
> > > > +esize, n)
> > > > do { \
> > > > +	if (esize == 4) \
> > > > +		ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n); \
> > > > +	else if (esize == 8) \
> > > > +		ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n); \
> > > > +	else if (esize == 16) \
> > > > +		ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table, n);
> > \ }
> > > > while
> > > > +(0)
> > > > +
> > > > +#define ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n) do { \
> > > > +	unsigned int i; \
> > > > +	const uint32_t size = (r)->size; \
> > > > +	uint32_t idx = prod_head & (r)->mask; \
> > > > +	uint32_t *ring = (uint32_t *)ring_start; \
> > > > +	uint32_t *obj = (uint32_t *)obj_table; \
> > > > +	if (likely(idx + n < size)) { \
> > > > +		for (i = 0; i < (n & ((~(unsigned)0x7))); i += 8, idx += 8) { \
> > > > +			ring[idx] = obj[i]; \
> > > > +			ring[idx + 1] = obj[i + 1]; \
> > > > +			ring[idx + 2] = obj[i + 2]; \
> > > > +			ring[idx + 3] = obj[i + 3]; \
> > > > +			ring[idx + 4] = obj[i + 4]; \
> > > > +			ring[idx + 5] = obj[i + 5]; \
> > > > +			ring[idx + 6] = obj[i + 6]; \
> > > > +			ring[idx + 7] = obj[i + 7]; \
> > > > +		} \
> > > > +		switch (n & 0x7) { \
> > > > +		case 7: \
> > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > +		case 6: \
> > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > +		case 5: \
> > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > +		case 4: \
> > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > +		case 3: \
> > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > +		case 2: \
> > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > +		case 1: \
> > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > +		} \
> > > > +	} else { \
> > > > +		for (i = 0; idx < size; i++, idx++)\
> > > > +			ring[idx] = obj[i]; \
> > > > +		for (idx = 0; i < n; i++, idx++) \
> > > > +			ring[idx] = obj[i]; \
> > > > +	} \
> > > > +} while (0)
> > > > +
> > > > +#define ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n) do { \
> > > > +	unsigned int i; \
> > > > +	const uint32_t size = (r)->size; \
> > > > +	uint32_t idx = prod_head & (r)->mask; \
> > > > +	uint64_t *ring = (uint64_t *)ring_start; \
> > > > +	uint64_t *obj = (uint64_t *)obj_table; \
> > > > +	if (likely(idx + n < size)) { \
> > > > +		for (i = 0; i < (n & ((~(unsigned)0x3))); i += 4, idx += 4) { \
> > > > +			ring[idx] = obj[i]; \
> > > > +			ring[idx + 1] = obj[i + 1]; \
> > > > +			ring[idx + 2] = obj[i + 2]; \
> > > > +			ring[idx + 3] = obj[i + 3]; \
> > > > +		} \
> > > > +		switch (n & 0x3) { \
> > > > +		case 3: \
> > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > +		case 2: \
> > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > +		case 1: \
> > > > +			ring[idx++] = obj[i++]; \
> > > > +		} \
> > > > +	} else { \
> > > > +		for (i = 0; idx < size; i++, idx++)\
> > > > +			ring[idx] = obj[i]; \
> > > > +		for (idx = 0; i < n; i++, idx++) \
> > > > +			ring[idx] = obj[i]; \
> > > > +	} \
> > > > +} while (0)
> > > > +
> > > > +#define ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table, n) do
> > { \
> > > > +	unsigned int i; \
> > > > +	const uint32_t size = (r)->size; \
> > > > +	uint32_t idx = prod_head & (r)->mask; \
> > > > +	__uint128_t *ring = (__uint128_t *)ring_start; \
> > > > +	__uint128_t *obj = (__uint128_t *)obj_table; \
> > > > +	if (likely(idx + n < size)) { \
> > > > +		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
> > > > +			ring[idx] = obj[i]; \
> > > > +			ring[idx + 1] = obj[i + 1]; \
> > > > +		} \
> > > > +		switch (n & 0x1) { \
> > > > +		case 1: \
> > > > +			ring[idx++] = obj[i++]; \
> > > > +		} \
> > > > +	} else { \
> > > > +		for (i = 0; idx < size; i++, idx++)\
> > > > +			ring[idx] = obj[i]; \
> > > > +		for (idx = 0; i < n; i++, idx++) \
> > > > +			ring[idx] = obj[i]; \
> > > > +	} \
> > > > +} while (0)
> > > > +
> > > > +/* the actual copy of pointers on the ring to obj_table.
> > > > + * Placed here since identical code needed in both
> > > > + * single and multi consumer dequeue functions.
> > > > + */
> > > > +#define DEQUEUE_PTRS_ELEM(r, ring_start, cons_head, obj_table,
> > > > +esize, n)
> > > > do { \
> > > > +	if (esize == 4) \
> > > > +		DEQUEUE_PTRS_32(r, ring_start, cons_head, obj_table, n); \
> > > > +	else if (esize == 8) \
> > > > +		DEQUEUE_PTRS_64(r, ring_start, cons_head, obj_table, n); \
> > > > +	else if (esize == 16) \
> > > > +		DEQUEUE_PTRS_128(r, ring_start, cons_head, obj_table, n);
> > \ }
> > > > while
> > > > +(0)
> > > > +
> > > > +#define DEQUEUE_PTRS_32(r, ring_start, cons_head, obj_table, n) do { \
> > > > +	unsigned int i; \
> > > > +	uint32_t idx = cons_head & (r)->mask; \
> > > > +	const uint32_t size = (r)->size; \
> > > > +	uint32_t *ring = (uint32_t *)ring_start; \
> > > > +	uint32_t *obj = (uint32_t *)obj_table; \
> > > > +	if (likely(idx + n < size)) { \
> > > > +		for (i = 0; i < (n & (~(unsigned)0x7)); i += 8, idx += 8) {\
> > > > +			obj[i] = ring[idx]; \
> > > > +			obj[i + 1] = ring[idx + 1]; \
> > > > +			obj[i + 2] = ring[idx + 2]; \
> > > > +			obj[i + 3] = ring[idx + 3]; \
> > > > +			obj[i + 4] = ring[idx + 4]; \
> > > > +			obj[i + 5] = ring[idx + 5]; \
> > > > +			obj[i + 6] = ring[idx + 6]; \
> > > > +			obj[i + 7] = ring[idx + 7]; \
> > > > +		} \
> > > > +		switch (n & 0x7) { \
> > > > +		case 7: \
> > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > +		case 6: \
> > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > +		case 5: \
> > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > +		case 4: \
> > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > +		case 3: \
> > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > +		case 2: \
> > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > +		case 1: \
> > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > +		} \
> > > > +	} else { \
> > > > +		for (i = 0; idx < size; i++, idx++) \
> > > > +			obj[i] = ring[idx]; \
> > > > +		for (idx = 0; i < n; i++, idx++) \
> > > > +			obj[i] = ring[idx]; \
> > > > +	} \
> > > > +} while (0)
> > > > +
> > > > +#define DEQUEUE_PTRS_64(r, ring_start, cons_head, obj_table, n) do { \
> > > > +	unsigned int i; \
> > > > +	uint32_t idx = cons_head & (r)->mask; \
> > > > +	const uint32_t size = (r)->size; \
> > > > +	uint64_t *ring = (uint64_t *)ring_start; \
> > > > +	uint64_t *obj = (uint64_t *)obj_table; \
> > > > +	if (likely(idx + n < size)) { \
> > > > +		for (i = 0; i < (n & (~(unsigned)0x3)); i += 4, idx += 4) {\
> > > > +			obj[i] = ring[idx]; \
> > > > +			obj[i + 1] = ring[idx + 1]; \
> > > > +			obj[i + 2] = ring[idx + 2]; \
> > > > +			obj[i + 3] = ring[idx + 3]; \
> > > > +		} \
> > > > +		switch (n & 0x3) { \
> > > > +		case 3: \
> > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > +		case 2: \
> > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > +		case 1: \
> > > > +			obj[i++] = ring[idx++]; \
> > > > +		} \
> > > > +	} else { \
> > > > +		for (i = 0; idx < size; i++, idx++) \
> > > > +			obj[i] = ring[idx]; \
> > > > +		for (idx = 0; i < n; i++, idx++) \
> > > > +			obj[i] = ring[idx]; \
> > > > +	} \
> > > > +} while (0)
> > > > +
> > > > +#define DEQUEUE_PTRS_128(r, ring_start, cons_head, obj_table, n) do
> > { \
> > > > +	unsigned int i; \
> > > > +	uint32_t idx = cons_head & (r)->mask; \
> > > > +	const uint32_t size = (r)->size; \
> > > > +	__uint128_t *ring = (__uint128_t *)ring_start; \
> > > > +	__uint128_t *obj = (__uint128_t *)obj_table; \
> > > > +	if (likely(idx + n < size)) { \
> > > > +		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
> > > > +			obj[i] = ring[idx]; \
> > > > +			obj[i + 1] = ring[idx + 1]; \
> > > > +		} \
> > > > +		switch (n & 0x1) { \
> > > > +		case 1: \
> > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > +		} \
> > > > +	} else { \
> > > > +		for (i = 0; idx < size; i++, idx++) \
> > > > +			obj[i] = ring[idx]; \
> > > > +		for (idx = 0; i < n; i++, idx++) \
> > > > +			obj[i] = ring[idx]; \
> > > > +	} \
> > > > +} while (0)
> > > > +
> > > > +/* Between load and load. there might be cpu reorder in weak model
> > > > + * (powerpc/arm).
> > > > + * There are 2 choices for the users
> > > > + * 1.use rmb() memory barrier
> > > > + * 2.use one-direction load_acquire/store_release barrier,defined
> > > > +by
> > > > + * CONFIG_RTE_USE_C11_MEM_MODEL=y
> > > > + * It depends on performance test results.
> > > > + * By default, move common functions to rte_ring_generic.h  */
> > > > +#ifdef RTE_USE_C11_MEM_MODEL #include "rte_ring_c11_mem.h"
> > > > +#else
> > > > +#include "rte_ring_generic.h"
> > > > +#endif
> > > > +
> > > > +/**
> > > > + * @internal Enqueue several objects on the ring
> > > > + *
> > > > + * @param r
> > > > + *   A pointer to the ring structure.
> > > > + * @param obj_table
> > > > + *   A pointer to a table of void * pointers (objects).
> > > > + * @param esize
> > > > + *   The size of ring element, in bytes. It must be a multiple of 4.
> > > > + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> > > > + *   as passed while creating the ring, otherwise the results are undefined.
> > > > + * @param n
> > > > + *   The number of objects to add in the ring from the obj_table.
> > > > + * @param behavior
> > > > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items from a
> > ring
> > > > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible
> > from
> > > > ring
> > > > + * @param is_sp
> > > > + *   Indicates whether to use single producer or multi-producer head
> > update
> > > > + * @param free_space
> > > > + *   returns the amount of space after the enqueue operation has
> > finished
> > > > + * @return
> > > > + *   Actual number of objects enqueued.
> > > > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > > > + */
> > > > +static __rte_always_inline unsigned int
> > > > +__rte_ring_do_enqueue_elem(struct rte_ring *r, void * const obj_table,
> > > > +		unsigned int esize, unsigned int n,
> > > > +		enum rte_ring_queue_behavior behavior, unsigned int is_sp,
> > > > +		unsigned int *free_space)
> >
> >
> > I like the idea to add esize as an argument to the public API, so the compiler
> > can do it's jib optimizing calls with constant esize.
> > Though I am not very happy with the rest of implementation:
> > 1. It doesn't really provide configurable elem size - only 4/8/16B elems are
> > supported.
> Agree. I was thinking other sizes can be added on need basis.
> However, I am wondering if we should just provide for 4B and then the users can use bulk operations to construct whatever they need?

I suppose it could be plan B... if there would be no agreement on generic case.
And for 4B elems, I guess you do have a particular use-case?

> It
> would mean extra work for the users.
> 
> > 2. A lot of code duplication with these 3 copies of ENQUEUE/DEQUEUE
> > macros.
> >
> > Looking at ENQUEUE/DEQUEUE macros, I can see that main loop always does
> > 32B copy per iteration.
> Yes, I tried to keep it the same as the existing one (originally, I guess the intention was to allow for 256b vector instructions to be
> generated)
> 
> > So wonder can we make a generic function that would do 32B copy per
> > iteration in a main loop, and copy tail  by 4B chunks?
> > That would avoid copy duplication and will allow user to have any elem size
> > (multiple of 4B) he wants.
> > Something like that (note didn't test it, just a rough idea):
> >
> >  static inline void
> > copy_elems(uint32_t du32[], const uint32_t su32[], uint32_t num, uint32_t
> > esize) {
> >         uint32_t i, sz;
> >
> >         sz = (num * esize) / sizeof(uint32_t);
> If 'num' is a compile time constant, 'sz' will be a compile time constant. Otherwise, this will result in a multiplication operation. 

Not always.
If esize is compile time constant, then for esize as power of 2 (4,8,16,...), it would be just one shift.
For other constant values it could be a 'mul' or in many cases just 2 shifts plus 'add' (if compiler is smart enough).
I.E. let say for 24B elem is would be either num * 6 or (num << 2) + (num << 1).
I suppose for non-power of 2 elems it might be ok to get such small perf hit.

>I have tried 
> to avoid the multiplication operation and try to use shift and mask operations (just like how the rest of the ring code does).
> 
> >
> >         for (i = 0; i < (sz & ~7); i += 8)
> >                 memcpy(du32 + i, su32 + i, 8 * sizeof(uint32_t));
> I had used memcpy to start with (for the entire copy operation), performance is not the same for 64b elements when compared with the
> existing ring APIs (some cases more and some cases less).

I remember that from one of your previous mails, that's why here I suggest to use in a loop memcpy() with fixed size.
That way for each iteration complier will replace memcpy() with instructions to copy 32B in a way he thinks is optimal
(same as for original macro, I think).

> 
> IMO, we have to keep the performance of the 64b and 128b the same as what we get with the existing ring and event-ring APIs. That would
> allow us to replace them with these new APIs. I suggest that we keep the macros in this patch for 64b and 128b.

I still think we probably can achieve that without duplicating macros, while still supporting arbitrary elem size.
See above.

> For the rest of the sizes, we could put a for loop around 32b macro (this would allow for all sizes as well).
> 
> >
> >         switch (sz & 7) {
> >         case 7: du32[sz - 7] = su32[sz - 7]; /* fallthrough */
> >         case 6: du32[sz - 6] = su32[sz - 6]; /* fallthrough */
> >         case 5: du32[sz - 5] = su32[sz - 5]; /* fallthrough */
> >         case 4: du32[sz - 4] = su32[sz - 4]; /* fallthrough */
> >         case 3: du32[sz - 3] = su32[sz - 3]; /* fallthrough */
> >         case 2: du32[sz - 2] = su32[sz - 2]; /* fallthrough */
> >         case 1: du32[sz - 1] = su32[sz - 1]; /* fallthrough */
> >         }
> > }
> >
> > static inline void
> > enqueue_elems(struct rte_ring *r, void *ring_start, uint32_t prod_head,
> >                 void *obj_table, uint32_t num, uint32_t esize) {
> >         uint32_t idx, n;
> >         uint32_t *du32;
> >
> >         const uint32_t size = r->size;
> >
> >         idx = prod_head & (r)->mask;
> >
> >         du32 = ring_start + idx * sizeof(uint32_t);
> >
> >         if (idx + num < size)
> >                 copy_elems(du32, obj_table, num, esize);
> >         else {
> >                 n = size - idx;
> >                 copy_elems(du32, obj_table, n, esize);
> >                 copy_elems(ring_start, obj_table + n * sizeof(uint32_t),
> >                         num - n, esize);
> >         }
> > }
> >
> > And then, in that function, instead of ENQUEUE_PTRS_ELEM(), just:
> >
> > enqueue_elems(r, &r[1], prod_head, obj_table, n, esize);
> >
> >
> > > > +{
> > > > +	uint32_t prod_head, prod_next;
> > > > +	uint32_t free_entries;
> > > > +
> > > > +	n = __rte_ring_move_prod_head(r, is_sp, n, behavior,
> > > > +			&prod_head, &prod_next, &free_entries);
> > > > +	if (n == 0)
> > > > +		goto end;
> > > > +
> > > > +	ENQUEUE_PTRS_ELEM(r, &r[1], prod_head, obj_table, esize, n);
> > > > +
> > > > +	update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
> > > > +end:
> > > > +	if (free_space != NULL)
> > > > +		*free_space = free_entries - n;
> > > > +	return n;
> > > > +}
> > > > +

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable element size
  2019-10-15  9:34             ` Ananyev, Konstantin
@ 2019-10-17  4:46               ` Honnappa Nagarahalli
  2019-10-17 11:51                 ` Ananyev, Konstantin
  0 siblings, 1 reply; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-17  4:46 UTC (permalink / raw)
  To: Ananyev, Konstantin, olivier.matz, sthemmin, jerinj, Richardson,
	Bruce, david.marchand, pbhagavatula
  Cc: dev, Dharmik Thakkar, Ruifeng Wang (Arm Technology China),
	Gavin Hu (Arm Technology China),
	stephen, Honnappa Nagarahalli, nd, nd

<snip>

> Hi Honnappa,
> 
> > > > >
> > > > > Current APIs assume ring elements to be pointers. However, in
> > > > > many use cases, the size can be different. Add new APIs to
> > > > > support configurable ring element sizes.
> > > > >
> > > > > Signed-off-by: Honnappa Nagarahalli
> > > > > <honnappa.nagarahalli@arm.com>
> > > > > Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > ---
> > > > >  lib/librte_ring/Makefile             |   3 +-
> > > > >  lib/librte_ring/meson.build          |   3 +
> > > > >  lib/librte_ring/rte_ring.c           |  45 +-
> > > > >  lib/librte_ring/rte_ring.h           |   1 +
> > > > >  lib/librte_ring/rte_ring_elem.h      | 946
> +++++++++++++++++++++++++++
> > > > >  lib/librte_ring/rte_ring_version.map |   2 +
> > > > >  6 files changed, 991 insertions(+), 9 deletions(-)  create mode
> > > > > 100644 lib/librte_ring/rte_ring_elem.h
> > > > >
> > > > > diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
> > > > > index 21a36770d..515a967bb 100644
> > > > > --- a/lib/librte_ring/Makefile
> > > > > +++ b/lib/librte_ring/Makefile

<snip>

> > > > > +
> > > > > +# rte_ring_create_elem and rte_ring_get_memsize_elem are
> > > > > +experimental allow_experimental_apis = true
> > > > > diff --git a/lib/librte_ring/rte_ring.c
> > > > > b/lib/librte_ring/rte_ring.c index d9b308036..6fed3648b 100644
> > > > > --- a/lib/librte_ring/rte_ring.c
> > > > > +++ b/lib/librte_ring/rte_ring.c
> > > > > @@ -33,6 +33,7 @@
> > > > >  #include <rte_tailq.h>
> > > > >
> > > > >  #include "rte_ring.h"
> > > > > +#include "rte_ring_elem.h"
> > > > >

<snip>

> > > > > diff --git a/lib/librte_ring/rte_ring_elem.h
> > > > > b/lib/librte_ring/rte_ring_elem.h new file mode 100644 index
> > > > > 000000000..860f059ad
> > > > > --- /dev/null
> > > > > +++ b/lib/librte_ring/rte_ring_elem.h
> > > > > @@ -0,0 +1,946 @@
> > > > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > > > + *
> > > > > + * Copyright (c) 2019 Arm Limited
> > > > > + * Copyright (c) 2010-2017 Intel Corporation
> > > > > + * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
> > > > > + * All rights reserved.
> > > > > + * Derived from FreeBSD's bufring.h
> > > > > + * Used as BSD-3 Licensed with permission from Kip Macy.
> > > > > + */
> > > > > +
> > > > > +#ifndef _RTE_RING_ELEM_H_
> > > > > +#define _RTE_RING_ELEM_H_
> > > > > +

<snip>

> > > > > +
> > > > > +/* the actual enqueue of pointers on the ring.
> > > > > + * Placed here since identical code needed in both
> > > > > + * single and multi producer enqueue functions.
> > > > > + */
> > > > > +#define ENQUEUE_PTRS_ELEM(r, ring_start, prod_head, obj_table,
> > > > > +esize, n)
> > > > > do { \
> > > > > +	if (esize == 4) \
> > > > > +		ENQUEUE_PTRS_32(r, ring_start, prod_head,
> obj_table, n); \
> > > > > +	else if (esize == 8) \
> > > > > +		ENQUEUE_PTRS_64(r, ring_start, prod_head,
> obj_table, n); \
> > > > > +	else if (esize == 16) \
> > > > > +		ENQUEUE_PTRS_128(r, ring_start, prod_head,
> obj_table, n);
> > > \ }
> > > > > while
> > > > > +(0)
> > > > > +
> > > > > +#define ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n)
> do { \
> > > > > +	unsigned int i; \
> > > > > +	const uint32_t size = (r)->size; \
> > > > > +	uint32_t idx = prod_head & (r)->mask; \
> > > > > +	uint32_t *ring = (uint32_t *)ring_start; \
> > > > > +	uint32_t *obj = (uint32_t *)obj_table; \
> > > > > +	if (likely(idx + n < size)) { \
> > > > > +		for (i = 0; i < (n & ((~(unsigned)0x7))); i += 8, idx += 8)
> { \
> > > > > +			ring[idx] = obj[i]; \
> > > > > +			ring[idx + 1] = obj[i + 1]; \
> > > > > +			ring[idx + 2] = obj[i + 2]; \
> > > > > +			ring[idx + 3] = obj[i + 3]; \
> > > > > +			ring[idx + 4] = obj[i + 4]; \
> > > > > +			ring[idx + 5] = obj[i + 5]; \
> > > > > +			ring[idx + 6] = obj[i + 6]; \
> > > > > +			ring[idx + 7] = obj[i + 7]; \
> > > > > +		} \
> > > > > +		switch (n & 0x7) { \
> > > > > +		case 7: \
> > > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > > +		case 6: \
> > > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > > +		case 5: \
> > > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > > +		case 4: \
> > > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > > +		case 3: \
> > > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > > +		case 2: \
> > > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > > +		case 1: \
> > > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > > +		} \
> > > > > +	} else { \
> > > > > +		for (i = 0; idx < size; i++, idx++)\
> > > > > +			ring[idx] = obj[i]; \
> > > > > +		for (idx = 0; i < n; i++, idx++) \
> > > > > +			ring[idx] = obj[i]; \
> > > > > +	} \
> > > > > +} while (0)
> > > > > +
> > > > > +#define ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n)
> do { \
> > > > > +	unsigned int i; \
> > > > > +	const uint32_t size = (r)->size; \
> > > > > +	uint32_t idx = prod_head & (r)->mask; \
> > > > > +	uint64_t *ring = (uint64_t *)ring_start; \
> > > > > +	uint64_t *obj = (uint64_t *)obj_table; \
> > > > > +	if (likely(idx + n < size)) { \
> > > > > +		for (i = 0; i < (n & ((~(unsigned)0x3))); i += 4, idx += 4)
> { \
> > > > > +			ring[idx] = obj[i]; \
> > > > > +			ring[idx + 1] = obj[i + 1]; \
> > > > > +			ring[idx + 2] = obj[i + 2]; \
> > > > > +			ring[idx + 3] = obj[i + 3]; \
> > > > > +		} \
> > > > > +		switch (n & 0x3) { \
> > > > > +		case 3: \
> > > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > > +		case 2: \
> > > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > > +		case 1: \
> > > > > +			ring[idx++] = obj[i++]; \
> > > > > +		} \
> > > > > +	} else { \
> > > > > +		for (i = 0; idx < size; i++, idx++)\
> > > > > +			ring[idx] = obj[i]; \
> > > > > +		for (idx = 0; i < n; i++, idx++) \
> > > > > +			ring[idx] = obj[i]; \
> > > > > +	} \
> > > > > +} while (0)
> > > > > +
> > > > > +#define ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table,
> > > > > +n) do
> > > { \
> > > > > +	unsigned int i; \
> > > > > +	const uint32_t size = (r)->size; \
> > > > > +	uint32_t idx = prod_head & (r)->mask; \
> > > > > +	__uint128_t *ring = (__uint128_t *)ring_start; \
> > > > > +	__uint128_t *obj = (__uint128_t *)obj_table; \
> > > > > +	if (likely(idx + n < size)) { \
> > > > > +		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
> > > > > +			ring[idx] = obj[i]; \
> > > > > +			ring[idx + 1] = obj[i + 1]; \
> > > > > +		} \
> > > > > +		switch (n & 0x1) { \
> > > > > +		case 1: \
> > > > > +			ring[idx++] = obj[i++]; \
> > > > > +		} \
> > > > > +	} else { \
> > > > > +		for (i = 0; idx < size; i++, idx++)\
> > > > > +			ring[idx] = obj[i]; \
> > > > > +		for (idx = 0; i < n; i++, idx++) \
> > > > > +			ring[idx] = obj[i]; \
> > > > > +	} \
> > > > > +} while (0)
> > > > > +
> > > > > +/* the actual copy of pointers on the ring to obj_table.
> > > > > + * Placed here since identical code needed in both
> > > > > + * single and multi consumer dequeue functions.
> > > > > + */
> > > > > +#define DEQUEUE_PTRS_ELEM(r, ring_start, cons_head, obj_table,
> > > > > +esize, n)
> > > > > do { \
> > > > > +	if (esize == 4) \
> > > > > +		DEQUEUE_PTRS_32(r, ring_start, cons_head,
> obj_table, n); \
> > > > > +	else if (esize == 8) \
> > > > > +		DEQUEUE_PTRS_64(r, ring_start, cons_head,
> obj_table, n); \
> > > > > +	else if (esize == 16) \
> > > > > +		DEQUEUE_PTRS_128(r, ring_start, cons_head,
> obj_table, n);
> > > \ }
> > > > > while
> > > > > +(0)
> > > > > +
> > > > > +#define DEQUEUE_PTRS_32(r, ring_start, cons_head, obj_table, n) do
> { \
> > > > > +	unsigned int i; \
> > > > > +	uint32_t idx = cons_head & (r)->mask; \
> > > > > +	const uint32_t size = (r)->size; \
> > > > > +	uint32_t *ring = (uint32_t *)ring_start; \
> > > > > +	uint32_t *obj = (uint32_t *)obj_table; \
> > > > > +	if (likely(idx + n < size)) { \
> > > > > +		for (i = 0; i < (n & (~(unsigned)0x7)); i += 8, idx += 8)
> {\
> > > > > +			obj[i] = ring[idx]; \
> > > > > +			obj[i + 1] = ring[idx + 1]; \
> > > > > +			obj[i + 2] = ring[idx + 2]; \
> > > > > +			obj[i + 3] = ring[idx + 3]; \
> > > > > +			obj[i + 4] = ring[idx + 4]; \
> > > > > +			obj[i + 5] = ring[idx + 5]; \
> > > > > +			obj[i + 6] = ring[idx + 6]; \
> > > > > +			obj[i + 7] = ring[idx + 7]; \
> > > > > +		} \
> > > > > +		switch (n & 0x7) { \
> > > > > +		case 7: \
> > > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > > +		case 6: \
> > > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > > +		case 5: \
> > > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > > +		case 4: \
> > > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > > +		case 3: \
> > > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > > +		case 2: \
> > > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > > +		case 1: \
> > > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > > +		} \
> > > > > +	} else { \
> > > > > +		for (i = 0; idx < size; i++, idx++) \
> > > > > +			obj[i] = ring[idx]; \
> > > > > +		for (idx = 0; i < n; i++, idx++) \
> > > > > +			obj[i] = ring[idx]; \
> > > > > +	} \
> > > > > +} while (0)
> > > > > +
> > > > > +#define DEQUEUE_PTRS_64(r, ring_start, cons_head, obj_table, n) do
> { \
> > > > > +	unsigned int i; \
> > > > > +	uint32_t idx = cons_head & (r)->mask; \
> > > > > +	const uint32_t size = (r)->size; \
> > > > > +	uint64_t *ring = (uint64_t *)ring_start; \
> > > > > +	uint64_t *obj = (uint64_t *)obj_table; \
> > > > > +	if (likely(idx + n < size)) { \
> > > > > +		for (i = 0; i < (n & (~(unsigned)0x3)); i += 4, idx += 4)
> {\
> > > > > +			obj[i] = ring[idx]; \
> > > > > +			obj[i + 1] = ring[idx + 1]; \
> > > > > +			obj[i + 2] = ring[idx + 2]; \
> > > > > +			obj[i + 3] = ring[idx + 3]; \
> > > > > +		} \
> > > > > +		switch (n & 0x3) { \
> > > > > +		case 3: \
> > > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > > +		case 2: \
> > > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > > +		case 1: \
> > > > > +			obj[i++] = ring[idx++]; \
> > > > > +		} \
> > > > > +	} else { \
> > > > > +		for (i = 0; idx < size; i++, idx++) \
> > > > > +			obj[i] = ring[idx]; \
> > > > > +		for (idx = 0; i < n; i++, idx++) \
> > > > > +			obj[i] = ring[idx]; \
> > > > > +	} \
> > > > > +} while (0)
> > > > > +
> > > > > +#define DEQUEUE_PTRS_128(r, ring_start, cons_head, obj_table,
> > > > > +n) do
> > > { \
> > > > > +	unsigned int i; \
> > > > > +	uint32_t idx = cons_head & (r)->mask; \
> > > > > +	const uint32_t size = (r)->size; \
> > > > > +	__uint128_t *ring = (__uint128_t *)ring_start; \
> > > > > +	__uint128_t *obj = (__uint128_t *)obj_table; \
> > > > > +	if (likely(idx + n < size)) { \
> > > > > +		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
> > > > > +			obj[i] = ring[idx]; \
> > > > > +			obj[i + 1] = ring[idx + 1]; \
> > > > > +		} \
> > > > > +		switch (n & 0x1) { \
> > > > > +		case 1: \
> > > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > > +		} \
> > > > > +	} else { \
> > > > > +		for (i = 0; idx < size; i++, idx++) \
> > > > > +			obj[i] = ring[idx]; \
> > > > > +		for (idx = 0; i < n; i++, idx++) \
> > > > > +			obj[i] = ring[idx]; \
> > > > > +	} \
> > > > > +} while (0)
> > > > > +
> > > > > +/* Between load and load. there might be cpu reorder in weak
> > > > > +model
> > > > > + * (powerpc/arm).
> > > > > + * There are 2 choices for the users
> > > > > + * 1.use rmb() memory barrier
> > > > > + * 2.use one-direction load_acquire/store_release
> > > > > +barrier,defined by
> > > > > + * CONFIG_RTE_USE_C11_MEM_MODEL=y
> > > > > + * It depends on performance test results.
> > > > > + * By default, move common functions to rte_ring_generic.h  */
> > > > > +#ifdef RTE_USE_C11_MEM_MODEL #include "rte_ring_c11_mem.h"
> > > > > +#else
> > > > > +#include "rte_ring_generic.h"
> > > > > +#endif
> > > > > +
> > > > > +/**
> > > > > + * @internal Enqueue several objects on the ring
> > > > > + *
> > > > > + * @param r
> > > > > + *   A pointer to the ring structure.
> > > > > + * @param obj_table
> > > > > + *   A pointer to a table of void * pointers (objects).
> > > > > + * @param esize
> > > > > + *   The size of ring element, in bytes. It must be a multiple of 4.
> > > > > + *   Currently, sizes 4, 8 and 16 are supported. This should be the
> same
> > > > > + *   as passed while creating the ring, otherwise the results are
> undefined.
> > > > > + * @param n
> > > > > + *   The number of objects to add in the ring from the obj_table.
> > > > > + * @param behavior
> > > > > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items
> from a
> > > ring
> > > > > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible
> > > from
> > > > > ring
> > > > > + * @param is_sp
> > > > > + *   Indicates whether to use single producer or multi-producer head
> > > update
> > > > > + * @param free_space
> > > > > + *   returns the amount of space after the enqueue operation has
> > > finished
> > > > > + * @return
> > > > > + *   Actual number of objects enqueued.
> > > > > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > > > > + */
> > > > > +static __rte_always_inline unsigned int
> > > > > +__rte_ring_do_enqueue_elem(struct rte_ring *r, void * const
> obj_table,
> > > > > +		unsigned int esize, unsigned int n,
> > > > > +		enum rte_ring_queue_behavior behavior, unsigned
> int is_sp,
> > > > > +		unsigned int *free_space)
> > >
> > >
> > > I like the idea to add esize as an argument to the public API, so
> > > the compiler can do it's jib optimizing calls with constant esize.
> > > Though I am not very happy with the rest of implementation:
> > > 1. It doesn't really provide configurable elem size - only 4/8/16B
> > > elems are supported.
> > Agree. I was thinking other sizes can be added on need basis.
> > However, I am wondering if we should just provide for 4B and then the
> users can use bulk operations to construct whatever they need?
> 
> I suppose it could be plan B... if there would be no agreement on generic case.
> And for 4B elems, I guess you do have a particular use-case?
Yes

> 
> > It
> > would mean extra work for the users.
> >
> > > 2. A lot of code duplication with these 3 copies of ENQUEUE/DEQUEUE
> > > macros.
> > >
> > > Looking at ENQUEUE/DEQUEUE macros, I can see that main loop always
> > > does 32B copy per iteration.
> > Yes, I tried to keep it the same as the existing one (originally, I
> > guess the intention was to allow for 256b vector instructions to be
> > generated)
> >
> > > So wonder can we make a generic function that would do 32B copy per
> > > iteration in a main loop, and copy tail  by 4B chunks?
> > > That would avoid copy duplication and will allow user to have any
> > > elem size (multiple of 4B) he wants.
> > > Something like that (note didn't test it, just a rough idea):
> > >
> > >  static inline void
> > > copy_elems(uint32_t du32[], const uint32_t su32[], uint32_t num,
> > > uint32_t
> > > esize) {
> > >         uint32_t i, sz;
> > >
> > >         sz = (num * esize) / sizeof(uint32_t);
> > If 'num' is a compile time constant, 'sz' will be a compile time constant.
> Otherwise, this will result in a multiplication operation.
> 
> Not always.
> If esize is compile time constant, then for esize as power of 2 (4,8,16,...), it
> would be just one shift.
> For other constant values it could be a 'mul' or in many cases just 2 shifts plus
> 'add' (if compiler is smart enough).
> I.E. let say for 24B elem is would be either num * 6 or (num << 2) + (num <<
> 1).
With num * 15 it has to be (num << 3) + (num << 2) + (num << 1) + num
Not sure if the compiler will do this.

> I suppose for non-power of 2 elems it might be ok to get such small perf hit.
Agree, should be ok not to focus on right now.

> 
> >I have tried
> > to avoid the multiplication operation and try to use shift and mask
> operations (just like how the rest of the ring code does).
> >
> > >
> > >         for (i = 0; i < (sz & ~7); i += 8)
> > >                 memcpy(du32 + i, su32 + i, 8 * sizeof(uint32_t));
> > I had used memcpy to start with (for the entire copy operation),
> > performance is not the same for 64b elements when compared with the
> existing ring APIs (some cases more and some cases less).
> 
> I remember that from one of your previous mails, that's why here I suggest to
> use in a loop memcpy() with fixed size.
> That way for each iteration complier will replace memcpy() with instructions
> to copy 32B in a way he thinks is optimal (same as for original macro, I think).
I tried this. On x86 (Xeon(R) Gold 6132 CPU @ 2.60GHz), the results are as follows. The numbers in brackets are with the code on master.
gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0

RTE>>ring_perf_elem_autotest
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 5
MP/MC single enq/dequeue: 40 (35)
SP/SC burst enq/dequeue (size: 8): 2
MP/MC burst enq/dequeue (size: 8): 6
SP/SC burst enq/dequeue (size: 32): 1 (2)
MP/MC burst enq/dequeue (size: 32): 2

### Testing empty dequeue ###
SC empty dequeue: 2.11
MC empty dequeue: 1.41 (2.11)

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 2.15 (2.86)
MP/MC bulk enq/dequeue (size: 8): 6.35 (6.91)
SP/SC bulk enq/dequeue (size: 32): 1.35 (2.06)
MP/MC bulk enq/dequeue (size: 32): 2.38 (2.95)

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 73.81 (15.33)
MP/MC bulk enq/dequeue (size: 8): 75.10 (71.27)
SP/SC bulk enq/dequeue (size: 32): 21.14 (9.58)
MP/MC bulk enq/dequeue (size: 32): 25.74 (20.91)

### Testing using two NUMA nodes ###
SP/SC bulk enq/dequeue (size: 8): 164.32 (50.66)
MP/MC bulk enq/dequeue (size: 8): 176.02 (173.43)
SP/SC bulk enq/dequeue (size: 32): 50.78 (23)
MP/MC bulk enq/dequeue (size: 32): 63.17 (46.74)

On one of the Arm platform
MP/MC bulk enq/dequeue (size: 32): 0.37 (0.33) (~12% hit, the rest are ok)

On another Arm platform, all numbers are same or slightly better.

I can post the patch with this change if you want to run some benchmarks on your platform.
I have not used the same code you have suggested, instead I have used the same logic in a single macro with memcpy.

> 
> >
> > IMO, we have to keep the performance of the 64b and 128b the same as
> > what we get with the existing ring and event-ring APIs. That would allow us
> to replace them with these new APIs. I suggest that we keep the macros in
> this patch for 64b and 128b.
> 
> I still think we probably can achieve that without duplicating macros, while
> still supporting arbitrary elem size.
> See above.
> 
> > For the rest of the sizes, we could put a for loop around 32b macro (this
> would allow for all sizes as well).
> >
> > >
> > >         switch (sz & 7) {
> > >         case 7: du32[sz - 7] = su32[sz - 7]; /* fallthrough */
> > >         case 6: du32[sz - 6] = su32[sz - 6]; /* fallthrough */
> > >         case 5: du32[sz - 5] = su32[sz - 5]; /* fallthrough */
> > >         case 4: du32[sz - 4] = su32[sz - 4]; /* fallthrough */
> > >         case 3: du32[sz - 3] = su32[sz - 3]; /* fallthrough */
> > >         case 2: du32[sz - 2] = su32[sz - 2]; /* fallthrough */
> > >         case 1: du32[sz - 1] = su32[sz - 1]; /* fallthrough */
> > >         }
> > > }
> > >
> > > static inline void
> > > enqueue_elems(struct rte_ring *r, void *ring_start, uint32_t prod_head,
> > >                 void *obj_table, uint32_t num, uint32_t esize) {
> > >         uint32_t idx, n;
> > >         uint32_t *du32;
> > >
> > >         const uint32_t size = r->size;
> > >
> > >         idx = prod_head & (r)->mask;
> > >
> > >         du32 = ring_start + idx * sizeof(uint32_t);
> > >
> > >         if (idx + num < size)
> > >                 copy_elems(du32, obj_table, num, esize);
> > >         else {
> > >                 n = size - idx;
> > >                 copy_elems(du32, obj_table, n, esize);
> > >                 copy_elems(ring_start, obj_table + n * sizeof(uint32_t),
> > >                         num - n, esize);
> > >         }
> > > }
> > >
> > > And then, in that function, instead of ENQUEUE_PTRS_ELEM(), just:
> > >
> > > enqueue_elems(r, &r[1], prod_head, obj_table, n, esize);
> > >
> > >
> > > > > +{
> > > > > +	uint32_t prod_head, prod_next;
> > > > > +	uint32_t free_entries;
> > > > > +
> > > > > +	n = __rte_ring_move_prod_head(r, is_sp, n, behavior,
> > > > > +			&prod_head, &prod_next, &free_entries);
> > > > > +	if (n == 0)
> > > > > +		goto end;
> > > > > +
> > > > > +	ENQUEUE_PTRS_ELEM(r, &r[1], prod_head, obj_table, esize,
> n);
> > > > > +
> > > > > +	update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
> > > > > +end:
> > > > > +	if (free_space != NULL)
> > > > > +		*free_space = free_entries - n;
> > > > > +	return n;
> > > > > +}
> > > > > +

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable element size
  2019-10-17  4:46               ` Honnappa Nagarahalli
@ 2019-10-17 11:51                 ` Ananyev, Konstantin
  2019-10-17 20:16                   ` Honnappa Nagarahalli
  0 siblings, 1 reply; 173+ messages in thread
From: Ananyev, Konstantin @ 2019-10-17 11:51 UTC (permalink / raw)
  To: Honnappa Nagarahalli, olivier.matz, sthemmin, jerinj, Richardson,
	Bruce, david.marchand, pbhagavatula
  Cc: dev, Dharmik Thakkar, Ruifeng Wang (Arm Technology China),
	Gavin Hu (Arm Technology China),
	stephen, nd, nd


> > > > > > Current APIs assume ring elements to be pointers. However, in
> > > > > > many use cases, the size can be different. Add new APIs to
> > > > > > support configurable ring element sizes.
> > > > > >
> > > > > > Signed-off-by: Honnappa Nagarahalli
> > > > > > <honnappa.nagarahalli@arm.com>
> > > > > > Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> > > > > > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > > > > > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > > > > > ---
> > > > > >  lib/librte_ring/Makefile             |   3 +-
> > > > > >  lib/librte_ring/meson.build          |   3 +
> > > > > >  lib/librte_ring/rte_ring.c           |  45 +-
> > > > > >  lib/librte_ring/rte_ring.h           |   1 +
> > > > > >  lib/librte_ring/rte_ring_elem.h      | 946
> > +++++++++++++++++++++++++++
> > > > > >  lib/librte_ring/rte_ring_version.map |   2 +
> > > > > >  6 files changed, 991 insertions(+), 9 deletions(-)  create mode
> > > > > > 100644 lib/librte_ring/rte_ring_elem.h
> > > > > >
> > > > > > diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
> > > > > > index 21a36770d..515a967bb 100644
> > > > > > --- a/lib/librte_ring/Makefile
> > > > > > +++ b/lib/librte_ring/Makefile
> 
> <snip>
> 
> > > > > > +
> > > > > > +# rte_ring_create_elem and rte_ring_get_memsize_elem are
> > > > > > +experimental allow_experimental_apis = true
> > > > > > diff --git a/lib/librte_ring/rte_ring.c
> > > > > > b/lib/librte_ring/rte_ring.c index d9b308036..6fed3648b 100644
> > > > > > --- a/lib/librte_ring/rte_ring.c
> > > > > > +++ b/lib/librte_ring/rte_ring.c
> > > > > > @@ -33,6 +33,7 @@
> > > > > >  #include <rte_tailq.h>
> > > > > >
> > > > > >  #include "rte_ring.h"
> > > > > > +#include "rte_ring_elem.h"
> > > > > >
> 
> <snip>
> 
> > > > > > diff --git a/lib/librte_ring/rte_ring_elem.h
> > > > > > b/lib/librte_ring/rte_ring_elem.h new file mode 100644 index
> > > > > > 000000000..860f059ad
> > > > > > --- /dev/null
> > > > > > +++ b/lib/librte_ring/rte_ring_elem.h
> > > > > > @@ -0,0 +1,946 @@
> > > > > > +/* SPDX-License-Identifier: BSD-3-Clause
> > > > > > + *
> > > > > > + * Copyright (c) 2019 Arm Limited
> > > > > > + * Copyright (c) 2010-2017 Intel Corporation
> > > > > > + * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
> > > > > > + * All rights reserved.
> > > > > > + * Derived from FreeBSD's bufring.h
> > > > > > + * Used as BSD-3 Licensed with permission from Kip Macy.
> > > > > > + */
> > > > > > +
> > > > > > +#ifndef _RTE_RING_ELEM_H_
> > > > > > +#define _RTE_RING_ELEM_H_
> > > > > > +
> 
> <snip>
> 
> > > > > > +
> > > > > > +/* the actual enqueue of pointers on the ring.
> > > > > > + * Placed here since identical code needed in both
> > > > > > + * single and multi producer enqueue functions.
> > > > > > + */
> > > > > > +#define ENQUEUE_PTRS_ELEM(r, ring_start, prod_head, obj_table,
> > > > > > +esize, n)
> > > > > > do { \
> > > > > > +	if (esize == 4) \
> > > > > > +		ENQUEUE_PTRS_32(r, ring_start, prod_head,
> > obj_table, n); \
> > > > > > +	else if (esize == 8) \
> > > > > > +		ENQUEUE_PTRS_64(r, ring_start, prod_head,
> > obj_table, n); \
> > > > > > +	else if (esize == 16) \
> > > > > > +		ENQUEUE_PTRS_128(r, ring_start, prod_head,
> > obj_table, n);
> > > > \ }
> > > > > > while
> > > > > > +(0)
> > > > > > +
> > > > > > +#define ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n)
> > do { \
> > > > > > +	unsigned int i; \
> > > > > > +	const uint32_t size = (r)->size; \
> > > > > > +	uint32_t idx = prod_head & (r)->mask; \
> > > > > > +	uint32_t *ring = (uint32_t *)ring_start; \
> > > > > > +	uint32_t *obj = (uint32_t *)obj_table; \
> > > > > > +	if (likely(idx + n < size)) { \
> > > > > > +		for (i = 0; i < (n & ((~(unsigned)0x7))); i += 8, idx += 8)
> > { \
> > > > > > +			ring[idx] = obj[i]; \
> > > > > > +			ring[idx + 1] = obj[i + 1]; \
> > > > > > +			ring[idx + 2] = obj[i + 2]; \
> > > > > > +			ring[idx + 3] = obj[i + 3]; \
> > > > > > +			ring[idx + 4] = obj[i + 4]; \
> > > > > > +			ring[idx + 5] = obj[i + 5]; \
> > > > > > +			ring[idx + 6] = obj[i + 6]; \
> > > > > > +			ring[idx + 7] = obj[i + 7]; \
> > > > > > +		} \
> > > > > > +		switch (n & 0x7) { \
> > > > > > +		case 7: \
> > > > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > > > +		case 6: \
> > > > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > > > +		case 5: \
> > > > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > > > +		case 4: \
> > > > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > > > +		case 3: \
> > > > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > > > +		case 2: \
> > > > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > > > +		case 1: \
> > > > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > > > +		} \
> > > > > > +	} else { \
> > > > > > +		for (i = 0; idx < size; i++, idx++)\
> > > > > > +			ring[idx] = obj[i]; \
> > > > > > +		for (idx = 0; i < n; i++, idx++) \
> > > > > > +			ring[idx] = obj[i]; \
> > > > > > +	} \
> > > > > > +} while (0)
> > > > > > +
> > > > > > +#define ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n)
> > do { \
> > > > > > +	unsigned int i; \
> > > > > > +	const uint32_t size = (r)->size; \
> > > > > > +	uint32_t idx = prod_head & (r)->mask; \
> > > > > > +	uint64_t *ring = (uint64_t *)ring_start; \
> > > > > > +	uint64_t *obj = (uint64_t *)obj_table; \
> > > > > > +	if (likely(idx + n < size)) { \
> > > > > > +		for (i = 0; i < (n & ((~(unsigned)0x3))); i += 4, idx += 4)
> > { \
> > > > > > +			ring[idx] = obj[i]; \
> > > > > > +			ring[idx + 1] = obj[i + 1]; \
> > > > > > +			ring[idx + 2] = obj[i + 2]; \
> > > > > > +			ring[idx + 3] = obj[i + 3]; \
> > > > > > +		} \
> > > > > > +		switch (n & 0x3) { \
> > > > > > +		case 3: \
> > > > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > > > +		case 2: \
> > > > > > +			ring[idx++] = obj[i++]; /* fallthrough */ \
> > > > > > +		case 1: \
> > > > > > +			ring[idx++] = obj[i++]; \
> > > > > > +		} \
> > > > > > +	} else { \
> > > > > > +		for (i = 0; idx < size; i++, idx++)\
> > > > > > +			ring[idx] = obj[i]; \
> > > > > > +		for (idx = 0; i < n; i++, idx++) \
> > > > > > +			ring[idx] = obj[i]; \
> > > > > > +	} \
> > > > > > +} while (0)
> > > > > > +
> > > > > > +#define ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table,
> > > > > > +n) do
> > > > { \
> > > > > > +	unsigned int i; \
> > > > > > +	const uint32_t size = (r)->size; \
> > > > > > +	uint32_t idx = prod_head & (r)->mask; \
> > > > > > +	__uint128_t *ring = (__uint128_t *)ring_start; \
> > > > > > +	__uint128_t *obj = (__uint128_t *)obj_table; \
> > > > > > +	if (likely(idx + n < size)) { \
> > > > > > +		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
> > > > > > +			ring[idx] = obj[i]; \
> > > > > > +			ring[idx + 1] = obj[i + 1]; \
> > > > > > +		} \
> > > > > > +		switch (n & 0x1) { \
> > > > > > +		case 1: \
> > > > > > +			ring[idx++] = obj[i++]; \
> > > > > > +		} \
> > > > > > +	} else { \
> > > > > > +		for (i = 0; idx < size; i++, idx++)\
> > > > > > +			ring[idx] = obj[i]; \
> > > > > > +		for (idx = 0; i < n; i++, idx++) \
> > > > > > +			ring[idx] = obj[i]; \
> > > > > > +	} \
> > > > > > +} while (0)
> > > > > > +
> > > > > > +/* the actual copy of pointers on the ring to obj_table.
> > > > > > + * Placed here since identical code needed in both
> > > > > > + * single and multi consumer dequeue functions.
> > > > > > + */
> > > > > > +#define DEQUEUE_PTRS_ELEM(r, ring_start, cons_head, obj_table,
> > > > > > +esize, n)
> > > > > > do { \
> > > > > > +	if (esize == 4) \
> > > > > > +		DEQUEUE_PTRS_32(r, ring_start, cons_head,
> > obj_table, n); \
> > > > > > +	else if (esize == 8) \
> > > > > > +		DEQUEUE_PTRS_64(r, ring_start, cons_head,
> > obj_table, n); \
> > > > > > +	else if (esize == 16) \
> > > > > > +		DEQUEUE_PTRS_128(r, ring_start, cons_head,
> > obj_table, n);
> > > > \ }
> > > > > > while
> > > > > > +(0)
> > > > > > +
> > > > > > +#define DEQUEUE_PTRS_32(r, ring_start, cons_head, obj_table, n) do
> > { \
> > > > > > +	unsigned int i; \
> > > > > > +	uint32_t idx = cons_head & (r)->mask; \
> > > > > > +	const uint32_t size = (r)->size; \
> > > > > > +	uint32_t *ring = (uint32_t *)ring_start; \
> > > > > > +	uint32_t *obj = (uint32_t *)obj_table; \
> > > > > > +	if (likely(idx + n < size)) { \
> > > > > > +		for (i = 0; i < (n & (~(unsigned)0x7)); i += 8, idx += 8)
> > {\
> > > > > > +			obj[i] = ring[idx]; \
> > > > > > +			obj[i + 1] = ring[idx + 1]; \
> > > > > > +			obj[i + 2] = ring[idx + 2]; \
> > > > > > +			obj[i + 3] = ring[idx + 3]; \
> > > > > > +			obj[i + 4] = ring[idx + 4]; \
> > > > > > +			obj[i + 5] = ring[idx + 5]; \
> > > > > > +			obj[i + 6] = ring[idx + 6]; \
> > > > > > +			obj[i + 7] = ring[idx + 7]; \
> > > > > > +		} \
> > > > > > +		switch (n & 0x7) { \
> > > > > > +		case 7: \
> > > > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > > > +		case 6: \
> > > > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > > > +		case 5: \
> > > > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > > > +		case 4: \
> > > > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > > > +		case 3: \
> > > > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > > > +		case 2: \
> > > > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > > > +		case 1: \
> > > > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > > > +		} \
> > > > > > +	} else { \
> > > > > > +		for (i = 0; idx < size; i++, idx++) \
> > > > > > +			obj[i] = ring[idx]; \
> > > > > > +		for (idx = 0; i < n; i++, idx++) \
> > > > > > +			obj[i] = ring[idx]; \
> > > > > > +	} \
> > > > > > +} while (0)
> > > > > > +
> > > > > > +#define DEQUEUE_PTRS_64(r, ring_start, cons_head, obj_table, n) do
> > { \
> > > > > > +	unsigned int i; \
> > > > > > +	uint32_t idx = cons_head & (r)->mask; \
> > > > > > +	const uint32_t size = (r)->size; \
> > > > > > +	uint64_t *ring = (uint64_t *)ring_start; \
> > > > > > +	uint64_t *obj = (uint64_t *)obj_table; \
> > > > > > +	if (likely(idx + n < size)) { \
> > > > > > +		for (i = 0; i < (n & (~(unsigned)0x3)); i += 4, idx += 4)
> > {\
> > > > > > +			obj[i] = ring[idx]; \
> > > > > > +			obj[i + 1] = ring[idx + 1]; \
> > > > > > +			obj[i + 2] = ring[idx + 2]; \
> > > > > > +			obj[i + 3] = ring[idx + 3]; \
> > > > > > +		} \
> > > > > > +		switch (n & 0x3) { \
> > > > > > +		case 3: \
> > > > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > > > +		case 2: \
> > > > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > > > +		case 1: \
> > > > > > +			obj[i++] = ring[idx++]; \
> > > > > > +		} \
> > > > > > +	} else { \
> > > > > > +		for (i = 0; idx < size; i++, idx++) \
> > > > > > +			obj[i] = ring[idx]; \
> > > > > > +		for (idx = 0; i < n; i++, idx++) \
> > > > > > +			obj[i] = ring[idx]; \
> > > > > > +	} \
> > > > > > +} while (0)
> > > > > > +
> > > > > > +#define DEQUEUE_PTRS_128(r, ring_start, cons_head, obj_table,
> > > > > > +n) do
> > > > { \
> > > > > > +	unsigned int i; \
> > > > > > +	uint32_t idx = cons_head & (r)->mask; \
> > > > > > +	const uint32_t size = (r)->size; \
> > > > > > +	__uint128_t *ring = (__uint128_t *)ring_start; \
> > > > > > +	__uint128_t *obj = (__uint128_t *)obj_table; \
> > > > > > +	if (likely(idx + n < size)) { \
> > > > > > +		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
> > > > > > +			obj[i] = ring[idx]; \
> > > > > > +			obj[i + 1] = ring[idx + 1]; \
> > > > > > +		} \
> > > > > > +		switch (n & 0x1) { \
> > > > > > +		case 1: \
> > > > > > +			obj[i++] = ring[idx++]; /* fallthrough */ \
> > > > > > +		} \
> > > > > > +	} else { \
> > > > > > +		for (i = 0; idx < size; i++, idx++) \
> > > > > > +			obj[i] = ring[idx]; \
> > > > > > +		for (idx = 0; i < n; i++, idx++) \
> > > > > > +			obj[i] = ring[idx]; \
> > > > > > +	} \
> > > > > > +} while (0)
> > > > > > +
> > > > > > +/* Between load and load. there might be cpu reorder in weak
> > > > > > +model
> > > > > > + * (powerpc/arm).
> > > > > > + * There are 2 choices for the users
> > > > > > + * 1.use rmb() memory barrier
> > > > > > + * 2.use one-direction load_acquire/store_release
> > > > > > +barrier,defined by
> > > > > > + * CONFIG_RTE_USE_C11_MEM_MODEL=y
> > > > > > + * It depends on performance test results.
> > > > > > + * By default, move common functions to rte_ring_generic.h  */
> > > > > > +#ifdef RTE_USE_C11_MEM_MODEL #include "rte_ring_c11_mem.h"
> > > > > > +#else
> > > > > > +#include "rte_ring_generic.h"
> > > > > > +#endif
> > > > > > +
> > > > > > +/**
> > > > > > + * @internal Enqueue several objects on the ring
> > > > > > + *
> > > > > > + * @param r
> > > > > > + *   A pointer to the ring structure.
> > > > > > + * @param obj_table
> > > > > > + *   A pointer to a table of void * pointers (objects).
> > > > > > + * @param esize
> > > > > > + *   The size of ring element, in bytes. It must be a multiple of 4.
> > > > > > + *   Currently, sizes 4, 8 and 16 are supported. This should be the
> > same
> > > > > > + *   as passed while creating the ring, otherwise the results are
> > undefined.
> > > > > > + * @param n
> > > > > > + *   The number of objects to add in the ring from the obj_table.
> > > > > > + * @param behavior
> > > > > > + *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items
> > from a
> > > > ring
> > > > > > + *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible
> > > > from
> > > > > > ring
> > > > > > + * @param is_sp
> > > > > > + *   Indicates whether to use single producer or multi-producer head
> > > > update
> > > > > > + * @param free_space
> > > > > > + *   returns the amount of space after the enqueue operation has
> > > > finished
> > > > > > + * @return
> > > > > > + *   Actual number of objects enqueued.
> > > > > > + *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
> > > > > > + */
> > > > > > +static __rte_always_inline unsigned int
> > > > > > +__rte_ring_do_enqueue_elem(struct rte_ring *r, void * const
> > obj_table,
> > > > > > +		unsigned int esize, unsigned int n,
> > > > > > +		enum rte_ring_queue_behavior behavior, unsigned
> > int is_sp,
> > > > > > +		unsigned int *free_space)
> > > >
> > > >
> > > > I like the idea to add esize as an argument to the public API, so
> > > > the compiler can do it's jib optimizing calls with constant esize.
> > > > Though I am not very happy with the rest of implementation:
> > > > 1. It doesn't really provide configurable elem size - only 4/8/16B
> > > > elems are supported.
> > > Agree. I was thinking other sizes can be added on need basis.
> > > However, I am wondering if we should just provide for 4B and then the
> > users can use bulk operations to construct whatever they need?
> >
> > I suppose it could be plan B... if there would be no agreement on generic case.
> > And for 4B elems, I guess you do have a particular use-case?
> Yes
> 
> >
> > > It
> > > would mean extra work for the users.
> > >
> > > > 2. A lot of code duplication with these 3 copies of ENQUEUE/DEQUEUE
> > > > macros.
> > > >
> > > > Looking at ENQUEUE/DEQUEUE macros, I can see that main loop always
> > > > does 32B copy per iteration.
> > > Yes, I tried to keep it the same as the existing one (originally, I
> > > guess the intention was to allow for 256b vector instructions to be
> > > generated)
> > >
> > > > So wonder can we make a generic function that would do 32B copy per
> > > > iteration in a main loop, and copy tail  by 4B chunks?
> > > > That would avoid copy duplication and will allow user to have any
> > > > elem size (multiple of 4B) he wants.
> > > > Something like that (note didn't test it, just a rough idea):
> > > >
> > > >  static inline void
> > > > copy_elems(uint32_t du32[], const uint32_t su32[], uint32_t num,
> > > > uint32_t
> > > > esize) {
> > > >         uint32_t i, sz;
> > > >
> > > >         sz = (num * esize) / sizeof(uint32_t);
> > > If 'num' is a compile time constant, 'sz' will be a compile time constant.
> > Otherwise, this will result in a multiplication operation.
> >
> > Not always.
> > If esize is compile time constant, then for esize as power of 2 (4,8,16,...), it
> > would be just one shift.
> > For other constant values it could be a 'mul' or in many cases just 2 shifts plus
> > 'add' (if compiler is smart enough).
> > I.E. let say for 24B elem is would be either num * 6 or (num << 2) + (num <<
> > 1).
> With num * 15 it has to be (num << 3) + (num << 2) + (num << 1) + num
> Not sure if the compiler will do this.

For 15, it can be just (num << 4) - num

> 
> > I suppose for non-power of 2 elems it might be ok to get such small perf hit.
> Agree, should be ok not to focus on right now.
> 
> >
> > >I have tried
> > > to avoid the multiplication operation and try to use shift and mask
> > operations (just like how the rest of the ring code does).
> > >
> > > >
> > > >         for (i = 0; i < (sz & ~7); i += 8)
> > > >                 memcpy(du32 + i, su32 + i, 8 * sizeof(uint32_t));
> > > I had used memcpy to start with (for the entire copy operation),
> > > performance is not the same for 64b elements when compared with the
> > existing ring APIs (some cases more and some cases less).
> >
> > I remember that from one of your previous mails, that's why here I suggest to
> > use in a loop memcpy() with fixed size.
> > That way for each iteration complier will replace memcpy() with instructions
> > to copy 32B in a way he thinks is optimal (same as for original macro, I think).
> I tried this. On x86 (Xeon(R) Gold 6132 CPU @ 2.60GHz), the results are as follows. The numbers in brackets are with the code on master.
> gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
> 
> RTE>>ring_perf_elem_autotest
> ### Testing single element and burst enq/deq ###
> SP/SC single enq/dequeue: 5
> MP/MC single enq/dequeue: 40 (35)
> SP/SC burst enq/dequeue (size: 8): 2
> MP/MC burst enq/dequeue (size: 8): 6
> SP/SC burst enq/dequeue (size: 32): 1 (2)
> MP/MC burst enq/dequeue (size: 32): 2
> 
> ### Testing empty dequeue ###
> SC empty dequeue: 2.11
> MC empty dequeue: 1.41 (2.11)
> 
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 2.15 (2.86)
> MP/MC bulk enq/dequeue (size: 8): 6.35 (6.91)
> SP/SC bulk enq/dequeue (size: 32): 1.35 (2.06)
> MP/MC bulk enq/dequeue (size: 32): 2.38 (2.95)
> 
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 73.81 (15.33)
> MP/MC bulk enq/dequeue (size: 8): 75.10 (71.27)
> SP/SC bulk enq/dequeue (size: 32): 21.14 (9.58)
> MP/MC bulk enq/dequeue (size: 32): 25.74 (20.91)
> 
> ### Testing using two NUMA nodes ###
> SP/SC bulk enq/dequeue (size: 8): 164.32 (50.66)
> MP/MC bulk enq/dequeue (size: 8): 176.02 (173.43)
> SP/SC bulk enq/dequeue (size: 32): 50.78 (23)
> MP/MC bulk enq/dequeue (size: 32): 63.17 (46.74)
> 
> On one of the Arm platform
> MP/MC bulk enq/dequeue (size: 32): 0.37 (0.33) (~12% hit, the rest are ok)

So it shows better numbers for one core, but worse on 2, right?

 
> On another Arm platform, all numbers are same or slightly better.
> 
> I can post the patch with this change if you want to run some benchmarks on your platform.

Sure, please do.
I'll try to run on my boxes.

> I have not used the same code you have suggested, instead I have used the same logic in a single macro with memcpy.
> 


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH v5 0/3] lib/ring: APIs to support custom element size
  2019-09-06 19:05 ` [dpdk-dev] [PATCH v2 0/6] " Honnappa Nagarahalli
                     ` (9 preceding siblings ...)
  2019-10-09  2:47   ` [dpdk-dev] [PATCH v3 0/2] lib/ring: APIs to support custom element size Honnappa Nagarahalli
@ 2019-10-17 20:08   ` Honnappa Nagarahalli
  2019-10-17 20:08     ` [dpdk-dev] [PATCH v5 1/3] lib/ring: apis to support configurable " Honnappa Nagarahalli
                       ` (2 more replies)
  2019-10-21  0:22   ` [dpdk-dev] [RFC v6 0/6] lib/ring: APIs to support custom element size Honnappa Nagarahalli
                     ` (4 subsequent siblings)
  15 siblings, 3 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-17 20:08 UTC (permalink / raw)
  To: olivier.matz, sthemmin, jerinj, bruce.richardson, david.marchand,
	pbhagavatula, konstantin.ananyev, drc, hemant.agrawal,
	honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, ruifeng.wang, gavin.hu

The current rte_ring hard-codes the type of the ring element to 'void *',
hence the size of the element is hard-coded to 32b/64b. Since the ring
element type is not an input to rte_ring APIs, it results in couple
of issues:

1) If an application requires to store an element which is not 64b, it
   needs to write its own ring APIs similar to rte_event_ring APIs. This
   creates additional burden on the programmers, who end up making
   work-arounds and often waste memory.
2) If there are multiple libraries that store elements of the same
   type, currently they would have to write their own rte_ring APIs. This
   results in code duplication.

This patch adds new APIs to support configurable ring element size.
The APIs support custom element sizes by allowing to define the ring
element to be a multiple of 32b.

The aim is to achieve same performance as the existing ring
implementation. The patch adds same performance tests that are run
for existing APIs. This allows for performance comparison.

I also tested with memcpy. x86 shows significant improvements on bulk
and burst tests. On the Arm platform, I used, there is a drop of
4% to 6% in few tests. May be this is something that we can explore
later.

Note that this version skips changes to other libraries as I would
like to get an agreement on the implementation from the community.
They will be added once there is agreement on the rte_ring changes.

v5
 - Use memcpy for chunks of 32B (Konstantin).
 - Both 'ring_perf_autotest' and 'ring_perf_elem_autotest' are available
   to compare the results easily.
 - Copying without memcpy is also available in 1/3, if anyone wants to
   experiment on their platform.
 - Added other platform owners to test on their respective platforms.

v4
 - Few fixes after more performance testing

v3
 - Removed macro-fest and used inline functions
   (Stephen, Bruce)

v2
 - Change Event Ring implementation to use ring templates
   (Jerin, Pavan)

Honnappa Nagarahalli (3):
  lib/ring: apis to support configurable element size
  test/ring: add test cases for configurable element size ring
  lib/ring: copy ring elements using memcpy partially

 app/test/Makefile                    |   1 +
 app/test/meson.build                 |   1 +
 app/test/test_ring_perf_elem.c       | 419 ++++++++++++++
 lib/librte_ring/Makefile             |   3 +-
 lib/librte_ring/meson.build          |   3 +
 lib/librte_ring/rte_ring.c           |  45 +-
 lib/librte_ring/rte_ring.h           |   1 +
 lib/librte_ring/rte_ring_elem.h      | 805 +++++++++++++++++++++++++++
 lib/librte_ring/rte_ring_version.map |   2 +
 9 files changed, 1271 insertions(+), 9 deletions(-)
 create mode 100644 app/test/test_ring_perf_elem.c
 create mode 100644 lib/librte_ring/rte_ring_elem.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH v5 1/3] lib/ring: apis to support configurable element size
  2019-10-17 20:08   ` [dpdk-dev] [PATCH v5 0/3] lib/ring: APIs to support custom element size Honnappa Nagarahalli
@ 2019-10-17 20:08     ` Honnappa Nagarahalli
  2019-10-17 20:39       ` Stephen Hemminger
  2019-10-17 20:40       ` Stephen Hemminger
  2019-10-17 20:08     ` [dpdk-dev] [PATCH v5 2/3] test/ring: add test cases for configurable element size ring Honnappa Nagarahalli
  2019-10-17 20:08     ` [dpdk-dev] [PATCH v5 3/3] lib/ring: copy ring elements using memcpy partially Honnappa Nagarahalli
  2 siblings, 2 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-17 20:08 UTC (permalink / raw)
  To: olivier.matz, sthemmin, jerinj, bruce.richardson, david.marchand,
	pbhagavatula, konstantin.ananyev, drc, hemant.agrawal,
	honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, ruifeng.wang, gavin.hu

Current APIs assume ring elements to be pointers. However, in many
use cases, the size can be different. Add new APIs to support
configurable ring element sizes.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_ring/Makefile             |   3 +-
 lib/librte_ring/meson.build          |   3 +
 lib/librte_ring/rte_ring.c           |  45 +-
 lib/librte_ring/rte_ring.h           |   1 +
 lib/librte_ring/rte_ring_elem.h      | 946 +++++++++++++++++++++++++++
 lib/librte_ring/rte_ring_version.map |   2 +
 6 files changed, 991 insertions(+), 9 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_elem.h

diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index 21a36770d..515a967bb 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -6,7 +6,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 # library name
 LIB = librte_ring.a
 
-CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3
+CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -DALLOW_EXPERIMENTAL_API
 LDLIBS += -lrte_eal
 
 EXPORT_MAP := rte_ring_version.map
@@ -18,6 +18,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
+					rte_ring_elem.h \
 					rte_ring_generic.h \
 					rte_ring_c11_mem.h
 
diff --git a/lib/librte_ring/meson.build b/lib/librte_ring/meson.build
index ab8b0b469..74219840a 100644
--- a/lib/librte_ring/meson.build
+++ b/lib/librte_ring/meson.build
@@ -6,3 +6,6 @@ sources = files('rte_ring.c')
 headers = files('rte_ring.h',
 		'rte_ring_c11_mem.h',
 		'rte_ring_generic.h')
+
+# rte_ring_create_elem and rte_ring_get_memsize_elem are experimental
+allow_experimental_apis = true
diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
index d9b308036..6fed3648b 100644
--- a/lib/librte_ring/rte_ring.c
+++ b/lib/librte_ring/rte_ring.c
@@ -33,6 +33,7 @@
 #include <rte_tailq.h>
 
 #include "rte_ring.h"
+#include "rte_ring_elem.h"
 
 TAILQ_HEAD(rte_ring_list, rte_tailq_entry);
 
@@ -46,23 +47,42 @@ EAL_REGISTER_TAILQ(rte_ring_tailq)
 
 /* return the size of memory occupied by a ring */
 ssize_t
-rte_ring_get_memsize(unsigned count)
+rte_ring_get_memsize_elem(unsigned count, unsigned esize)
 {
 	ssize_t sz;
 
+	/* Supported esize values are 4/8/16.
+	 * Others can be added on need basis.
+	 */
+	if ((esize != 4) && (esize != 8) && (esize != 16)) {
+		RTE_LOG(ERR, RING,
+			"Unsupported esize value. Supported values are 4, 8 and 16\n");
+
+		return -EINVAL;
+	}
+
 	/* count must be a power of 2 */
 	if ((!POWEROF2(count)) || (count > RTE_RING_SZ_MASK )) {
 		RTE_LOG(ERR, RING,
-			"Requested size is invalid, must be power of 2, and "
-			"do not exceed the size limit %u\n", RTE_RING_SZ_MASK);
+			"Requested number of elements is invalid, must be "
+			"power of 2, and do not exceed the limit %u\n",
+			RTE_RING_SZ_MASK);
+
 		return -EINVAL;
 	}
 
-	sz = sizeof(struct rte_ring) + count * sizeof(void *);
+	sz = sizeof(struct rte_ring) + count * esize;
 	sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
 	return sz;
 }
 
+/* return the size of memory occupied by a ring */
+ssize_t
+rte_ring_get_memsize(unsigned count)
+{
+	return rte_ring_get_memsize_elem(count, sizeof(void *));
+}
+
 void
 rte_ring_reset(struct rte_ring *r)
 {
@@ -114,10 +134,10 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 	return 0;
 }
 
-/* create the ring */
+/* create the ring for a given element size */
 struct rte_ring *
-rte_ring_create(const char *name, unsigned count, int socket_id,
-		unsigned flags)
+rte_ring_create_elem(const char *name, unsigned count, unsigned esize,
+		int socket_id, unsigned flags)
 {
 	char mz_name[RTE_MEMZONE_NAMESIZE];
 	struct rte_ring *r;
@@ -135,7 +155,7 @@ rte_ring_create(const char *name, unsigned count, int socket_id,
 	if (flags & RING_F_EXACT_SZ)
 		count = rte_align32pow2(count + 1);
 
-	ring_size = rte_ring_get_memsize(count);
+	ring_size = rte_ring_get_memsize_elem(count, esize);
 	if (ring_size < 0) {
 		rte_errno = ring_size;
 		return NULL;
@@ -182,6 +202,15 @@ rte_ring_create(const char *name, unsigned count, int socket_id,
 	return r;
 }
 
+/* create the ring */
+struct rte_ring *
+rte_ring_create(const char *name, unsigned count, int socket_id,
+		unsigned flags)
+{
+	return rte_ring_create_elem(name, count, sizeof(void *), socket_id,
+		flags);
+}
+
 /* free the ring */
 void
 rte_ring_free(struct rte_ring *r)
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 2a9f768a1..18fc5d845 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -216,6 +216,7 @@ int rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
  */
 struct rte_ring *rte_ring_create(const char *name, unsigned count,
 				 int socket_id, unsigned flags);
+
 /**
  * De-allocate all memory used by the ring.
  *
diff --git a/lib/librte_ring/rte_ring_elem.h b/lib/librte_ring/rte_ring_elem.h
new file mode 100644
index 000000000..860f059ad
--- /dev/null
+++ b/lib/librte_ring/rte_ring_elem.h
@@ -0,0 +1,946 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Copyright (c) 2019 Arm Limited
+ * Copyright (c) 2010-2017 Intel Corporation
+ * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
+ * All rights reserved.
+ * Derived from FreeBSD's bufring.h
+ * Used as BSD-3 Licensed with permission from Kip Macy.
+ */
+
+#ifndef _RTE_RING_ELEM_H_
+#define _RTE_RING_ELEM_H_
+
+/**
+ * @file
+ * RTE Ring with flexible element size
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stdio.h>
+#include <stdint.h>
+#include <sys/queue.h>
+#include <errno.h>
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_memory.h>
+#include <rte_lcore.h>
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_memzone.h>
+#include <rte_pause.h>
+
+#include "rte_ring.h"
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Calculate the memory size needed for a ring with given element size
+ *
+ * This function returns the number of bytes needed for a ring, given
+ * the number of elements in it and the size of the element. This value
+ * is the sum of the size of the structure rte_ring and the size of the
+ * memory needed for storing the elements. The value is aligned to a cache
+ * line size.
+ *
+ * @param count
+ *   The number of elements in the ring (must be a power of 2).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported.
+ * @return
+ *   - The memory size needed for the ring on success.
+ *   - -EINVAL if count is not a power of 2.
+ */
+__rte_experimental
+ssize_t rte_ring_get_memsize_elem(unsigned count, unsigned esize);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Create a new ring named *name* that stores elements with given size.
+ *
+ * This function uses ``memzone_reserve()`` to allocate memory. Then it
+ * calls rte_ring_init() to initialize an empty ring.
+ *
+ * The new ring size is set to *count*, which must be a power of
+ * two. Water marking is disabled by default. The real usable ring size
+ * is *count-1* instead of *count* to differentiate a free ring from an
+ * empty ring.
+ *
+ * The ring is added in RTE_TAILQ_RING list.
+ *
+ * @param name
+ *   The name of the ring.
+ * @param count
+ *   The number of elements in the ring (must be a power of 2).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported.
+ * @param socket_id
+ *   The *socket_id* argument is the socket identifier in case of
+ *   NUMA. The value can be *SOCKET_ID_ANY* if there is no NUMA
+ *   constraint for the reserved zone.
+ * @param flags
+ *   An OR of the following:
+ *    - RING_F_SP_ENQ: If this flag is set, the default behavior when
+ *      using ``rte_ring_enqueue()`` or ``rte_ring_enqueue_bulk()``
+ *      is "single-producer". Otherwise, it is "multi-producers".
+ *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
+ *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
+ *      is "single-consumer". Otherwise, it is "multi-consumers".
+ * @return
+ *   On success, the pointer to the new allocated ring. NULL on error with
+ *    rte_errno set appropriately. Possible errno values include:
+ *    - E_RTE_NO_CONFIG - function could not get pointer to rte_config structure
+ *    - E_RTE_SECONDARY - function was called from a secondary process instance
+ *    - EINVAL - count provided is not a power of 2
+ *    - ENOSPC - the maximum number of memzones has already been allocated
+ *    - EEXIST - a memzone with the same name already exists
+ *    - ENOMEM - no appropriate memory area found in which to create memzone
+ */
+__rte_experimental
+struct rte_ring *rte_ring_create_elem(const char *name, unsigned count,
+				unsigned esize, int socket_id, unsigned flags);
+
+/* the actual enqueue of pointers on the ring.
+ * Placed here since identical code needed in both
+ * single and multi producer enqueue functions.
+ */
+#define ENQUEUE_PTRS_ELEM(r, ring_start, prod_head, obj_table, esize, n) do { \
+	if (esize == 4) \
+		ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n); \
+	else if (esize == 8) \
+		ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n); \
+	else if (esize == 16) \
+		ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table, n); \
+} while (0)
+
+#define ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n) do { \
+	unsigned int i; \
+	const uint32_t size = (r)->size; \
+	uint32_t idx = prod_head & (r)->mask; \
+	uint32_t *ring = (uint32_t *)ring_start; \
+	uint32_t *obj = (uint32_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & ((~(unsigned)0x7))); i += 8, idx += 8) { \
+			ring[idx] = obj[i]; \
+			ring[idx + 1] = obj[i + 1]; \
+			ring[idx + 2] = obj[i + 2]; \
+			ring[idx + 3] = obj[i + 3]; \
+			ring[idx + 4] = obj[i + 4]; \
+			ring[idx + 5] = obj[i + 5]; \
+			ring[idx + 6] = obj[i + 6]; \
+			ring[idx + 7] = obj[i + 7]; \
+		} \
+		switch (n & 0x7) { \
+		case 7: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 6: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 5: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 4: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 3: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 2: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 1: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++)\
+			ring[idx] = obj[i]; \
+		for (idx = 0; i < n; i++, idx++) \
+			ring[idx] = obj[i]; \
+	} \
+} while (0)
+
+#define ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n) do { \
+	unsigned int i; \
+	const uint32_t size = (r)->size; \
+	uint32_t idx = prod_head & (r)->mask; \
+	uint64_t *ring = (uint64_t *)ring_start; \
+	uint64_t *obj = (uint64_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & ((~(unsigned)0x3))); i += 4, idx += 4) { \
+			ring[idx] = obj[i]; \
+			ring[idx + 1] = obj[i + 1]; \
+			ring[idx + 2] = obj[i + 2]; \
+			ring[idx + 3] = obj[i + 3]; \
+		} \
+		switch (n & 0x3) { \
+		case 3: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 2: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 1: \
+			ring[idx++] = obj[i++]; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++)\
+			ring[idx] = obj[i]; \
+		for (idx = 0; i < n; i++, idx++) \
+			ring[idx] = obj[i]; \
+	} \
+} while (0)
+
+#define ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table, n) do { \
+	unsigned int i; \
+	const uint32_t size = (r)->size; \
+	uint32_t idx = prod_head & (r)->mask; \
+	__uint128_t *ring = (__uint128_t *)ring_start; \
+	__uint128_t *obj = (__uint128_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
+			ring[idx] = obj[i]; \
+			ring[idx + 1] = obj[i + 1]; \
+		} \
+		switch (n & 0x1) { \
+		case 1: \
+			ring[idx++] = obj[i++]; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++)\
+			ring[idx] = obj[i]; \
+		for (idx = 0; i < n; i++, idx++) \
+			ring[idx] = obj[i]; \
+	} \
+} while (0)
+
+/* the actual copy of pointers on the ring to obj_table.
+ * Placed here since identical code needed in both
+ * single and multi consumer dequeue functions.
+ */
+#define DEQUEUE_PTRS_ELEM(r, ring_start, cons_head, obj_table, esize, n) do { \
+	if (esize == 4) \
+		DEQUEUE_PTRS_32(r, ring_start, cons_head, obj_table, n); \
+	else if (esize == 8) \
+		DEQUEUE_PTRS_64(r, ring_start, cons_head, obj_table, n); \
+	else if (esize == 16) \
+		DEQUEUE_PTRS_128(r, ring_start, cons_head, obj_table, n); \
+} while (0)
+
+#define DEQUEUE_PTRS_32(r, ring_start, cons_head, obj_table, n) do { \
+	unsigned int i; \
+	uint32_t idx = cons_head & (r)->mask; \
+	const uint32_t size = (r)->size; \
+	uint32_t *ring = (uint32_t *)ring_start; \
+	uint32_t *obj = (uint32_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & (~(unsigned)0x7)); i += 8, idx += 8) {\
+			obj[i] = ring[idx]; \
+			obj[i + 1] = ring[idx + 1]; \
+			obj[i + 2] = ring[idx + 2]; \
+			obj[i + 3] = ring[idx + 3]; \
+			obj[i + 4] = ring[idx + 4]; \
+			obj[i + 5] = ring[idx + 5]; \
+			obj[i + 6] = ring[idx + 6]; \
+			obj[i + 7] = ring[idx + 7]; \
+		} \
+		switch (n & 0x7) { \
+		case 7: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 6: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 5: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 4: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 3: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 2: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 1: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) \
+			obj[i] = ring[idx]; \
+		for (idx = 0; i < n; i++, idx++) \
+			obj[i] = ring[idx]; \
+	} \
+} while (0)
+
+#define DEQUEUE_PTRS_64(r, ring_start, cons_head, obj_table, n) do { \
+	unsigned int i; \
+	uint32_t idx = cons_head & (r)->mask; \
+	const uint32_t size = (r)->size; \
+	uint64_t *ring = (uint64_t *)ring_start; \
+	uint64_t *obj = (uint64_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & (~(unsigned)0x3)); i += 4, idx += 4) {\
+			obj[i] = ring[idx]; \
+			obj[i + 1] = ring[idx + 1]; \
+			obj[i + 2] = ring[idx + 2]; \
+			obj[i + 3] = ring[idx + 3]; \
+		} \
+		switch (n & 0x3) { \
+		case 3: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 2: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 1: \
+			obj[i++] = ring[idx++]; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) \
+			obj[i] = ring[idx]; \
+		for (idx = 0; i < n; i++, idx++) \
+			obj[i] = ring[idx]; \
+	} \
+} while (0)
+
+#define DEQUEUE_PTRS_128(r, ring_start, cons_head, obj_table, n) do { \
+	unsigned int i; \
+	uint32_t idx = cons_head & (r)->mask; \
+	const uint32_t size = (r)->size; \
+	__uint128_t *ring = (__uint128_t *)ring_start; \
+	__uint128_t *obj = (__uint128_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
+			obj[i] = ring[idx]; \
+			obj[i + 1] = ring[idx + 1]; \
+		} \
+		switch (n & 0x1) { \
+		case 1: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) \
+			obj[i] = ring[idx]; \
+		for (idx = 0; i < n; i++, idx++) \
+			obj[i] = ring[idx]; \
+	} \
+} while (0)
+
+/* Between load and load. there might be cpu reorder in weak model
+ * (powerpc/arm).
+ * There are 2 choices for the users
+ * 1.use rmb() memory barrier
+ * 2.use one-direction load_acquire/store_release barrier,defined by
+ * CONFIG_RTE_USE_C11_MEM_MODEL=y
+ * It depends on performance test results.
+ * By default, move common functions to rte_ring_generic.h
+ */
+#ifdef RTE_USE_C11_MEM_MODEL
+#include "rte_ring_c11_mem.h"
+#else
+#include "rte_ring_generic.h"
+#endif
+
+/**
+ * @internal Enqueue several objects on the ring
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
+ * @param is_sp
+ *   Indicates whether to use single producer or multi-producer head update
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_enqueue_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n,
+		enum rte_ring_queue_behavior behavior, unsigned int is_sp,
+		unsigned int *free_space)
+{
+	uint32_t prod_head, prod_next;
+	uint32_t free_entries;
+
+	n = __rte_ring_move_prod_head(r, is_sp, n, behavior,
+			&prod_head, &prod_next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	ENQUEUE_PTRS_ELEM(r, &r[1], prod_head, obj_table, esize, n);
+
+	update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
+end:
+	if (free_space != NULL)
+		*free_space = free_entries - n;
+	return n;
+}
+
+/**
+ * @internal Dequeue several objects from the ring
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from ring
+ * @param is_sc
+ *   Indicates whether to use single consumer or multi-consumer head update
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n,
+		enum rte_ring_queue_behavior behavior, unsigned int is_sc,
+		unsigned int *available)
+{
+	uint32_t cons_head, cons_next;
+	uint32_t entries;
+
+	n = __rte_ring_move_cons_head(r, (int)is_sc, n, behavior,
+			&cons_head, &cons_next, &entries);
+	if (n == 0)
+		goto end;
+
+	DEQUEUE_PTRS_ELEM(r, &r[1], cons_head, obj_table, esize, n);
+
+	update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
+
+end:
+	if (available != NULL)
+		*available = entries - n;
+	return n;
+}
+
+/**
+ * Enqueue several objects on the ring (multi-producers safe).
+ *
+ * This function uses a "compare and set" instruction to move the
+ * producer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   The number of objects enqueued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_mp_enqueue_bulk_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_FIXED, __IS_MP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring (NOT multi-producers safe).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   The number of objects enqueued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_sp_enqueue_bulk_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_FIXED, __IS_SP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring.
+ *
+ * This function calls the multi-producer or the single-producer
+ * version depending on the default behavior that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   The number of objects enqueued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_enqueue_bulk_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_FIXED, r->prod.single, free_space);
+}
+
+/**
+ * Enqueue one object on a ring (multi-producers safe).
+ *
+ * This function uses a "compare and set" instruction to move the
+ * producer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj
+ *   A pointer to the object to be added.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success; objects enqueued.
+ *   - -ENOBUFS: Not enough room in the ring to enqueue; no object is enqueued.
+ */
+static __rte_always_inline int
+rte_ring_mp_enqueue_elem(struct rte_ring *r, void *obj, unsigned int esize)
+{
+	return rte_ring_mp_enqueue_bulk_elem(r, obj, esize, 1, NULL) ? 0 :
+								-ENOBUFS;
+}
+
+/**
+ * Enqueue one object on a ring (NOT multi-producers safe).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj
+ *   A pointer to the object to be added.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success; objects enqueued.
+ *   - -ENOBUFS: Not enough room in the ring to enqueue; no object is enqueued.
+ */
+static __rte_always_inline int
+rte_ring_sp_enqueue_elem(struct rte_ring *r, void *obj, unsigned int esize)
+{
+	return rte_ring_sp_enqueue_bulk_elem(r, obj, esize, 1, NULL) ? 0 :
+								-ENOBUFS;
+}
+
+/**
+ * Enqueue one object on a ring.
+ *
+ * This function calls the multi-producer or the single-producer
+ * version, depending on the default behaviour that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj
+ *   A pointer to the object to be added.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success; objects enqueued.
+ *   - -ENOBUFS: Not enough room in the ring to enqueue; no object is enqueued.
+ */
+static __rte_always_inline int
+rte_ring_enqueue_elem(struct rte_ring *r, void *obj, unsigned int esize)
+{
+	return rte_ring_enqueue_bulk_elem(r, obj, esize, 1, NULL) ? 0 :
+								-ENOBUFS;
+}
+
+/**
+ * Dequeue several objects from a ring (multi-consumers safe).
+ *
+ * This function uses a "compare and set" instruction to move the
+ * consumer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   The number of objects dequeued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_mc_dequeue_bulk_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+				RTE_RING_QUEUE_FIXED, __IS_MC, available);
+}
+
+/**
+ * Dequeue several objects from a ring (NOT multi-consumers safe).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table,
+ *   must be strictly positive.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   The number of objects dequeued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_sc_dequeue_bulk_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_FIXED, __IS_SC, available);
+}
+
+/**
+ * Dequeue several objects from a ring.
+ *
+ * This function calls the multi-consumers or the single-consumer
+ * version, depending on the default behaviour that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   The number of objects dequeued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_dequeue_bulk_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_FIXED, r->cons.single, available);
+}
+
+/**
+ * Dequeue one object from a ring (multi-consumers safe).
+ *
+ * This function uses a "compare and set" instruction to move the
+ * consumer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_p
+ *   A pointer to a void * pointer (object) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success; objects dequeued.
+ *   - -ENOENT: Not enough entries in the ring to dequeue; no object is
+ *     dequeued.
+ */
+static __rte_always_inline int
+rte_ring_mc_dequeue_elem(struct rte_ring *r, void *obj_p,
+				unsigned int esize)
+{
+	return rte_ring_mc_dequeue_bulk_elem(r, obj_p, esize, 1, NULL)  ? 0 :
+								-ENOENT;
+}
+
+/**
+ * Dequeue one object from a ring (NOT multi-consumers safe).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_p
+ *   A pointer to a void * pointer (object) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success; objects dequeued.
+ *   - -ENOENT: Not enough entries in the ring to dequeue, no object is
+ *     dequeued.
+ */
+static __rte_always_inline int
+rte_ring_sc_dequeue_elem(struct rte_ring *r, void *obj_p,
+				unsigned int esize)
+{
+	return rte_ring_sc_dequeue_bulk_elem(r, obj_p, esize, 1, NULL) ? 0 :
+								-ENOENT;
+}
+
+/**
+ * Dequeue one object from a ring.
+ *
+ * This function calls the multi-consumers or the single-consumer
+ * version depending on the default behaviour that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_p
+ *   A pointer to a void * pointer (object) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success, objects dequeued.
+ *   - -ENOENT: Not enough entries in the ring to dequeue, no object is
+ *     dequeued.
+ */
+static __rte_always_inline int
+rte_ring_dequeue_elem(struct rte_ring *r, void *obj_p, unsigned int esize)
+{
+	return rte_ring_dequeue_bulk_elem(r, obj_p, esize, 1, NULL) ? 0 :
+								-ENOENT;
+}
+
+/**
+ * Enqueue several objects on the ring (multi-producers safe).
+ *
+ * This function uses a "compare and set" instruction to move the
+ * producer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   - n: Actual number of objects enqueued.
+ */
+static __rte_always_inline unsigned
+rte_ring_mp_enqueue_burst_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_MP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring (NOT multi-producers safe).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   - n: Actual number of objects enqueued.
+ */
+static __rte_always_inline unsigned
+rte_ring_sp_enqueue_burst_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_SP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring.
+ *
+ * This function calls the multi-producer or the single-producer
+ * version depending on the default behavior that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   - n: Actual number of objects enqueued.
+ */
+static __rte_always_inline unsigned
+rte_ring_enqueue_burst_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_VARIABLE, r->prod.single, free_space);
+}
+
+/**
+ * Dequeue several objects from a ring (multi-consumers safe). When the request
+ * objects are more than the available objects, only dequeue the actual number
+ * of objects
+ *
+ * This function uses a "compare and set" instruction to move the
+ * consumer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   - n: Actual number of objects dequeued, 0 if ring is empty
+ */
+static __rte_always_inline unsigned
+rte_ring_mc_dequeue_burst_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_MC, available);
+}
+
+/**
+ * Dequeue several objects from a ring (NOT multi-consumers safe).When the
+ * request objects are more than the available objects, only dequeue the
+ * actual number of objects
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   - n: Actual number of objects dequeued, 0 if ring is empty
+ */
+static __rte_always_inline unsigned
+rte_ring_sc_dequeue_burst_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_SC, available);
+}
+
+/**
+ * Dequeue multiple objects from a ring up to a maximum number.
+ *
+ * This function calls the multi-consumers or the single-consumer
+ * version, depending on the default behaviour that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   - Number of objects dequeued
+ */
+static __rte_always_inline unsigned
+rte_ring_dequeue_burst_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+				RTE_RING_QUEUE_VARIABLE,
+				r->cons.single, available);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_RING_ELEM_H_ */
diff --git a/lib/librte_ring/rte_ring_version.map b/lib/librte_ring/rte_ring_version.map
index 510c1386e..e410a7503 100644
--- a/lib/librte_ring/rte_ring_version.map
+++ b/lib/librte_ring/rte_ring_version.map
@@ -21,6 +21,8 @@ DPDK_2.2 {
 EXPERIMENTAL {
 	global:
 
+	rte_ring_create_elem;
+	rte_ring_get_memsize_elem;
 	rte_ring_reset;
 
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH v5 2/3] test/ring: add test cases for configurable element size ring
  2019-10-17 20:08   ` [dpdk-dev] [PATCH v5 0/3] lib/ring: APIs to support custom element size Honnappa Nagarahalli
  2019-10-17 20:08     ` [dpdk-dev] [PATCH v5 1/3] lib/ring: apis to support configurable " Honnappa Nagarahalli
@ 2019-10-17 20:08     ` Honnappa Nagarahalli
  2019-10-17 20:08     ` [dpdk-dev] [PATCH v5 3/3] lib/ring: copy ring elements using memcpy partially Honnappa Nagarahalli
  2 siblings, 0 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-17 20:08 UTC (permalink / raw)
  To: olivier.matz, sthemmin, jerinj, bruce.richardson, david.marchand,
	pbhagavatula, konstantin.ananyev, drc, hemant.agrawal,
	honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, ruifeng.wang, gavin.hu

Add test cases to test APIs for configurable element size ring.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 app/test/Makefile              |   1 +
 app/test/meson.build           |   1 +
 app/test/test_ring_perf_elem.c | 419 +++++++++++++++++++++++++++++++++
 3 files changed, 421 insertions(+)
 create mode 100644 app/test/test_ring_perf_elem.c

diff --git a/app/test/Makefile b/app/test/Makefile
index 26ba6fe2b..e5cb27b75 100644
--- a/app/test/Makefile
+++ b/app/test/Makefile
@@ -78,6 +78,7 @@ SRCS-y += test_rand_perf.c
 
 SRCS-y += test_ring.c
 SRCS-y += test_ring_perf.c
+SRCS-y += test_ring_perf_elem.c
 SRCS-y += test_pmd_perf.c
 
 ifeq ($(CONFIG_RTE_LIBRTE_TABLE),y)
diff --git a/app/test/meson.build b/app/test/meson.build
index ec40943bd..995ee9bc7 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -101,6 +101,7 @@ test_sources = files('commands.c',
 	'test_reorder.c',
 	'test_ring.c',
 	'test_ring_perf.c',
+	'test_ring_perf_elem.c',
 	'test_rwlock.c',
 	'test_sched.c',
 	'test_service_cores.c',
diff --git a/app/test/test_ring_perf_elem.c b/app/test/test_ring_perf_elem.c
new file mode 100644
index 000000000..fc5b82d71
--- /dev/null
+++ b/app/test/test_ring_perf_elem.c
@@ -0,0 +1,419 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2014 Intel Corporation
+ */
+
+
+#include <stdio.h>
+#include <inttypes.h>
+#include <rte_ring.h>
+#include <rte_ring_elem.h>
+#include <rte_cycles.h>
+#include <rte_launch.h>
+#include <rte_pause.h>
+
+#include "test.h"
+
+/*
+ * Ring
+ * ====
+ *
+ * Measures performance of various operations using rdtsc
+ *  * Empty ring dequeue
+ *  * Enqueue/dequeue of bursts in 1 threads
+ *  * Enqueue/dequeue of bursts in 2 threads
+ */
+
+#define RING_NAME "RING_PERF"
+#define RING_SIZE 4096
+#define MAX_BURST 64
+
+/*
+ * the sizes to enqueue and dequeue in testing
+ * (marked volatile so they won't be seen as compile-time constants)
+ */
+static const volatile unsigned bulk_sizes[] = { 8, 32 };
+
+struct lcore_pair {
+	unsigned c1, c2;
+};
+
+static volatile unsigned lcore_count;
+
+/**** Functions to analyse our core mask to get cores for different tests ***/
+
+static int
+get_two_hyperthreads(struct lcore_pair *lcp)
+{
+	unsigned id1, id2;
+	unsigned c1, c2, s1, s2;
+	RTE_LCORE_FOREACH(id1) {
+		/* inner loop just re-reads all id's. We could skip the
+		 * first few elements, but since number of cores is small
+		 * there is little point
+		 */
+		RTE_LCORE_FOREACH(id2) {
+			if (id1 == id2)
+				continue;
+
+			c1 = rte_lcore_to_cpu_id(id1);
+			c2 = rte_lcore_to_cpu_id(id2);
+			s1 = rte_lcore_to_socket_id(id1);
+			s2 = rte_lcore_to_socket_id(id2);
+			if ((c1 == c2) && (s1 == s2)) {
+				lcp->c1 = id1;
+				lcp->c2 = id2;
+				return 0;
+			}
+		}
+	}
+	return 1;
+}
+
+static int
+get_two_cores(struct lcore_pair *lcp)
+{
+	unsigned id1, id2;
+	unsigned c1, c2, s1, s2;
+	RTE_LCORE_FOREACH(id1) {
+		RTE_LCORE_FOREACH(id2) {
+			if (id1 == id2)
+				continue;
+
+			c1 = rte_lcore_to_cpu_id(id1);
+			c2 = rte_lcore_to_cpu_id(id2);
+			s1 = rte_lcore_to_socket_id(id1);
+			s2 = rte_lcore_to_socket_id(id2);
+			if ((c1 != c2) && (s1 == s2)) {
+				lcp->c1 = id1;
+				lcp->c2 = id2;
+				return 0;
+			}
+		}
+	}
+	return 1;
+}
+
+static int
+get_two_sockets(struct lcore_pair *lcp)
+{
+	unsigned id1, id2;
+	unsigned s1, s2;
+	RTE_LCORE_FOREACH(id1) {
+		RTE_LCORE_FOREACH(id2) {
+			if (id1 == id2)
+				continue;
+			s1 = rte_lcore_to_socket_id(id1);
+			s2 = rte_lcore_to_socket_id(id2);
+			if (s1 != s2) {
+				lcp->c1 = id1;
+				lcp->c2 = id2;
+				return 0;
+			}
+		}
+	}
+	return 1;
+}
+
+/* Get cycle counts for dequeuing from an empty ring. Should be 2 or 3 cycles */
+static void
+test_empty_dequeue(struct rte_ring *r)
+{
+	const unsigned iter_shift = 26;
+	const unsigned iterations = 1<<iter_shift;
+	unsigned i = 0;
+	uint32_t burst[MAX_BURST];
+
+	const uint64_t sc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		rte_ring_sc_dequeue_bulk_elem(r, burst, 8, bulk_sizes[0], NULL);
+	const uint64_t sc_end = rte_rdtsc();
+
+	const uint64_t mc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		rte_ring_mc_dequeue_bulk_elem(r, burst, 8, bulk_sizes[0], NULL);
+	const uint64_t mc_end = rte_rdtsc();
+
+	printf("SC empty dequeue: %.2F\n",
+			(double)(sc_end-sc_start) / iterations);
+	printf("MC empty dequeue: %.2F\n",
+			(double)(mc_end-mc_start) / iterations);
+}
+
+/*
+ * for the separate enqueue and dequeue threads they take in one param
+ * and return two. Input = burst size, output = cycle average for sp/sc & mp/mc
+ */
+struct thread_params {
+	struct rte_ring *r;
+	unsigned size;        /* input value, the burst size */
+	double spsc, mpmc;    /* output value, the single or multi timings */
+};
+
+/*
+ * Function that uses rdtsc to measure timing for ring enqueue. Needs pair
+ * thread running dequeue_bulk function
+ */
+static int
+enqueue_bulk(void *p)
+{
+	const unsigned iter_shift = 23;
+	const unsigned iterations = 1<<iter_shift;
+	struct thread_params *params = p;
+	struct rte_ring *r = params->r;
+	const unsigned size = params->size;
+	unsigned i;
+	uint32_t burst[MAX_BURST] = {0};
+
+#ifdef RTE_USE_C11_MEM_MODEL
+	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
+#else
+	if (__sync_add_and_fetch(&lcore_count, 1) != 2)
+#endif
+		while (lcore_count != 2)
+			rte_pause();
+
+	const uint64_t sp_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		while (rte_ring_sp_enqueue_bulk_elem(r, burst, 8, size, NULL)
+				== 0)
+			rte_pause();
+	const uint64_t sp_end = rte_rdtsc();
+
+	const uint64_t mp_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		while (rte_ring_mp_enqueue_bulk_elem(r, burst, 8, size, NULL)
+				== 0)
+			rte_pause();
+	const uint64_t mp_end = rte_rdtsc();
+
+	params->spsc = ((double)(sp_end - sp_start))/(iterations*size);
+	params->mpmc = ((double)(mp_end - mp_start))/(iterations*size);
+	return 0;
+}
+
+/*
+ * Function that uses rdtsc to measure timing for ring dequeue. Needs pair
+ * thread running enqueue_bulk function
+ */
+static int
+dequeue_bulk(void *p)
+{
+	const unsigned iter_shift = 23;
+	const unsigned iterations = 1<<iter_shift;
+	struct thread_params *params = p;
+	struct rte_ring *r = params->r;
+	const unsigned size = params->size;
+	unsigned i;
+	uint32_t burst[MAX_BURST] = {0};
+
+#ifdef RTE_USE_C11_MEM_MODEL
+	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
+#else
+	if (__sync_add_and_fetch(&lcore_count, 1) != 2)
+#endif
+		while (lcore_count != 2)
+			rte_pause();
+
+	const uint64_t sc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		while (rte_ring_sc_dequeue_bulk_elem(r, burst, 8, size, NULL)
+				== 0)
+			rte_pause();
+	const uint64_t sc_end = rte_rdtsc();
+
+	const uint64_t mc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		while (rte_ring_mc_dequeue_bulk_elem(r, burst, 8, size, NULL)
+				== 0)
+			rte_pause();
+	const uint64_t mc_end = rte_rdtsc();
+
+	params->spsc = ((double)(sc_end - sc_start))/(iterations*size);
+	params->mpmc = ((double)(mc_end - mc_start))/(iterations*size);
+	return 0;
+}
+
+/*
+ * Function that calls the enqueue and dequeue bulk functions on pairs of cores.
+ * used to measure ring perf between hyperthreads, cores and sockets.
+ */
+static void
+run_on_core_pair(struct lcore_pair *cores, struct rte_ring *r,
+		lcore_function_t f1, lcore_function_t f2)
+{
+	struct thread_params param1 = {0}, param2 = {0};
+	unsigned i;
+	for (i = 0; i < sizeof(bulk_sizes)/sizeof(bulk_sizes[0]); i++) {
+		lcore_count = 0;
+		param1.size = param2.size = bulk_sizes[i];
+		param1.r = param2.r = r;
+		if (cores->c1 == rte_get_master_lcore()) {
+			rte_eal_remote_launch(f2, &param2, cores->c2);
+			f1(&param1);
+			rte_eal_wait_lcore(cores->c2);
+		} else {
+			rte_eal_remote_launch(f1, &param1, cores->c1);
+			rte_eal_remote_launch(f2, &param2, cores->c2);
+			rte_eal_wait_lcore(cores->c1);
+			rte_eal_wait_lcore(cores->c2);
+		}
+		printf("SP/SC bulk enq/dequeue (size: %u): %.2F\n",
+				bulk_sizes[i], param1.spsc + param2.spsc);
+		printf("MP/MC bulk enq/dequeue (size: %u): %.2F\n",
+				bulk_sizes[i], param1.mpmc + param2.mpmc);
+	}
+}
+
+/*
+ * Test function that determines how long an enqueue + dequeue of a single item
+ * takes on a single lcore. Result is for comparison with the bulk enq+deq.
+ */
+static void
+test_single_enqueue_dequeue(struct rte_ring *r)
+{
+	const unsigned iter_shift = 24;
+	const unsigned iterations = 1<<iter_shift;
+	unsigned i = 0;
+	uint32_t burst[2];
+
+	const uint64_t sc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++) {
+		rte_ring_sp_enqueue_elem(r, burst, 8);
+		rte_ring_sc_dequeue_elem(r, burst, 8);
+	}
+	const uint64_t sc_end = rte_rdtsc();
+
+	const uint64_t mc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++) {
+		rte_ring_mp_enqueue_elem(r, burst, 8);
+		rte_ring_mc_dequeue_elem(r, burst, 8);
+	}
+	const uint64_t mc_end = rte_rdtsc();
+
+	printf("SP/SC single enq/dequeue: %"PRIu64"\n",
+			(sc_end-sc_start) >> iter_shift);
+	printf("MP/MC single enq/dequeue: %"PRIu64"\n",
+			(mc_end-mc_start) >> iter_shift);
+}
+
+/*
+ * Test that does both enqueue and dequeue on a core using the burst() API calls
+ * instead of the bulk() calls used in other tests. Results should be the same
+ * as for the bulk function called on a single lcore.
+ */
+static void
+test_burst_enqueue_dequeue(struct rte_ring *r)
+{
+	const unsigned iter_shift = 23;
+	const unsigned iterations = 1<<iter_shift;
+	unsigned sz, i = 0;
+	uint32_t burst[MAX_BURST] = {0};
+
+	for (sz = 0; sz < sizeof(bulk_sizes)/sizeof(bulk_sizes[0]); sz++) {
+		const uint64_t sc_start = rte_rdtsc();
+		for (i = 0; i < iterations; i++) {
+			rte_ring_sp_enqueue_burst_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+			rte_ring_sc_dequeue_burst_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+		}
+		const uint64_t sc_end = rte_rdtsc();
+
+		const uint64_t mc_start = rte_rdtsc();
+		for (i = 0; i < iterations; i++) {
+			rte_ring_mp_enqueue_burst_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+			rte_ring_mc_dequeue_burst_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+		}
+		const uint64_t mc_end = rte_rdtsc();
+
+		uint64_t mc_avg = ((mc_end-mc_start) >> iter_shift) /
+					bulk_sizes[sz];
+		uint64_t sc_avg = ((sc_end-sc_start) >> iter_shift) /
+					bulk_sizes[sz];
+
+		printf("SP/SC burst enq/dequeue (size: %u): %"PRIu64"\n",
+				bulk_sizes[sz], sc_avg);
+		printf("MP/MC burst enq/dequeue (size: %u): %"PRIu64"\n",
+				bulk_sizes[sz], mc_avg);
+	}
+}
+
+/* Times enqueue and dequeue on a single lcore */
+static void
+test_bulk_enqueue_dequeue(struct rte_ring *r)
+{
+	const unsigned iter_shift = 23;
+	const unsigned iterations = 1<<iter_shift;
+	unsigned sz, i = 0;
+	uint32_t burst[MAX_BURST] = {0};
+
+	for (sz = 0; sz < sizeof(bulk_sizes)/sizeof(bulk_sizes[0]); sz++) {
+		const uint64_t sc_start = rte_rdtsc();
+		for (i = 0; i < iterations; i++) {
+			rte_ring_sp_enqueue_bulk_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+			rte_ring_sc_dequeue_bulk_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+		}
+		const uint64_t sc_end = rte_rdtsc();
+
+		const uint64_t mc_start = rte_rdtsc();
+		for (i = 0; i < iterations; i++) {
+			rte_ring_mp_enqueue_bulk_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+			rte_ring_mc_dequeue_bulk_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+		}
+		const uint64_t mc_end = rte_rdtsc();
+
+		double sc_avg = ((double)(sc_end-sc_start) /
+				(iterations * bulk_sizes[sz]));
+		double mc_avg = ((double)(mc_end-mc_start) /
+				(iterations * bulk_sizes[sz]));
+
+		printf("SP/SC bulk enq/dequeue (size: %u): %.2F\n",
+				bulk_sizes[sz], sc_avg);
+		printf("MP/MC bulk enq/dequeue (size: %u): %.2F\n",
+				bulk_sizes[sz], mc_avg);
+	}
+}
+
+static int
+test_ring_perf_elem(void)
+{
+	struct lcore_pair cores;
+	struct rte_ring *r = NULL;
+
+	r = rte_ring_create_elem(RING_NAME, RING_SIZE, 8, rte_socket_id(), 0);
+	if (r == NULL)
+		return -1;
+
+	printf("### Testing single element and burst enq/deq ###\n");
+	test_single_enqueue_dequeue(r);
+	test_burst_enqueue_dequeue(r);
+
+	printf("\n### Testing empty dequeue ###\n");
+	test_empty_dequeue(r);
+
+	printf("\n### Testing using a single lcore ###\n");
+	test_bulk_enqueue_dequeue(r);
+
+	if (get_two_hyperthreads(&cores) == 0) {
+		printf("\n### Testing using two hyperthreads ###\n");
+		run_on_core_pair(&cores, r, enqueue_bulk, dequeue_bulk);
+	}
+	if (get_two_cores(&cores) == 0) {
+		printf("\n### Testing using two physical cores ###\n");
+		run_on_core_pair(&cores, r, enqueue_bulk, dequeue_bulk);
+	}
+	if (get_two_sockets(&cores) == 0) {
+		printf("\n### Testing using two NUMA nodes ###\n");
+		run_on_core_pair(&cores, r, enqueue_bulk, dequeue_bulk);
+	}
+	rte_ring_free(r);
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(ring_perf_elem_autotest, test_ring_perf_elem);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH v5 3/3] lib/ring: copy ring elements using memcpy partially
  2019-10-17 20:08   ` [dpdk-dev] [PATCH v5 0/3] lib/ring: APIs to support custom element size Honnappa Nagarahalli
  2019-10-17 20:08     ` [dpdk-dev] [PATCH v5 1/3] lib/ring: apis to support configurable " Honnappa Nagarahalli
  2019-10-17 20:08     ` [dpdk-dev] [PATCH v5 2/3] test/ring: add test cases for configurable element size ring Honnappa Nagarahalli
@ 2019-10-17 20:08     ` Honnappa Nagarahalli
  2 siblings, 0 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-17 20:08 UTC (permalink / raw)
  To: olivier.matz, sthemmin, jerinj, bruce.richardson, david.marchand,
	pbhagavatula, konstantin.ananyev, drc, hemant.agrawal,
	honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, ruifeng.wang, gavin.hu

Copy of ring elements uses memcpy for 32B chunks. The remaining
bytes are copied using assignments.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_ring/rte_ring_elem.h | 163 +++-----------------------------
 1 file changed, 11 insertions(+), 152 deletions(-)

diff --git a/lib/librte_ring/rte_ring_elem.h b/lib/librte_ring/rte_ring_elem.h
index 860f059ad..92e92f150 100644
--- a/lib/librte_ring/rte_ring_elem.h
+++ b/lib/librte_ring/rte_ring_elem.h
@@ -24,6 +24,7 @@ extern "C" {
 #include <stdint.h>
 #include <sys/queue.h>
 #include <errno.h>
+#include <string.h>
 #include <rte_common.h>
 #include <rte_config.h>
 #include <rte_memory.h>
@@ -108,35 +109,16 @@ __rte_experimental
 struct rte_ring *rte_ring_create_elem(const char *name, unsigned count,
 				unsigned esize, int socket_id, unsigned flags);
 
-/* the actual enqueue of pointers on the ring.
- * Placed here since identical code needed in both
- * single and multi producer enqueue functions.
- */
-#define ENQUEUE_PTRS_ELEM(r, ring_start, prod_head, obj_table, esize, n) do { \
-	if (esize == 4) \
-		ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n); \
-	else if (esize == 8) \
-		ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n); \
-	else if (esize == 16) \
-		ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table, n); \
-} while (0)
-
-#define ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n) do { \
+#define ENQUEUE_PTRS_GEN(r, ring_start, prod_head, obj_table, esize, n) do { \
 	unsigned int i; \
 	const uint32_t size = (r)->size; \
 	uint32_t idx = prod_head & (r)->mask; \
 	uint32_t *ring = (uint32_t *)ring_start; \
 	uint32_t *obj = (uint32_t *)obj_table; \
+	uint32_t sz = n * (esize / sizeof(uint32_t)); \
 	if (likely(idx + n < size)) { \
-		for (i = 0; i < (n & ((~(unsigned)0x7))); i += 8, idx += 8) { \
-			ring[idx] = obj[i]; \
-			ring[idx + 1] = obj[i + 1]; \
-			ring[idx + 2] = obj[i + 2]; \
-			ring[idx + 3] = obj[i + 3]; \
-			ring[idx + 4] = obj[i + 4]; \
-			ring[idx + 5] = obj[i + 5]; \
-			ring[idx + 6] = obj[i + 6]; \
-			ring[idx + 7] = obj[i + 7]; \
+		for (i = 0; i < (sz & ((~(unsigned)0x7))); i += 8, idx += 8) { \
+			memcpy (ring + i, obj + i, 8 * sizeof (uint32_t)); \
 		} \
 		switch (n & 0x7) { \
 		case 7: \
@@ -162,87 +144,16 @@ struct rte_ring *rte_ring_create_elem(const char *name, unsigned count,
 	} \
 } while (0)
 
-#define ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n) do { \
-	unsigned int i; \
-	const uint32_t size = (r)->size; \
-	uint32_t idx = prod_head & (r)->mask; \
-	uint64_t *ring = (uint64_t *)ring_start; \
-	uint64_t *obj = (uint64_t *)obj_table; \
-	if (likely(idx + n < size)) { \
-		for (i = 0; i < (n & ((~(unsigned)0x3))); i += 4, idx += 4) { \
-			ring[idx] = obj[i]; \
-			ring[idx + 1] = obj[i + 1]; \
-			ring[idx + 2] = obj[i + 2]; \
-			ring[idx + 3] = obj[i + 3]; \
-		} \
-		switch (n & 0x3) { \
-		case 3: \
-			ring[idx++] = obj[i++]; /* fallthrough */ \
-		case 2: \
-			ring[idx++] = obj[i++]; /* fallthrough */ \
-		case 1: \
-			ring[idx++] = obj[i++]; \
-		} \
-	} else { \
-		for (i = 0; idx < size; i++, idx++)\
-			ring[idx] = obj[i]; \
-		for (idx = 0; i < n; i++, idx++) \
-			ring[idx] = obj[i]; \
-	} \
-} while (0)
-
-#define ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table, n) do { \
-	unsigned int i; \
-	const uint32_t size = (r)->size; \
-	uint32_t idx = prod_head & (r)->mask; \
-	__uint128_t *ring = (__uint128_t *)ring_start; \
-	__uint128_t *obj = (__uint128_t *)obj_table; \
-	if (likely(idx + n < size)) { \
-		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
-			ring[idx] = obj[i]; \
-			ring[idx + 1] = obj[i + 1]; \
-		} \
-		switch (n & 0x1) { \
-		case 1: \
-			ring[idx++] = obj[i++]; \
-		} \
-	} else { \
-		for (i = 0; idx < size; i++, idx++)\
-			ring[idx] = obj[i]; \
-		for (idx = 0; i < n; i++, idx++) \
-			ring[idx] = obj[i]; \
-	} \
-} while (0)
-
-/* the actual copy of pointers on the ring to obj_table.
- * Placed here since identical code needed in both
- * single and multi consumer dequeue functions.
- */
-#define DEQUEUE_PTRS_ELEM(r, ring_start, cons_head, obj_table, esize, n) do { \
-	if (esize == 4) \
-		DEQUEUE_PTRS_32(r, ring_start, cons_head, obj_table, n); \
-	else if (esize == 8) \
-		DEQUEUE_PTRS_64(r, ring_start, cons_head, obj_table, n); \
-	else if (esize == 16) \
-		DEQUEUE_PTRS_128(r, ring_start, cons_head, obj_table, n); \
-} while (0)
-
-#define DEQUEUE_PTRS_32(r, ring_start, cons_head, obj_table, n) do { \
+#define DEQUEUE_PTRS_GEN(r, ring_start, cons_head, obj_table, esize, n) do { \
 	unsigned int i; \
 	uint32_t idx = cons_head & (r)->mask; \
 	const uint32_t size = (r)->size; \
 	uint32_t *ring = (uint32_t *)ring_start; \
 	uint32_t *obj = (uint32_t *)obj_table; \
+	uint32_t sz = n * (esize / sizeof(uint32_t)); \
 	if (likely(idx + n < size)) { \
-		for (i = 0; i < (n & (~(unsigned)0x7)); i += 8, idx += 8) {\
-			obj[i] = ring[idx]; \
-			obj[i + 1] = ring[idx + 1]; \
-			obj[i + 2] = ring[idx + 2]; \
-			obj[i + 3] = ring[idx + 3]; \
-			obj[i + 4] = ring[idx + 4]; \
-			obj[i + 5] = ring[idx + 5]; \
-			obj[i + 6] = ring[idx + 6]; \
-			obj[i + 7] = ring[idx + 7]; \
+		for (i = 0; i < (sz & ((~(unsigned)0x7))); i += 8, idx += 8) { \
+			memcpy (obj + i, ring + i, 8 * sizeof (uint32_t)); \
 		} \
 		switch (n & 0x7) { \
 		case 7: \
@@ -268,58 +179,6 @@ struct rte_ring *rte_ring_create_elem(const char *name, unsigned count,
 	} \
 } while (0)
 
-#define DEQUEUE_PTRS_64(r, ring_start, cons_head, obj_table, n) do { \
-	unsigned int i; \
-	uint32_t idx = cons_head & (r)->mask; \
-	const uint32_t size = (r)->size; \
-	uint64_t *ring = (uint64_t *)ring_start; \
-	uint64_t *obj = (uint64_t *)obj_table; \
-	if (likely(idx + n < size)) { \
-		for (i = 0; i < (n & (~(unsigned)0x3)); i += 4, idx += 4) {\
-			obj[i] = ring[idx]; \
-			obj[i + 1] = ring[idx + 1]; \
-			obj[i + 2] = ring[idx + 2]; \
-			obj[i + 3] = ring[idx + 3]; \
-		} \
-		switch (n & 0x3) { \
-		case 3: \
-			obj[i++] = ring[idx++]; /* fallthrough */ \
-		case 2: \
-			obj[i++] = ring[idx++]; /* fallthrough */ \
-		case 1: \
-			obj[i++] = ring[idx++]; \
-		} \
-	} else { \
-		for (i = 0; idx < size; i++, idx++) \
-			obj[i] = ring[idx]; \
-		for (idx = 0; i < n; i++, idx++) \
-			obj[i] = ring[idx]; \
-	} \
-} while (0)
-
-#define DEQUEUE_PTRS_128(r, ring_start, cons_head, obj_table, n) do { \
-	unsigned int i; \
-	uint32_t idx = cons_head & (r)->mask; \
-	const uint32_t size = (r)->size; \
-	__uint128_t *ring = (__uint128_t *)ring_start; \
-	__uint128_t *obj = (__uint128_t *)obj_table; \
-	if (likely(idx + n < size)) { \
-		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
-			obj[i] = ring[idx]; \
-			obj[i + 1] = ring[idx + 1]; \
-		} \
-		switch (n & 0x1) { \
-		case 1: \
-			obj[i++] = ring[idx++]; /* fallthrough */ \
-		} \
-	} else { \
-		for (i = 0; idx < size; i++, idx++) \
-			obj[i] = ring[idx]; \
-		for (idx = 0; i < n; i++, idx++) \
-			obj[i] = ring[idx]; \
-	} \
-} while (0)
-
 /* Between load and load. there might be cpu reorder in weak model
  * (powerpc/arm).
  * There are 2 choices for the users
@@ -373,7 +232,7 @@ __rte_ring_do_enqueue_elem(struct rte_ring *r, void * const obj_table,
 	if (n == 0)
 		goto end;
 
-	ENQUEUE_PTRS_ELEM(r, &r[1], prod_head, obj_table, esize, n);
+	ENQUEUE_PTRS_GEN(r, &r[1], prod_head, obj_table, esize, n);
 
 	update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
 end:
@@ -420,7 +279,7 @@ __rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
 	if (n == 0)
 		goto end;
 
-	DEQUEUE_PTRS_ELEM(r, &r[1], cons_head, obj_table, esize, n);
+	DEQUEUE_PTRS_GEN(r, &r[1], cons_head, obj_table, esize, n);
 
 	update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable element size
  2019-10-17 11:51                 ` Ananyev, Konstantin
@ 2019-10-17 20:16                   ` Honnappa Nagarahalli
  2019-10-17 23:17                     ` David Christensen
  0 siblings, 1 reply; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-17 20:16 UTC (permalink / raw)
  To: Ananyev, Konstantin, olivier.matz, sthemmin, jerinj, Richardson,
	Bruce, david.marchand, pbhagavatula, David Christensen
  Cc: dev, Dharmik Thakkar, Ruifeng Wang (Arm Technology China),
	Gavin Hu (Arm Technology China),
	stephen, nd, nd

<snip>

+ David Christensen for Power architecture

> > >
> > > > It
> > > > would mean extra work for the users.
> > > >
> > > > > 2. A lot of code duplication with these 3 copies of
> > > > > ENQUEUE/DEQUEUE macros.
> > > > >
> > > > > Looking at ENQUEUE/DEQUEUE macros, I can see that main loop
> > > > > always does 32B copy per iteration.
> > > > Yes, I tried to keep it the same as the existing one (originally,
> > > > I guess the intention was to allow for 256b vector instructions to
> > > > be
> > > > generated)
> > > >
> > > > > So wonder can we make a generic function that would do 32B copy
> > > > > per iteration in a main loop, and copy tail  by 4B chunks?
> > > > > That would avoid copy duplication and will allow user to have
> > > > > any elem size (multiple of 4B) he wants.
> > > > > Something like that (note didn't test it, just a rough idea):
> > > > >
> > > > >  static inline void
> > > > > copy_elems(uint32_t du32[], const uint32_t su32[], uint32_t num,
> > > > > uint32_t
> > > > > esize) {
> > > > >         uint32_t i, sz;
> > > > >
> > > > >         sz = (num * esize) / sizeof(uint32_t);
> > > > If 'num' is a compile time constant, 'sz' will be a compile time constant.
> > > Otherwise, this will result in a multiplication operation.
> > >
> > > Not always.
> > > If esize is compile time constant, then for esize as power of 2
> > > (4,8,16,...), it would be just one shift.
> > > For other constant values it could be a 'mul' or in many cases just
> > > 2 shifts plus 'add' (if compiler is smart enough).
> > > I.E. let say for 24B elem is would be either num * 6 or (num << 2) +
> > > (num << 1).
> > With num * 15 it has to be (num << 3) + (num << 2) + (num << 1) + num
> > Not sure if the compiler will do this.
> 
> For 15, it can be just (num << 4) - num
> 
> >
> > > I suppose for non-power of 2 elems it might be ok to get such small perf hit.
> > Agree, should be ok not to focus on right now.
> >
> > >
> > > >I have tried
> > > > to avoid the multiplication operation and try to use shift and
> > > >mask
> > > operations (just like how the rest of the ring code does).
> > > >
> > > > >
> > > > >         for (i = 0; i < (sz & ~7); i += 8)
> > > > >                 memcpy(du32 + i, su32 + i, 8 *
> > > > > sizeof(uint32_t));
> > > > I had used memcpy to start with (for the entire copy operation),
> > > > performance is not the same for 64b elements when compared with
> > > > the
> > > existing ring APIs (some cases more and some cases less).
> > >
> > > I remember that from one of your previous mails, that's why here I
> > > suggest to use in a loop memcpy() with fixed size.
> > > That way for each iteration complier will replace memcpy() with
> > > instructions to copy 32B in a way he thinks is optimal (same as for original
> macro, I think).
> > I tried this. On x86 (Xeon(R) Gold 6132 CPU @ 2.60GHz), the results are as
> follows. The numbers in brackets are with the code on master.
> > gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
> >
> > RTE>>ring_perf_elem_autotest
> > ### Testing single element and burst enq/deq ### SP/SC single
> > enq/dequeue: 5 MP/MC single enq/dequeue: 40 (35) SP/SC burst
> > enq/dequeue (size: 8): 2 MP/MC burst enq/dequeue (size: 8): 6 SP/SC
> > burst enq/dequeue (size: 32): 1 (2) MP/MC burst enq/dequeue (size:
> > 32): 2
> >
> > ### Testing empty dequeue ###
> > SC empty dequeue: 2.11
> > MC empty dequeue: 1.41 (2.11)
> >
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 2.15 (2.86) MP/MC bulk enq/dequeue
> > (size: 8): 6.35 (6.91) SP/SC bulk enq/dequeue (size: 32): 1.35 (2.06)
> > MP/MC bulk enq/dequeue (size: 32): 2.38 (2.95)
> >
> > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size:
> > 8): 73.81 (15.33) MP/MC bulk enq/dequeue (size: 8): 75.10 (71.27)
> > SP/SC bulk enq/dequeue (size: 32): 21.14 (9.58) MP/MC bulk enq/dequeue
> > (size: 32): 25.74 (20.91)
> >
> > ### Testing using two NUMA nodes ###
> > SP/SC bulk enq/dequeue (size: 8): 164.32 (50.66) MP/MC bulk
> > enq/dequeue (size: 8): 176.02 (173.43) SP/SC bulk enq/dequeue (size:
> > 32): 50.78 (23) MP/MC bulk enq/dequeue (size: 32): 63.17 (46.74)
> >
> > On one of the Arm platform
> > MP/MC bulk enq/dequeue (size: 32): 0.37 (0.33) (~12% hit, the rest are
> > ok)
> 
> So it shows better numbers for one core, but worse on 2, right?
> 
> 
> > On another Arm platform, all numbers are same or slightly better.
> >
> > I can post the patch with this change if you want to run some benchmarks on
> your platform.
> 
> Sure, please do.
> I'll try to run on my boxes.
Sent v5, please check. Other platform owners should run this as well.

> 
> > I have not used the same code you have suggested, instead I have used the
> same logic in a single macro with memcpy.
> >


^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v5 1/3] lib/ring: apis to support configurable element size
  2019-10-17 20:08     ` [dpdk-dev] [PATCH v5 1/3] lib/ring: apis to support configurable " Honnappa Nagarahalli
@ 2019-10-17 20:39       ` Stephen Hemminger
  2019-10-17 20:40       ` Stephen Hemminger
  1 sibling, 0 replies; 173+ messages in thread
From: Stephen Hemminger @ 2019-10-17 20:39 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: olivier.matz, sthemmin, jerinj, bruce.richardson, david.marchand,
	pbhagavatula, konstantin.ananyev, drc, hemant.agrawal, dev,
	dharmik.thakkar, ruifeng.wang, gavin.hu

On Thu, 17 Oct 2019 15:08:05 -0500
Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> wrote:

> +	if ((esize != 4) && (esize != 8) && (esize != 16)) {
> +		RTE_LOG(ERR, RING,
> +			"Unsupported esize value. Supported values are 4, 8 and 16\n");
> +
> +		return -EINVAL;
> +	}
> +
>  	/* count must be a power of 2 */
>  	if ((!POWEROF2(count)) || (count > RTE_RING_SZ_MASK )) {

Minor nit, you don't need as many parens in conditionals.

	if (esize != 4 && esize != 8 && esize != 16) {

and
	if (!POWEROF2(count) || count > RTE_RING_SZ_MASK) {

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v5 1/3] lib/ring: apis to support configurable element size
  2019-10-17 20:08     ` [dpdk-dev] [PATCH v5 1/3] lib/ring: apis to support configurable " Honnappa Nagarahalli
  2019-10-17 20:39       ` Stephen Hemminger
@ 2019-10-17 20:40       ` Stephen Hemminger
  1 sibling, 0 replies; 173+ messages in thread
From: Stephen Hemminger @ 2019-10-17 20:40 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: olivier.matz, sthemmin, jerinj, bruce.richardson, david.marchand,
	pbhagavatula, konstantin.ananyev, drc, hemant.agrawal, dev,
	dharmik.thakkar, ruifeng.wang, gavin.hu

On Thu, 17 Oct 2019 15:08:05 -0500
Honnappa Nagarahalli <honnappa.nagarahalli@arm.com> wrote:

>  	/* count must be a power of 2 */
>  	if ((!POWEROF2(count)) || (count > RTE_RING_SZ_MASK )) {
>  		RTE_LOG(ERR, RING,
> -			"Requested size is invalid, must be power of 2, and "
> -			"do not exceed the size limit %u\n", RTE_RING_SZ_MASK);
> +			"Requested number of elements is invalid, must be "
> +			"power of 2, and do not exceed the limit %u\n",

Error messages often go to syslog. Please don't use multi-line messages, syslog doesn't handle it.
Better to be less wordy

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable element size
  2019-10-17 20:16                   ` Honnappa Nagarahalli
@ 2019-10-17 23:17                     ` David Christensen
  2019-10-18  3:18                       ` Honnappa Nagarahalli
  0 siblings, 1 reply; 173+ messages in thread
From: David Christensen @ 2019-10-17 23:17 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Ananyev, Konstantin, olivier.matz,
	sthemmin, jerinj, Richardson, Bruce, david.marchand,
	pbhagavatula
  Cc: dev, Dharmik Thakkar, Ruifeng Wang (Arm Technology China),
	Gavin Hu (Arm Technology China),
	stephen, nd

>>> I tried this. On x86 (Xeon(R) Gold 6132 CPU @ 2.60GHz), the results are as
>> follows. The numbers in brackets are with the code on master.
>>> gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
>>>
>>> RTE>>ring_perf_elem_autotest
>>> ### Testing single element and burst enq/deq ### SP/SC single
>>> enq/dequeue: 5 MP/MC single enq/dequeue: 40 (35) SP/SC burst
>>> enq/dequeue (size: 8): 2 MP/MC burst enq/dequeue (size: 8): 6 SP/SC
>>> burst enq/dequeue (size: 32): 1 (2) MP/MC burst enq/dequeue (size:
>>> 32): 2
>>>
>>> ### Testing empty dequeue ###
>>> SC empty dequeue: 2.11
>>> MC empty dequeue: 1.41 (2.11)
>>>
>>> ### Testing using a single lcore ###
>>> SP/SC bulk enq/dequeue (size: 8): 2.15 (2.86) MP/MC bulk enq/dequeue
>>> (size: 8): 6.35 (6.91) SP/SC bulk enq/dequeue (size: 32): 1.35 (2.06)
>>> MP/MC bulk enq/dequeue (size: 32): 2.38 (2.95)
>>>
>>> ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size:
>>> 8): 73.81 (15.33) MP/MC bulk enq/dequeue (size: 8): 75.10 (71.27)
>>> SP/SC bulk enq/dequeue (size: 32): 21.14 (9.58) MP/MC bulk enq/dequeue
>>> (size: 32): 25.74 (20.91)
>>>
>>> ### Testing using two NUMA nodes ###
>>> SP/SC bulk enq/dequeue (size: 8): 164.32 (50.66) MP/MC bulk
>>> enq/dequeue (size: 8): 176.02 (173.43) SP/SC bulk enq/dequeue (size:
>>> 32): 50.78 (23) MP/MC bulk enq/dequeue (size: 32): 63.17 (46.74)
>>>
>>> On one of the Arm platform
>>> MP/MC bulk enq/dequeue (size: 32): 0.37 (0.33) (~12% hit, the rest are
>>> ok)

Tried this on a Power9 platform (3.6GHz), with two numa nodes and 16 
cores/node (SMT=4).  Applied all 3 patches in v5, test results are as 
follows:

RTE>>ring_perf_elem_autotest
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 42
MP/MC single enq/dequeue: 59
SP/SC burst enq/dequeue (size: 8): 5
MP/MC burst enq/dequeue (size: 8): 7
SP/SC burst enq/dequeue (size: 32): 2
MP/MC burst enq/dequeue (size: 32): 2

### Testing empty dequeue ###
SC empty dequeue: 7.81
MC empty dequeue: 7.81

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 5.76
MP/MC bulk enq/dequeue (size: 8): 7.66
SP/SC bulk enq/dequeue (size: 32): 2.10
MP/MC bulk enq/dequeue (size: 32): 2.57

### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 13.13
MP/MC bulk enq/dequeue (size: 8): 13.98
SP/SC bulk enq/dequeue (size: 32): 3.41
MP/MC bulk enq/dequeue (size: 32): 4.45

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 11.00
MP/MC bulk enq/dequeue (size: 8): 10.95
SP/SC bulk enq/dequeue (size: 32): 3.08
MP/MC bulk enq/dequeue (size: 32): 3.40

### Testing using two NUMA nodes ###
SP/SC bulk enq/dequeue (size: 8): 63.41
MP/MC bulk enq/dequeue (size: 8): 62.70
SP/SC bulk enq/dequeue (size: 32): 15.39
MP/MC bulk enq/dequeue (size: 32): 22.96

Dave

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable element size
  2019-10-17 23:17                     ` David Christensen
@ 2019-10-18  3:18                       ` Honnappa Nagarahalli
  2019-10-18  8:04                         ` Jerin Jacob
  2019-10-18 17:23                         ` David Christensen
  0 siblings, 2 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-18  3:18 UTC (permalink / raw)
  To: David Christensen, Ananyev, Konstantin, olivier.matz, sthemmin,
	jerinj, Richardson, Bruce, david.marchand, pbhagavatula
  Cc: dev, Dharmik Thakkar, Ruifeng Wang (Arm Technology China),
	Gavin Hu (Arm Technology China),
	stephen, Honnappa Nagarahalli, nd, nd

<snip>

> Subject: Re: [PATCH v4 1/2] lib/ring: apis to support configurable element
> size
> 
> >>> I tried this. On x86 (Xeon(R) Gold 6132 CPU @ 2.60GHz), the results
> >>> are as
> >> follows. The numbers in brackets are with the code on master.
> >>> gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
> >>>
> >>> RTE>>ring_perf_elem_autotest
> >>> ### Testing single element and burst enq/deq ### SP/SC single
> >>> enq/dequeue: 5 MP/MC single enq/dequeue: 40 (35) SP/SC burst
> >>> enq/dequeue (size: 8): 2 MP/MC burst enq/dequeue (size: 8): 6 SP/SC
> >>> burst enq/dequeue (size: 32): 1 (2) MP/MC burst enq/dequeue (size:
> >>> 32): 2
> >>>
> >>> ### Testing empty dequeue ###
> >>> SC empty dequeue: 2.11
> >>> MC empty dequeue: 1.41 (2.11)
> >>>
> >>> ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size:
> >>> 8): 2.15 (2.86) MP/MC bulk enq/dequeue
> >>> (size: 8): 6.35 (6.91) SP/SC bulk enq/dequeue (size: 32): 1.35
> >>> (2.06) MP/MC bulk enq/dequeue (size: 32): 2.38 (2.95)
> >>>
> >>> ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size:
> >>> 8): 73.81 (15.33) MP/MC bulk enq/dequeue (size: 8): 75.10 (71.27)
> >>> SP/SC bulk enq/dequeue (size: 32): 21.14 (9.58) MP/MC bulk
> >>> enq/dequeue
> >>> (size: 32): 25.74 (20.91)
> >>>
> >>> ### Testing using two NUMA nodes ### SP/SC bulk enq/dequeue (size:
> >>> 8): 164.32 (50.66) MP/MC bulk enq/dequeue (size: 8): 176.02 (173.43)
> >>> SP/SC bulk enq/dequeue (size:
> >>> 32): 50.78 (23) MP/MC bulk enq/dequeue (size: 32): 63.17 (46.74)
> >>>
> >>> On one of the Arm platform
> >>> MP/MC bulk enq/dequeue (size: 32): 0.37 (0.33) (~12% hit, the rest
> >>> are
> >>> ok)
> 
> Tried this on a Power9 platform (3.6GHz), with two numa nodes and 16
> cores/node (SMT=4).  Applied all 3 patches in v5, test results are as
> follows:
> 
> RTE>>ring_perf_elem_autotest
> ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue:
> 42 MP/MC single enq/dequeue: 59 SP/SC burst enq/dequeue (size: 8): 5
> MP/MC burst enq/dequeue (size: 8): 7 SP/SC burst enq/dequeue (size: 32): 2
> MP/MC burst enq/dequeue (size: 32): 2
> 
> ### Testing empty dequeue ###
> SC empty dequeue: 7.81
> MC empty dequeue: 7.81
> 
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 5.76
> MP/MC bulk enq/dequeue (size: 8): 7.66
> SP/SC bulk enq/dequeue (size: 32): 2.10
> MP/MC bulk enq/dequeue (size: 32): 2.57
> 
> ### Testing using two hyperthreads ###
> SP/SC bulk enq/dequeue (size: 8): 13.13
> MP/MC bulk enq/dequeue (size: 8): 13.98
> SP/SC bulk enq/dequeue (size: 32): 3.41
> MP/MC bulk enq/dequeue (size: 32): 4.45
> 
> ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8):
> 11.00 MP/MC bulk enq/dequeue (size: 8): 10.95 SP/SC bulk enq/dequeue
> (size: 32): 3.08 MP/MC bulk enq/dequeue (size: 32): 3.40
> 
> ### Testing using two NUMA nodes ###
> SP/SC bulk enq/dequeue (size: 8): 63.41
> MP/MC bulk enq/dequeue (size: 8): 62.70
> SP/SC bulk enq/dequeue (size: 32): 15.39 MP/MC bulk enq/dequeue (size:
> 32): 22.96
> 
Thanks for running this. There is another test 'ring_perf_autotest' which provides the numbers with the original implementation. The goal is to make sure the numbers with the original implementation are the same as these. Can you please run that as well?

> Dave

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable element size
  2019-10-18  3:18                       ` Honnappa Nagarahalli
@ 2019-10-18  8:04                         ` Jerin Jacob
  2019-10-18 16:11                           ` Jerin Jacob
  2019-10-18 16:44                           ` Ananyev, Konstantin
  2019-10-18 17:23                         ` David Christensen
  1 sibling, 2 replies; 173+ messages in thread
From: Jerin Jacob @ 2019-10-18  8:04 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: David Christensen, Ananyev, Konstantin, olivier.matz, sthemmin,
	jerinj, Richardson, Bruce, david.marchand, pbhagavatula, dev,
	Dharmik Thakkar, Ruifeng Wang (Arm Technology China),
	Gavin Hu (Arm Technology China),
	stephen, nd

On Fri, Oct 18, 2019 at 8:48 AM Honnappa Nagarahalli
<Honnappa.Nagarahalli@arm.com> wrote:
>
> <snip>
>
> > Subject: Re: [PATCH v4 1/2] lib/ring: apis to support configurable element
> > size
> >
> > >>> I tried this. On x86 (Xeon(R) Gold 6132 CPU @ 2.60GHz), the results
> > >>> are as
> > >> follows. The numbers in brackets are with the code on master.
> > >>> gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
> > >>>
> > >>> RTE>>ring_perf_elem_autotest
> > >>> ### Testing single element and burst enq/deq ### SP/SC single
> > >>> enq/dequeue: 5 MP/MC single enq/dequeue: 40 (35) SP/SC burst
> > >>> enq/dequeue (size: 8): 2 MP/MC burst enq/dequeue (size: 8): 6 SP/SC
> > >>> burst enq/dequeue (size: 32): 1 (2) MP/MC burst enq/dequeue (size:
> > >>> 32): 2
> > >>>
> > >>> ### Testing empty dequeue ###
> > >>> SC empty dequeue: 2.11
> > >>> MC empty dequeue: 1.41 (2.11)
> > >>>
> > >>> ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size:
> > >>> 8): 2.15 (2.86) MP/MC bulk enq/dequeue
> > >>> (size: 8): 6.35 (6.91) SP/SC bulk enq/dequeue (size: 32): 1.35
> > >>> (2.06) MP/MC bulk enq/dequeue (size: 32): 2.38 (2.95)
> > >>>
> > >>> ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size:
> > >>> 8): 73.81 (15.33) MP/MC bulk enq/dequeue (size: 8): 75.10 (71.27)
> > >>> SP/SC bulk enq/dequeue (size: 32): 21.14 (9.58) MP/MC bulk
> > >>> enq/dequeue
> > >>> (size: 32): 25.74 (20.91)
> > >>>
> > >>> ### Testing using two NUMA nodes ### SP/SC bulk enq/dequeue (size:
> > >>> 8): 164.32 (50.66) MP/MC bulk enq/dequeue (size: 8): 176.02 (173.43)
> > >>> SP/SC bulk enq/dequeue (size:
> > >>> 32): 50.78 (23) MP/MC bulk enq/dequeue (size: 32): 63.17 (46.74)
> > >>>
> > >>> On one of the Arm platform
> > >>> MP/MC bulk enq/dequeue (size: 32): 0.37 (0.33) (~12% hit, the rest
> > >>> are
> > >>> ok)
> >
> > Tried this on a Power9 platform (3.6GHz), with two numa nodes and 16
> > cores/node (SMT=4).  Applied all 3 patches in v5, test results are as
> > follows:
> >
> > RTE>>ring_perf_elem_autotest
> > ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue:
> > 42 MP/MC single enq/dequeue: 59 SP/SC burst enq/dequeue (size: 8): 5
> > MP/MC burst enq/dequeue (size: 8): 7 SP/SC burst enq/dequeue (size: 32): 2
> > MP/MC burst enq/dequeue (size: 32): 2
> >
> > ### Testing empty dequeue ###
> > SC empty dequeue: 7.81
> > MC empty dequeue: 7.81
> >
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 5.76
> > MP/MC bulk enq/dequeue (size: 8): 7.66
> > SP/SC bulk enq/dequeue (size: 32): 2.10
> > MP/MC bulk enq/dequeue (size: 32): 2.57
> >
> > ### Testing using two hyperthreads ###
> > SP/SC bulk enq/dequeue (size: 8): 13.13
> > MP/MC bulk enq/dequeue (size: 8): 13.98
> > SP/SC bulk enq/dequeue (size: 32): 3.41
> > MP/MC bulk enq/dequeue (size: 32): 4.45
> >
> > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8):
> > 11.00 MP/MC bulk enq/dequeue (size: 8): 10.95 SP/SC bulk enq/dequeue
> > (size: 32): 3.08 MP/MC bulk enq/dequeue (size: 32): 3.40
> >
> > ### Testing using two NUMA nodes ###
> > SP/SC bulk enq/dequeue (size: 8): 63.41
> > MP/MC bulk enq/dequeue (size: 8): 62.70
> > SP/SC bulk enq/dequeue (size: 32): 15.39 MP/MC bulk enq/dequeue (size:
> > 32): 22.96
> >
> Thanks for running this. There is another test 'ring_perf_autotest' which provides the numbers with the original implementation. The goal is to make sure the numbers with the original implementation are the same as these. Can you please run that as well?

Honnappa,

Your earlier perf report shows the cycles are in less than 1. That's
is due to it is using 50 or 100MHz clock in EL0.
Please check with PMU counter. See "ARM64 profiling" in

http://doc.dpdk.org/guides/prog_guide/profile_app.html


Here is the octeontx2 values. There is a regression in two core cases
as you reported earlier in x86.


RTE>>ring_perf_autotest
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 288
MP/MC single enq/dequeue: 452
SP/SC burst enq/dequeue (size: 8): 39
MP/MC burst enq/dequeue (size: 8): 61
SP/SC burst enq/dequeue (size: 32): 13
MP/MC burst enq/dequeue (size: 32): 21

### Testing empty dequeue ###
SC empty dequeue: 6.33
MC empty dequeue: 6.67

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 38.35
MP/MC bulk enq/dequeue (size: 8): 67.36
SP/SC bulk enq/dequeue (size: 32): 13.10
MP/MC bulk enq/dequeue (size: 32): 21.64

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 75.94
MP/MC bulk enq/dequeue (size: 8): 107.66
SP/SC bulk enq/dequeue (size: 32): 24.51
MP/MC bulk enq/dequeue (size: 32): 33.23
Test OK
RTE>>

---- after applying v5 of the patch ------

RTE>>ring_perf_autotest
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 289
MP/MC single enq/dequeue: 452
SP/SC burst enq/dequeue (size: 8): 40
MP/MC burst enq/dequeue (size: 8): 64
SP/SC burst enq/dequeue (size: 32): 13
MP/MC burst enq/dequeue (size: 32): 22

### Testing empty dequeue ###
SC empty dequeue: 6.33
MC empty dequeue: 6.67

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 39.73
MP/MC bulk enq/dequeue (size: 8): 69.13
SP/SC bulk enq/dequeue (size: 32): 13.44
MP/MC bulk enq/dequeue (size: 32): 22.00

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 76.02
MP/MC bulk enq/dequeue (size: 8): 112.50
SP/SC bulk enq/dequeue (size: 32): 24.71
MP/MC bulk enq/dequeue (size: 32): 33.34
Test OK
RTE>>

RTE>>ring_perf_elem_autotest
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 290
MP/MC single enq/dequeue: 503
SP/SC burst enq/dequeue (size: 8): 39
MP/MC burst enq/dequeue (size: 8): 63
SP/SC burst enq/dequeue (size: 32): 11
MP/MC burst enq/dequeue (size: 32): 19

### Testing empty dequeue ###
SC empty dequeue: 6.33
MC empty dequeue: 6.67

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 38.92
MP/MC bulk enq/dequeue (size: 8): 62.54
SP/SC bulk enq/dequeue (size: 32): 11.46
MP/MC bulk enq/dequeue (size: 32): 19.89

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 87.55
MP/MC bulk enq/dequeue (size: 8): 99.10
SP/SC bulk enq/dequeue (size: 32): 26.63
MP/MC bulk enq/dequeue (size: 32): 29.91
Test OK
RTE>>



> > Dave

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable element size
  2019-10-18  8:04                         ` Jerin Jacob
@ 2019-10-18 16:11                           ` Jerin Jacob
  2019-10-21  0:27                             ` Honnappa Nagarahalli
  2019-10-18 16:44                           ` Ananyev, Konstantin
  1 sibling, 1 reply; 173+ messages in thread
From: Jerin Jacob @ 2019-10-18 16:11 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: David Christensen, Ananyev, Konstantin, olivier.matz, sthemmin,
	jerinj, Richardson, Bruce, david.marchand, pbhagavatula, dev,
	Dharmik Thakkar, Ruifeng Wang (Arm Technology China),
	Gavin Hu (Arm Technology China),
	stephen, nd

On Fri, Oct 18, 2019 at 1:34 PM Jerin Jacob <jerinjacobk@gmail.com> wrote:
>
> On Fri, Oct 18, 2019 at 8:48 AM Honnappa Nagarahalli
> <Honnappa.Nagarahalli@arm.com> wrote:
> >
> > <snip>
> >
> > > Subject: Re: [PATCH v4 1/2] lib/ring: apis to support configurable element
> > > size
> > >
> > > >>> I tried this. On x86 (Xeon(R) Gold 6132 CPU @ 2.60GHz), the results
> > > >>> are as
> > > >> follows. The numbers in brackets are with the code on master.
> > > >>> gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
> > > >>>
> > > >>> RTE>>ring_perf_elem_autotest
> > > >>> ### Testing single element and burst enq/deq ### SP/SC single
> > > >>> enq/dequeue: 5 MP/MC single enq/dequeue: 40 (35) SP/SC burst
> > > >>> enq/dequeue (size: 8): 2 MP/MC burst enq/dequeue (size: 8): 6 SP/SC
> > > >>> burst enq/dequeue (size: 32): 1 (2) MP/MC burst enq/dequeue (size:
> > > >>> 32): 2
> > > >>>
> > > >>> ### Testing empty dequeue ###
> > > >>> SC empty dequeue: 2.11
> > > >>> MC empty dequeue: 1.41 (2.11)
> > > >>>
> > > >>> ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size:
> > > >>> 8): 2.15 (2.86) MP/MC bulk enq/dequeue
> > > >>> (size: 8): 6.35 (6.91) SP/SC bulk enq/dequeue (size: 32): 1.35
> > > >>> (2.06) MP/MC bulk enq/dequeue (size: 32): 2.38 (2.95)
> > > >>>
> > > >>> ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size:
> > > >>> 8): 73.81 (15.33) MP/MC bulk enq/dequeue (size: 8): 75.10 (71.27)
> > > >>> SP/SC bulk enq/dequeue (size: 32): 21.14 (9.58) MP/MC bulk
> > > >>> enq/dequeue
> > > >>> (size: 32): 25.74 (20.91)
> > > >>>
> > > >>> ### Testing using two NUMA nodes ### SP/SC bulk enq/dequeue (size:
> > > >>> 8): 164.32 (50.66) MP/MC bulk enq/dequeue (size: 8): 176.02 (173.43)
> > > >>> SP/SC bulk enq/dequeue (size:
> > > >>> 32): 50.78 (23) MP/MC bulk enq/dequeue (size: 32): 63.17 (46.74)
> > > >>>
> > > >>> On one of the Arm platform
> > > >>> MP/MC bulk enq/dequeue (size: 32): 0.37 (0.33) (~12% hit, the rest
> > > >>> are
> > > >>> ok)
> > >
> > > Tried this on a Power9 platform (3.6GHz), with two numa nodes and 16
> > > cores/node (SMT=4).  Applied all 3 patches in v5, test results are as
> > > follows:
> > >
> > > RTE>>ring_perf_elem_autotest
> > > ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue:
> > > 42 MP/MC single enq/dequeue: 59 SP/SC burst enq/dequeue (size: 8): 5
> > > MP/MC burst enq/dequeue (size: 8): 7 SP/SC burst enq/dequeue (size: 32): 2
> > > MP/MC burst enq/dequeue (size: 32): 2
> > >
> > > ### Testing empty dequeue ###
> > > SC empty dequeue: 7.81
> > > MC empty dequeue: 7.81
> > >
> > > ### Testing using a single lcore ###
> > > SP/SC bulk enq/dequeue (size: 8): 5.76
> > > MP/MC bulk enq/dequeue (size: 8): 7.66
> > > SP/SC bulk enq/dequeue (size: 32): 2.10
> > > MP/MC bulk enq/dequeue (size: 32): 2.57
> > >
> > > ### Testing using two hyperthreads ###
> > > SP/SC bulk enq/dequeue (size: 8): 13.13
> > > MP/MC bulk enq/dequeue (size: 8): 13.98
> > > SP/SC bulk enq/dequeue (size: 32): 3.41
> > > MP/MC bulk enq/dequeue (size: 32): 4.45
> > >
> > > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8):
> > > 11.00 MP/MC bulk enq/dequeue (size: 8): 10.95 SP/SC bulk enq/dequeue
> > > (size: 32): 3.08 MP/MC bulk enq/dequeue (size: 32): 3.40
> > >
> > > ### Testing using two NUMA nodes ###
> > > SP/SC bulk enq/dequeue (size: 8): 63.41
> > > MP/MC bulk enq/dequeue (size: 8): 62.70
> > > SP/SC bulk enq/dequeue (size: 32): 15.39 MP/MC bulk enq/dequeue (size:
> > > 32): 22.96
> > >
> > Thanks for running this. There is another test 'ring_perf_autotest' which provides the numbers with the original implementation. The goal is to make sure the numbers with the original implementation are the same as these. Can you please run that as well?
>
> Honnappa,
>
> Your earlier perf report shows the cycles are in less than 1. That's
> is due to it is using 50 or 100MHz clock in EL0.
> Please check with PMU counter. See "ARM64 profiling" in
>
> http://doc.dpdk.org/guides/prog_guide/profile_app.html
>
>
> Here is the octeontx2 values. There is a regression in two core cases
> as you reported earlier in x86.
>
>
> RTE>>ring_perf_autotest
> ### Testing single element and burst enq/deq ###
> SP/SC single enq/dequeue: 288
> MP/MC single enq/dequeue: 452
> SP/SC burst enq/dequeue (size: 8): 39
> MP/MC burst enq/dequeue (size: 8): 61
> SP/SC burst enq/dequeue (size: 32): 13
> MP/MC burst enq/dequeue (size: 32): 21
>
> ### Testing empty dequeue ###
> SC empty dequeue: 6.33
> MC empty dequeue: 6.67
>
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 38.35
> MP/MC bulk enq/dequeue (size: 8): 67.36
> SP/SC bulk enq/dequeue (size: 32): 13.10
> MP/MC bulk enq/dequeue (size: 32): 21.64
>
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 75.94
> MP/MC bulk enq/dequeue (size: 8): 107.66
> SP/SC bulk enq/dequeue (size: 32): 24.51
> MP/MC bulk enq/dequeue (size: 32): 33.23
> Test OK
> RTE>>
>
> ---- after applying v5 of the patch ------
>
> RTE>>ring_perf_autotest
> ### Testing single element and burst enq/deq ###
> SP/SC single enq/dequeue: 289
> MP/MC single enq/dequeue: 452
> SP/SC burst enq/dequeue (size: 8): 40
> MP/MC burst enq/dequeue (size: 8): 64
> SP/SC burst enq/dequeue (size: 32): 13
> MP/MC burst enq/dequeue (size: 32): 22
>
> ### Testing empty dequeue ###
> SC empty dequeue: 6.33
> MC empty dequeue: 6.67
>
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 39.73
> MP/MC bulk enq/dequeue (size: 8): 69.13
> SP/SC bulk enq/dequeue (size: 32): 13.44
> MP/MC bulk enq/dequeue (size: 32): 22.00
>
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 76.02
> MP/MC bulk enq/dequeue (size: 8): 112.50
> SP/SC bulk enq/dequeue (size: 32): 24.71
> MP/MC bulk enq/dequeue (size: 32): 33.34
> Test OK
> RTE>>
>
> RTE>>ring_perf_elem_autotest
> ### Testing single element and burst enq/deq ###
> SP/SC single enq/dequeue: 290
> MP/MC single enq/dequeue: 503
> SP/SC burst enq/dequeue (size: 8): 39
> MP/MC burst enq/dequeue (size: 8): 63
> SP/SC burst enq/dequeue (size: 32): 11
> MP/MC burst enq/dequeue (size: 32): 19
>
> ### Testing empty dequeue ###
> SC empty dequeue: 6.33
> MC empty dequeue: 6.67
>
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 38.92
> MP/MC bulk enq/dequeue (size: 8): 62.54
> SP/SC bulk enq/dequeue (size: 32): 11.46
> MP/MC bulk enq/dequeue (size: 32): 19.89
>
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 87.55
> MP/MC bulk enq/dequeue (size: 8): 99.10
> SP/SC bulk enq/dequeue (size: 32): 26.63
> MP/MC bulk enq/dequeue (size: 32): 29.91
> Test OK
> RTE>>

it looks like removal of 3/3 and keeping only 1/3 and 2/3 shows better
results in some cases


RTE>>ring_perf_autotest
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 288
MP/MC single enq/dequeue: 439
SP/SC burst enq/dequeue (size: 8): 39
MP/MC burst enq/dequeue (size: 8): 61
SP/SC burst enq/dequeue (size: 32): 13
MP/MC burst enq/dequeue (size: 32): 22

### Testing empty dequeue ###
SC empty dequeue: 6.33
MC empty dequeue: 6.67

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 38.35
MP/MC bulk enq/dequeue (size: 8): 67.48
SP/SC bulk enq/dequeue (size: 32): 13.40
MP/MC bulk enq/dequeue (size: 32): 22.03

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 75.94
MP/MC bulk enq/dequeue (size: 8): 105.84
SP/SC bulk enq/dequeue (size: 32): 25.11
MP/MC bulk enq/dequeue (size: 32): 33.48
Test OK
RTE>>


RTE>>ring_perf_elem_autotest
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 288
MP/MC single enq/dequeue: 452
SP/SC burst enq/dequeue (size: 8): 39
MP/MC burst enq/dequeue (size: 8): 61
SP/SC burst enq/dequeue (size: 32): 13
MP/MC burst enq/dequeue (size: 32): 22

### Testing empty dequeue ###
SC empty dequeue: 6.33
MC empty dequeue: 6.00

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 38.35
MP/MC bulk enq/dequeue (size: 8): 67.46
SP/SC bulk enq/dequeue (size: 32): 13.42
MP/MC bulk enq/dequeue (size: 32): 22.01

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 76.04
MP/MC bulk enq/dequeue (size: 8): 104.88
SP/SC bulk enq/dequeue (size: 32): 24.75
MP/MC bulk enq/dequeue (size: 32): 34.66
Test OK
RTE>>


>
>
>
> > > Dave

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable element size
  2019-10-18  8:04                         ` Jerin Jacob
  2019-10-18 16:11                           ` Jerin Jacob
@ 2019-10-18 16:44                           ` Ananyev, Konstantin
  2019-10-18 19:03                             ` Honnappa Nagarahalli
  2019-10-21  0:36                             ` Honnappa Nagarahalli
  1 sibling, 2 replies; 173+ messages in thread
From: Ananyev, Konstantin @ 2019-10-18 16:44 UTC (permalink / raw)
  To: Jerin Jacob, Honnappa Nagarahalli
  Cc: David Christensen, olivier.matz, sthemmin, jerinj, Richardson,
	Bruce, david.marchand, pbhagavatula, dev, Dharmik Thakkar,
	Ruifeng Wang (Arm Technology China),
	Gavin Hu (Arm Technology China),
	stephen, nd


Hi everyone,


> > > >>> I tried this. On x86 (Xeon(R) Gold 6132 CPU @ 2.60GHz), the results
> > > >>> are as
> > > >> follows. The numbers in brackets are with the code on master.
> > > >>> gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
> > > >>>
> > > >>> RTE>>ring_perf_elem_autotest
> > > >>> ### Testing single element and burst enq/deq ### SP/SC single
> > > >>> enq/dequeue: 5 MP/MC single enq/dequeue: 40 (35) SP/SC burst
> > > >>> enq/dequeue (size: 8): 2 MP/MC burst enq/dequeue (size: 8): 6 SP/SC
> > > >>> burst enq/dequeue (size: 32): 1 (2) MP/MC burst enq/dequeue (size:
> > > >>> 32): 2
> > > >>>
> > > >>> ### Testing empty dequeue ###
> > > >>> SC empty dequeue: 2.11
> > > >>> MC empty dequeue: 1.41 (2.11)
> > > >>>
> > > >>> ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size:
> > > >>> 8): 2.15 (2.86) MP/MC bulk enq/dequeue
> > > >>> (size: 8): 6.35 (6.91) SP/SC bulk enq/dequeue (size: 32): 1.35
> > > >>> (2.06) MP/MC bulk enq/dequeue (size: 32): 2.38 (2.95)
> > > >>>
> > > >>> ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size:
> > > >>> 8): 73.81 (15.33) MP/MC bulk enq/dequeue (size: 8): 75.10 (71.27)
> > > >>> SP/SC bulk enq/dequeue (size: 32): 21.14 (9.58) MP/MC bulk
> > > >>> enq/dequeue
> > > >>> (size: 32): 25.74 (20.91)
> > > >>>
> > > >>> ### Testing using two NUMA nodes ### SP/SC bulk enq/dequeue (size:
> > > >>> 8): 164.32 (50.66) MP/MC bulk enq/dequeue (size: 8): 176.02 (173.43)
> > > >>> SP/SC bulk enq/dequeue (size:
> > > >>> 32): 50.78 (23) MP/MC bulk enq/dequeue (size: 32): 63.17 (46.74)
> > > >>>
> > > >>> On one of the Arm platform
> > > >>> MP/MC bulk enq/dequeue (size: 32): 0.37 (0.33) (~12% hit, the rest
> > > >>> are
> > > >>> ok)
> > >
> > > Tried this on a Power9 platform (3.6GHz), with two numa nodes and 16
> > > cores/node (SMT=4).  Applied all 3 patches in v5, test results are as
> > > follows:
> > >
> > > RTE>>ring_perf_elem_autotest
> > > ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue:
> > > 42 MP/MC single enq/dequeue: 59 SP/SC burst enq/dequeue (size: 8): 5
> > > MP/MC burst enq/dequeue (size: 8): 7 SP/SC burst enq/dequeue (size: 32): 2
> > > MP/MC burst enq/dequeue (size: 32): 2
> > >
> > > ### Testing empty dequeue ###
> > > SC empty dequeue: 7.81
> > > MC empty dequeue: 7.81
> > >
> > > ### Testing using a single lcore ###
> > > SP/SC bulk enq/dequeue (size: 8): 5.76
> > > MP/MC bulk enq/dequeue (size: 8): 7.66
> > > SP/SC bulk enq/dequeue (size: 32): 2.10
> > > MP/MC bulk enq/dequeue (size: 32): 2.57
> > >
> > > ### Testing using two hyperthreads ###
> > > SP/SC bulk enq/dequeue (size: 8): 13.13
> > > MP/MC bulk enq/dequeue (size: 8): 13.98
> > > SP/SC bulk enq/dequeue (size: 32): 3.41
> > > MP/MC bulk enq/dequeue (size: 32): 4.45
> > >
> > > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8):
> > > 11.00 MP/MC bulk enq/dequeue (size: 8): 10.95 SP/SC bulk enq/dequeue
> > > (size: 32): 3.08 MP/MC bulk enq/dequeue (size: 32): 3.40
> > >
> > > ### Testing using two NUMA nodes ###
> > > SP/SC bulk enq/dequeue (size: 8): 63.41
> > > MP/MC bulk enq/dequeue (size: 8): 62.70
> > > SP/SC bulk enq/dequeue (size: 32): 15.39 MP/MC bulk enq/dequeue (size:
> > > 32): 22.96
> > >
> > Thanks for running this. There is another test 'ring_perf_autotest' which provides the numbers with the original implementation. The goal
> is to make sure the numbers with the original implementation are the same as these. Can you please run that as well?
> 
> Honnappa,
> 
> Your earlier perf report shows the cycles are in less than 1. That's
> is due to it is using 50 or 100MHz clock in EL0.
> Please check with PMU counter. See "ARM64 profiling" in
> 
> http://doc.dpdk.org/guides/prog_guide/profile_app.html
> 
> 
> Here is the octeontx2 values. There is a regression in two core cases
> as you reported earlier in x86.
> 
> 
> RTE>>ring_perf_autotest
> ### Testing single element and burst enq/deq ###
> SP/SC single enq/dequeue: 288
> MP/MC single enq/dequeue: 452
> SP/SC burst enq/dequeue (size: 8): 39
> MP/MC burst enq/dequeue (size: 8): 61
> SP/SC burst enq/dequeue (size: 32): 13
> MP/MC burst enq/dequeue (size: 32): 21
> 
> ### Testing empty dequeue ###
> SC empty dequeue: 6.33
> MC empty dequeue: 6.67
> 
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 38.35
> MP/MC bulk enq/dequeue (size: 8): 67.36
> SP/SC bulk enq/dequeue (size: 32): 13.10
> MP/MC bulk enq/dequeue (size: 32): 21.64
> 
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 75.94
> MP/MC bulk enq/dequeue (size: 8): 107.66
> SP/SC bulk enq/dequeue (size: 32): 24.51
> MP/MC bulk enq/dequeue (size: 32): 33.23
> Test OK
> RTE>>
> 
> ---- after applying v5 of the patch ------
> 
> RTE>>ring_perf_autotest
> ### Testing single element and burst enq/deq ###
> SP/SC single enq/dequeue: 289
> MP/MC single enq/dequeue: 452
> SP/SC burst enq/dequeue (size: 8): 40
> MP/MC burst enq/dequeue (size: 8): 64
> SP/SC burst enq/dequeue (size: 32): 13
> MP/MC burst enq/dequeue (size: 32): 22
> 
> ### Testing empty dequeue ###
> SC empty dequeue: 6.33
> MC empty dequeue: 6.67
> 
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 39.73
> MP/MC bulk enq/dequeue (size: 8): 69.13
> SP/SC bulk enq/dequeue (size: 32): 13.44
> MP/MC bulk enq/dequeue (size: 32): 22.00
> 
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 76.02
> MP/MC bulk enq/dequeue (size: 8): 112.50
> SP/SC bulk enq/dequeue (size: 32): 24.71
> MP/MC bulk enq/dequeue (size: 32): 33.34
> Test OK
> RTE>>
> 
> RTE>>ring_perf_elem_autotest
> ### Testing single element and burst enq/deq ###
> SP/SC single enq/dequeue: 290
> MP/MC single enq/dequeue: 503
> SP/SC burst enq/dequeue (size: 8): 39
> MP/MC burst enq/dequeue (size: 8): 63
> SP/SC burst enq/dequeue (size: 32): 11
> MP/MC burst enq/dequeue (size: 32): 19
> 
> ### Testing empty dequeue ###
> SC empty dequeue: 6.33
> MC empty dequeue: 6.67
> 
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 38.92
> MP/MC bulk enq/dequeue (size: 8): 62.54
> SP/SC bulk enq/dequeue (size: 32): 11.46
> MP/MC bulk enq/dequeue (size: 32): 19.89
> 
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 87.55
> MP/MC bulk enq/dequeue (size: 8): 99.10
> SP/SC bulk enq/dequeue (size: 32): 26.63
> MP/MC bulk enq/dequeue (size: 32): 29.91
> Test OK
> RTE>>
> 

As I can see, there is copy&paste bug in patch #3
(that's why it probably produced some weird numbers for me first).
After fix applied (see patch below), things look pretty good on my box.
As I can see there are only 3 results noticably lower:
   SP/SC (size=8) over 2 physical cores same numa socket
   MP/MC (size=8) over 2 physical cores on different numa sockets. 
All others seems about same or better. 
Anyway I went ahead and reworked code a bit (as I suggested before)
to get rid of these huge ENQUEUE/DEQUEUE macros.
Results are very close to fixed patch #3 version (patch is also attached).
Though I suggest people hold on to re-run perf tests till we'll make ring
functional test to run for _elem_ functions too.
I started to work on that, but not sure I'll finish today (most likely Monday).
Perf results from my box, plus patches below.
Konstantin

perf results
==========

Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
  
A - ring_perf_autotest
B - ring_perf_elem_autotest + patch #3 + fix
C - B + update

### Testing using a single lcore ###	A	B	C
SP/SC bulk enq/dequeue (size: 8): 	4.06	3.06	3.22
MP/MC bulk enq/dequeue (size: 8): 	10.05	9.04	9.38
SP/SC bulk enq/dequeue (size: 32): 	2.93	1.91	1.84
MP/MC bulk enq/dequeue (size: 32): 	4.12	3.39	3.35

### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 	9.24	8.92	8.89
MP/MC bulk enq/dequeue (size: 8): 	15.47	15.39	16.02
SP/SC bulk enq/dequeue (size: 32): 	5.78	3.87	3.86
MP/MC bulk enq/dequeue (size: 32): 	6.41	4.57	4.45

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 	24.14	29.89	27.05
MP/MC bulk enq/dequeue (size: 8): 	68.61	70.55	69.85
SP/SC bulk enq/dequeue (size: 32): 	12.11	12.99	13.04
MP/MC bulk enq/dequeue (size: 32): 	22.14	17.86	18.25

### Testing using two NUMA nodes ###
SP/SC bulk enq/dequeue (size: 8): 	48.78	31.98	33.57
MP/MC bulk enq/dequeue (size: 8): 	167.53	197.29	192.13
SP/SC bulk enq/dequeue (size: 32): 	31.28	21.68	21.61
MP/MC bulk enq/dequeue (size: 32): 	53.45	49.94	48.81
 
fix patch
=======
 
From a2be5a9b136333a56d466ef042c655e522ca7012 Mon Sep 17 00:00:00 2001
From: Konstantin Ananyev <konstantin.ananyev@intel.com>
Date: Fri, 18 Oct 2019 15:50:43 +0100
Subject: [PATCH] fix1

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_ring/rte_ring_elem.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/lib/librte_ring/rte_ring_elem.h b/lib/librte_ring/rte_ring_elem.h
index 92e92f150..5e1819069 100644
--- a/lib/librte_ring/rte_ring_elem.h
+++ b/lib/librte_ring/rte_ring_elem.h
@@ -118,7 +118,7 @@ struct rte_ring *rte_ring_create_elem(const char *name, unsigned count,
        uint32_t sz = n * (esize / sizeof(uint32_t)); \
        if (likely(idx + n < size)) { \
                for (i = 0; i < (sz & ((~(unsigned)0x7))); i += 8, idx += 8) { \
-                       memcpy (ring + i, obj + i, 8 * sizeof (uint32_t)); \
+                       memcpy (ring + idx, obj + i, 8 * sizeof (uint32_t)); \
                } \
                switch (n & 0x7) { \
                case 7: \
@@ -153,7 +153,7 @@ struct rte_ring *rte_ring_create_elem(const char *name, unsigned count,
        uint32_t sz = n * (esize / sizeof(uint32_t)); \
        if (likely(idx + n < size)) { \
                for (i = 0; i < (sz & ((~(unsigned)0x7))); i += 8, idx += 8) { \
-                       memcpy (obj + i, ring + i, 8 * sizeof (uint32_t)); \
+                       memcpy (obj + i, ring + idx, 8 * sizeof (uint32_t)); \
                } \
                switch (n & 0x7) { \
                case 7: \
--
2.17.1

update patch (remove macros)
=========================

From 18b388e877b97e243f807f27a323e876b30869dd Mon Sep 17 00:00:00 2001
From: Konstantin Ananyev <konstantin.ananyev@intel.com>
Date: Fri, 18 Oct 2019 17:35:43 +0100
Subject: [PATCH] update1

Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_ring/rte_ring_elem.h | 141 ++++++++++++++++----------------
 1 file changed, 70 insertions(+), 71 deletions(-)

diff --git a/lib/librte_ring/rte_ring_elem.h b/lib/librte_ring/rte_ring_elem.h
index 5e1819069..eb706b12f 100644
--- a/lib/librte_ring/rte_ring_elem.h
+++ b/lib/librte_ring/rte_ring_elem.h
@@ -109,75 +109,74 @@ __rte_experimental
 struct rte_ring *rte_ring_create_elem(const char *name, unsigned count,
                                unsigned esize, int socket_id, unsigned flags);

-#define ENQUEUE_PTRS_GEN(r, ring_start, prod_head, obj_table, esize, n) do { \
-       unsigned int i; \
-       const uint32_t size = (r)->size; \
-       uint32_t idx = prod_head & (r)->mask; \
-       uint32_t *ring = (uint32_t *)ring_start; \
-       uint32_t *obj = (uint32_t *)obj_table; \
-       uint32_t sz = n * (esize / sizeof(uint32_t)); \
-       if (likely(idx + n < size)) { \
-               for (i = 0; i < (sz & ((~(unsigned)0x7))); i += 8, idx += 8) { \
-                       memcpy (ring + idx, obj + i, 8 * sizeof (uint32_t)); \
-               } \
-               switch (n & 0x7) { \
-               case 7: \
-                       ring[idx++] = obj[i++]; /* fallthrough */ \
-               case 6: \
-                       ring[idx++] = obj[i++]; /* fallthrough */ \
-               case 5: \
-                       ring[idx++] = obj[i++]; /* fallthrough */ \
-               case 4: \
-                       ring[idx++] = obj[i++]; /* fallthrough */ \
-               case 3: \
-                       ring[idx++] = obj[i++]; /* fallthrough */ \
-               case 2: \
-                       ring[idx++] = obj[i++]; /* fallthrough */ \
-               case 1: \
-                       ring[idx++] = obj[i++]; /* fallthrough */ \
-               } \
-       } else { \
-               for (i = 0; idx < size; i++, idx++)\
-                       ring[idx] = obj[i]; \
-               for (idx = 0; i < n; i++, idx++) \
-                       ring[idx] = obj[i]; \
-       } \
-} while (0)
-
-#define DEQUEUE_PTRS_GEN(r, ring_start, cons_head, obj_table, esize, n) do { \
-       unsigned int i; \
-       uint32_t idx = cons_head & (r)->mask; \
-       const uint32_t size = (r)->size; \
-       uint32_t *ring = (uint32_t *)ring_start; \
-       uint32_t *obj = (uint32_t *)obj_table; \
-       uint32_t sz = n * (esize / sizeof(uint32_t)); \
-       if (likely(idx + n < size)) { \
-               for (i = 0; i < (sz & ((~(unsigned)0x7))); i += 8, idx += 8) { \
-                       memcpy (obj + i, ring + idx, 8 * sizeof (uint32_t)); \
-               } \
-               switch (n & 0x7) { \
-               case 7: \
-                       obj[i++] = ring[idx++]; /* fallthrough */ \
-               case 6: \
-                       obj[i++] = ring[idx++]; /* fallthrough */ \
-               case 5: \
-                       obj[i++] = ring[idx++]; /* fallthrough */ \
-               case 4: \
-                       obj[i++] = ring[idx++]; /* fallthrough */ \
-               case 3: \
-                       obj[i++] = ring[idx++]; /* fallthrough */ \
-               case 2: \
-                       obj[i++] = ring[idx++]; /* fallthrough */ \
-               case 1: \
-                       obj[i++] = ring[idx++]; /* fallthrough */ \
-               } \
-       } else { \
-               for (i = 0; idx < size; i++, idx++) \
-                       obj[i] = ring[idx]; \
-               for (idx = 0; i < n; i++, idx++) \
-                       obj[i] = ring[idx]; \
-       } \
-} while (0)
+static __rte_always_inline void
+copy_elems(uint32_t du32[], const uint32_t su32[], uint32_t num, uint32_t esize)
+{
+       uint32_t i, sz;
+
+       sz = (num * esize) / sizeof(uint32_t);
+
+       for (i = 0; i < (sz & ~7); i += 8)
+               memcpy(du32 + i, su32 + i, 8 * sizeof(uint32_t));
+
+       switch (sz & 7) {
+       case 7: du32[sz - 7] = su32[sz - 7]; /* fallthrough */
+       case 6: du32[sz - 6] = su32[sz - 6]; /* fallthrough */
+       case 5: du32[sz - 5] = su32[sz - 5]; /* fallthrough */
+       case 4: du32[sz - 4] = su32[sz - 4]; /* fallthrough */
+       case 3: du32[sz - 3] = su32[sz - 3]; /* fallthrough */
+       case 2: du32[sz - 2] = su32[sz - 2]; /* fallthrough */
+       case 1: du32[sz - 1] = su32[sz - 1]; /* fallthrough */
+       }
+}
+
+static __rte_always_inline void
+enqueue_elems(struct rte_ring *r, void *ring_start, uint32_t prod_head,
+               void *obj_table, uint32_t num, uint32_t esize)
+{
+       uint32_t idx, n;
+       uint32_t *du32;
+       const uint32_t *su32;
+
+       const uint32_t size = r->size;
+
+       idx = prod_head & (r)->mask;
+
+       du32 = (uint32_t *)ring_start + idx;
+       su32 = obj_table;
+
+       if (idx + num < size)
+               copy_elems(du32, su32, num, esize);
+       else {
+               n = size - idx;
+               copy_elems(du32, su32, n, esize);
+               copy_elems(ring_start, su32 + n, num - n, esize);
+       }
+}
+
+static __rte_always_inline void
+dequeue_elems(struct rte_ring *r, void *ring_start, uint32_t cons_head,
+               void *obj_table, uint32_t num, uint32_t esize)
+{
+       uint32_t idx, n;
+       uint32_t *du32;
+       const uint32_t *su32;
+
+       const uint32_t size = r->size;
+
+       idx = cons_head & (r)->mask;
+
+       su32 = (uint32_t *)ring_start + idx;
+       du32 = obj_table;
+
+       if (idx + num < size)
+               copy_elems(du32, su32, num, esize);
+       else {
+               n = size - idx;
+               copy_elems(du32, su32, n, esize);
+               copy_elems(du32 + n, ring_start, num - n, esize);
+       }
+}

 /* Between load and load. there might be cpu reorder in weak model
  * (powerpc/arm).
@@ -232,7 +231,7 @@ __rte_ring_do_enqueue_elem(struct rte_ring *r, void * const obj_table,
        if (n == 0)
                goto end;

-       ENQUEUE_PTRS_GEN(r, &r[1], prod_head, obj_table, esize, n);
+       enqueue_elems(r, &r[1], prod_head, obj_table, n, esize);

        update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
 end:
@@ -279,7 +278,7 @@ __rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
        if (n == 0)
                goto end;

-       DEQUEUE_PTRS_GEN(r, &r[1], cons_head, obj_table, esize, n);
+       dequeue_elems(r, &r[1], cons_head, obj_table, n, esize);

        update_tail(&r->cons, cons_head, cons_next, is_sc, 0);

--
2.17.1



^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable element size
  2019-10-18  3:18                       ` Honnappa Nagarahalli
  2019-10-18  8:04                         ` Jerin Jacob
@ 2019-10-18 17:23                         ` David Christensen
  1 sibling, 0 replies; 173+ messages in thread
From: David Christensen @ 2019-10-18 17:23 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Ananyev, Konstantin, olivier.matz,
	sthemmin, jerinj, Richardson, Bruce, david.marchand,
	pbhagavatula
  Cc: dev, Dharmik Thakkar, Ruifeng Wang (Arm Technology China),
	Gavin Hu (Arm Technology China),
	stephen, nd


>> Tried this on a Power9 platform (3.6GHz), with two numa nodes and 16
>> cores/node (SMT=4).  Applied all 3 patches in v5, test results are as
>> follows:
>>
>> RTE>>ring_perf_elem_autotest
>> ### Testing single element and burst enq/deq ### SP/SC single enq/dequeue:
>> 42 MP/MC single enq/dequeue: 59 SP/SC burst enq/dequeue (size: 8): 5
>> MP/MC burst enq/dequeue (size: 8): 7 SP/SC burst enq/dequeue (size: 32): 2
>> MP/MC burst enq/dequeue (size: 32): 2
>>
>> ### Testing empty dequeue ###
>> SC empty dequeue: 7.81
>> MC empty dequeue: 7.81
>>
>> ### Testing using a single lcore ###
>> SP/SC bulk enq/dequeue (size: 8): 5.76
>> MP/MC bulk enq/dequeue (size: 8): 7.66
>> SP/SC bulk enq/dequeue (size: 32): 2.10
>> MP/MC bulk enq/dequeue (size: 32): 2.57
>>
>> ### Testing using two hyperthreads ###
>> SP/SC bulk enq/dequeue (size: 8): 13.13
>> MP/MC bulk enq/dequeue (size: 8): 13.98
>> SP/SC bulk enq/dequeue (size: 32): 3.41
>> MP/MC bulk enq/dequeue (size: 32): 4.45
>>
>> ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size: 8):
>> 11.00 MP/MC bulk enq/dequeue (size: 8): 10.95 SP/SC bulk enq/dequeue
>> (size: 32): 3.08 MP/MC bulk enq/dequeue (size: 32): 3.40
>>
>> ### Testing using two NUMA nodes ###
>> SP/SC bulk enq/dequeue (size: 8): 63.41
>> MP/MC bulk enq/dequeue (size: 8): 62.70
>> SP/SC bulk enq/dequeue (size: 32): 15.39 MP/MC bulk enq/dequeue (size:
>> 32): 22.96
>>
> Thanks for running this. There is another test 'ring_perf_autotest' which provides the numbers with the original implementation. The goal is to make sure the numbers with the original implementation are the same as these. Can you please run that as well?
> 
RTE>>ring_perf_autotest
### Testing single element and burst enq/deq ###
SP/SC single enq/dequeue: 42
MP/MC single enq/dequeue: 59
SP/SC burst enq/dequeue (size: 8): 6
MP/MC burst enq/dequeue (size: 8): 8
SP/SC burst enq/dequeue (size: 32): 2
MP/MC burst enq/dequeue (size: 32): 3

### Testing empty dequeue ###
SC empty dequeue: 7.81
MC empty dequeue: 7.81

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 6.91
MP/MC bulk enq/dequeue (size: 8): 8.87
SP/SC bulk enq/dequeue (size: 32): 2.55
MP/MC bulk enq/dequeue (size: 32): 3.04

### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 11.70
MP/MC bulk enq/dequeue (size: 8): 13.56
SP/SC bulk enq/dequeue (size: 32): 3.48
MP/MC bulk enq/dequeue (size: 32): 3.95

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 10.86
MP/MC bulk enq/dequeue (size: 8): 11.11
SP/SC bulk enq/dequeue (size: 32): 2.97
MP/MC bulk enq/dequeue (size: 32): 3.43

### Testing using two NUMA nodes ###
SP/SC bulk enq/dequeue (size: 8): 48.07
MP/MC bulk enq/dequeue (size: 8): 67.38
SP/SC bulk enq/dequeue (size: 32): 13.04
MP/MC bulk enq/dequeue (size: 32): 27.10
Test OK

Dave

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable element size
  2019-10-18 16:44                           ` Ananyev, Konstantin
@ 2019-10-18 19:03                             ` Honnappa Nagarahalli
  2019-10-21  0:36                             ` Honnappa Nagarahalli
  1 sibling, 0 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-18 19:03 UTC (permalink / raw)
  To: Ananyev, Konstantin, Jerin Jacob
  Cc: David Christensen, olivier.matz, sthemmin, jerinj, Richardson,
	Bruce, david.marchand, pbhagavatula, dev, Dharmik Thakkar,
	Ruifeng Wang (Arm Technology China),
	Gavin Hu (Arm Technology China),
	stephen, nd, Honnappa Nagarahalli, nd

<snip>

> Subject: RE: [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable
> element size
> 
> 
> Hi everyone,
> 
> 
> > > > >>> I tried this. On x86 (Xeon(R) Gold 6132 CPU @ 2.60GHz), the
> > > > >>> results are as
> > > > >> follows. The numbers in brackets are with the code on master.
> > > > >>> gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
> > > > >>>
> > > > >>> RTE>>ring_perf_elem_autotest
> > > > >>> ### Testing single element and burst enq/deq ### SP/SC single
> > > > >>> enq/dequeue: 5 MP/MC single enq/dequeue: 40 (35) SP/SC burst
> > > > >>> enq/dequeue (size: 8): 2 MP/MC burst enq/dequeue (size: 8): 6
> > > > >>> SP/SC burst enq/dequeue (size: 32): 1 (2) MP/MC burst
> enq/dequeue (size:
> > > > >>> 32): 2
> > > > >>>
> > > > >>> ### Testing empty dequeue ###
> > > > >>> SC empty dequeue: 2.11
> > > > >>> MC empty dequeue: 1.41 (2.11)
> > > > >>>
> > > > >>> ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size:
> > > > >>> 8): 2.15 (2.86) MP/MC bulk enq/dequeue
> > > > >>> (size: 8): 6.35 (6.91) SP/SC bulk enq/dequeue (size: 32): 1.35
> > > > >>> (2.06) MP/MC bulk enq/dequeue (size: 32): 2.38 (2.95)
> > > > >>>
> > > > >>> ### Testing using two physical cores ### SP/SC bulk enq/dequeue
> (size:
> > > > >>> 8): 73.81 (15.33) MP/MC bulk enq/dequeue (size: 8): 75.10
> > > > >>> (71.27) SP/SC bulk enq/dequeue (size: 32): 21.14 (9.58) MP/MC
> > > > >>> bulk enq/dequeue
> > > > >>> (size: 32): 25.74 (20.91)
> > > > >>>
> > > > >>> ### Testing using two NUMA nodes ### SP/SC bulk enq/dequeue
> (size:
> > > > >>> 8): 164.32 (50.66) MP/MC bulk enq/dequeue (size: 8): 176.02
> > > > >>> (173.43) SP/SC bulk enq/dequeue (size:
> > > > >>> 32): 50.78 (23) MP/MC bulk enq/dequeue (size: 32): 63.17
> > > > >>> (46.74)
> > > > >>>
> > > > >>> On one of the Arm platform
> > > > >>> MP/MC bulk enq/dequeue (size: 32): 0.37 (0.33) (~12% hit, the
> > > > >>> rest are
> > > > >>> ok)
> > > >
> > > > Tried this on a Power9 platform (3.6GHz), with two numa nodes and
> > > > 16 cores/node (SMT=4).  Applied all 3 patches in v5, test results
> > > > are as
> > > > follows:
> > > >
> > > > RTE>>ring_perf_elem_autotest
> > > > ### Testing single element and burst enq/deq ### SP/SC single
> enq/dequeue:
> > > > 42 MP/MC single enq/dequeue: 59 SP/SC burst enq/dequeue (size: 8):
> > > > 5 MP/MC burst enq/dequeue (size: 8): 7 SP/SC burst enq/dequeue
> > > > (size: 32): 2 MP/MC burst enq/dequeue (size: 32): 2
> > > >
> > > > ### Testing empty dequeue ###
> > > > SC empty dequeue: 7.81
> > > > MC empty dequeue: 7.81
> > > >
> > > > ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size:
> > > > 8): 5.76 MP/MC bulk enq/dequeue (size: 8): 7.66 SP/SC bulk
> > > > enq/dequeue (size: 32): 2.10 MP/MC bulk enq/dequeue (size: 32):
> > > > 2.57
> > > >
> > > > ### Testing using two hyperthreads ### SP/SC bulk enq/dequeue
> > > > (size: 8): 13.13 MP/MC bulk enq/dequeue (size: 8): 13.98 SP/SC
> > > > bulk enq/dequeue (size: 32): 3.41 MP/MC bulk enq/dequeue (size:
> > > > 32): 4.45
> > > >
> > > > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size:
> 8):
> > > > 11.00 MP/MC bulk enq/dequeue (size: 8): 10.95 SP/SC bulk
> > > > enq/dequeue
> > > > (size: 32): 3.08 MP/MC bulk enq/dequeue (size: 32): 3.40
> > > >
> > > > ### Testing using two NUMA nodes ### SP/SC bulk enq/dequeue (size:
> > > > 8): 63.41 MP/MC bulk enq/dequeue (size: 8): 62.70 SP/SC bulk
> > > > enq/dequeue (size: 32): 15.39 MP/MC bulk enq/dequeue (size:
> > > > 32): 22.96
> > > >
> > > Thanks for running this. There is another test 'ring_perf_autotest'
> > > which provides the numbers with the original implementation. The
> > > goal
> > is to make sure the numbers with the original implementation are the same
> as these. Can you please run that as well?
> >
> > Honnappa,
> >
> > Your earlier perf report shows the cycles are in less than 1. That's
> > is due to it is using 50 or 100MHz clock in EL0.
> > Please check with PMU counter. See "ARM64 profiling" in
> >
> > http://doc.dpdk.org/guides/prog_guide/profile_app.html
> >
> >
> > Here is the octeontx2 values. There is a regression in two core cases
> > as you reported earlier in x86.
> >
> >
> > RTE>>ring_perf_autotest
> > ### Testing single element and burst enq/deq ### SP/SC single
> > enq/dequeue: 288 MP/MC single enq/dequeue: 452 SP/SC burst
> enq/dequeue
> > (size: 8): 39 MP/MC burst enq/dequeue (size: 8): 61 SP/SC burst
> > enq/dequeue (size: 32): 13 MP/MC burst enq/dequeue (size: 32): 21
> >
> > ### Testing empty dequeue ###
> > SC empty dequeue: 6.33
> > MC empty dequeue: 6.67
> >
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 38.35 MP/MC bulk enq/dequeue (size:
> > 8): 67.36 SP/SC bulk enq/dequeue (size: 32): 13.10 MP/MC bulk
> > enq/dequeue (size: 32): 21.64
> >
> > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size:
> > 8): 75.94 MP/MC bulk enq/dequeue (size: 8): 107.66 SP/SC bulk
> > enq/dequeue (size: 32): 24.51 MP/MC bulk enq/dequeue (size: 32): 33.23
> > Test OK
> > RTE>>
> >
> > ---- after applying v5 of the patch ------
> >
> > RTE>>ring_perf_autotest
> > ### Testing single element and burst enq/deq ### SP/SC single
> > enq/dequeue: 289 MP/MC single enq/dequeue: 452 SP/SC burst
> enq/dequeue
> > (size: 8): 40 MP/MC burst enq/dequeue (size: 8): 64 SP/SC burst
> > enq/dequeue (size: 32): 13 MP/MC burst enq/dequeue (size: 32): 22
> >
> > ### Testing empty dequeue ###
> > SC empty dequeue: 6.33
> > MC empty dequeue: 6.67
> >
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 39.73 MP/MC bulk enq/dequeue (size:
> > 8): 69.13 SP/SC bulk enq/dequeue (size: 32): 13.44 MP/MC bulk
> > enq/dequeue (size: 32): 22.00
> >
> > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size:
> > 8): 76.02 MP/MC bulk enq/dequeue (size: 8): 112.50 SP/SC bulk
> > enq/dequeue (size: 32): 24.71 MP/MC bulk enq/dequeue (size: 32): 33.34
> > Test OK
> > RTE>>
> >
> > RTE>>ring_perf_elem_autotest
> > ### Testing single element and burst enq/deq ### SP/SC single
> > enq/dequeue: 290 MP/MC single enq/dequeue: 503 SP/SC burst
> enq/dequeue
> > (size: 8): 39 MP/MC burst enq/dequeue (size: 8): 63 SP/SC burst
> > enq/dequeue (size: 32): 11 MP/MC burst enq/dequeue (size: 32): 19
> >
> > ### Testing empty dequeue ###
> > SC empty dequeue: 6.33
> > MC empty dequeue: 6.67
> >
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 38.92 MP/MC bulk enq/dequeue (size:
> > 8): 62.54 SP/SC bulk enq/dequeue (size: 32): 11.46 MP/MC bulk
> > enq/dequeue (size: 32): 19.89
> >
> > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size:
> > 8): 87.55 MP/MC bulk enq/dequeue (size: 8): 99.10 SP/SC bulk
> > enq/dequeue (size: 32): 26.63 MP/MC bulk enq/dequeue (size: 32): 29.91
> > Test OK
> > RTE>>
> >
> 
> As I can see, there is copy&paste bug in patch #3 (that's why it probably
> produced some weird numbers for me first).
Apologies on this. In the hindsight, should have added the unit tests.

> After fix applied (see patch below), things look pretty good on my box.
> As I can see there are only 3 results noticably lower:
>    SP/SC (size=8) over 2 physical cores same numa socket
>    MP/MC (size=8) over 2 physical cores on different numa sockets.
Is this ok for you?

> All others seems about same or better.
> Anyway I went ahead and reworked code a bit (as I suggested before) to get
> rid of these huge ENQUEUE/DEQUEUE macros.
> Results are very close to fixed patch #3 version (patch is also attached).
> Though I suggest people hold on to re-run perf tests till we'll make ring
> functional test to run for _elem_ functions too.
> I started to work on that, but not sure I'll finish today (most likely Monday).
> Perf results from my box, plus patches below.
> Konstantin
> 
> perf results
> ==========
> 
> Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
> 
> A - ring_perf_autotest
> B - ring_perf_elem_autotest + patch #3 + fix C - B + update
> 
> ### Testing using a single lcore ###	A	B	C
> SP/SC bulk enq/dequeue (size: 8): 	4.06	3.06	3.22
> MP/MC bulk enq/dequeue (size: 8): 	10.05	9.04	9.38
> SP/SC bulk enq/dequeue (size: 32): 	2.93	1.91	1.84
> MP/MC bulk enq/dequeue (size: 32): 	4.12	3.39	3.35
> 
> ### Testing using two hyperthreads ###
> SP/SC bulk enq/dequeue (size: 8): 	9.24	8.92	8.89
> MP/MC bulk enq/dequeue (size: 8): 	15.47	15.39	16.02
> SP/SC bulk enq/dequeue (size: 32): 	5.78	3.87	3.86
> MP/MC bulk enq/dequeue (size: 32): 	6.41	4.57	4.45
> 
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 	24.14	29.89	27.05
> MP/MC bulk enq/dequeue (size: 8): 	68.61	70.55	69.85
> SP/SC bulk enq/dequeue (size: 32): 	12.11	12.99	13.04
> MP/MC bulk enq/dequeue (size: 32): 	22.14	17.86	18.25
> 
> ### Testing using two NUMA nodes ###
> SP/SC bulk enq/dequeue (size: 8): 	48.78	31.98	33.57
> MP/MC bulk enq/dequeue (size: 8): 	167.53	197.29	192.13
> SP/SC bulk enq/dequeue (size: 32): 	31.28	21.68	21.61
> MP/MC bulk enq/dequeue (size: 32): 	53.45	49.94	48.81
> 
> fix patch
> =======
> 
> From a2be5a9b136333a56d466ef042c655e522ca7012 Mon Sep 17 00:00:00
> 2001
> From: Konstantin Ananyev <konstantin.ananyev@intel.com>
> Date: Fri, 18 Oct 2019 15:50:43 +0100
> Subject: [PATCH] fix1
> 
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
>  lib/librte_ring/rte_ring_elem.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_ring/rte_ring_elem.h b/lib/librte_ring/rte_ring_elem.h
> index 92e92f150..5e1819069 100644
> --- a/lib/librte_ring/rte_ring_elem.h
> +++ b/lib/librte_ring/rte_ring_elem.h
> @@ -118,7 +118,7 @@ struct rte_ring *rte_ring_create_elem(const char
> *name, unsigned count,
>         uint32_t sz = n * (esize / sizeof(uint32_t)); \
>         if (likely(idx + n < size)) { \
>                 for (i = 0; i < (sz & ((~(unsigned)0x7))); i += 8, idx += 8) { \
> -                       memcpy (ring + i, obj + i, 8 * sizeof (uint32_t)); \
> +                       memcpy (ring + idx, obj + i, 8 * sizeof
> + (uint32_t)); \
>                 } \
>                 switch (n & 0x7) { \
>                 case 7: \
> @@ -153,7 +153,7 @@ struct rte_ring *rte_ring_create_elem(const char
> *name, unsigned count,
>         uint32_t sz = n * (esize / sizeof(uint32_t)); \
>         if (likely(idx + n < size)) { \
>                 for (i = 0; i < (sz & ((~(unsigned)0x7))); i += 8, idx += 8) { \
> -                       memcpy (obj + i, ring + i, 8 * sizeof (uint32_t)); \
> +                       memcpy (obj + i, ring + idx, 8 * sizeof
> + (uint32_t)); \
>                 } \
>                 switch (n & 0x7) { \
>                 case 7: \
> --
> 2.17.1
> 
> update patch (remove macros)
> =========================
> 
> From 18b388e877b97e243f807f27a323e876b30869dd Mon Sep 17 00:00:00
> 2001
> From: Konstantin Ananyev <konstantin.ananyev@intel.com>
> Date: Fri, 18 Oct 2019 17:35:43 +0100
> Subject: [PATCH] update1
> 
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
>  lib/librte_ring/rte_ring_elem.h | 141 ++++++++++++++++----------------
>  1 file changed, 70 insertions(+), 71 deletions(-)
> 
> diff --git a/lib/librte_ring/rte_ring_elem.h b/lib/librte_ring/rte_ring_elem.h
> index 5e1819069..eb706b12f 100644
> --- a/lib/librte_ring/rte_ring_elem.h
> +++ b/lib/librte_ring/rte_ring_elem.h
> @@ -109,75 +109,74 @@ __rte_experimental  struct rte_ring
> *rte_ring_create_elem(const char *name, unsigned count,
>                                 unsigned esize, int socket_id, unsigned flags);
> 
> -#define ENQUEUE_PTRS_GEN(r, ring_start, prod_head, obj_table, esize, n)
> do { \
> -       unsigned int i; \
> -       const uint32_t size = (r)->size; \
> -       uint32_t idx = prod_head & (r)->mask; \
> -       uint32_t *ring = (uint32_t *)ring_start; \
> -       uint32_t *obj = (uint32_t *)obj_table; \
> -       uint32_t sz = n * (esize / sizeof(uint32_t)); \
> -       if (likely(idx + n < size)) { \
> -               for (i = 0; i < (sz & ((~(unsigned)0x7))); i += 8, idx += 8) { \
> -                       memcpy (ring + idx, obj + i, 8 * sizeof (uint32_t)); \
> -               } \
> -               switch (n & 0x7) { \
> -               case 7: \
> -                       ring[idx++] = obj[i++]; /* fallthrough */ \
> -               case 6: \
> -                       ring[idx++] = obj[i++]; /* fallthrough */ \
> -               case 5: \
> -                       ring[idx++] = obj[i++]; /* fallthrough */ \
> -               case 4: \
> -                       ring[idx++] = obj[i++]; /* fallthrough */ \
> -               case 3: \
> -                       ring[idx++] = obj[i++]; /* fallthrough */ \
> -               case 2: \
> -                       ring[idx++] = obj[i++]; /* fallthrough */ \
> -               case 1: \
> -                       ring[idx++] = obj[i++]; /* fallthrough */ \
> -               } \
> -       } else { \
> -               for (i = 0; idx < size; i++, idx++)\
> -                       ring[idx] = obj[i]; \
> -               for (idx = 0; i < n; i++, idx++) \
> -                       ring[idx] = obj[i]; \
> -       } \
> -} while (0)
> -
> -#define DEQUEUE_PTRS_GEN(r, ring_start, cons_head, obj_table, esize, n)
> do { \
> -       unsigned int i; \
> -       uint32_t idx = cons_head & (r)->mask; \
> -       const uint32_t size = (r)->size; \
> -       uint32_t *ring = (uint32_t *)ring_start; \
> -       uint32_t *obj = (uint32_t *)obj_table; \
> -       uint32_t sz = n * (esize / sizeof(uint32_t)); \
> -       if (likely(idx + n < size)) { \
> -               for (i = 0; i < (sz & ((~(unsigned)0x7))); i += 8, idx += 8) { \
> -                       memcpy (obj + i, ring + idx, 8 * sizeof (uint32_t)); \
> -               } \
> -               switch (n & 0x7) { \
> -               case 7: \
> -                       obj[i++] = ring[idx++]; /* fallthrough */ \
> -               case 6: \
> -                       obj[i++] = ring[idx++]; /* fallthrough */ \
> -               case 5: \
> -                       obj[i++] = ring[idx++]; /* fallthrough */ \
> -               case 4: \
> -                       obj[i++] = ring[idx++]; /* fallthrough */ \
> -               case 3: \
> -                       obj[i++] = ring[idx++]; /* fallthrough */ \
> -               case 2: \
> -                       obj[i++] = ring[idx++]; /* fallthrough */ \
> -               case 1: \
> -                       obj[i++] = ring[idx++]; /* fallthrough */ \
> -               } \
> -       } else { \
> -               for (i = 0; idx < size; i++, idx++) \
> -                       obj[i] = ring[idx]; \
> -               for (idx = 0; i < n; i++, idx++) \
> -                       obj[i] = ring[idx]; \
> -       } \
> -} while (0)
> +static __rte_always_inline void
> +copy_elems(uint32_t du32[], const uint32_t su32[], uint32_t num,
> +uint32_t esize) {
> +       uint32_t i, sz;
> +
> +       sz = (num * esize) / sizeof(uint32_t);
> +
> +       for (i = 0; i < (sz & ~7); i += 8)
> +               memcpy(du32 + i, su32 + i, 8 * sizeof(uint32_t));
> +
> +       switch (sz & 7) {
> +       case 7: du32[sz - 7] = su32[sz - 7]; /* fallthrough */
> +       case 6: du32[sz - 6] = su32[sz - 6]; /* fallthrough */
> +       case 5: du32[sz - 5] = su32[sz - 5]; /* fallthrough */
> +       case 4: du32[sz - 4] = su32[sz - 4]; /* fallthrough */
> +       case 3: du32[sz - 3] = su32[sz - 3]; /* fallthrough */
> +       case 2: du32[sz - 2] = su32[sz - 2]; /* fallthrough */
> +       case 1: du32[sz - 1] = su32[sz - 1]; /* fallthrough */
> +       }
> +}
> +
> +static __rte_always_inline void
> +enqueue_elems(struct rte_ring *r, void *ring_start, uint32_t prod_head,
> +               void *obj_table, uint32_t num, uint32_t esize) {
> +       uint32_t idx, n;
> +       uint32_t *du32;
> +       const uint32_t *su32;
> +
> +       const uint32_t size = r->size;
> +
> +       idx = prod_head & (r)->mask;
> +
> +       du32 = (uint32_t *)ring_start + idx;
> +       su32 = obj_table;
> +
> +       if (idx + num < size)
> +               copy_elems(du32, su32, num, esize);
> +       else {
> +               n = size - idx;
> +               copy_elems(du32, su32, n, esize);
> +               copy_elems(ring_start, su32 + n, num - n, esize);
> +       }
> +}
> +
> +static __rte_always_inline void
> +dequeue_elems(struct rte_ring *r, void *ring_start, uint32_t cons_head,
> +               void *obj_table, uint32_t num, uint32_t esize) {
> +       uint32_t idx, n;
> +       uint32_t *du32;
> +       const uint32_t *su32;
> +
> +       const uint32_t size = r->size;
> +
> +       idx = cons_head & (r)->mask;
> +
> +       su32 = (uint32_t *)ring_start + idx;
> +       du32 = obj_table;
> +
> +       if (idx + num < size)
> +               copy_elems(du32, su32, num, esize);
> +       else {
> +               n = size - idx;
> +               copy_elems(du32, su32, n, esize);
> +               copy_elems(du32 + n, ring_start, num - n, esize);
> +       }
> +}
> 
>  /* Between load and load. there might be cpu reorder in weak model
>   * (powerpc/arm).
> @@ -232,7 +231,7 @@ __rte_ring_do_enqueue_elem(struct rte_ring *r, void
> * const obj_table,
>         if (n == 0)
>                 goto end;
> 
> -       ENQUEUE_PTRS_GEN(r, &r[1], prod_head, obj_table, esize, n);
> +       enqueue_elems(r, &r[1], prod_head, obj_table, n, esize);
> 
>         update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
>  end:
> @@ -279,7 +278,7 @@ __rte_ring_do_dequeue_elem(struct rte_ring *r, void
> *obj_table,
>         if (n == 0)
>                 goto end;
> 
> -       DEQUEUE_PTRS_GEN(r, &r[1], cons_head, obj_table, esize, n);
> +       dequeue_elems(r, &r[1], cons_head, obj_table, n, esize);
> 
>         update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
> 
> --
> 2.17.1
> 


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [RFC v6 0/6] lib/ring: APIs to support custom element size
  2019-09-06 19:05 ` [dpdk-dev] [PATCH v2 0/6] " Honnappa Nagarahalli
                     ` (10 preceding siblings ...)
  2019-10-17 20:08   ` [dpdk-dev] [PATCH v5 0/3] lib/ring: APIs to support custom element size Honnappa Nagarahalli
@ 2019-10-21  0:22   ` Honnappa Nagarahalli
  2019-10-21  0:22     ` [dpdk-dev] [RFC v6 1/6] test/ring: use division for cycle count calculation Honnappa Nagarahalli
                       ` (6 more replies)
  2019-12-20  4:45   ` [dpdk-dev] [PATCH v7 00/17] " Honnappa Nagarahalli
                     ` (3 subsequent siblings)
  15 siblings, 7 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-21  0:22 UTC (permalink / raw)
  To: olivier.matz, sthemmin, jerinj, bruce.richardson, david.marchand,
	pbhagavatula, konstantin.ananyev, drc, hemant.agrawal,
	honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, ruifeng.wang, gavin.hu

The current rte_ring hard-codes the type of the ring element to 'void *',
hence the size of the element is hard-coded to 32b/64b. Since the ring
element type is not an input to rte_ring APIs, it results in couple
of issues:

1) If an application requires to store an element which is not 64b, it
   needs to write its own ring APIs similar to rte_event_ring APIs. This
   creates additional burden on the programmers, who end up making
   work-arounds and often waste memory.
2) If there are multiple libraries that store elements of the same
   type, currently they would have to write their own rte_ring APIs. This
   results in code duplication.

This patch adds new APIs to support configurable ring element size.
The APIs support custom element sizes by allowing to define the ring
element to be a multiple of 32b.

The aim is to achieve same performance as the existing ring
implementation. The patch adds same performance tests that are run
for existing APIs. This allows for performance comparison.

I also tested with memcpy. x86 shows significant improvements on bulk
and burst tests. On the Arm platform, I used, there is a drop of
4% to 6% in few tests. May be this is something that we can explore
later.

Note that this version skips changes to other libraries as I would
like to get an agreement on the implementation from the community.
They will be added once there is agreement on the rte_ring changes.

v6
 - Labelled as RFC to indicate the better status
 - Added unit tests to test the rte_ring_xxx_elem APIs
 - Corrected 'macro based partial memcpy' (5/6) patch
 - Added Konstantin's method after correction (6/6)
 - Check Patch shows significant warnings and errors mainly due
   copying code from existing test cases. None of them are harmful.
   I will fix them once we have an agreement.

v5
 - Use memcpy for chunks of 32B (Konstantin).
 - Both 'ring_perf_autotest' and 'ring_perf_elem_autotest' are available
   to compare the results easily.
 - Copying without memcpy is also available in 1/3, if anyone wants to
   experiment on their platform.
 - Added other platform owners to test on their respective platforms.

v4
 - Few fixes after more performance testing

v3
 - Removed macro-fest and used inline functions
   (Stephen, Bruce)

v2
 - Change Event Ring implementation to use ring templates
   (Jerin, Pavan)

Honnappa Nagarahalli (6):
  test/ring: use division for cycle count calculation
  lib/ring: apis to support configurable element size
  test/ring: add functional tests for configurable element size ring
  test/ring: add perf tests for configurable element size ring
  lib/ring: copy ring elements using memcpy partially
  lib/ring: improved copy function to copy ring elements

 app/test/Makefile                    |   2 +
 app/test/meson.build                 |   2 +
 app/test/test_ring_elem.c            | 859 +++++++++++++++++++++++++++
 app/test/test_ring_perf.c            |  22 +-
 app/test/test_ring_perf_elem.c       | 419 +++++++++++++
 lib/librte_ring/Makefile             |   3 +-
 lib/librte_ring/meson.build          |   4 +
 lib/librte_ring/rte_ring.c           |  34 +-
 lib/librte_ring/rte_ring.h           |   1 +
 lib/librte_ring/rte_ring_elem.h      | 818 +++++++++++++++++++++++++
 lib/librte_ring/rte_ring_version.map |   2 +
 11 files changed, 2147 insertions(+), 19 deletions(-)
 create mode 100644 app/test/test_ring_elem.c
 create mode 100644 app/test/test_ring_perf_elem.c
 create mode 100644 lib/librte_ring/rte_ring_elem.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [RFC v6 1/6] test/ring: use division for cycle count calculation
  2019-10-21  0:22   ` [dpdk-dev] [RFC v6 0/6] lib/ring: APIs to support custom element size Honnappa Nagarahalli
@ 2019-10-21  0:22     ` Honnappa Nagarahalli
  2019-10-23  9:49       ` Olivier Matz
  2019-10-21  0:22     ` [dpdk-dev] [RFC v6 2/6] lib/ring: apis to support configurable element size Honnappa Nagarahalli
                       ` (5 subsequent siblings)
  6 siblings, 1 reply; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-21  0:22 UTC (permalink / raw)
  To: olivier.matz, sthemmin, jerinj, bruce.richardson, david.marchand,
	pbhagavatula, konstantin.ananyev, drc, hemant.agrawal,
	honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, ruifeng.wang, gavin.hu

Use division instead of modulo operation to calculate more
accurate cycle count.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 app/test/test_ring_perf.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/app/test/test_ring_perf.c b/app/test/test_ring_perf.c
index b6ad703bb..e3e17f251 100644
--- a/app/test/test_ring_perf.c
+++ b/app/test/test_ring_perf.c
@@ -284,10 +284,10 @@ test_single_enqueue_dequeue(struct rte_ring *r)
 	}
 	const uint64_t mc_end = rte_rdtsc();
 
-	printf("SP/SC single enq/dequeue: %"PRIu64"\n",
-			(sc_end-sc_start) >> iter_shift);
-	printf("MP/MC single enq/dequeue: %"PRIu64"\n",
-			(mc_end-mc_start) >> iter_shift);
+	printf("SP/SC single enq/dequeue: %.2F\n",
+			((double)(sc_end-sc_start)) / iterations);
+	printf("MP/MC single enq/dequeue: %.2F\n",
+			((double)(mc_end-mc_start)) / iterations);
 }
 
 /*
@@ -322,13 +322,15 @@ test_burst_enqueue_dequeue(struct rte_ring *r)
 		}
 		const uint64_t mc_end = rte_rdtsc();
 
-		uint64_t mc_avg = ((mc_end-mc_start) >> iter_shift) / bulk_sizes[sz];
-		uint64_t sc_avg = ((sc_end-sc_start) >> iter_shift) / bulk_sizes[sz];
+		double mc_avg = ((double)(mc_end-mc_start) / iterations) /
+					bulk_sizes[sz];
+		double sc_avg = ((double)(sc_end-sc_start) / iterations) /
+					bulk_sizes[sz];
 
-		printf("SP/SC burst enq/dequeue (size: %u): %"PRIu64"\n", bulk_sizes[sz],
-				sc_avg);
-		printf("MP/MC burst enq/dequeue (size: %u): %"PRIu64"\n", bulk_sizes[sz],
-				mc_avg);
+		printf("SP/SC burst enq/dequeue (size: %u): %.2F\n",
+				bulk_sizes[sz], sc_avg);
+		printf("MP/MC burst enq/dequeue (size: %u): %.2F\n",
+				bulk_sizes[sz], mc_avg);
 	}
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [RFC v6 2/6] lib/ring: apis to support configurable element size
  2019-10-21  0:22   ` [dpdk-dev] [RFC v6 0/6] lib/ring: APIs to support custom element size Honnappa Nagarahalli
  2019-10-21  0:22     ` [dpdk-dev] [RFC v6 1/6] test/ring: use division for cycle count calculation Honnappa Nagarahalli
@ 2019-10-21  0:22     ` Honnappa Nagarahalli
  2019-10-23  9:59       ` Olivier Matz
  2019-10-21  0:22     ` [dpdk-dev] [RFC v6 3/6] test/ring: add functional tests for configurable element size ring Honnappa Nagarahalli
                       ` (4 subsequent siblings)
  6 siblings, 1 reply; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-21  0:22 UTC (permalink / raw)
  To: olivier.matz, sthemmin, jerinj, bruce.richardson, david.marchand,
	pbhagavatula, konstantin.ananyev, drc, hemant.agrawal,
	honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, ruifeng.wang, gavin.hu

Current APIs assume ring elements to be pointers. However, in many
use cases, the size can be different. Add new APIs to support
configurable ring element sizes.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_ring/Makefile             |   3 +-
 lib/librte_ring/meson.build          |   4 +
 lib/librte_ring/rte_ring.c           |  44 +-
 lib/librte_ring/rte_ring.h           |   1 +
 lib/librte_ring/rte_ring_elem.h      | 946 +++++++++++++++++++++++++++
 lib/librte_ring/rte_ring_version.map |   2 +
 6 files changed, 991 insertions(+), 9 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_elem.h

diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index 21a36770d..515a967bb 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -6,7 +6,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 # library name
 LIB = librte_ring.a
 
-CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3
+CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -DALLOW_EXPERIMENTAL_API
 LDLIBS += -lrte_eal
 
 EXPORT_MAP := rte_ring_version.map
@@ -18,6 +18,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
+					rte_ring_elem.h \
 					rte_ring_generic.h \
 					rte_ring_c11_mem.h
 
diff --git a/lib/librte_ring/meson.build b/lib/librte_ring/meson.build
index ab8b0b469..7ebaba919 100644
--- a/lib/librte_ring/meson.build
+++ b/lib/librte_ring/meson.build
@@ -4,5 +4,9 @@
 version = 2
 sources = files('rte_ring.c')
 headers = files('rte_ring.h',
+		'rte_ring_elem.h',
 		'rte_ring_c11_mem.h',
 		'rte_ring_generic.h')
+
+# rte_ring_create_elem and rte_ring_get_memsize_elem are experimental
+allow_experimental_apis = true
diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
index d9b308036..e95285259 100644
--- a/lib/librte_ring/rte_ring.c
+++ b/lib/librte_ring/rte_ring.c
@@ -33,6 +33,7 @@
 #include <rte_tailq.h>
 
 #include "rte_ring.h"
+#include "rte_ring_elem.h"
 
 TAILQ_HEAD(rte_ring_list, rte_tailq_entry);
 
@@ -46,23 +47,41 @@ EAL_REGISTER_TAILQ(rte_ring_tailq)
 
 /* return the size of memory occupied by a ring */
 ssize_t
-rte_ring_get_memsize(unsigned count)
+rte_ring_get_memsize_elem(unsigned count, unsigned esize)
 {
 	ssize_t sz;
 
+	/* Supported esize values are 4/8/16.
+	 * Others can be added on need basis.
+	 */
+	if (esize != 4 && esize != 8 && esize != 16) {
+		RTE_LOG(ERR, RING,
+			"Unsupported esize value. Supported values are 4, 8 and 16\n");
+
+		return -EINVAL;
+	}
+
 	/* count must be a power of 2 */
 	if ((!POWEROF2(count)) || (count > RTE_RING_SZ_MASK )) {
 		RTE_LOG(ERR, RING,
-			"Requested size is invalid, must be power of 2, and "
-			"do not exceed the size limit %u\n", RTE_RING_SZ_MASK);
+			"Requested number of elements is invalid, must be power of 2, and not exceed %u\n",
+			RTE_RING_SZ_MASK);
+
 		return -EINVAL;
 	}
 
-	sz = sizeof(struct rte_ring) + count * sizeof(void *);
+	sz = sizeof(struct rte_ring) + count * esize;
 	sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
 	return sz;
 }
 
+/* return the size of memory occupied by a ring */
+ssize_t
+rte_ring_get_memsize(unsigned count)
+{
+	return rte_ring_get_memsize_elem(count, sizeof(void *));
+}
+
 void
 rte_ring_reset(struct rte_ring *r)
 {
@@ -114,10 +133,10 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 	return 0;
 }
 
-/* create the ring */
+/* create the ring for a given element size */
 struct rte_ring *
-rte_ring_create(const char *name, unsigned count, int socket_id,
-		unsigned flags)
+rte_ring_create_elem(const char *name, unsigned count, unsigned esize,
+		int socket_id, unsigned flags)
 {
 	char mz_name[RTE_MEMZONE_NAMESIZE];
 	struct rte_ring *r;
@@ -135,7 +154,7 @@ rte_ring_create(const char *name, unsigned count, int socket_id,
 	if (flags & RING_F_EXACT_SZ)
 		count = rte_align32pow2(count + 1);
 
-	ring_size = rte_ring_get_memsize(count);
+	ring_size = rte_ring_get_memsize_elem(count, esize);
 	if (ring_size < 0) {
 		rte_errno = ring_size;
 		return NULL;
@@ -182,6 +201,15 @@ rte_ring_create(const char *name, unsigned count, int socket_id,
 	return r;
 }
 
+/* create the ring */
+struct rte_ring *
+rte_ring_create(const char *name, unsigned count, int socket_id,
+		unsigned flags)
+{
+	return rte_ring_create_elem(name, count, sizeof(void *), socket_id,
+		flags);
+}
+
 /* free the ring */
 void
 rte_ring_free(struct rte_ring *r)
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 2a9f768a1..18fc5d845 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -216,6 +216,7 @@ int rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
  */
 struct rte_ring *rte_ring_create(const char *name, unsigned count,
 				 int socket_id, unsigned flags);
+
 /**
  * De-allocate all memory used by the ring.
  *
diff --git a/lib/librte_ring/rte_ring_elem.h b/lib/librte_ring/rte_ring_elem.h
new file mode 100644
index 000000000..7e9914567
--- /dev/null
+++ b/lib/librte_ring/rte_ring_elem.h
@@ -0,0 +1,946 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Copyright (c) 2019 Arm Limited
+ * Copyright (c) 2010-2017 Intel Corporation
+ * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
+ * All rights reserved.
+ * Derived from FreeBSD's bufring.h
+ * Used as BSD-3 Licensed with permission from Kip Macy.
+ */
+
+#ifndef _RTE_RING_ELEM_H_
+#define _RTE_RING_ELEM_H_
+
+/**
+ * @file
+ * RTE Ring with flexible element size
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stdio.h>
+#include <stdint.h>
+#include <sys/queue.h>
+#include <errno.h>
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_memory.h>
+#include <rte_lcore.h>
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_memzone.h>
+#include <rte_pause.h>
+
+#include "rte_ring.h"
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Calculate the memory size needed for a ring with given element size
+ *
+ * This function returns the number of bytes needed for a ring, given
+ * the number of elements in it and the size of the element. This value
+ * is the sum of the size of the structure rte_ring and the size of the
+ * memory needed for storing the elements. The value is aligned to a cache
+ * line size.
+ *
+ * @param count
+ *   The number of elements in the ring (must be a power of 2).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported.
+ * @return
+ *   - The memory size needed for the ring on success.
+ *   - -EINVAL if count is not a power of 2.
+ */
+__rte_experimental
+ssize_t rte_ring_get_memsize_elem(unsigned int count, unsigned int esize);
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Create a new ring named *name* that stores elements with given size.
+ *
+ * This function uses ``memzone_reserve()`` to allocate memory. Then it
+ * calls rte_ring_init() to initialize an empty ring.
+ *
+ * The new ring size is set to *count*, which must be a power of
+ * two. Water marking is disabled by default. The real usable ring size
+ * is *count-1* instead of *count* to differentiate a free ring from an
+ * empty ring.
+ *
+ * The ring is added in RTE_TAILQ_RING list.
+ *
+ * @param name
+ *   The name of the ring.
+ * @param count
+ *   The number of elements in the ring (must be a power of 2).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported.
+ * @param socket_id
+ *   The *socket_id* argument is the socket identifier in case of
+ *   NUMA. The value can be *SOCKET_ID_ANY* if there is no NUMA
+ *   constraint for the reserved zone.
+ * @param flags
+ *   An OR of the following:
+ *    - RING_F_SP_ENQ: If this flag is set, the default behavior when
+ *      using ``rte_ring_enqueue()`` or ``rte_ring_enqueue_bulk()``
+ *      is "single-producer". Otherwise, it is "multi-producers".
+ *    - RING_F_SC_DEQ: If this flag is set, the default behavior when
+ *      using ``rte_ring_dequeue()`` or ``rte_ring_dequeue_bulk()``
+ *      is "single-consumer". Otherwise, it is "multi-consumers".
+ * @return
+ *   On success, the pointer to the new allocated ring. NULL on error with
+ *    rte_errno set appropriately. Possible errno values include:
+ *    - E_RTE_NO_CONFIG - function could not get pointer to rte_config structure
+ *    - E_RTE_SECONDARY - function was called from a secondary process instance
+ *    - EINVAL - count provided is not a power of 2
+ *    - ENOSPC - the maximum number of memzones has already been allocated
+ *    - EEXIST - a memzone with the same name already exists
+ *    - ENOMEM - no appropriate memory area found in which to create memzone
+ */
+__rte_experimental
+struct rte_ring *rte_ring_create_elem(const char *name, unsigned int count,
+			unsigned int esize, int socket_id, unsigned int flags);
+
+/* the actual enqueue of pointers on the ring.
+ * Placed here since identical code needed in both
+ * single and multi producer enqueue functions.
+ */
+#define ENQUEUE_PTRS_ELEM(r, ring_start, prod_head, obj_table, esize, n) do { \
+	if (esize == 4) \
+		ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n); \
+	else if (esize == 8) \
+		ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n); \
+	else if (esize == 16) \
+		ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table, n); \
+} while (0)
+
+#define ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n) do { \
+	unsigned int i; \
+	const uint32_t size = (r)->size; \
+	uint32_t idx = prod_head & (r)->mask; \
+	uint32_t *ring = (uint32_t *)ring_start; \
+	uint32_t *obj = (uint32_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & ((~(uint32_t)0x7))); i += 8, idx += 8) { \
+			ring[idx] = obj[i]; \
+			ring[idx + 1] = obj[i + 1]; \
+			ring[idx + 2] = obj[i + 2]; \
+			ring[idx + 3] = obj[i + 3]; \
+			ring[idx + 4] = obj[i + 4]; \
+			ring[idx + 5] = obj[i + 5]; \
+			ring[idx + 6] = obj[i + 6]; \
+			ring[idx + 7] = obj[i + 7]; \
+		} \
+		switch (n & 0x7) { \
+		case 7: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 6: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 5: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 4: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 3: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 2: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 1: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++)\
+			ring[idx] = obj[i]; \
+		for (idx = 0; i < n; i++, idx++) \
+			ring[idx] = obj[i]; \
+	} \
+} while (0)
+
+#define ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n) do { \
+	unsigned int i; \
+	const uint32_t size = (r)->size; \
+	uint32_t idx = prod_head & (r)->mask; \
+	uint64_t *ring = (uint64_t *)ring_start; \
+	uint64_t *obj = (uint64_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & ((~(uint32_t)0x3))); i += 4, idx += 4) { \
+			ring[idx] = obj[i]; \
+			ring[idx + 1] = obj[i + 1]; \
+			ring[idx + 2] = obj[i + 2]; \
+			ring[idx + 3] = obj[i + 3]; \
+		} \
+		switch (n & 0x3) { \
+		case 3: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 2: \
+			ring[idx++] = obj[i++]; /* fallthrough */ \
+		case 1: \
+			ring[idx++] = obj[i++]; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++)\
+			ring[idx] = obj[i]; \
+		for (idx = 0; i < n; i++, idx++) \
+			ring[idx] = obj[i]; \
+	} \
+} while (0)
+
+#define ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table, n) do { \
+	unsigned int i; \
+	const uint32_t size = (r)->size; \
+	uint32_t idx = prod_head & (r)->mask; \
+	__uint128_t *ring = (__uint128_t *)ring_start; \
+	__uint128_t *obj = (__uint128_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
+			ring[idx] = obj[i]; \
+			ring[idx + 1] = obj[i + 1]; \
+		} \
+		switch (n & 0x1) { \
+		case 1: \
+			ring[idx++] = obj[i++]; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++)\
+			ring[idx] = obj[i]; \
+		for (idx = 0; i < n; i++, idx++) \
+			ring[idx] = obj[i]; \
+	} \
+} while (0)
+
+/* the actual copy of pointers on the ring to obj_table.
+ * Placed here since identical code needed in both
+ * single and multi consumer dequeue functions.
+ */
+#define DEQUEUE_PTRS_ELEM(r, ring_start, cons_head, obj_table, esize, n) do { \
+	if (esize == 4) \
+		DEQUEUE_PTRS_32(r, ring_start, cons_head, obj_table, n); \
+	else if (esize == 8) \
+		DEQUEUE_PTRS_64(r, ring_start, cons_head, obj_table, n); \
+	else if (esize == 16) \
+		DEQUEUE_PTRS_128(r, ring_start, cons_head, obj_table, n); \
+} while (0)
+
+#define DEQUEUE_PTRS_32(r, ring_start, cons_head, obj_table, n) do { \
+	unsigned int i; \
+	uint32_t idx = cons_head & (r)->mask; \
+	const uint32_t size = (r)->size; \
+	uint32_t *ring = (uint32_t *)ring_start; \
+	uint32_t *obj = (uint32_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & (~(uint32_t)0x7)); i += 8, idx += 8) {\
+			obj[i] = ring[idx]; \
+			obj[i + 1] = ring[idx + 1]; \
+			obj[i + 2] = ring[idx + 2]; \
+			obj[i + 3] = ring[idx + 3]; \
+			obj[i + 4] = ring[idx + 4]; \
+			obj[i + 5] = ring[idx + 5]; \
+			obj[i + 6] = ring[idx + 6]; \
+			obj[i + 7] = ring[idx + 7]; \
+		} \
+		switch (n & 0x7) { \
+		case 7: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 6: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 5: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 4: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 3: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 2: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 1: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) \
+			obj[i] = ring[idx]; \
+		for (idx = 0; i < n; i++, idx++) \
+			obj[i] = ring[idx]; \
+	} \
+} while (0)
+
+#define DEQUEUE_PTRS_64(r, ring_start, cons_head, obj_table, n) do { \
+	unsigned int i; \
+	uint32_t idx = cons_head & (r)->mask; \
+	const uint32_t size = (r)->size; \
+	uint64_t *ring = (uint64_t *)ring_start; \
+	uint64_t *obj = (uint64_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n & (~(uint32_t)0x3)); i += 4, idx += 4) {\
+			obj[i] = ring[idx]; \
+			obj[i + 1] = ring[idx + 1]; \
+			obj[i + 2] = ring[idx + 2]; \
+			obj[i + 3] = ring[idx + 3]; \
+		} \
+		switch (n & 0x3) { \
+		case 3: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 2: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		case 1: \
+			obj[i++] = ring[idx++]; \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) \
+			obj[i] = ring[idx]; \
+		for (idx = 0; i < n; i++, idx++) \
+			obj[i] = ring[idx]; \
+	} \
+} while (0)
+
+#define DEQUEUE_PTRS_128(r, ring_start, cons_head, obj_table, n) do { \
+	unsigned int i; \
+	uint32_t idx = cons_head & (r)->mask; \
+	const uint32_t size = (r)->size; \
+	__uint128_t *ring = (__uint128_t *)ring_start; \
+	__uint128_t *obj = (__uint128_t *)obj_table; \
+	if (likely(idx + n < size)) { \
+		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
+			obj[i] = ring[idx]; \
+			obj[i + 1] = ring[idx + 1]; \
+		} \
+		switch (n & 0x1) { \
+		case 1: \
+			obj[i++] = ring[idx++]; /* fallthrough */ \
+		} \
+	} else { \
+		for (i = 0; idx < size; i++, idx++) \
+			obj[i] = ring[idx]; \
+		for (idx = 0; i < n; i++, idx++) \
+			obj[i] = ring[idx]; \
+	} \
+} while (0)
+
+/* Between load and load. there might be cpu reorder in weak model
+ * (powerpc/arm).
+ * There are 2 choices for the users
+ * 1.use rmb() memory barrier
+ * 2.use one-direction load_acquire/store_release barrier,defined by
+ * CONFIG_RTE_USE_C11_MEM_MODEL=y
+ * It depends on performance test results.
+ * By default, move common functions to rte_ring_generic.h
+ */
+#ifdef RTE_USE_C11_MEM_MODEL
+#include "rte_ring_c11_mem.h"
+#else
+#include "rte_ring_generic.h"
+#endif
+
+/**
+ * @internal Enqueue several objects on the ring
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Enqueue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Enqueue as many items as possible from ring
+ * @param is_sp
+ *   Indicates whether to use single producer or multi-producer head update
+ * @param free_space
+ *   returns the amount of space after the enqueue operation has finished
+ * @return
+ *   Actual number of objects enqueued.
+ *   If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_enqueue_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n,
+		enum rte_ring_queue_behavior behavior, unsigned int is_sp,
+		unsigned int *free_space)
+{
+	uint32_t prod_head, prod_next;
+	uint32_t free_entries;
+
+	n = __rte_ring_move_prod_head(r, is_sp, n, behavior,
+			&prod_head, &prod_next, &free_entries);
+	if (n == 0)
+		goto end;
+
+	ENQUEUE_PTRS_ELEM(r, &r[1], prod_head, obj_table, esize, n);
+
+	update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
+end:
+	if (free_space != NULL)
+		*free_space = free_entries - n;
+	return n;
+}
+
+/**
+ * @internal Dequeue several objects from the ring
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to pull from the ring.
+ * @param behavior
+ *   RTE_RING_QUEUE_FIXED:    Dequeue a fixed number of items from a ring
+ *   RTE_RING_QUEUE_VARIABLE: Dequeue as many items as possible from ring
+ * @param is_sc
+ *   Indicates whether to use single consumer or multi-consumer head update
+ * @param available
+ *   returns the number of remaining ring entries after the dequeue has finished
+ * @return
+ *   - Actual number of objects dequeued.
+ *     If behavior == RTE_RING_QUEUE_FIXED, this will be 0 or n only.
+ */
+static __rte_always_inline unsigned int
+__rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n,
+		enum rte_ring_queue_behavior behavior, unsigned int is_sc,
+		unsigned int *available)
+{
+	uint32_t cons_head, cons_next;
+	uint32_t entries;
+
+	n = __rte_ring_move_cons_head(r, (int)is_sc, n, behavior,
+			&cons_head, &cons_next, &entries);
+	if (n == 0)
+		goto end;
+
+	DEQUEUE_PTRS_ELEM(r, &r[1], cons_head, obj_table, esize, n);
+
+	update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
+
+end:
+	if (available != NULL)
+		*available = entries - n;
+	return n;
+}
+
+/**
+ * Enqueue several objects on the ring (multi-producers safe).
+ *
+ * This function uses a "compare and set" instruction to move the
+ * producer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   The number of objects enqueued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_mp_enqueue_bulk_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_FIXED, __IS_MP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring (NOT multi-producers safe).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   The number of objects enqueued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_sp_enqueue_bulk_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_FIXED, __IS_SP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring.
+ *
+ * This function calls the multi-producer or the single-producer
+ * version depending on the default behavior that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   The number of objects enqueued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_enqueue_bulk_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_FIXED, r->prod.single, free_space);
+}
+
+/**
+ * Enqueue one object on a ring (multi-producers safe).
+ *
+ * This function uses a "compare and set" instruction to move the
+ * producer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj
+ *   A pointer to the object to be added.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success; objects enqueued.
+ *   - -ENOBUFS: Not enough room in the ring to enqueue; no object is enqueued.
+ */
+static __rte_always_inline int
+rte_ring_mp_enqueue_elem(struct rte_ring *r, void *obj, unsigned int esize)
+{
+	return rte_ring_mp_enqueue_bulk_elem(r, obj, esize, 1, NULL) ? 0 :
+								-ENOBUFS;
+}
+
+/**
+ * Enqueue one object on a ring (NOT multi-producers safe).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj
+ *   A pointer to the object to be added.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success; objects enqueued.
+ *   - -ENOBUFS: Not enough room in the ring to enqueue; no object is enqueued.
+ */
+static __rte_always_inline int
+rte_ring_sp_enqueue_elem(struct rte_ring *r, void *obj, unsigned int esize)
+{
+	return rte_ring_sp_enqueue_bulk_elem(r, obj, esize, 1, NULL) ? 0 :
+								-ENOBUFS;
+}
+
+/**
+ * Enqueue one object on a ring.
+ *
+ * This function calls the multi-producer or the single-producer
+ * version, depending on the default behaviour that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj
+ *   A pointer to the object to be added.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success; objects enqueued.
+ *   - -ENOBUFS: Not enough room in the ring to enqueue; no object is enqueued.
+ */
+static __rte_always_inline int
+rte_ring_enqueue_elem(struct rte_ring *r, void *obj, unsigned int esize)
+{
+	return rte_ring_enqueue_bulk_elem(r, obj, esize, 1, NULL) ? 0 :
+								-ENOBUFS;
+}
+
+/**
+ * Dequeue several objects from a ring (multi-consumers safe).
+ *
+ * This function uses a "compare and set" instruction to move the
+ * consumer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   The number of objects dequeued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_mc_dequeue_bulk_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+				RTE_RING_QUEUE_FIXED, __IS_MC, available);
+}
+
+/**
+ * Dequeue several objects from a ring (NOT multi-consumers safe).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table,
+ *   must be strictly positive.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   The number of objects dequeued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_sc_dequeue_bulk_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_FIXED, __IS_SC, available);
+}
+
+/**
+ * Dequeue several objects from a ring.
+ *
+ * This function calls the multi-consumers or the single-consumer
+ * version, depending on the default behaviour that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   The number of objects dequeued, either 0 or n
+ */
+static __rte_always_inline unsigned int
+rte_ring_dequeue_bulk_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_FIXED, r->cons.single, available);
+}
+
+/**
+ * Dequeue one object from a ring (multi-consumers safe).
+ *
+ * This function uses a "compare and set" instruction to move the
+ * consumer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_p
+ *   A pointer to a void * pointer (object) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success; objects dequeued.
+ *   - -ENOENT: Not enough entries in the ring to dequeue; no object is
+ *     dequeued.
+ */
+static __rte_always_inline int
+rte_ring_mc_dequeue_elem(struct rte_ring *r, void *obj_p,
+				unsigned int esize)
+{
+	return rte_ring_mc_dequeue_bulk_elem(r, obj_p, esize, 1, NULL)  ? 0 :
+								-ENOENT;
+}
+
+/**
+ * Dequeue one object from a ring (NOT multi-consumers safe).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_p
+ *   A pointer to a void * pointer (object) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success; objects dequeued.
+ *   - -ENOENT: Not enough entries in the ring to dequeue, no object is
+ *     dequeued.
+ */
+static __rte_always_inline int
+rte_ring_sc_dequeue_elem(struct rte_ring *r, void *obj_p,
+				unsigned int esize)
+{
+	return rte_ring_sc_dequeue_bulk_elem(r, obj_p, esize, 1, NULL) ? 0 :
+								-ENOENT;
+}
+
+/**
+ * Dequeue one object from a ring.
+ *
+ * This function calls the multi-consumers or the single-consumer
+ * version depending on the default behaviour that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_p
+ *   A pointer to a void * pointer (object) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @return
+ *   - 0: Success, objects dequeued.
+ *   - -ENOENT: Not enough entries in the ring to dequeue, no object is
+ *     dequeued.
+ */
+static __rte_always_inline int
+rte_ring_dequeue_elem(struct rte_ring *r, void *obj_p, unsigned int esize)
+{
+	return rte_ring_dequeue_bulk_elem(r, obj_p, esize, 1, NULL) ? 0 :
+								-ENOENT;
+}
+
+/**
+ * Enqueue several objects on the ring (multi-producers safe).
+ *
+ * This function uses a "compare and set" instruction to move the
+ * producer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   - n: Actual number of objects enqueued.
+ */
+static __rte_always_inline unsigned
+rte_ring_mp_enqueue_burst_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_MP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring (NOT multi-producers safe).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   - n: Actual number of objects enqueued.
+ */
+static __rte_always_inline unsigned
+rte_ring_sp_enqueue_burst_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_SP, free_space);
+}
+
+/**
+ * Enqueue several objects on a ring.
+ *
+ * This function calls the multi-producer or the single-producer
+ * version depending on the default behavior that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects).
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to add in the ring from the obj_table.
+ * @param free_space
+ *   if non-NULL, returns the amount of space in the ring after the
+ *   enqueue operation has finished.
+ * @return
+ *   - n: Actual number of objects enqueued.
+ */
+static __rte_always_inline unsigned
+rte_ring_enqueue_burst_elem(struct rte_ring *r, void * const obj_table,
+		unsigned int esize, unsigned int n, unsigned int *free_space)
+{
+	return __rte_ring_do_enqueue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_VARIABLE, r->prod.single, free_space);
+}
+
+/**
+ * Dequeue several objects from a ring (multi-consumers safe). When the request
+ * objects are more than the available objects, only dequeue the actual number
+ * of objects
+ *
+ * This function uses a "compare and set" instruction to move the
+ * consumer index atomically.
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   - n: Actual number of objects dequeued, 0 if ring is empty
+ */
+static __rte_always_inline unsigned
+rte_ring_mc_dequeue_burst_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_MC, available);
+}
+
+/**
+ * Dequeue several objects from a ring (NOT multi-consumers safe).When the
+ * request objects are more than the available objects, only dequeue the
+ * actual number of objects
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   - n: Actual number of objects dequeued, 0 if ring is empty
+ */
+static __rte_always_inline unsigned
+rte_ring_sc_dequeue_burst_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+			RTE_RING_QUEUE_VARIABLE, __IS_SC, available);
+}
+
+/**
+ * Dequeue multiple objects from a ring up to a maximum number.
+ *
+ * This function calls the multi-consumers or the single-consumer
+ * version, depending on the default behaviour that was specified at
+ * ring creation time (see flags).
+ *
+ * @param r
+ *   A pointer to the ring structure.
+ * @param obj_table
+ *   A pointer to a table of void * pointers (objects) that will be filled.
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ *   Currently, sizes 4, 8 and 16 are supported. This should be the same
+ *   as passed while creating the ring, otherwise the results are undefined.
+ * @param n
+ *   The number of objects to dequeue from the ring to the obj_table.
+ * @param available
+ *   If non-NULL, returns the number of remaining ring entries after the
+ *   dequeue has finished.
+ * @return
+ *   - Number of objects dequeued
+ */
+static __rte_always_inline unsigned
+rte_ring_dequeue_burst_elem(struct rte_ring *r, void *obj_table,
+		unsigned int esize, unsigned int n, unsigned int *available)
+{
+	return __rte_ring_do_dequeue_elem(r, obj_table, esize, n,
+				RTE_RING_QUEUE_VARIABLE,
+				r->cons.single, available);
+}
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_RING_ELEM_H_ */
diff --git a/lib/librte_ring/rte_ring_version.map b/lib/librte_ring/rte_ring_version.map
index 510c1386e..e410a7503 100644
--- a/lib/librte_ring/rte_ring_version.map
+++ b/lib/librte_ring/rte_ring_version.map
@@ -21,6 +21,8 @@ DPDK_2.2 {
 EXPERIMENTAL {
 	global:
 
+	rte_ring_create_elem;
+	rte_ring_get_memsize_elem;
 	rte_ring_reset;
 
 };
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [RFC v6 3/6] test/ring: add functional tests for configurable element size ring
  2019-10-21  0:22   ` [dpdk-dev] [RFC v6 0/6] lib/ring: APIs to support custom element size Honnappa Nagarahalli
  2019-10-21  0:22     ` [dpdk-dev] [RFC v6 1/6] test/ring: use division for cycle count calculation Honnappa Nagarahalli
  2019-10-21  0:22     ` [dpdk-dev] [RFC v6 2/6] lib/ring: apis to support configurable element size Honnappa Nagarahalli
@ 2019-10-21  0:22     ` Honnappa Nagarahalli
  2019-10-23 10:01       ` Olivier Matz
  2019-10-21  0:22     ` [dpdk-dev] [RFC v6 4/6] test/ring: add perf " Honnappa Nagarahalli
                       ` (3 subsequent siblings)
  6 siblings, 1 reply; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-21  0:22 UTC (permalink / raw)
  To: olivier.matz, sthemmin, jerinj, bruce.richardson, david.marchand,
	pbhagavatula, konstantin.ananyev, drc, hemant.agrawal,
	honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, ruifeng.wang, gavin.hu

Add functional tests for rte_ring_xxx_elem APIs. At this point these
are derived mainly from existing rte_ring_xxx test cases.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 app/test/Makefile         |   1 +
 app/test/meson.build      |   1 +
 app/test/test_ring_elem.c | 859 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 861 insertions(+)
 create mode 100644 app/test/test_ring_elem.c

diff --git a/app/test/Makefile b/app/test/Makefile
index 26ba6fe2b..483865b4a 100644
--- a/app/test/Makefile
+++ b/app/test/Makefile
@@ -77,6 +77,7 @@ SRCS-y += test_external_mem.c
 SRCS-y += test_rand_perf.c
 
 SRCS-y += test_ring.c
+SRCS-y += test_ring_elem.c
 SRCS-y += test_ring_perf.c
 SRCS-y += test_pmd_perf.c
 
diff --git a/app/test/meson.build b/app/test/meson.build
index ec40943bd..1ca25c00a 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -100,6 +100,7 @@ test_sources = files('commands.c',
 	'test_red.c',
 	'test_reorder.c',
 	'test_ring.c',
+	'test_ring_elem.c',
 	'test_ring_perf.c',
 	'test_rwlock.c',
 	'test_sched.c',
diff --git a/app/test/test_ring_elem.c b/app/test/test_ring_elem.c
new file mode 100644
index 000000000..54ae35a71
--- /dev/null
+++ b/app/test/test_ring_elem.c
@@ -0,0 +1,859 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2014 Intel Corporation
+ */
+
+#include <string.h>
+#include <stdarg.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <stdint.h>
+#include <inttypes.h>
+#include <errno.h>
+#include <sys/queue.h>
+
+#include <rte_common.h>
+#include <rte_log.h>
+#include <rte_memory.h>
+#include <rte_launch.h>
+#include <rte_cycles.h>
+#include <rte_eal.h>
+#include <rte_per_lcore.h>
+#include <rte_lcore.h>
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_malloc.h>
+#include <rte_ring.h>
+#include <rte_ring_elem.h>
+#include <rte_random.h>
+#include <rte_errno.h>
+#include <rte_hexdump.h>
+
+#include "test.h"
+
+/*
+ * Ring
+ * ====
+ *
+ * #. Basic tests: done on one core:
+ *
+ *    - Using single producer/single consumer functions:
+ *
+ *      - Enqueue one object, two objects, MAX_BULK objects
+ *      - Dequeue one object, two objects, MAX_BULK objects
+ *      - Check that dequeued pointers are correct
+ *
+ *    - Using multi producers/multi consumers functions:
+ *
+ *      - Enqueue one object, two objects, MAX_BULK objects
+ *      - Dequeue one object, two objects, MAX_BULK objects
+ *      - Check that dequeued pointers are correct
+ *
+ * #. Performance tests.
+ *
+ * Tests done in test_ring_perf.c
+ */
+
+#define RING_SIZE 4096
+#define MAX_BULK 32
+
+static rte_atomic32_t synchro;
+
+#define	TEST_RING_VERIFY(exp)						\
+	if (!(exp)) {							\
+		printf("error at %s:%d\tcondition " #exp " failed\n",	\
+		    __func__, __LINE__);				\
+		rte_ring_dump(stdout, r);				\
+		return -1;						\
+	}
+
+#define	TEST_RING_FULL_EMTPY_ITER	8
+
+/*
+ * helper routine for test_ring_basic
+ */
+static int
+test_ring_basic_full_empty(struct rte_ring *r, void * const src, void *dst)
+{
+	unsigned i, rand;
+	const unsigned rsz = RING_SIZE - 1;
+
+	printf("Basic full/empty test\n");
+
+	for (i = 0; TEST_RING_FULL_EMTPY_ITER != i; i++) {
+
+		/* random shift in the ring */
+		rand = RTE_MAX(rte_rand() % RING_SIZE, 1UL);
+		printf("%s: iteration %u, random shift: %u;\n",
+		    __func__, i, rand);
+		TEST_RING_VERIFY(rte_ring_enqueue_bulk_elem(r, src, 8, rand,
+				NULL) != 0);
+		TEST_RING_VERIFY(rte_ring_dequeue_bulk_elem(r, dst, 8, rand,
+				NULL) == rand);
+
+		/* fill the ring */
+		TEST_RING_VERIFY(rte_ring_enqueue_bulk_elem(r, src, 8, rsz, NULL) != 0);
+		TEST_RING_VERIFY(0 == rte_ring_free_count(r));
+		TEST_RING_VERIFY(rsz == rte_ring_count(r));
+		TEST_RING_VERIFY(rte_ring_full(r));
+		TEST_RING_VERIFY(0 == rte_ring_empty(r));
+
+		/* empty the ring */
+		TEST_RING_VERIFY(rte_ring_dequeue_bulk_elem(r, dst, 8, rsz,
+				NULL) == rsz);
+		TEST_RING_VERIFY(rsz == rte_ring_free_count(r));
+		TEST_RING_VERIFY(0 == rte_ring_count(r));
+		TEST_RING_VERIFY(0 == rte_ring_full(r));
+		TEST_RING_VERIFY(rte_ring_empty(r));
+
+		/* check data */
+		TEST_RING_VERIFY(0 == memcmp(src, dst, rsz));
+		rte_ring_dump(stdout, r);
+	}
+	return 0;
+}
+
+static int
+test_ring_basic(struct rte_ring *r)
+{
+	void **src = NULL, **cur_src = NULL, **dst = NULL, **cur_dst = NULL;
+	int ret;
+	unsigned i, num_elems;
+
+	/* alloc dummy object pointers */
+	src = malloc(RING_SIZE*2*sizeof(void *));
+	if (src == NULL)
+		goto fail;
+
+	for (i = 0; i < RING_SIZE*2 ; i++) {
+		src[i] = (void *)(unsigned long)i;
+	}
+	cur_src = src;
+
+	/* alloc some room for copied objects */
+	dst = malloc(RING_SIZE*2*sizeof(void *));
+	if (dst == NULL)
+		goto fail;
+
+	memset(dst, 0, RING_SIZE*2*sizeof(void *));
+	cur_dst = dst;
+
+	printf("enqueue 1 obj\n");
+	ret = rte_ring_sp_enqueue_bulk_elem(r, cur_src, 8, 1, NULL);
+	cur_src += 1;
+	if (ret == 0)
+		goto fail;
+
+	printf("enqueue 2 objs\n");
+	ret = rte_ring_sp_enqueue_bulk_elem(r, cur_src, 8, 2, NULL);
+	cur_src += 2;
+	if (ret == 0)
+		goto fail;
+
+	printf("enqueue MAX_BULK objs\n");
+	ret = rte_ring_sp_enqueue_bulk_elem(r, cur_src, 8, MAX_BULK, NULL);
+	cur_src += MAX_BULK;
+	if (ret == 0)
+		goto fail;
+
+	printf("dequeue 1 obj\n");
+	ret = rte_ring_sc_dequeue_bulk_elem(r, cur_dst, 8, 1, NULL);
+	cur_dst += 1;
+	if (ret == 0)
+		goto fail;
+
+	printf("dequeue 2 objs\n");
+	ret = rte_ring_sc_dequeue_bulk_elem(r, cur_dst, 8, 2, NULL);
+	cur_dst += 2;
+	if (ret == 0)
+		goto fail;
+
+	printf("dequeue MAX_BULK objs\n");
+	ret = rte_ring_sc_dequeue_bulk_elem(r, cur_dst, 8, MAX_BULK, NULL);
+	cur_dst += MAX_BULK;
+	if (ret == 0)
+		goto fail;
+
+	/* check data */
+	if (memcmp(src, dst, cur_dst - dst)) {
+		rte_hexdump(stdout, "src", src, cur_src - src);
+		rte_hexdump(stdout, "dst", dst, cur_dst - dst);
+		printf("data after dequeue is not the same\n");
+		goto fail;
+	}
+	cur_src = src;
+	cur_dst = dst;
+
+	printf("enqueue 1 obj\n");
+	ret = rte_ring_mp_enqueue_bulk_elem(r, cur_src, 8, 1, NULL);
+	cur_src += 1;
+	if (ret == 0)
+		goto fail;
+
+	printf("enqueue 2 objs\n");
+	ret = rte_ring_mp_enqueue_bulk_elem(r, cur_src, 8, 2, NULL);
+	cur_src += 2;
+	if (ret == 0)
+		goto fail;
+
+	printf("enqueue MAX_BULK objs\n");
+	ret = rte_ring_mp_enqueue_bulk_elem(r, cur_src, 8, MAX_BULK, NULL);
+	cur_src += MAX_BULK;
+	if (ret == 0)
+		goto fail;
+
+	printf("dequeue 1 obj\n");
+	ret = rte_ring_mc_dequeue_bulk_elem(r, cur_dst, 8, 1, NULL);
+	cur_dst += 1;
+	if (ret == 0)
+		goto fail;
+
+	printf("dequeue 2 objs\n");
+	ret = rte_ring_mc_dequeue_bulk_elem(r, cur_dst, 8, 2, NULL);
+	cur_dst += 2;
+	if (ret == 0)
+		goto fail;
+
+	printf("dequeue MAX_BULK objs\n");
+	ret = rte_ring_mc_dequeue_bulk_elem(r, cur_dst, 8, MAX_BULK, NULL);
+	cur_dst += MAX_BULK;
+	if (ret == 0)
+		goto fail;
+
+	/* check data */
+	if (memcmp(src, dst, cur_dst - dst)) {
+		rte_hexdump(stdout, "src", src, cur_src - src);
+		rte_hexdump(stdout, "dst", dst, cur_dst - dst);
+		printf("data after dequeue is not the same\n");
+		goto fail;
+	}
+	cur_src = src;
+	cur_dst = dst;
+
+	printf("fill and empty the ring\n");
+	for (i = 0; i<RING_SIZE/MAX_BULK; i++) {
+		ret = rte_ring_mp_enqueue_bulk_elem(r, cur_src, 8, MAX_BULK, NULL);
+		cur_src += MAX_BULK;
+		if (ret == 0)
+			goto fail;
+		ret = rte_ring_mc_dequeue_bulk_elem(r, cur_dst, 8, MAX_BULK, NULL);
+		cur_dst += MAX_BULK;
+		if (ret == 0)
+			goto fail;
+	}
+
+	/* check data */
+	if (memcmp(src, dst, cur_dst - dst)) {
+		rte_hexdump(stdout, "src", src, cur_src - src);
+		rte_hexdump(stdout, "dst", dst, cur_dst - dst);
+		printf("data after dequeue is not the same\n");
+		goto fail;
+	}
+
+	if (test_ring_basic_full_empty(r, src, dst) != 0)
+		goto fail;
+
+	cur_src = src;
+	cur_dst = dst;
+
+	printf("test default bulk enqueue / dequeue\n");
+	num_elems = 16;
+
+	cur_src = src;
+	cur_dst = dst;
+
+	ret = rte_ring_enqueue_bulk_elem(r, cur_src, 8, num_elems, NULL);
+	cur_src += num_elems;
+	if (ret == 0) {
+		printf("Cannot enqueue\n");
+		goto fail;
+	}
+	ret = rte_ring_enqueue_bulk_elem(r, cur_src, 8, num_elems, NULL);
+	cur_src += num_elems;
+	if (ret == 0) {
+		printf("Cannot enqueue\n");
+		goto fail;
+	}
+	ret = rte_ring_dequeue_bulk_elem(r, cur_dst, 8, num_elems, NULL);
+	cur_dst += num_elems;
+	if (ret == 0) {
+		printf("Cannot dequeue\n");
+		goto fail;
+	}
+	ret = rte_ring_dequeue_bulk_elem(r, cur_dst, 8, num_elems, NULL);
+	cur_dst += num_elems;
+	if (ret == 0) {
+		printf("Cannot dequeue2\n");
+		goto fail;
+	}
+
+	/* check data */
+	if (memcmp(src, dst, cur_dst - dst)) {
+		rte_hexdump(stdout, "src", src, cur_src - src);
+		rte_hexdump(stdout, "dst", dst, cur_dst - dst);
+		printf("data after dequeue is not the same\n");
+		goto fail;
+	}
+
+	cur_src = src;
+	cur_dst = dst;
+
+	ret = rte_ring_mp_enqueue_elem(r, cur_src, 8);
+	if (ret != 0)
+		goto fail;
+
+	ret = rte_ring_mc_dequeue_elem(r, cur_dst, 8);
+	if (ret != 0)
+		goto fail;
+
+	free(src);
+	free(dst);
+	return 0;
+
+ fail:
+	free(src);
+	free(dst);
+	return -1;
+}
+
+static int
+test_ring_burst_basic(struct rte_ring *r)
+{
+	void **src = NULL, **cur_src = NULL, **dst = NULL, **cur_dst = NULL;
+	int ret;
+	unsigned i;
+
+	/* alloc dummy object pointers */
+	src = malloc(RING_SIZE*2*sizeof(void *));
+	if (src == NULL)
+		goto fail;
+
+	for (i = 0; i < RING_SIZE*2 ; i++) {
+		src[i] = (void *)(unsigned long)i;
+	}
+	cur_src = src;
+
+	/* alloc some room for copied objects */
+	dst = malloc(RING_SIZE*2*sizeof(void *));
+	if (dst == NULL)
+		goto fail;
+
+	memset(dst, 0, RING_SIZE*2*sizeof(void *));
+	cur_dst = dst;
+
+	printf("Test SP & SC basic functions \n");
+	printf("enqueue 1 obj\n");
+	ret = rte_ring_sp_enqueue_burst_elem(r, cur_src, 8, 1, NULL);
+	cur_src += 1;
+	if (ret != 1)
+		goto fail;
+
+	printf("enqueue 2 objs\n");
+	ret = rte_ring_sp_enqueue_burst_elem(r, cur_src, 8, 2, NULL);
+	cur_src += 2;
+	if (ret != 2)
+		goto fail;
+
+	printf("enqueue MAX_BULK objs\n");
+	ret = rte_ring_sp_enqueue_burst_elem(r, cur_src, 8, MAX_BULK, NULL);
+	cur_src += MAX_BULK;
+	if (ret != MAX_BULK)
+		goto fail;
+
+	printf("dequeue 1 obj\n");
+	ret = rte_ring_sc_dequeue_burst_elem(r, cur_dst, 8, 1, NULL);
+	cur_dst += 1;
+	if (ret != 1)
+		goto fail;
+
+	printf("dequeue 2 objs\n");
+	ret = rte_ring_sc_dequeue_burst_elem(r, cur_dst, 8, 2, NULL);
+	cur_dst += 2;
+	if (ret != 2)
+		goto fail;
+
+	printf("dequeue MAX_BULK objs\n");
+	ret = rte_ring_sc_dequeue_burst_elem(r, cur_dst, 8, MAX_BULK, NULL);
+	cur_dst += MAX_BULK;
+	if (ret != MAX_BULK)
+		goto fail;
+
+	/* check data */
+	if (memcmp(src, dst, cur_dst - dst)) {
+		rte_hexdump(stdout, "src", src, cur_src - src);
+		rte_hexdump(stdout, "dst", dst, cur_dst - dst);
+		printf("data after dequeue is not the same\n");
+		goto fail;
+	}
+
+	cur_src = src;
+	cur_dst = dst;
+
+	printf("Test enqueue without enough memory space \n");
+	for (i = 0; i < (RING_SIZE/MAX_BULK - 1); i++) {
+		ret = rte_ring_sp_enqueue_burst_elem(r, cur_src, 8, MAX_BULK, NULL);
+		cur_src += MAX_BULK;
+		if (ret != MAX_BULK)
+			goto fail;
+	}
+
+	printf("Enqueue 2 objects, free entries = MAX_BULK - 2  \n");
+	ret = rte_ring_sp_enqueue_burst_elem(r, cur_src, 8, 2, NULL);
+	cur_src += 2;
+	if (ret != 2)
+		goto fail;
+
+	printf("Enqueue the remaining entries = MAX_BULK - 2  \n");
+	/* Always one free entry left */
+	ret = rte_ring_sp_enqueue_burst_elem(r, cur_src, 8, MAX_BULK, NULL);
+	cur_src += MAX_BULK - 3;
+	if (ret != MAX_BULK - 3)
+		goto fail;
+
+	printf("Test if ring is full  \n");
+	if (rte_ring_full(r) != 1)
+		goto fail;
+
+	printf("Test enqueue for a full entry  \n");
+	ret = rte_ring_sp_enqueue_burst_elem(r, cur_src, 8, MAX_BULK, NULL);
+	if (ret != 0)
+		goto fail;
+
+	printf("Test dequeue without enough objects \n");
+	for (i = 0; i<RING_SIZE/MAX_BULK - 1; i++) {
+		ret = rte_ring_sc_dequeue_burst_elem(r, cur_dst, 8, MAX_BULK, NULL);
+		cur_dst += MAX_BULK;
+		if (ret != MAX_BULK)
+			goto fail;
+	}
+
+	/* Available memory space for the exact MAX_BULK entries */
+	ret = rte_ring_sc_dequeue_burst_elem(r, cur_dst, 8, 2, NULL);
+	cur_dst += 2;
+	if (ret != 2)
+		goto fail;
+
+	ret = rte_ring_sc_dequeue_burst_elem(r, cur_dst, 8, MAX_BULK, NULL);
+	cur_dst += MAX_BULK - 3;
+	if (ret != MAX_BULK - 3)
+		goto fail;
+
+	printf("Test if ring is empty \n");
+	/* Check if ring is empty */
+	if (1 != rte_ring_empty(r))
+		goto fail;
+
+	/* check data */
+	if (memcmp(src, dst, cur_dst - dst)) {
+		rte_hexdump(stdout, "src", src, cur_src - src);
+		rte_hexdump(stdout, "dst", dst, cur_dst - dst);
+		printf("data after dequeue is not the same\n");
+		goto fail;
+	}
+
+	cur_src = src;
+	cur_dst = dst;
+
+	printf("Test MP & MC basic functions \n");
+
+	printf("enqueue 1 obj\n");
+	ret = rte_ring_mp_enqueue_burst_elem(r, cur_src, 8, 1, NULL);
+	cur_src += 1;
+	if (ret != 1)
+		goto fail;
+
+	printf("enqueue 2 objs\n");
+	ret = rte_ring_mp_enqueue_burst_elem(r, cur_src, 8, 2, NULL);
+	cur_src += 2;
+	if (ret != 2)
+		goto fail;
+
+	printf("enqueue MAX_BULK objs\n");
+	ret = rte_ring_mp_enqueue_burst_elem(r, cur_src, 8, MAX_BULK, NULL);
+	cur_src += MAX_BULK;
+	if (ret != MAX_BULK)
+		goto fail;
+
+	printf("dequeue 1 obj\n");
+	ret = rte_ring_mc_dequeue_burst_elem(r, cur_dst, 8, 1, NULL);
+	cur_dst += 1;
+	if (ret != 1)
+		goto fail;
+
+	printf("dequeue 2 objs\n");
+	ret = rte_ring_mc_dequeue_burst_elem(r, cur_dst, 8, 2, NULL);
+	cur_dst += 2;
+	if (ret != 2)
+		goto fail;
+
+	printf("dequeue MAX_BULK objs\n");
+	ret = rte_ring_mc_dequeue_burst_elem(r, cur_dst, 8, MAX_BULK, NULL);
+	cur_dst += MAX_BULK;
+	if (ret != MAX_BULK)
+		goto fail;
+
+	/* check data */
+	if (memcmp(src, dst, cur_dst - dst)) {
+		rte_hexdump(stdout, "src", src, cur_src - src);
+		rte_hexdump(stdout, "dst", dst, cur_dst - dst);
+		printf("data after dequeue is not the same\n");
+		goto fail;
+	}
+
+	cur_src = src;
+	cur_dst = dst;
+
+	printf("fill and empty the ring\n");
+	for (i = 0; i < RING_SIZE/MAX_BULK; i++) {
+		ret = rte_ring_mp_enqueue_burst_elem(r, cur_src, 8, MAX_BULK, NULL);
+		cur_src += MAX_BULK;
+		if (ret != MAX_BULK)
+			goto fail;
+		ret = rte_ring_mc_dequeue_burst_elem(r, cur_dst, 8, MAX_BULK, NULL);
+		cur_dst += MAX_BULK;
+		if (ret != MAX_BULK)
+			goto fail;
+	}
+
+	/* check data */
+	if (memcmp(src, dst, cur_dst - dst)) {
+		rte_hexdump(stdout, "src", src, cur_src - src);
+		rte_hexdump(stdout, "dst", dst, cur_dst - dst);
+		printf("data after dequeue is not the same\n");
+		goto fail;
+	}
+
+	cur_src = src;
+	cur_dst = dst;
+
+	printf("Test enqueue without enough memory space \n");
+	for (i = 0; i < RING_SIZE/MAX_BULK - 1; i++) {
+		ret = rte_ring_mp_enqueue_burst_elem(r, cur_src, 8, MAX_BULK, NULL);
+		cur_src += MAX_BULK;
+		if (ret != MAX_BULK)
+			goto fail;
+	}
+
+	/* Available memory space for the exact MAX_BULK objects */
+	ret = rte_ring_mp_enqueue_burst_elem(r, cur_src, 8, 2, NULL);
+	cur_src += 2;
+	if (ret != 2)
+		goto fail;
+
+	ret = rte_ring_mp_enqueue_burst_elem(r, cur_src, 8, MAX_BULK, NULL);
+	cur_src += MAX_BULK - 3;
+	if (ret != MAX_BULK - 3)
+		goto fail;
+
+
+	printf("Test dequeue without enough objects \n");
+	for (i = 0; i < RING_SIZE/MAX_BULK - 1; i++) {
+		ret = rte_ring_mc_dequeue_burst_elem(r, cur_dst, 8, MAX_BULK, NULL);
+		cur_dst += MAX_BULK;
+		if (ret != MAX_BULK)
+			goto fail;
+	}
+
+	/* Available objects - the exact MAX_BULK */
+	ret = rte_ring_mc_dequeue_burst_elem(r, cur_dst, 8, 2, NULL);
+	cur_dst += 2;
+	if (ret != 2)
+		goto fail;
+
+	ret = rte_ring_mc_dequeue_burst_elem(r, cur_dst, 8, MAX_BULK, NULL);
+	cur_dst += MAX_BULK - 3;
+	if (ret != MAX_BULK - 3)
+		goto fail;
+
+	/* check data */
+	if (memcmp(src, dst, cur_dst - dst)) {
+		rte_hexdump(stdout, "src", src, cur_src - src);
+		rte_hexdump(stdout, "dst", dst, cur_dst - dst);
+		printf("data after dequeue is not the same\n");
+		goto fail;
+	}
+
+	cur_src = src;
+	cur_dst = dst;
+
+	printf("Covering rte_ring_enqueue_burst functions \n");
+
+	ret = rte_ring_enqueue_burst_elem(r, cur_src, 8, 2, NULL);
+	cur_src += 2;
+	if (ret != 2)
+		goto fail;
+
+	ret = rte_ring_dequeue_burst_elem(r, cur_dst, 8, 2, NULL);
+	cur_dst += 2;
+	if (ret != 2)
+		goto fail;
+
+	/* Free memory before test completed */
+	free(src);
+	free(dst);
+	return 0;
+
+ fail:
+	free(src);
+	free(dst);
+	return -1;
+}
+
+/*
+ * it will always fail to create ring with a wrong ring size number in this function
+ */
+static int
+test_ring_creation_with_wrong_size(void)
+{
+	struct rte_ring * rp = NULL;
+
+	/* Test if ring size is not power of 2 */
+	rp = rte_ring_create_elem("test_bad_ring_size", RING_SIZE + 1, 8, SOCKET_ID_ANY, 0);
+	if (NULL != rp) {
+		return -1;
+	}
+
+	/* Test if ring size is exceeding the limit */
+	rp = rte_ring_create_elem("test_bad_ring_size", (RTE_RING_SZ_MASK + 1), 8, SOCKET_ID_ANY, 0);
+	if (NULL != rp) {
+		return -1;
+	}
+	return 0;
+}
+
+/*
+ * it tests if it would always fail to create ring with an used ring name
+ */
+static int
+test_ring_creation_with_an_used_name(void)
+{
+	struct rte_ring * rp;
+
+	rp = rte_ring_create_elem("test", RING_SIZE, 8, SOCKET_ID_ANY, 0);
+	if (NULL != rp)
+		return -1;
+
+	return 0;
+}
+
+/*
+ * Test to if a non-power of 2 count causes the create
+ * function to fail correctly
+ */
+static int
+test_create_count_odd(void)
+{
+	struct rte_ring *r = rte_ring_create_elem("test_ring_count",
+			4097, 8, SOCKET_ID_ANY, 0 );
+	if(r != NULL){
+		return -1;
+	}
+	return 0;
+}
+
+/*
+ * it tests some more basic ring operations
+ */
+static int
+test_ring_basic_ex(void)
+{
+	int ret = -1;
+	unsigned i;
+	struct rte_ring *rp = NULL;
+	void **obj = NULL;
+
+	obj = rte_calloc("test_ring_basic_ex_malloc", RING_SIZE, sizeof(void *), 0);
+	if (obj == NULL) {
+		printf("test_ring_basic_ex fail to rte_malloc\n");
+		goto fail_test;
+	}
+
+	rp = rte_ring_create_elem("test_ring_basic_ex", RING_SIZE, 8, SOCKET_ID_ANY,
+			RING_F_SP_ENQ | RING_F_SC_DEQ);
+	if (rp == NULL) {
+		printf("test_ring_basic_ex fail to create ring\n");
+		goto fail_test;
+	}
+
+	if (rte_ring_lookup("test_ring_basic_ex") != rp) {
+		goto fail_test;
+	}
+
+	if (rte_ring_empty(rp) != 1) {
+		printf("test_ring_basic_ex ring is not empty but it should be\n");
+		goto fail_test;
+	}
+
+	printf("%u ring entries are now free\n", rte_ring_free_count(rp));
+
+	for (i = 0; i < RING_SIZE; i ++) {
+		rte_ring_enqueue_elem(rp, &obj[i], 8);
+	}
+
+	if (rte_ring_full(rp) != 1) {
+		printf("test_ring_basic_ex ring is not full but it should be\n");
+		goto fail_test;
+	}
+
+	for (i = 0; i < RING_SIZE; i ++) {
+		rte_ring_dequeue_elem(rp, &obj[i], 8);
+	}
+
+	if (rte_ring_empty(rp) != 1) {
+		printf("test_ring_basic_ex ring is not empty but it should be\n");
+		goto fail_test;
+	}
+
+	/* Covering the ring burst operation */
+	ret = rte_ring_enqueue_burst_elem(rp, obj, 8, 2, NULL);
+	if (ret != 2) {
+		printf("test_ring_basic_ex: rte_ring_enqueue_burst fails \n");
+		goto fail_test;
+	}
+
+	ret = rte_ring_dequeue_burst_elem(rp, obj, 8, 2, NULL);
+	if (ret != 2) {
+		printf("test_ring_basic_ex: rte_ring_dequeue_burst fails \n");
+		goto fail_test;
+	}
+
+	ret = 0;
+fail_test:
+	rte_ring_free(rp);
+	if (obj != NULL)
+		rte_free(obj);
+
+	return ret;
+}
+
+static int
+test_ring_with_exact_size(void)
+{
+	struct rte_ring *std_ring = NULL, *exact_sz_ring = NULL;
+	void *ptr_array[16];
+	static const unsigned int ring_sz = RTE_DIM(ptr_array);
+	unsigned int i;
+	int ret = -1;
+
+	std_ring = rte_ring_create_elem("std", ring_sz, 8, rte_socket_id(),
+			RING_F_SP_ENQ | RING_F_SC_DEQ);
+	if (std_ring == NULL) {
+		printf("%s: error, can't create std ring\n", __func__);
+		goto end;
+	}
+	exact_sz_ring = rte_ring_create_elem("exact sz", ring_sz, 8, rte_socket_id(),
+			RING_F_SP_ENQ | RING_F_SC_DEQ | RING_F_EXACT_SZ);
+	if (exact_sz_ring == NULL) {
+		printf("%s: error, can't create exact size ring\n", __func__);
+		goto end;
+	}
+
+	/*
+	 * Check that the exact size ring is bigger than the standard ring
+	 */
+	if (rte_ring_get_size(std_ring) >= rte_ring_get_size(exact_sz_ring)) {
+		printf("%s: error, std ring (size: %u) is not smaller than exact size one (size %u)\n",
+				__func__,
+				rte_ring_get_size(std_ring),
+				rte_ring_get_size(exact_sz_ring));
+		goto end;
+	}
+	/*
+	 * check that the exact_sz_ring can hold one more element than the
+	 * standard ring. (16 vs 15 elements)
+	 */
+	for (i = 0; i < ring_sz - 1; i++) {
+		rte_ring_enqueue_elem(std_ring, ptr_array, 8);
+		rte_ring_enqueue_elem(exact_sz_ring, ptr_array, 8);
+	}
+	if (rte_ring_enqueue_elem(std_ring, ptr_array, 8) != -ENOBUFS) {
+		printf("%s: error, unexpected successful enqueue\n", __func__);
+		goto end;
+	}
+	if (rte_ring_enqueue_elem(exact_sz_ring, ptr_array, 8) == -ENOBUFS) {
+		printf("%s: error, enqueue failed\n", __func__);
+		goto end;
+	}
+
+	/* check that dequeue returns the expected number of elements */
+	if (rte_ring_dequeue_burst_elem(exact_sz_ring, ptr_array, 8,
+			RTE_DIM(ptr_array), NULL) != ring_sz) {
+		printf("%s: error, failed to dequeue expected nb of elements\n",
+				__func__);
+		goto end;
+	}
+
+	/* check that the capacity function returns expected value */
+	if (rte_ring_get_capacity(exact_sz_ring) != ring_sz) {
+		printf("%s: error, incorrect ring capacity reported\n",
+				__func__);
+		goto end;
+	}
+
+	ret = 0; /* all ok if we get here */
+end:
+	rte_ring_free(std_ring);
+	rte_ring_free(exact_sz_ring);
+	return ret;
+}
+
+static int
+test_ring(void)
+{
+	struct rte_ring *r = NULL;
+
+	/* some more basic operations */
+	if (test_ring_basic_ex() < 0)
+		goto test_fail;
+
+	rte_atomic32_init(&synchro);
+
+	r = rte_ring_create_elem("test", RING_SIZE, 8, SOCKET_ID_ANY, 0);
+	if (r == NULL)
+		goto test_fail;
+
+	/* retrieve the ring from its name */
+	if (rte_ring_lookup("test") != r) {
+		printf("Cannot lookup ring from its name\n");
+		goto test_fail;
+	}
+
+	/* burst operations */
+	if (test_ring_burst_basic(r) < 0)
+		goto test_fail;
+
+	/* basic operations */
+	if (test_ring_basic(r) < 0)
+		goto test_fail;
+
+	/* basic operations */
+	if ( test_create_count_odd() < 0){
+		printf("Test failed to detect odd count\n");
+		goto test_fail;
+	} else
+		printf("Test detected odd count\n");
+
+	/* test of creating ring with wrong size */
+	if (test_ring_creation_with_wrong_size() < 0)
+		goto test_fail;
+
+	/* test of creation ring with an used name */
+	if (test_ring_creation_with_an_used_name() < 0)
+		goto test_fail;
+
+	if (test_ring_with_exact_size() < 0)
+		goto test_fail;
+
+	/* dump the ring status */
+	rte_ring_list_dump(stdout);
+
+	rte_ring_free(r);
+
+	return 0;
+
+test_fail:
+	rte_ring_free(r);
+
+	return -1;
+}
+
+REGISTER_TEST_COMMAND(ring_elem_autotest, test_ring);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [RFC v6 4/6] test/ring: add perf tests for configurable element size ring
  2019-10-21  0:22   ` [dpdk-dev] [RFC v6 0/6] lib/ring: APIs to support custom element size Honnappa Nagarahalli
                       ` (2 preceding siblings ...)
  2019-10-21  0:22     ` [dpdk-dev] [RFC v6 3/6] test/ring: add functional tests for configurable element size ring Honnappa Nagarahalli
@ 2019-10-21  0:22     ` Honnappa Nagarahalli
  2019-10-23 10:02       ` Olivier Matz
  2019-10-21  0:22     ` [dpdk-dev] [RFC v6 5/6] lib/ring: copy ring elements using memcpy partially Honnappa Nagarahalli
                       ` (2 subsequent siblings)
  6 siblings, 1 reply; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-21  0:22 UTC (permalink / raw)
  To: olivier.matz, sthemmin, jerinj, bruce.richardson, david.marchand,
	pbhagavatula, konstantin.ananyev, drc, hemant.agrawal,
	honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, ruifeng.wang, gavin.hu

Add performance tests for rte_ring_xxx_elem APIs. At this point these
are derived mainly from existing rte_ring_xxx test cases.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 app/test/Makefile              |   1 +
 app/test/meson.build           |   1 +
 app/test/test_ring_perf_elem.c | 419 +++++++++++++++++++++++++++++++++
 3 files changed, 421 insertions(+)
 create mode 100644 app/test/test_ring_perf_elem.c

diff --git a/app/test/Makefile b/app/test/Makefile
index 483865b4a..6f168881c 100644
--- a/app/test/Makefile
+++ b/app/test/Makefile
@@ -79,6 +79,7 @@ SRCS-y += test_rand_perf.c
 SRCS-y += test_ring.c
 SRCS-y += test_ring_elem.c
 SRCS-y += test_ring_perf.c
+SRCS-y += test_ring_perf_elem.c
 SRCS-y += test_pmd_perf.c
 
 ifeq ($(CONFIG_RTE_LIBRTE_TABLE),y)
diff --git a/app/test/meson.build b/app/test/meson.build
index 1ca25c00a..634cbbf26 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -102,6 +102,7 @@ test_sources = files('commands.c',
 	'test_ring.c',
 	'test_ring_elem.c',
 	'test_ring_perf.c',
+	'test_ring_perf_elem.c',
 	'test_rwlock.c',
 	'test_sched.c',
 	'test_service_cores.c',
diff --git a/app/test/test_ring_perf_elem.c b/app/test/test_ring_perf_elem.c
new file mode 100644
index 000000000..402b7877a
--- /dev/null
+++ b/app/test/test_ring_perf_elem.c
@@ -0,0 +1,419 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2010-2014 Intel Corporation
+ */
+
+
+#include <stdio.h>
+#include <inttypes.h>
+#include <rte_ring.h>
+#include <rte_ring_elem.h>
+#include <rte_cycles.h>
+#include <rte_launch.h>
+#include <rte_pause.h>
+
+#include "test.h"
+
+/*
+ * Ring
+ * ====
+ *
+ * Measures performance of various operations using rdtsc
+ *  * Empty ring dequeue
+ *  * Enqueue/dequeue of bursts in 1 threads
+ *  * Enqueue/dequeue of bursts in 2 threads
+ */
+
+#define RING_NAME "RING_PERF"
+#define RING_SIZE 4096
+#define MAX_BURST 64
+
+/*
+ * the sizes to enqueue and dequeue in testing
+ * (marked volatile so they won't be seen as compile-time constants)
+ */
+static const volatile unsigned bulk_sizes[] = { 8, 32 };
+
+struct lcore_pair {
+	unsigned c1, c2;
+};
+
+static volatile unsigned lcore_count;
+
+/**** Functions to analyse our core mask to get cores for different tests ***/
+
+static int
+get_two_hyperthreads(struct lcore_pair *lcp)
+{
+	unsigned id1, id2;
+	unsigned c1, c2, s1, s2;
+	RTE_LCORE_FOREACH(id1) {
+		/* inner loop just re-reads all id's. We could skip the
+		 * first few elements, but since number of cores is small
+		 * there is little point
+		 */
+		RTE_LCORE_FOREACH(id2) {
+			if (id1 == id2)
+				continue;
+
+			c1 = rte_lcore_to_cpu_id(id1);
+			c2 = rte_lcore_to_cpu_id(id2);
+			s1 = rte_lcore_to_socket_id(id1);
+			s2 = rte_lcore_to_socket_id(id2);
+			if ((c1 == c2) && (s1 == s2)) {
+				lcp->c1 = id1;
+				lcp->c2 = id2;
+				return 0;
+			}
+		}
+	}
+	return 1;
+}
+
+static int
+get_two_cores(struct lcore_pair *lcp)
+{
+	unsigned id1, id2;
+	unsigned c1, c2, s1, s2;
+	RTE_LCORE_FOREACH(id1) {
+		RTE_LCORE_FOREACH(id2) {
+			if (id1 == id2)
+				continue;
+
+			c1 = rte_lcore_to_cpu_id(id1);
+			c2 = rte_lcore_to_cpu_id(id2);
+			s1 = rte_lcore_to_socket_id(id1);
+			s2 = rte_lcore_to_socket_id(id2);
+			if ((c1 != c2) && (s1 == s2)) {
+				lcp->c1 = id1;
+				lcp->c2 = id2;
+				return 0;
+			}
+		}
+	}
+	return 1;
+}
+
+static int
+get_two_sockets(struct lcore_pair *lcp)
+{
+	unsigned id1, id2;
+	unsigned s1, s2;
+	RTE_LCORE_FOREACH(id1) {
+		RTE_LCORE_FOREACH(id2) {
+			if (id1 == id2)
+				continue;
+			s1 = rte_lcore_to_socket_id(id1);
+			s2 = rte_lcore_to_socket_id(id2);
+			if (s1 != s2) {
+				lcp->c1 = id1;
+				lcp->c2 = id2;
+				return 0;
+			}
+		}
+	}
+	return 1;
+}
+
+/* Get cycle counts for dequeuing from an empty ring. Should be 2 or 3 cycles */
+static void
+test_empty_dequeue(struct rte_ring *r)
+{
+	const unsigned iter_shift = 26;
+	const unsigned iterations = 1<<iter_shift;
+	unsigned i = 0;
+	uint32_t burst[MAX_BURST];
+
+	const uint64_t sc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		rte_ring_sc_dequeue_bulk_elem(r, burst, 8, bulk_sizes[0], NULL);
+	const uint64_t sc_end = rte_rdtsc();
+
+	const uint64_t mc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		rte_ring_mc_dequeue_bulk_elem(r, burst, 8, bulk_sizes[0], NULL);
+	const uint64_t mc_end = rte_rdtsc();
+
+	printf("SC empty dequeue: %.2F\n",
+			(double)(sc_end-sc_start) / iterations);
+	printf("MC empty dequeue: %.2F\n",
+			(double)(mc_end-mc_start) / iterations);
+}
+
+/*
+ * for the separate enqueue and dequeue threads they take in one param
+ * and return two. Input = burst size, output = cycle average for sp/sc & mp/mc
+ */
+struct thread_params {
+	struct rte_ring *r;
+	unsigned size;        /* input value, the burst size */
+	double spsc, mpmc;    /* output value, the single or multi timings */
+};
+
+/*
+ * Function that uses rdtsc to measure timing for ring enqueue. Needs pair
+ * thread running dequeue_bulk function
+ */
+static int
+enqueue_bulk(void *p)
+{
+	const unsigned iter_shift = 23;
+	const unsigned iterations = 1<<iter_shift;
+	struct thread_params *params = p;
+	struct rte_ring *r = params->r;
+	const unsigned size = params->size;
+	unsigned i;
+	uint32_t burst[MAX_BURST] = {0};
+
+#ifdef RTE_USE_C11_MEM_MODEL
+	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
+#else
+	if (__sync_add_and_fetch(&lcore_count, 1) != 2)
+#endif
+		while (lcore_count != 2)
+			rte_pause();
+
+	const uint64_t sp_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		while (rte_ring_sp_enqueue_bulk_elem(r, burst, 8, size, NULL)
+				== 0)
+			rte_pause();
+	const uint64_t sp_end = rte_rdtsc();
+
+	const uint64_t mp_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		while (rte_ring_mp_enqueue_bulk_elem(r, burst, 8, size, NULL)
+				== 0)
+			rte_pause();
+	const uint64_t mp_end = rte_rdtsc();
+
+	params->spsc = ((double)(sp_end - sp_start))/(iterations*size);
+	params->mpmc = ((double)(mp_end - mp_start))/(iterations*size);
+	return 0;
+}
+
+/*
+ * Function that uses rdtsc to measure timing for ring dequeue. Needs pair
+ * thread running enqueue_bulk function
+ */
+static int
+dequeue_bulk(void *p)
+{
+	const unsigned iter_shift = 23;
+	const unsigned iterations = 1<<iter_shift;
+	struct thread_params *params = p;
+	struct rte_ring *r = params->r;
+	const unsigned size = params->size;
+	unsigned i;
+	uint32_t burst[MAX_BURST] = {0};
+
+#ifdef RTE_USE_C11_MEM_MODEL
+	if (__atomic_add_fetch(&lcore_count, 1, __ATOMIC_RELAXED) != 2)
+#else
+	if (__sync_add_and_fetch(&lcore_count, 1) != 2)
+#endif
+		while (lcore_count != 2)
+			rte_pause();
+
+	const uint64_t sc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		while (rte_ring_sc_dequeue_bulk_elem(r, burst, 8, size, NULL)
+				== 0)
+			rte_pause();
+	const uint64_t sc_end = rte_rdtsc();
+
+	const uint64_t mc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++)
+		while (rte_ring_mc_dequeue_bulk_elem(r, burst, 8, size, NULL)
+				== 0)
+			rte_pause();
+	const uint64_t mc_end = rte_rdtsc();
+
+	params->spsc = ((double)(sc_end - sc_start))/(iterations*size);
+	params->mpmc = ((double)(mc_end - mc_start))/(iterations*size);
+	return 0;
+}
+
+/*
+ * Function that calls the enqueue and dequeue bulk functions on pairs of cores.
+ * used to measure ring perf between hyperthreads, cores and sockets.
+ */
+static void
+run_on_core_pair(struct lcore_pair *cores, struct rte_ring *r,
+		lcore_function_t f1, lcore_function_t f2)
+{
+	struct thread_params param1 = {0}, param2 = {0};
+	unsigned i;
+	for (i = 0; i < sizeof(bulk_sizes)/sizeof(bulk_sizes[0]); i++) {
+		lcore_count = 0;
+		param1.size = param2.size = bulk_sizes[i];
+		param1.r = param2.r = r;
+		if (cores->c1 == rte_get_master_lcore()) {
+			rte_eal_remote_launch(f2, &param2, cores->c2);
+			f1(&param1);
+			rte_eal_wait_lcore(cores->c2);
+		} else {
+			rte_eal_remote_launch(f1, &param1, cores->c1);
+			rte_eal_remote_launch(f2, &param2, cores->c2);
+			rte_eal_wait_lcore(cores->c1);
+			rte_eal_wait_lcore(cores->c2);
+		}
+		printf("SP/SC bulk enq/dequeue (size: %u): %.2F\n",
+				bulk_sizes[i], param1.spsc + param2.spsc);
+		printf("MP/MC bulk enq/dequeue (size: %u): %.2F\n",
+				bulk_sizes[i], param1.mpmc + param2.mpmc);
+	}
+}
+
+/*
+ * Test function that determines how long an enqueue + dequeue of a single item
+ * takes on a single lcore. Result is for comparison with the bulk enq+deq.
+ */
+static void
+test_single_enqueue_dequeue(struct rte_ring *r)
+{
+	const unsigned iter_shift = 24;
+	const unsigned iterations = 1<<iter_shift;
+	unsigned i = 0;
+	uint32_t burst[2];
+
+	const uint64_t sc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++) {
+		rte_ring_sp_enqueue_elem(r, burst, 8);
+		rte_ring_sc_dequeue_elem(r, burst, 8);
+	}
+	const uint64_t sc_end = rte_rdtsc();
+
+	const uint64_t mc_start = rte_rdtsc();
+	for (i = 0; i < iterations; i++) {
+		rte_ring_mp_enqueue_elem(r, burst, 8);
+		rte_ring_mc_dequeue_elem(r, burst, 8);
+	}
+	const uint64_t mc_end = rte_rdtsc();
+
+	printf("SP/SC single enq/dequeue: %.2F\n",
+			((double)(sc_end-sc_start)) / iterations);
+	printf("MP/MC single enq/dequeue: %.2F\n",
+			((double)(mc_end-mc_start)) / iterations);
+}
+
+/*
+ * Test that does both enqueue and dequeue on a core using the burst() API calls
+ * instead of the bulk() calls used in other tests. Results should be the same
+ * as for the bulk function called on a single lcore.
+ */
+static void
+test_burst_enqueue_dequeue(struct rte_ring *r)
+{
+	const unsigned iter_shift = 23;
+	const unsigned iterations = 1<<iter_shift;
+	unsigned sz, i = 0;
+	uint32_t burst[MAX_BURST] = {0};
+
+	for (sz = 0; sz < sizeof(bulk_sizes)/sizeof(bulk_sizes[0]); sz++) {
+		const uint64_t sc_start = rte_rdtsc();
+		for (i = 0; i < iterations; i++) {
+			rte_ring_sp_enqueue_burst_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+			rte_ring_sc_dequeue_burst_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+		}
+		const uint64_t sc_end = rte_rdtsc();
+
+		const uint64_t mc_start = rte_rdtsc();
+		for (i = 0; i < iterations; i++) {
+			rte_ring_mp_enqueue_burst_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+			rte_ring_mc_dequeue_burst_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+		}
+		const uint64_t mc_end = rte_rdtsc();
+
+		double mc_avg = ((double)(mc_end-mc_start) / iterations) /
+					bulk_sizes[sz];
+		double sc_avg = ((double)(sc_end-sc_start) / iterations) /
+					bulk_sizes[sz];
+
+		printf("SP/SC burst enq/dequeue (size: %u): %.2F\n",
+				bulk_sizes[sz], sc_avg);
+		printf("MP/MC burst enq/dequeue (size: %u): %.2F\n",
+				bulk_sizes[sz], mc_avg);
+	}
+}
+
+/* Times enqueue and dequeue on a single lcore */
+static void
+test_bulk_enqueue_dequeue(struct rte_ring *r)
+{
+	const unsigned iter_shift = 23;
+	const unsigned iterations = 1<<iter_shift;
+	unsigned sz, i = 0;
+	uint32_t burst[MAX_BURST] = {0};
+
+	for (sz = 0; sz < sizeof(bulk_sizes)/sizeof(bulk_sizes[0]); sz++) {
+		const uint64_t sc_start = rte_rdtsc();
+		for (i = 0; i < iterations; i++) {
+			rte_ring_sp_enqueue_bulk_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+			rte_ring_sc_dequeue_bulk_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+		}
+		const uint64_t sc_end = rte_rdtsc();
+
+		const uint64_t mc_start = rte_rdtsc();
+		for (i = 0; i < iterations; i++) {
+			rte_ring_mp_enqueue_bulk_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+			rte_ring_mc_dequeue_bulk_elem(r, burst, 8,
+					bulk_sizes[sz], NULL);
+		}
+		const uint64_t mc_end = rte_rdtsc();
+
+		double sc_avg = ((double)(sc_end-sc_start) /
+				(iterations * bulk_sizes[sz]));
+		double mc_avg = ((double)(mc_end-mc_start) /
+				(iterations * bulk_sizes[sz]));
+
+		printf("SP/SC bulk enq/dequeue (size: %u): %.2F\n",
+				bulk_sizes[sz], sc_avg);
+		printf("MP/MC bulk enq/dequeue (size: %u): %.2F\n",
+				bulk_sizes[sz], mc_avg);
+	}
+}
+
+static int
+test_ring_perf_elem(void)
+{
+	struct lcore_pair cores;
+	struct rte_ring *r = NULL;
+
+	r = rte_ring_create_elem(RING_NAME, RING_SIZE, 8, rte_socket_id(), 0);
+	if (r == NULL)
+		return -1;
+
+	printf("### Testing single element and burst enq/deq ###\n");
+	test_single_enqueue_dequeue(r);
+	test_burst_enqueue_dequeue(r);
+
+	printf("\n### Testing empty dequeue ###\n");
+	test_empty_dequeue(r);
+
+	printf("\n### Testing using a single lcore ###\n");
+	test_bulk_enqueue_dequeue(r);
+
+	if (get_two_hyperthreads(&cores) == 0) {
+		printf("\n### Testing using two hyperthreads ###\n");
+		run_on_core_pair(&cores, r, enqueue_bulk, dequeue_bulk);
+	}
+	if (get_two_cores(&cores) == 0) {
+		printf("\n### Testing using two physical cores ###\n");
+		run_on_core_pair(&cores, r, enqueue_bulk, dequeue_bulk);
+	}
+	if (get_two_sockets(&cores) == 0) {
+		printf("\n### Testing using two NUMA nodes ###\n");
+		run_on_core_pair(&cores, r, enqueue_bulk, dequeue_bulk);
+	}
+	rte_ring_free(r);
+	return 0;
+}
+
+REGISTER_TEST_COMMAND(ring_perf_elem_autotest, test_ring_perf_elem);
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [RFC v6 5/6] lib/ring: copy ring elements using memcpy partially
  2019-10-21  0:22   ` [dpdk-dev] [RFC v6 0/6] lib/ring: APIs to support custom element size Honnappa Nagarahalli
                       ` (3 preceding siblings ...)
  2019-10-21  0:22     ` [dpdk-dev] [RFC v6 4/6] test/ring: add perf " Honnappa Nagarahalli
@ 2019-10-21  0:22     ` Honnappa Nagarahalli
  2019-10-21  0:23     ` [dpdk-dev] [RFC v6 6/6] lib/ring: improved copy function to copy ring elements Honnappa Nagarahalli
  2019-10-23  9:48     ` [dpdk-dev] [RFC v6 0/6] lib/ring: APIs to support custom element size Olivier Matz
  6 siblings, 0 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-21  0:22 UTC (permalink / raw)
  To: olivier.matz, sthemmin, jerinj, bruce.richardson, david.marchand,
	pbhagavatula, konstantin.ananyev, drc, hemant.agrawal,
	honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, ruifeng.wang, gavin.hu

Copy of ring elements uses memcpy for 32B chunks. The remaining
bytes are copied using assignments.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
---
 lib/librte_ring/rte_ring.c      |  10 --
 lib/librte_ring/rte_ring_elem.h | 229 +++++++-------------------------
 2 files changed, 49 insertions(+), 190 deletions(-)

diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
index e95285259..0f7f4b598 100644
--- a/lib/librte_ring/rte_ring.c
+++ b/lib/librte_ring/rte_ring.c
@@ -51,16 +51,6 @@ rte_ring_get_memsize_elem(unsigned count, unsigned esize)
 {
 	ssize_t sz;
 
-	/* Supported esize values are 4/8/16.
-	 * Others can be added on need basis.
-	 */
-	if (esize != 4 && esize != 8 && esize != 16) {
-		RTE_LOG(ERR, RING,
-			"Unsupported esize value. Supported values are 4, 8 and 16\n");
-
-		return -EINVAL;
-	}
-
 	/* count must be a power of 2 */
 	if ((!POWEROF2(count)) || (count > RTE_RING_SZ_MASK )) {
 		RTE_LOG(ERR, RING,
diff --git a/lib/librte_ring/rte_ring_elem.h b/lib/librte_ring/rte_ring_elem.h
index 7e9914567..0ce5f2be7 100644
--- a/lib/librte_ring/rte_ring_elem.h
+++ b/lib/librte_ring/rte_ring_elem.h
@@ -24,6 +24,7 @@ extern "C" {
 #include <stdint.h>
 #include <sys/queue.h>
 #include <errno.h>
+#include <string.h>
 #include <rte_common.h>
 #include <rte_config.h>
 #include <rte_memory.h>
@@ -108,215 +109,83 @@ __rte_experimental
 struct rte_ring *rte_ring_create_elem(const char *name, unsigned int count,
 			unsigned int esize, int socket_id, unsigned int flags);
 
-/* the actual enqueue of pointers on the ring.
- * Placed here since identical code needed in both
- * single and multi producer enqueue functions.
- */
-#define ENQUEUE_PTRS_ELEM(r, ring_start, prod_head, obj_table, esize, n) do { \
-	if (esize == 4) \
-		ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n); \
-	else if (esize == 8) \
-		ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n); \
-	else if (esize == 16) \
-		ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table, n); \
-} while (0)
-
-#define ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n) do { \
-	unsigned int i; \
+#define ENQUEUE_PTRS_GEN(r, ring_start, prod_head, obj_table, esize, n) do { \
+	unsigned int i, j; \
 	const uint32_t size = (r)->size; \
 	uint32_t idx = prod_head & (r)->mask; \
 	uint32_t *ring = (uint32_t *)ring_start; \
 	uint32_t *obj = (uint32_t *)obj_table; \
-	if (likely(idx + n < size)) { \
-		for (i = 0; i < (n & ((~(uint32_t)0x7))); i += 8, idx += 8) { \
-			ring[idx] = obj[i]; \
-			ring[idx + 1] = obj[i + 1]; \
-			ring[idx + 2] = obj[i + 2]; \
-			ring[idx + 3] = obj[i + 3]; \
-			ring[idx + 4] = obj[i + 4]; \
-			ring[idx + 5] = obj[i + 5]; \
-			ring[idx + 6] = obj[i + 6]; \
-			ring[idx + 7] = obj[i + 7]; \
+	uint32_t nr_n = n * (esize / sizeof(uint32_t)); \
+	uint32_t nr_idx = idx * (esize / sizeof(uint32_t)); \
+	uint32_t seg0 = size - idx; \
+	if (likely(n < seg0)) { \
+		for (i = 0; i < (nr_n & ((~(unsigned)0x7))); \
+						i += 8, nr_idx += 8) { \
+			memcpy(ring + nr_idx, obj + i, 8 * sizeof (uint32_t)); \
 		} \
-		switch (n & 0x7) { \
+		switch (nr_n & 0x7) { \
 		case 7: \
-			ring[idx++] = obj[i++]; /* fallthrough */ \
+			ring[nr_idx++] = obj[i++]; /* fallthrough */ \
 		case 6: \
-			ring[idx++] = obj[i++]; /* fallthrough */ \
+			ring[nr_idx++] = obj[i++]; /* fallthrough */ \
 		case 5: \
-			ring[idx++] = obj[i++]; /* fallthrough */ \
+			ring[nr_idx++] = obj[i++]; /* fallthrough */ \
 		case 4: \
-			ring[idx++] = obj[i++]; /* fallthrough */ \
+			ring[nr_idx++] = obj[i++]; /* fallthrough */ \
 		case 3: \
-			ring[idx++] = obj[i++]; /* fallthrough */ \
+			ring[nr_idx++] = obj[i++]; /* fallthrough */ \
 		case 2: \
-			ring[idx++] = obj[i++]; /* fallthrough */ \
-		case 1: \
-			ring[idx++] = obj[i++]; /* fallthrough */ \
-		} \
-	} else { \
-		for (i = 0; idx < size; i++, idx++)\
-			ring[idx] = obj[i]; \
-		for (idx = 0; i < n; i++, idx++) \
-			ring[idx] = obj[i]; \
-	} \
-} while (0)
-
-#define ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n) do { \
-	unsigned int i; \
-	const uint32_t size = (r)->size; \
-	uint32_t idx = prod_head & (r)->mask; \
-	uint64_t *ring = (uint64_t *)ring_start; \
-	uint64_t *obj = (uint64_t *)obj_table; \
-	if (likely(idx + n < size)) { \
-		for (i = 0; i < (n & ((~(uint32_t)0x3))); i += 4, idx += 4) { \
-			ring[idx] = obj[i]; \
-			ring[idx + 1] = obj[i + 1]; \
-			ring[idx + 2] = obj[i + 2]; \
-			ring[idx + 3] = obj[i + 3]; \
-		} \
-		switch (n & 0x3) { \
-		case 3: \
-			ring[idx++] = obj[i++]; /* fallthrough */ \
-		case 2: \
-			ring[idx++] = obj[i++]; /* fallthrough */ \
-		case 1: \
-			ring[idx++] = obj[i++]; \
-		} \
-	} else { \
-		for (i = 0; idx < size; i++, idx++)\
-			ring[idx] = obj[i]; \
-		for (idx = 0; i < n; i++, idx++) \
-			ring[idx] = obj[i]; \
-	} \
-} while (0)
-
-#define ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table, n) do { \
-	unsigned int i; \
-	const uint32_t size = (r)->size; \
-	uint32_t idx = prod_head & (r)->mask; \
-	__uint128_t *ring = (__uint128_t *)ring_start; \
-	__uint128_t *obj = (__uint128_t *)obj_table; \
-	if (likely(idx + n < size)) { \
-		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
-			ring[idx] = obj[i]; \
-			ring[idx + 1] = obj[i + 1]; \
-		} \
-		switch (n & 0x1) { \
+			ring[nr_idx++] = obj[i++]; /* fallthrough */ \
 		case 1: \
-			ring[idx++] = obj[i++]; \
+			ring[nr_idx++] = obj[i++]; /* fallthrough */ \
 		} \
 	} else { \
-		for (i = 0; idx < size; i++, idx++)\
-			ring[idx] = obj[i]; \
-		for (idx = 0; i < n; i++, idx++) \
-			ring[idx] = obj[i]; \
+		uint32_t nr_seg0 = seg0 * (esize / sizeof(uint32_t)); \
+		uint32_t nr_seg1 = nr_n - nr_seg0; \
+		for (i = 0; i < nr_seg0; i++, nr_idx++)\
+			ring[nr_idx] = obj[i]; \
+		for (j = 0; j < nr_seg1; i++, j++) \
+			ring[j] = obj[i]; \
 	} \
 } while (0)
 
-/* the actual copy of pointers on the ring to obj_table.
- * Placed here since identical code needed in both
- * single and multi consumer dequeue functions.
- */
-#define DEQUEUE_PTRS_ELEM(r, ring_start, cons_head, obj_table, esize, n) do { \
-	if (esize == 4) \
-		DEQUEUE_PTRS_32(r, ring_start, cons_head, obj_table, n); \
-	else if (esize == 8) \
-		DEQUEUE_PTRS_64(r, ring_start, cons_head, obj_table, n); \
-	else if (esize == 16) \
-		DEQUEUE_PTRS_128(r, ring_start, cons_head, obj_table, n); \
-} while (0)
-
-#define DEQUEUE_PTRS_32(r, ring_start, cons_head, obj_table, n) do { \
-	unsigned int i; \
+#define DEQUEUE_PTRS_GEN(r, ring_start, cons_head, obj_table, esize, n) do { \
+	unsigned int i, j; \
 	uint32_t idx = cons_head & (r)->mask; \
 	const uint32_t size = (r)->size; \
 	uint32_t *ring = (uint32_t *)ring_start; \
 	uint32_t *obj = (uint32_t *)obj_table; \
-	if (likely(idx + n < size)) { \
-		for (i = 0; i < (n & (~(uint32_t)0x7)); i += 8, idx += 8) {\
-			obj[i] = ring[idx]; \
-			obj[i + 1] = ring[idx + 1]; \
-			obj[i + 2] = ring[idx + 2]; \
-			obj[i + 3] = ring[idx + 3]; \
-			obj[i + 4] = ring[idx + 4]; \
-			obj[i + 5] = ring[idx + 5]; \
-			obj[i + 6] = ring[idx + 6]; \
-			obj[i + 7] = ring[idx + 7]; \
+	uint32_t nr_n = n * (esize / sizeof(uint32_t)); \
+	uint32_t nr_idx = idx * (esize / sizeof(uint32_t)); \
+	uint32_t seg0 = size - idx; \
+	if (likely(n < seg0)) { \
+		for (i = 0; i < (nr_n & ((~(unsigned)0x7))); \
+						i += 8, nr_idx += 8) { \
+			memcpy(obj + i, ring + nr_idx, 8 * sizeof (uint32_t)); \
 		} \
-		switch (n & 0x7) { \
+		switch (nr_n & 0x7) { \
 		case 7: \
-			obj[i++] = ring[idx++]; /* fallthrough */ \
+			obj[i++] = ring[nr_idx++]; /* fallthrough */ \
 		case 6: \
-			obj[i++] = ring[idx++]; /* fallthrough */ \
+			obj[i++] = ring[nr_idx++]; /* fallthrough */ \
 		case 5: \
-			obj[i++] = ring[idx++]; /* fallthrough */ \
+			obj[i++] = ring[nr_idx++]; /* fallthrough */ \
 		case 4: \
-			obj[i++] = ring[idx++]; /* fallthrough */ \
+			obj[i++] = ring[nr_idx++]; /* fallthrough */ \
 		case 3: \
-			obj[i++] = ring[idx++]; /* fallthrough */ \
+			obj[i++] = ring[nr_idx++]; /* fallthrough */ \
 		case 2: \
-			obj[i++] = ring[idx++]; /* fallthrough */ \
-		case 1: \
-			obj[i++] = ring[idx++]; /* fallthrough */ \
-		} \
-	} else { \
-		for (i = 0; idx < size; i++, idx++) \
-			obj[i] = ring[idx]; \
-		for (idx = 0; i < n; i++, idx++) \
-			obj[i] = ring[idx]; \
-	} \
-} while (0)
-
-#define DEQUEUE_PTRS_64(r, ring_start, cons_head, obj_table, n) do { \
-	unsigned int i; \
-	uint32_t idx = cons_head & (r)->mask; \
-	const uint32_t size = (r)->size; \
-	uint64_t *ring = (uint64_t *)ring_start; \
-	uint64_t *obj = (uint64_t *)obj_table; \
-	if (likely(idx + n < size)) { \
-		for (i = 0; i < (n & (~(uint32_t)0x3)); i += 4, idx += 4) {\
-			obj[i] = ring[idx]; \
-			obj[i + 1] = ring[idx + 1]; \
-			obj[i + 2] = ring[idx + 2]; \
-			obj[i + 3] = ring[idx + 3]; \
-		} \
-		switch (n & 0x3) { \
-		case 3: \
-			obj[i++] = ring[idx++]; /* fallthrough */ \
-		case 2: \
-			obj[i++] = ring[idx++]; /* fallthrough */ \
-		case 1: \
-			obj[i++] = ring[idx++]; \
-		} \
-	} else { \
-		for (i = 0; idx < size; i++, idx++) \
-			obj[i] = ring[idx]; \
-		for (idx = 0; i < n; i++, idx++) \
-			obj[i] = ring[idx]; \
-	} \
-} while (0)
-
-#define DEQUEUE_PTRS_128(r, ring_start, cons_head, obj_table, n) do { \
-	unsigned int i; \
-	uint32_t idx = cons_head & (r)->mask; \
-	const uint32_t size = (r)->size; \
-	__uint128_t *ring = (__uint128_t *)ring_start; \
-	__uint128_t *obj = (__uint128_t *)obj_table; \
-	if (likely(idx + n < size)) { \
-		for (i = 0; i < (n >> 1); i += 2, idx += 2) { \
-			obj[i] = ring[idx]; \
-			obj[i + 1] = ring[idx + 1]; \
-		} \
-		switch (n & 0x1) { \
+			obj[i++] = ring[nr_idx++]; /* fallthrough */ \
 		case 1: \
-			obj[i++] = ring[idx++]; /* fallthrough */ \
+			obj[i++] = ring[nr_idx++]; /* fallthrough */ \
 		} \
 	} else { \
-		for (i = 0; idx < size; i++, idx++) \
-			obj[i] = ring[idx]; \
-		for (idx = 0; i < n; i++, idx++) \
-			obj[i] = ring[idx]; \
+		uint32_t nr_seg0 = seg0 * (esize / sizeof(uint32_t)); \
+		uint32_t nr_seg1 = nr_n - nr_seg0; \
+		for (i = 0; i < nr_seg0; i++, nr_idx++)\
+			obj[i] = ring[nr_idx];\
+		for (j = 0; j < nr_seg1; i++, j++) \
+			obj[i] = ring[j]; \
 	} \
 } while (0)
 
@@ -373,7 +242,7 @@ __rte_ring_do_enqueue_elem(struct rte_ring *r, void * const obj_table,
 	if (n == 0)
 		goto end;
 
-	ENQUEUE_PTRS_ELEM(r, &r[1], prod_head, obj_table, esize, n);
+	ENQUEUE_PTRS_GEN(r, &r[1], prod_head, obj_table, esize, n);
 
 	update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
 end:
@@ -420,7 +289,7 @@ __rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
 	if (n == 0)
 		goto end;
 
-	DEQUEUE_PTRS_ELEM(r, &r[1], cons_head, obj_table, esize, n);
+	DEQUEUE_PTRS_GEN(r, &r[1], cons_head, obj_table, esize, n);
 
 	update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [RFC v6 6/6] lib/ring: improved copy function to copy ring elements
  2019-10-21  0:22   ` [dpdk-dev] [RFC v6 0/6] lib/ring: APIs to support custom element size Honnappa Nagarahalli
                       ` (4 preceding siblings ...)
  2019-10-21  0:22     ` [dpdk-dev] [RFC v6 5/6] lib/ring: copy ring elements using memcpy partially Honnappa Nagarahalli
@ 2019-10-21  0:23     ` Honnappa Nagarahalli
  2019-10-23 10:05       ` Olivier Matz
  2019-10-23  9:48     ` [dpdk-dev] [RFC v6 0/6] lib/ring: APIs to support custom element size Olivier Matz
  6 siblings, 1 reply; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-21  0:23 UTC (permalink / raw)
  To: olivier.matz, sthemmin, jerinj, bruce.richardson, david.marchand,
	pbhagavatula, konstantin.ananyev, drc, hemant.agrawal,
	honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, ruifeng.wang, gavin.hu

Improved copy function to copy to/from ring elements.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---
 lib/librte_ring/rte_ring_elem.h | 165 ++++++++++++++++----------------
 1 file changed, 84 insertions(+), 81 deletions(-)

diff --git a/lib/librte_ring/rte_ring_elem.h b/lib/librte_ring/rte_ring_elem.h
index 0ce5f2be7..80ec3c562 100644
--- a/lib/librte_ring/rte_ring_elem.h
+++ b/lib/librte_ring/rte_ring_elem.h
@@ -109,85 +109,88 @@ __rte_experimental
 struct rte_ring *rte_ring_create_elem(const char *name, unsigned int count,
 			unsigned int esize, int socket_id, unsigned int flags);
 
-#define ENQUEUE_PTRS_GEN(r, ring_start, prod_head, obj_table, esize, n) do { \
-	unsigned int i, j; \
-	const uint32_t size = (r)->size; \
-	uint32_t idx = prod_head & (r)->mask; \
-	uint32_t *ring = (uint32_t *)ring_start; \
-	uint32_t *obj = (uint32_t *)obj_table; \
-	uint32_t nr_n = n * (esize / sizeof(uint32_t)); \
-	uint32_t nr_idx = idx * (esize / sizeof(uint32_t)); \
-	uint32_t seg0 = size - idx; \
-	if (likely(n < seg0)) { \
-		for (i = 0; i < (nr_n & ((~(unsigned)0x7))); \
-						i += 8, nr_idx += 8) { \
-			memcpy(ring + nr_idx, obj + i, 8 * sizeof (uint32_t)); \
-		} \
-		switch (nr_n & 0x7) { \
-		case 7: \
-			ring[nr_idx++] = obj[i++]; /* fallthrough */ \
-		case 6: \
-			ring[nr_idx++] = obj[i++]; /* fallthrough */ \
-		case 5: \
-			ring[nr_idx++] = obj[i++]; /* fallthrough */ \
-		case 4: \
-			ring[nr_idx++] = obj[i++]; /* fallthrough */ \
-		case 3: \
-			ring[nr_idx++] = obj[i++]; /* fallthrough */ \
-		case 2: \
-			ring[nr_idx++] = obj[i++]; /* fallthrough */ \
-		case 1: \
-			ring[nr_idx++] = obj[i++]; /* fallthrough */ \
-		} \
-	} else { \
-		uint32_t nr_seg0 = seg0 * (esize / sizeof(uint32_t)); \
-		uint32_t nr_seg1 = nr_n - nr_seg0; \
-		for (i = 0; i < nr_seg0; i++, nr_idx++)\
-			ring[nr_idx] = obj[i]; \
-		for (j = 0; j < nr_seg1; i++, j++) \
-			ring[j] = obj[i]; \
-	} \
-} while (0)
-
-#define DEQUEUE_PTRS_GEN(r, ring_start, cons_head, obj_table, esize, n) do { \
-	unsigned int i, j; \
-	uint32_t idx = cons_head & (r)->mask; \
-	const uint32_t size = (r)->size; \
-	uint32_t *ring = (uint32_t *)ring_start; \
-	uint32_t *obj = (uint32_t *)obj_table; \
-	uint32_t nr_n = n * (esize / sizeof(uint32_t)); \
-	uint32_t nr_idx = idx * (esize / sizeof(uint32_t)); \
-	uint32_t seg0 = size - idx; \
-	if (likely(n < seg0)) { \
-		for (i = 0; i < (nr_n & ((~(unsigned)0x7))); \
-						i += 8, nr_idx += 8) { \
-			memcpy(obj + i, ring + nr_idx, 8 * sizeof (uint32_t)); \
-		} \
-		switch (nr_n & 0x7) { \
-		case 7: \
-			obj[i++] = ring[nr_idx++]; /* fallthrough */ \
-		case 6: \
-			obj[i++] = ring[nr_idx++]; /* fallthrough */ \
-		case 5: \
-			obj[i++] = ring[nr_idx++]; /* fallthrough */ \
-		case 4: \
-			obj[i++] = ring[nr_idx++]; /* fallthrough */ \
-		case 3: \
-			obj[i++] = ring[nr_idx++]; /* fallthrough */ \
-		case 2: \
-			obj[i++] = ring[nr_idx++]; /* fallthrough */ \
-		case 1: \
-			obj[i++] = ring[nr_idx++]; /* fallthrough */ \
-		} \
-	} else { \
-		uint32_t nr_seg0 = seg0 * (esize / sizeof(uint32_t)); \
-		uint32_t nr_seg1 = nr_n - nr_seg0; \
-		for (i = 0; i < nr_seg0; i++, nr_idx++)\
-			obj[i] = ring[nr_idx];\
-		for (j = 0; j < nr_seg1; i++, j++) \
-			obj[i] = ring[j]; \
-	} \
-} while (0)
+static __rte_always_inline void
+copy_elems(uint32_t du32[], const uint32_t su32[], uint32_t nr_num)
+{
+	uint32_t i;
+
+	for (i = 0; i < (nr_num & ~7); i += 8)
+		memcpy(du32 + i, su32 + i, 8 * sizeof(uint32_t));
+
+	switch (nr_num & 7) {
+	case 7: du32[nr_num - 7] = su32[nr_num - 7]; /* fallthrough */
+	case 6: du32[nr_num - 6] = su32[nr_num - 6]; /* fallthrough */
+	case 5: du32[nr_num - 5] = su32[nr_num - 5]; /* fallthrough */
+	case 4: du32[nr_num - 4] = su32[nr_num - 4]; /* fallthrough */
+	case 3: du32[nr_num - 3] = su32[nr_num - 3]; /* fallthrough */
+	case 2: du32[nr_num - 2] = su32[nr_num - 2]; /* fallthrough */
+	case 1: du32[nr_num - 1] = su32[nr_num - 1]; /* fallthrough */
+	}
+}
+
+static __rte_always_inline void
+enqueue_elems(struct rte_ring *r, void *ring_start, uint32_t prod_head,
+		void *obj_table, uint32_t num, uint32_t esize)
+{
+	uint32_t idx, nr_idx, nr_num;
+	uint32_t *du32;
+	const uint32_t *su32;
+
+	const uint32_t size = r->size;
+	uint32_t s0, nr_s0, nr_s1;
+
+	idx = prod_head & (r)->mask;
+	/* Normalize the idx to uint32_t */
+	nr_idx = (idx * esize) / sizeof(uint32_t);
+
+	du32 = (uint32_t *)ring_start + nr_idx;
+	su32 = obj_table;
+
+	/* Normalize the number of elements to uint32_t */
+	nr_num = (num * esize) / sizeof(uint32_t);
+
+	s0 = size - idx;
+	if (num < s0)
+		copy_elems(du32, su32, nr_num);
+	else {
+		nr_s0 = (s0 * esize) / sizeof(uint32_t);
+		nr_s1 = nr_num - nr_s0;
+		copy_elems(du32, su32, nr_s0);
+		copy_elems(ring_start, su32 + nr_s0, nr_s1);
+	}
+}
+
+static __rte_always_inline void
+dequeue_elems(struct rte_ring *r, void *ring_start, uint32_t cons_head,
+		void *obj_table, uint32_t num, uint32_t esize)
+{
+	uint32_t idx, nr_idx, nr_num;
+	uint32_t *du32;
+	const uint32_t *su32;
+
+	const uint32_t size = r->size;
+	uint32_t s0, nr_s0, nr_s1;
+
+	idx = cons_head & (r)->mask;
+	/* Normalize the idx to uint32_t */
+	nr_idx = (idx * esize) / sizeof(uint32_t);
+
+	su32 = (uint32_t *)ring_start + nr_idx;
+	du32 = obj_table;
+
+	/* Normalize the number of elements to uint32_t */
+	nr_num = (num * esize) / sizeof(uint32_t);
+
+	s0 = size - idx;
+	if (num < s0)
+		copy_elems(du32, su32, nr_num);
+	else {
+		nr_s0 = (s0 * esize) / sizeof(uint32_t);
+		nr_s1 = nr_num - nr_s0;
+		copy_elems(du32, su32, nr_s0);
+		copy_elems(du32 + nr_s0, ring_start, nr_s1);
+	}
+}
 
 /* Between load and load. there might be cpu reorder in weak model
  * (powerpc/arm).
@@ -242,7 +245,7 @@ __rte_ring_do_enqueue_elem(struct rte_ring *r, void * const obj_table,
 	if (n == 0)
 		goto end;
 
-	ENQUEUE_PTRS_GEN(r, &r[1], prod_head, obj_table, esize, n);
+	enqueue_elems(r, &r[1], prod_head, obj_table, n, esize);
 
 	update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
 end:
@@ -289,7 +292,7 @@ __rte_ring_do_dequeue_elem(struct rte_ring *r, void *obj_table,
 	if (n == 0)
 		goto end;
 
-	DEQUEUE_PTRS_GEN(r, &r[1], cons_head, obj_table, esize, n);
+	dequeue_elems(r, &r[1], cons_head, obj_table, n, esize);
 
 	update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable element size
  2019-10-18 16:11                           ` Jerin Jacob
@ 2019-10-21  0:27                             ` Honnappa Nagarahalli
  0 siblings, 0 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-21  0:27 UTC (permalink / raw)
  To: Jerin Jacob
  Cc: David Christensen, Ananyev, Konstantin, olivier.matz, sthemmin,
	jerinj, Richardson, Bruce, david.marchand, pbhagavatula, dev,
	Dharmik Thakkar, Ruifeng Wang (Arm Technology China),
	Gavin Hu (Arm Technology China),
	stephen, nd, Honnappa Nagarahalli, nd

> > >
> > > > Subject: Re: [PATCH v4 1/2] lib/ring: apis to support configurable
> > > > element size
> > > >
> > > > >>> I tried this. On x86 (Xeon(R) Gold 6132 CPU @ 2.60GHz), the
> > > > >>> results are as
> > > > >> follows. The numbers in brackets are with the code on master.
> > > > >>> gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
> > > > >>>
> > > > >>> RTE>>ring_perf_elem_autotest
> > > > >>> ### Testing single element and burst enq/deq ### SP/SC single
> > > > >>> enq/dequeue: 5 MP/MC single enq/dequeue: 40 (35) SP/SC burst
> > > > >>> enq/dequeue (size: 8): 2 MP/MC burst enq/dequeue (size: 8): 6
> > > > >>> SP/SC burst enq/dequeue (size: 32): 1 (2) MP/MC burst
> enq/dequeue (size:
> > > > >>> 32): 2
> > > > >>>
> > > > >>> ### Testing empty dequeue ###
> > > > >>> SC empty dequeue: 2.11
> > > > >>> MC empty dequeue: 1.41 (2.11)
> > > > >>>
> > > > >>> ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size:
> > > > >>> 8): 2.15 (2.86) MP/MC bulk enq/dequeue
> > > > >>> (size: 8): 6.35 (6.91) SP/SC bulk enq/dequeue (size: 32): 1.35
> > > > >>> (2.06) MP/MC bulk enq/dequeue (size: 32): 2.38 (2.95)
> > > > >>>
> > > > >>> ### Testing using two physical cores ### SP/SC bulk enq/dequeue
> (size:
> > > > >>> 8): 73.81 (15.33) MP/MC bulk enq/dequeue (size: 8): 75.10
> > > > >>> (71.27) SP/SC bulk enq/dequeue (size: 32): 21.14 (9.58) MP/MC
> > > > >>> bulk enq/dequeue
> > > > >>> (size: 32): 25.74 (20.91)
> > > > >>>
> > > > >>> ### Testing using two NUMA nodes ### SP/SC bulk enq/dequeue
> (size:
> > > > >>> 8): 164.32 (50.66) MP/MC bulk enq/dequeue (size: 8): 176.02
> > > > >>> (173.43) SP/SC bulk enq/dequeue (size:
> > > > >>> 32): 50.78 (23) MP/MC bulk enq/dequeue (size: 32): 63.17
> > > > >>> (46.74)
> > > > >>>
> > > > >>> On one of the Arm platform
> > > > >>> MP/MC bulk enq/dequeue (size: 32): 0.37 (0.33) (~12% hit, the
> > > > >>> rest are
> > > > >>> ok)
> > > >
> > > > Tried this on a Power9 platform (3.6GHz), with two numa nodes and
> > > > 16 cores/node (SMT=4).  Applied all 3 patches in v5, test results
> > > > are as
> > > > follows:
> > > >
> > > > RTE>>ring_perf_elem_autotest
> > > > ### Testing single element and burst enq/deq ### SP/SC single
> enq/dequeue:
> > > > 42 MP/MC single enq/dequeue: 59 SP/SC burst enq/dequeue (size: 8):
> > > > 5 MP/MC burst enq/dequeue (size: 8): 7 SP/SC burst enq/dequeue
> > > > (size: 32): 2 MP/MC burst enq/dequeue (size: 32): 2
> > > >
> > > > ### Testing empty dequeue ###
> > > > SC empty dequeue: 7.81
> > > > MC empty dequeue: 7.81
> > > >
> > > > ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size:
> > > > 8): 5.76 MP/MC bulk enq/dequeue (size: 8): 7.66 SP/SC bulk
> > > > enq/dequeue (size: 32): 2.10 MP/MC bulk enq/dequeue (size: 32):
> > > > 2.57
> > > >
> > > > ### Testing using two hyperthreads ### SP/SC bulk enq/dequeue
> > > > (size: 8): 13.13 MP/MC bulk enq/dequeue (size: 8): 13.98 SP/SC
> > > > bulk enq/dequeue (size: 32): 3.41 MP/MC bulk enq/dequeue (size:
> > > > 32): 4.45
> > > >
> > > > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size:
> 8):
> > > > 11.00 MP/MC bulk enq/dequeue (size: 8): 10.95 SP/SC bulk
> > > > enq/dequeue
> > > > (size: 32): 3.08 MP/MC bulk enq/dequeue (size: 32): 3.40
> > > >
> > > > ### Testing using two NUMA nodes ### SP/SC bulk enq/dequeue (size:
> > > > 8): 63.41 MP/MC bulk enq/dequeue (size: 8): 62.70 SP/SC bulk
> > > > enq/dequeue (size: 32): 15.39 MP/MC bulk enq/dequeue (size:
> > > > 32): 22.96
> > > >
> > > Thanks for running this. There is another test 'ring_perf_autotest' which
> provides the numbers with the original implementation. The goal is to make
> sure the numbers with the original implementation are the same as these.
> Can you please run that as well?
> >
> > Honnappa,
> >
> > Your earlier perf report shows the cycles are in less than 1. That's
> > is due to it is using 50 or 100MHz clock in EL0.
> > Please check with PMU counter. See "ARM64 profiling" in
> >
> > http://doc.dpdk.org/guides/prog_guide/profile_app.html
I am aware of this. Unfortunately, it does not work on all the platforms. The kernel team discourages using cycle counter for this purpose.
I have replaced the modulo operation with division (in v6) which adds couple of decimal points to the results.

> >
> >
> > Here is the octeontx2 values. There is a regression in two core cases
> > as you reported earlier in x86.
> >
> >
> > RTE>>ring_perf_autotest
> > ### Testing single element and burst enq/deq ### SP/SC single
> > enq/dequeue: 288 MP/MC single enq/dequeue: 452 SP/SC burst
> enq/dequeue
> > (size: 8): 39 MP/MC burst enq/dequeue (size: 8): 61 SP/SC burst
> > enq/dequeue (size: 32): 13 MP/MC burst enq/dequeue (size: 32): 21
> >
> > ### Testing empty dequeue ###
> > SC empty dequeue: 6.33
> > MC empty dequeue: 6.67
> >
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 38.35 MP/MC bulk enq/dequeue (size:
> > 8): 67.36 SP/SC bulk enq/dequeue (size: 32): 13.10 MP/MC bulk
> > enq/dequeue (size: 32): 21.64
> >
> > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size:
> > 8): 75.94 MP/MC bulk enq/dequeue (size: 8): 107.66 SP/SC bulk
> > enq/dequeue (size: 32): 24.51 MP/MC bulk enq/dequeue (size: 32): 33.23
> > Test OK
> > RTE>>
> >
> > ---- after applying v5 of the patch ------
> >
> > RTE>>ring_perf_autotest
> > ### Testing single element and burst enq/deq ### SP/SC single
> > enq/dequeue: 289 MP/MC single enq/dequeue: 452 SP/SC burst
> enq/dequeue
> > (size: 8): 40 MP/MC burst enq/dequeue (size: 8): 64 SP/SC burst
> > enq/dequeue (size: 32): 13 MP/MC burst enq/dequeue (size: 32): 22
> >
> > ### Testing empty dequeue ###
> > SC empty dequeue: 6.33
> > MC empty dequeue: 6.67
> >
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 39.73 MP/MC bulk enq/dequeue (size:
> > 8): 69.13 SP/SC bulk enq/dequeue (size: 32): 13.44 MP/MC bulk
> > enq/dequeue (size: 32): 22.00
> >
> > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size:
> > 8): 76.02 MP/MC bulk enq/dequeue (size: 8): 112.50 SP/SC bulk
> > enq/dequeue (size: 32): 24.71 MP/MC bulk enq/dequeue (size: 32): 33.34
> > Test OK
> > RTE>>
> >
> > RTE>>ring_perf_elem_autotest
> > ### Testing single element and burst enq/deq ### SP/SC single
> > enq/dequeue: 290 MP/MC single enq/dequeue: 503 SP/SC burst
> enq/dequeue
> > (size: 8): 39 MP/MC burst enq/dequeue (size: 8): 63 SP/SC burst
> > enq/dequeue (size: 32): 11 MP/MC burst enq/dequeue (size: 32): 19
> >
> > ### Testing empty dequeue ###
> > SC empty dequeue: 6.33
> > MC empty dequeue: 6.67
> >
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 38.92 MP/MC bulk enq/dequeue (size:
> > 8): 62.54 SP/SC bulk enq/dequeue (size: 32): 11.46 MP/MC bulk
> > enq/dequeue (size: 32): 19.89
> >
> > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size:
> > 8): 87.55 MP/MC bulk enq/dequeue (size: 8): 99.10 SP/SC bulk
> > enq/dequeue (size: 32): 26.63 MP/MC bulk enq/dequeue (size: 32): 29.91
> > Test OK
> > RTE>>
> 
> it looks like removal of 3/3 and keeping only 1/3 and 2/3 shows better
> results in some cases
> 
> 
> RTE>>ring_perf_autotest
> ### Testing single element and burst enq/deq ###
> SP/SC single enq/dequeue: 288
> MP/MC single enq/dequeue: 439
> SP/SC burst enq/dequeue (size: 8): 39
> MP/MC burst enq/dequeue (size: 8): 61
> SP/SC burst enq/dequeue (size: 32): 13
> MP/MC burst enq/dequeue (size: 32): 22
> 
> ### Testing empty dequeue ###
> SC empty dequeue: 6.33
> MC empty dequeue: 6.67
> 
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 38.35
> MP/MC bulk enq/dequeue (size: 8): 67.48
> SP/SC bulk enq/dequeue (size: 32): 13.40
> MP/MC bulk enq/dequeue (size: 32): 22.03
> 
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 75.94
> MP/MC bulk enq/dequeue (size: 8): 105.84
> SP/SC bulk enq/dequeue (size: 32): 25.11
> MP/MC bulk enq/dequeue (size: 32): 33.48
> Test OK
> RTE>>
> 
> 
> RTE>>ring_perf_elem_autotest
> ### Testing single element and burst enq/deq ###
> SP/SC single enq/dequeue: 288
> MP/MC single enq/dequeue: 452
> SP/SC burst enq/dequeue (size: 8): 39
> MP/MC burst enq/dequeue (size: 8): 61
> SP/SC burst enq/dequeue (size: 32): 13
> MP/MC burst enq/dequeue (size: 32): 22
> 
> ### Testing empty dequeue ###
> SC empty dequeue: 6.33
> MC empty dequeue: 6.00
> 
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 38.35
> MP/MC bulk enq/dequeue (size: 8): 67.46
> SP/SC bulk enq/dequeue (size: 32): 13.42
> MP/MC bulk enq/dequeue (size: 32): 22.01
> 
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 76.04
> MP/MC bulk enq/dequeue (size: 8): 104.88
> SP/SC bulk enq/dequeue (size: 32): 24.75
> MP/MC bulk enq/dequeue (size: 32): 34.66
> Test OK
> RTE>>
> 
> 
> >
> >
> >
> > > > Dave

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable element size
  2019-10-18 16:44                           ` Ananyev, Konstantin
  2019-10-18 19:03                             ` Honnappa Nagarahalli
@ 2019-10-21  0:36                             ` Honnappa Nagarahalli
  2019-10-21  9:04                               ` Ananyev, Konstantin
  1 sibling, 1 reply; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-21  0:36 UTC (permalink / raw)
  To: Ananyev, Konstantin, Jerin Jacob
  Cc: David Christensen, olivier.matz, sthemmin, jerinj, Richardson,
	Bruce, david.marchand, pbhagavatula, dev, Dharmik Thakkar,
	Ruifeng Wang (Arm Technology China),
	Gavin Hu (Arm Technology China),
	stephen, nd, Honnappa Nagarahalli, nd

> 
> Hi everyone,
> 
> 
> > > > >>> I tried this. On x86 (Xeon(R) Gold 6132 CPU @ 2.60GHz), the
> > > > >>> results are as
> > > > >> follows. The numbers in brackets are with the code on master.
> > > > >>> gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0
> > > > >>>
> > > > >>> RTE>>ring_perf_elem_autotest
> > > > >>> ### Testing single element and burst enq/deq ### SP/SC single
> > > > >>> enq/dequeue: 5 MP/MC single enq/dequeue: 40 (35) SP/SC burst
> > > > >>> enq/dequeue (size: 8): 2 MP/MC burst enq/dequeue (size: 8): 6
> > > > >>> SP/SC burst enq/dequeue (size: 32): 1 (2) MP/MC burst
> enq/dequeue (size:
> > > > >>> 32): 2
> > > > >>>
> > > > >>> ### Testing empty dequeue ###
> > > > >>> SC empty dequeue: 2.11
> > > > >>> MC empty dequeue: 1.41 (2.11)
> > > > >>>
> > > > >>> ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size:
> > > > >>> 8): 2.15 (2.86) MP/MC bulk enq/dequeue
> > > > >>> (size: 8): 6.35 (6.91) SP/SC bulk enq/dequeue (size: 32): 1.35
> > > > >>> (2.06) MP/MC bulk enq/dequeue (size: 32): 2.38 (2.95)
> > > > >>>
> > > > >>> ### Testing using two physical cores ### SP/SC bulk enq/dequeue
> (size:
> > > > >>> 8): 73.81 (15.33) MP/MC bulk enq/dequeue (size: 8): 75.10
> > > > >>> (71.27) SP/SC bulk enq/dequeue (size: 32): 21.14 (9.58) MP/MC
> > > > >>> bulk enq/dequeue
> > > > >>> (size: 32): 25.74 (20.91)
> > > > >>>
> > > > >>> ### Testing using two NUMA nodes ### SP/SC bulk enq/dequeue
> (size:
> > > > >>> 8): 164.32 (50.66) MP/MC bulk enq/dequeue (size: 8): 176.02
> > > > >>> (173.43) SP/SC bulk enq/dequeue (size:
> > > > >>> 32): 50.78 (23) MP/MC bulk enq/dequeue (size: 32): 63.17
> > > > >>> (46.74)
> > > > >>>
> > > > >>> On one of the Arm platform
> > > > >>> MP/MC bulk enq/dequeue (size: 32): 0.37 (0.33) (~12% hit, the
> > > > >>> rest are
> > > > >>> ok)
> > > >
> > > > Tried this on a Power9 platform (3.6GHz), with two numa nodes and
> > > > 16 cores/node (SMT=4).  Applied all 3 patches in v5, test results
> > > > are as
> > > > follows:
> > > >
> > > > RTE>>ring_perf_elem_autotest
> > > > ### Testing single element and burst enq/deq ### SP/SC single
> enq/dequeue:
> > > > 42 MP/MC single enq/dequeue: 59 SP/SC burst enq/dequeue (size: 8):
> > > > 5 MP/MC burst enq/dequeue (size: 8): 7 SP/SC burst enq/dequeue
> > > > (size: 32): 2 MP/MC burst enq/dequeue (size: 32): 2
> > > >
> > > > ### Testing empty dequeue ###
> > > > SC empty dequeue: 7.81
> > > > MC empty dequeue: 7.81
> > > >
> > > > ### Testing using a single lcore ### SP/SC bulk enq/dequeue (size:
> > > > 8): 5.76 MP/MC bulk enq/dequeue (size: 8): 7.66 SP/SC bulk
> > > > enq/dequeue (size: 32): 2.10 MP/MC bulk enq/dequeue (size: 32):
> > > > 2.57
> > > >
> > > > ### Testing using two hyperthreads ### SP/SC bulk enq/dequeue
> > > > (size: 8): 13.13 MP/MC bulk enq/dequeue (size: 8): 13.98 SP/SC
> > > > bulk enq/dequeue (size: 32): 3.41 MP/MC bulk enq/dequeue (size:
> > > > 32): 4.45
> > > >
> > > > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size:
> 8):
> > > > 11.00 MP/MC bulk enq/dequeue (size: 8): 10.95 SP/SC bulk
> > > > enq/dequeue
> > > > (size: 32): 3.08 MP/MC bulk enq/dequeue (size: 32): 3.40
> > > >
> > > > ### Testing using two NUMA nodes ### SP/SC bulk enq/dequeue (size:
> > > > 8): 63.41 MP/MC bulk enq/dequeue (size: 8): 62.70 SP/SC bulk
> > > > enq/dequeue (size: 32): 15.39 MP/MC bulk enq/dequeue (size:
> > > > 32): 22.96
> > > >
> > > Thanks for running this. There is another test 'ring_perf_autotest'
> > > which provides the numbers with the original implementation. The
> > > goal
> > is to make sure the numbers with the original implementation are the same
> as these. Can you please run that as well?
> >
> > Honnappa,
> >
> > Your earlier perf report shows the cycles are in less than 1. That's
> > is due to it is using 50 or 100MHz clock in EL0.
> > Please check with PMU counter. See "ARM64 profiling" in
> >
> > http://doc.dpdk.org/guides/prog_guide/profile_app.html
> >
> >
> > Here is the octeontx2 values. There is a regression in two core cases
> > as you reported earlier in x86.
> >
> >
> > RTE>>ring_perf_autotest
> > ### Testing single element and burst enq/deq ### SP/SC single
> > enq/dequeue: 288 MP/MC single enq/dequeue: 452 SP/SC burst
> enq/dequeue
> > (size: 8): 39 MP/MC burst enq/dequeue (size: 8): 61 SP/SC burst
> > enq/dequeue (size: 32): 13 MP/MC burst enq/dequeue (size: 32): 21
> >
> > ### Testing empty dequeue ###
> > SC empty dequeue: 6.33
> > MC empty dequeue: 6.67
> >
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 38.35 MP/MC bulk enq/dequeue (size:
> > 8): 67.36 SP/SC bulk enq/dequeue (size: 32): 13.10 MP/MC bulk
> > enq/dequeue (size: 32): 21.64
> >
> > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size:
> > 8): 75.94 MP/MC bulk enq/dequeue (size: 8): 107.66 SP/SC bulk
> > enq/dequeue (size: 32): 24.51 MP/MC bulk enq/dequeue (size: 32): 33.23
> > Test OK
> > RTE>>
> >
> > ---- after applying v5 of the patch ------
> >
> > RTE>>ring_perf_autotest
> > ### Testing single element and burst enq/deq ### SP/SC single
> > enq/dequeue: 289 MP/MC single enq/dequeue: 452 SP/SC burst
> enq/dequeue
> > (size: 8): 40 MP/MC burst enq/dequeue (size: 8): 64 SP/SC burst
> > enq/dequeue (size: 32): 13 MP/MC burst enq/dequeue (size: 32): 22
> >
> > ### Testing empty dequeue ###
> > SC empty dequeue: 6.33
> > MC empty dequeue: 6.67
> >
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 39.73 MP/MC bulk enq/dequeue (size:
> > 8): 69.13 SP/SC bulk enq/dequeue (size: 32): 13.44 MP/MC bulk
> > enq/dequeue (size: 32): 22.00
> >
> > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size:
> > 8): 76.02 MP/MC bulk enq/dequeue (size: 8): 112.50 SP/SC bulk
> > enq/dequeue (size: 32): 24.71 MP/MC bulk enq/dequeue (size: 32): 33.34
> > Test OK
> > RTE>>
> >
> > RTE>>ring_perf_elem_autotest
> > ### Testing single element and burst enq/deq ### SP/SC single
> > enq/dequeue: 290 MP/MC single enq/dequeue: 503 SP/SC burst
> enq/dequeue
> > (size: 8): 39 MP/MC burst enq/dequeue (size: 8): 63 SP/SC burst
> > enq/dequeue (size: 32): 11 MP/MC burst enq/dequeue (size: 32): 19
> >
> > ### Testing empty dequeue ###
> > SC empty dequeue: 6.33
> > MC empty dequeue: 6.67
> >
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 38.92 MP/MC bulk enq/dequeue (size:
> > 8): 62.54 SP/SC bulk enq/dequeue (size: 32): 11.46 MP/MC bulk
> > enq/dequeue (size: 32): 19.89
> >
> > ### Testing using two physical cores ### SP/SC bulk enq/dequeue (size:
> > 8): 87.55 MP/MC bulk enq/dequeue (size: 8): 99.10 SP/SC bulk
> > enq/dequeue (size: 32): 26.63 MP/MC bulk enq/dequeue (size: 32): 29.91
> > Test OK
> > RTE>>
> >
> 
> As I can see, there is copy&paste bug in patch #3 (that's why it probably
> produced some weird numbers for me first).
> After fix applied (see patch below), things look pretty good on my box.
> As I can see there are only 3 results noticably lower:
>    SP/SC (size=8) over 2 physical cores same numa socket
>    MP/MC (size=8) over 2 physical cores on different numa sockets.
> All others seems about same or better.
> Anyway I went ahead and reworked code a bit (as I suggested before) to get
> rid of these huge ENQUEUE/DEQUEUE macros.
> Results are very close to fixed patch #3 version (patch is also attached).
> Though I suggest people hold on to re-run perf tests till we'll make ring
> functional test to run for _elem_ functions too.
> I started to work on that, but not sure I'll finish today (most likely Monday).
I have sent V6. This has the test cases added for 'rte_ring_xxx_elem' APIs. All issues are fixed in both the methods of copy, more info below. I will post the performance info soon.

> Perf results from my box, plus patches below.
> Konstantin
> 
> perf results
> ==========
> 
> Intel(R) Xeon(R) Platinum 8160 CPU @ 2.10GHz
> 
> A - ring_perf_autotest
> B - ring_perf_elem_autotest + patch #3 + fix C - B + update
> 
> ### Testing using a single lcore ###	A	B	C
> SP/SC bulk enq/dequeue (size: 8): 	4.06	3.06	3.22
> MP/MC bulk enq/dequeue (size: 8): 	10.05	9.04	9.38
> SP/SC bulk enq/dequeue (size: 32): 	2.93	1.91	1.84
> MP/MC bulk enq/dequeue (size: 32): 	4.12	3.39	3.35
> 
> ### Testing using two hyperthreads ###
> SP/SC bulk enq/dequeue (size: 8): 	9.24	8.92	8.89
> MP/MC bulk enq/dequeue (size: 8): 	15.47	15.39	16.02
> SP/SC bulk enq/dequeue (size: 32): 	5.78	3.87	3.86
> MP/MC bulk enq/dequeue (size: 32): 	6.41	4.57	4.45
> 
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 	24.14	29.89	27.05
> MP/MC bulk enq/dequeue (size: 8): 	68.61	70.55	69.85
> SP/SC bulk enq/dequeue (size: 32): 	12.11	12.99	13.04
> MP/MC bulk enq/dequeue (size: 32): 	22.14	17.86	18.25
> 
> ### Testing using two NUMA nodes ###
> SP/SC bulk enq/dequeue (size: 8): 	48.78	31.98	33.57
> MP/MC bulk enq/dequeue (size: 8): 	167.53	197.29	192.13
> SP/SC bulk enq/dequeue (size: 32): 	31.28	21.68	21.61
> MP/MC bulk enq/dequeue (size: 32): 	53.45	49.94	48.81
> 
> fix patch
> =======
> 
> From a2be5a9b136333a56d466ef042c655e522ca7012 Mon Sep 17 00:00:00
> 2001
> From: Konstantin Ananyev <konstantin.ananyev@intel.com>
> Date: Fri, 18 Oct 2019 15:50:43 +0100
> Subject: [PATCH] fix1
> 
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
>  lib/librte_ring/rte_ring_elem.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/lib/librte_ring/rte_ring_elem.h b/lib/librte_ring/rte_ring_elem.h
> index 92e92f150..5e1819069 100644
> --- a/lib/librte_ring/rte_ring_elem.h
> +++ b/lib/librte_ring/rte_ring_elem.h
> @@ -118,7 +118,7 @@ struct rte_ring *rte_ring_create_elem(const char
> *name, unsigned count,
>         uint32_t sz = n * (esize / sizeof(uint32_t)); \
>         if (likely(idx + n < size)) { \
>                 for (i = 0; i < (sz & ((~(unsigned)0x7))); i += 8, idx += 8) { \
> -                       memcpy (ring + i, obj + i, 8 * sizeof (uint32_t)); \
> +                       memcpy (ring + idx, obj + i, 8 * sizeof
> + (uint32_t)); \
>                 } \
>                 switch (n & 0x7) { \
>                 case 7: \
> @@ -153,7 +153,7 @@ struct rte_ring *rte_ring_create_elem(const char
> *name, unsigned count,
>         uint32_t sz = n * (esize / sizeof(uint32_t)); \
>         if (likely(idx + n < size)) { \
>                 for (i = 0; i < (sz & ((~(unsigned)0x7))); i += 8, idx += 8) { \
> -                       memcpy (obj + i, ring + i, 8 * sizeof (uint32_t)); \
> +                       memcpy (obj + i, ring + idx, 8 * sizeof
Actually, this fix alone is not enough. 'idx' needs to be normalized to elements of type 'uint32_t'.

> + (uint32_t)); \
>                 } \
>                 switch (n & 0x7) { \
>                 case 7: \
> --
> 2.17.1
> 
> update patch (remove macros)
> =========================
> 
> From 18b388e877b97e243f807f27a323e876b30869dd Mon Sep 17 00:00:00
> 2001
> From: Konstantin Ananyev <konstantin.ananyev@intel.com>
> Date: Fri, 18 Oct 2019 17:35:43 +0100
> Subject: [PATCH] update1
> 
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
>  lib/librte_ring/rte_ring_elem.h | 141 ++++++++++++++++----------------
>  1 file changed, 70 insertions(+), 71 deletions(-)
> 
> diff --git a/lib/librte_ring/rte_ring_elem.h b/lib/librte_ring/rte_ring_elem.h
> index 5e1819069..eb706b12f 100644
> --- a/lib/librte_ring/rte_ring_elem.h
> +++ b/lib/librte_ring/rte_ring_elem.h
> @@ -109,75 +109,74 @@ __rte_experimental  struct rte_ring
> *rte_ring_create_elem(const char *name, unsigned count,
>                                 unsigned esize, int socket_id, unsigned flags);
> 
> -#define ENQUEUE_PTRS_GEN(r, ring_start, prod_head, obj_table, esize, n)
> do { \
> -       unsigned int i; \
> -       const uint32_t size = (r)->size; \
> -       uint32_t idx = prod_head & (r)->mask; \
> -       uint32_t *ring = (uint32_t *)ring_start; \
> -       uint32_t *obj = (uint32_t *)obj_table; \
> -       uint32_t sz = n * (esize / sizeof(uint32_t)); \
> -       if (likely(idx + n < size)) { \
> -               for (i = 0; i < (sz & ((~(unsigned)0x7))); i += 8, idx += 8) { \
> -                       memcpy (ring + idx, obj + i, 8 * sizeof (uint32_t)); \
> -               } \
> -               switch (n & 0x7) { \
> -               case 7: \
> -                       ring[idx++] = obj[i++]; /* fallthrough */ \
> -               case 6: \
> -                       ring[idx++] = obj[i++]; /* fallthrough */ \
> -               case 5: \
> -                       ring[idx++] = obj[i++]; /* fallthrough */ \
> -               case 4: \
> -                       ring[idx++] = obj[i++]; /* fallthrough */ \
> -               case 3: \
> -                       ring[idx++] = obj[i++]; /* fallthrough */ \
> -               case 2: \
> -                       ring[idx++] = obj[i++]; /* fallthrough */ \
> -               case 1: \
> -                       ring[idx++] = obj[i++]; /* fallthrough */ \
> -               } \
> -       } else { \
> -               for (i = 0; idx < size; i++, idx++)\
> -                       ring[idx] = obj[i]; \
> -               for (idx = 0; i < n; i++, idx++) \
> -                       ring[idx] = obj[i]; \
> -       } \
> -} while (0)
> -
> -#define DEQUEUE_PTRS_GEN(r, ring_start, cons_head, obj_table, esize, n)
> do { \
> -       unsigned int i; \
> -       uint32_t idx = cons_head & (r)->mask; \
> -       const uint32_t size = (r)->size; \
> -       uint32_t *ring = (uint32_t *)ring_start; \
> -       uint32_t *obj = (uint32_t *)obj_table; \
> -       uint32_t sz = n * (esize / sizeof(uint32_t)); \
> -       if (likely(idx + n < size)) { \
> -               for (i = 0; i < (sz & ((~(unsigned)0x7))); i += 8, idx += 8) { \
> -                       memcpy (obj + i, ring + idx, 8 * sizeof (uint32_t)); \
> -               } \
> -               switch (n & 0x7) { \
> -               case 7: \
> -                       obj[i++] = ring[idx++]; /* fallthrough */ \
> -               case 6: \
> -                       obj[i++] = ring[idx++]; /* fallthrough */ \
> -               case 5: \
> -                       obj[i++] = ring[idx++]; /* fallthrough */ \
> -               case 4: \
> -                       obj[i++] = ring[idx++]; /* fallthrough */ \
> -               case 3: \
> -                       obj[i++] = ring[idx++]; /* fallthrough */ \
> -               case 2: \
> -                       obj[i++] = ring[idx++]; /* fallthrough */ \
> -               case 1: \
> -                       obj[i++] = ring[idx++]; /* fallthrough */ \
> -               } \
> -       } else { \
> -               for (i = 0; idx < size; i++, idx++) \
> -                       obj[i] = ring[idx]; \
> -               for (idx = 0; i < n; i++, idx++) \
> -                       obj[i] = ring[idx]; \
> -       } \
> -} while (0)
> +static __rte_always_inline void
> +copy_elems(uint32_t du32[], const uint32_t su32[], uint32_t num,
> +uint32_t esize) {
> +       uint32_t i, sz;
> +
> +       sz = (num * esize) / sizeof(uint32_t);
> +
> +       for (i = 0; i < (sz & ~7); i += 8)
> +               memcpy(du32 + i, su32 + i, 8 * sizeof(uint32_t));
> +
> +       switch (sz & 7) {
> +       case 7: du32[sz - 7] = su32[sz - 7]; /* fallthrough */
> +       case 6: du32[sz - 6] = su32[sz - 6]; /* fallthrough */
> +       case 5: du32[sz - 5] = su32[sz - 5]; /* fallthrough */
> +       case 4: du32[sz - 4] = su32[sz - 4]; /* fallthrough */
> +       case 3: du32[sz - 3] = su32[sz - 3]; /* fallthrough */
> +       case 2: du32[sz - 2] = su32[sz - 2]; /* fallthrough */
> +       case 1: du32[sz - 1] = su32[sz - 1]; /* fallthrough */
> +       }
> +}
> +
> +static __rte_always_inline void
> +enqueue_elems(struct rte_ring *r, void *ring_start, uint32_t prod_head,
> +               void *obj_table, uint32_t num, uint32_t esize) {
> +       uint32_t idx, n;
> +       uint32_t *du32;
> +       const uint32_t *su32;
> +
> +       const uint32_t size = r->size;
> +
> +       idx = prod_head & (r)->mask;
Same here, 'idx' needs to be normalized to elements of type 'uint32_t' and similar fixes on other variables. I have applied your suggestion in 6/6 in v6 along with my corrections. The rte_ring_elem test cases are added in 3/6. I have verified that they are running fine (they are done for 64b alone, will add more). Hopefully, there are no more errors.

> +
> +       du32 = (uint32_t *)ring_start + idx;
> +       su32 = obj_table;
> +
> +       if (idx + num < size)
> +               copy_elems(du32, su32, num, esize);
> +       else {
> +               n = size - idx;
> +               copy_elems(du32, su32, n, esize);
> +               copy_elems(ring_start, su32 + n, num - n, esize);
> +       }
> +}
> +
> +static __rte_always_inline void
> +dequeue_elems(struct rte_ring *r, void *ring_start, uint32_t cons_head,
> +               void *obj_table, uint32_t num, uint32_t esize) {
> +       uint32_t idx, n;
> +       uint32_t *du32;
> +       const uint32_t *su32;
> +
> +       const uint32_t size = r->size;
> +
> +       idx = cons_head & (r)->mask;
> +
> +       su32 = (uint32_t *)ring_start + idx;
> +       du32 = obj_table;
> +
> +       if (idx + num < size)
> +               copy_elems(du32, su32, num, esize);
> +       else {
> +               n = size - idx;
> +               copy_elems(du32, su32, n, esize);
> +               copy_elems(du32 + n, ring_start, num - n, esize);
> +       }
> +}
> 
>  /* Between load and load. there might be cpu reorder in weak model
>   * (powerpc/arm).
> @@ -232,7 +231,7 @@ __rte_ring_do_enqueue_elem(struct rte_ring *r, void
> * const obj_table,
>         if (n == 0)
>                 goto end;
> 
> -       ENQUEUE_PTRS_GEN(r, &r[1], prod_head, obj_table, esize, n);
> +       enqueue_elems(r, &r[1], prod_head, obj_table, n, esize);
> 
>         update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
>  end:
> @@ -279,7 +278,7 @@ __rte_ring_do_dequeue_elem(struct rte_ring *r, void
> *obj_table,
>         if (n == 0)
>                 goto end;
> 
> -       DEQUEUE_PTRS_GEN(r, &r[1], cons_head, obj_table, esize, n);
> +       dequeue_elems(r, &r[1], cons_head, obj_table, n, esize);
> 
>         update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
> 
> --
> 2.17.1
> 


^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable element size
  2019-10-21  0:36                             ` Honnappa Nagarahalli
@ 2019-10-21  9:04                               ` Ananyev, Konstantin
  2019-10-22 15:59                                 ` Ananyev, Konstantin
  0 siblings, 1 reply; 173+ messages in thread
From: Ananyev, Konstantin @ 2019-10-21  9:04 UTC (permalink / raw)
  To: Honnappa Nagarahalli, Jerin Jacob
  Cc: David Christensen, olivier.matz, sthemmin, jerinj, Richardson,
	Bruce, david.marchand, pbhagavatula, dev, Dharmik Thakkar,
	Ruifeng Wang (Arm Technology China),
	Gavin Hu (Arm Technology China),
	stephen, nd, nd



> >
> > fix patch
> > =======
> >
> > From a2be5a9b136333a56d466ef042c655e522ca7012 Mon Sep 17 00:00:00
> > 2001
> > From: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > Date: Fri, 18 Oct 2019 15:50:43 +0100
> > Subject: [PATCH] fix1
> >
> > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > ---
> >  lib/librte_ring/rte_ring_elem.h | 4 ++--
> >  1 file changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/librte_ring/rte_ring_elem.h b/lib/librte_ring/rte_ring_elem.h
> > index 92e92f150..5e1819069 100644
> > --- a/lib/librte_ring/rte_ring_elem.h
> > +++ b/lib/librte_ring/rte_ring_elem.h
> > @@ -118,7 +118,7 @@ struct rte_ring *rte_ring_create_elem(const char
> > *name, unsigned count,
> >         uint32_t sz = n * (esize / sizeof(uint32_t)); \
> >         if (likely(idx + n < size)) { \
> >                 for (i = 0; i < (sz & ((~(unsigned)0x7))); i += 8, idx += 8) { \
> > -                       memcpy (ring + i, obj + i, 8 * sizeof (uint32_t)); \
> > +                       memcpy (ring + idx, obj + i, 8 * sizeof
> > + (uint32_t)); \
> >                 } \
> >                 switch (n & 0x7) { \
> >                 case 7: \
> > @@ -153,7 +153,7 @@ struct rte_ring *rte_ring_create_elem(const char
> > *name, unsigned count,
> >         uint32_t sz = n * (esize / sizeof(uint32_t)); \
> >         if (likely(idx + n < size)) { \
> >                 for (i = 0; i < (sz & ((~(unsigned)0x7))); i += 8, idx += 8) { \
> > -                       memcpy (obj + i, ring + i, 8 * sizeof (uint32_t)); \
> > +                       memcpy (obj + i, ring + idx, 8 * sizeof
> Actually, this fix alone is not enough. 'idx' needs to be normalized to elements of type 'uint32_t'.
> 
> > + (uint32_t)); \
> >                 } \
> >                 switch (n & 0x7) { \
> >                 case 7: \
> > --
> > 2.17.1
> >
> > update patch (remove macros)
> > =========================
> >
> > From 18b388e877b97e243f807f27a323e876b30869dd Mon Sep 17 00:00:00
> > 2001
> > From: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > Date: Fri, 18 Oct 2019 17:35:43 +0100
> > Subject: [PATCH] update1
> >
> > Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> > ---
> >  lib/librte_ring/rte_ring_elem.h | 141 ++++++++++++++++----------------
> >  1 file changed, 70 insertions(+), 71 deletions(-)
> >
> > diff --git a/lib/librte_ring/rte_ring_elem.h b/lib/librte_ring/rte_ring_elem.h
> > index 5e1819069..eb706b12f 100644
> > --- a/lib/librte_ring/rte_ring_elem.h
> > +++ b/lib/librte_ring/rte_ring_elem.h
> > @@ -109,75 +109,74 @@ __rte_experimental  struct rte_ring
> > *rte_ring_create_elem(const char *name, unsigned count,
> >                                 unsigned esize, int socket_id, unsigned flags);
> >
> > -#define ENQUEUE_PTRS_GEN(r, ring_start, prod_head, obj_table, esize, n)
> > do { \
> > -       unsigned int i; \
> > -       const uint32_t size = (r)->size; \
> > -       uint32_t idx = prod_head & (r)->mask; \
> > -       uint32_t *ring = (uint32_t *)ring_start; \
> > -       uint32_t *obj = (uint32_t *)obj_table; \
> > -       uint32_t sz = n * (esize / sizeof(uint32_t)); \
> > -       if (likely(idx + n < size)) { \
> > -               for (i = 0; i < (sz & ((~(unsigned)0x7))); i += 8, idx += 8) { \
> > -                       memcpy (ring + idx, obj + i, 8 * sizeof (uint32_t)); \
> > -               } \
> > -               switch (n & 0x7) { \
> > -               case 7: \
> > -                       ring[idx++] = obj[i++]; /* fallthrough */ \
> > -               case 6: \
> > -                       ring[idx++] = obj[i++]; /* fallthrough */ \
> > -               case 5: \
> > -                       ring[idx++] = obj[i++]; /* fallthrough */ \
> > -               case 4: \
> > -                       ring[idx++] = obj[i++]; /* fallthrough */ \
> > -               case 3: \
> > -                       ring[idx++] = obj[i++]; /* fallthrough */ \
> > -               case 2: \
> > -                       ring[idx++] = obj[i++]; /* fallthrough */ \
> > -               case 1: \
> > -                       ring[idx++] = obj[i++]; /* fallthrough */ \
> > -               } \
> > -       } else { \
> > -               for (i = 0; idx < size; i++, idx++)\
> > -                       ring[idx] = obj[i]; \
> > -               for (idx = 0; i < n; i++, idx++) \
> > -                       ring[idx] = obj[i]; \
> > -       } \
> > -} while (0)
> > -
> > -#define DEQUEUE_PTRS_GEN(r, ring_start, cons_head, obj_table, esize, n)
> > do { \
> > -       unsigned int i; \
> > -       uint32_t idx = cons_head & (r)->mask; \
> > -       const uint32_t size = (r)->size; \
> > -       uint32_t *ring = (uint32_t *)ring_start; \
> > -       uint32_t *obj = (uint32_t *)obj_table; \
> > -       uint32_t sz = n * (esize / sizeof(uint32_t)); \
> > -       if (likely(idx + n < size)) { \
> > -               for (i = 0; i < (sz & ((~(unsigned)0x7))); i += 8, idx += 8) { \
> > -                       memcpy (obj + i, ring + idx, 8 * sizeof (uint32_t)); \
> > -               } \
> > -               switch (n & 0x7) { \
> > -               case 7: \
> > -                       obj[i++] = ring[idx++]; /* fallthrough */ \
> > -               case 6: \
> > -                       obj[i++] = ring[idx++]; /* fallthrough */ \
> > -               case 5: \
> > -                       obj[i++] = ring[idx++]; /* fallthrough */ \
> > -               case 4: \
> > -                       obj[i++] = ring[idx++]; /* fallthrough */ \
> > -               case 3: \
> > -                       obj[i++] = ring[idx++]; /* fallthrough */ \
> > -               case 2: \
> > -                       obj[i++] = ring[idx++]; /* fallthrough */ \
> > -               case 1: \
> > -                       obj[i++] = ring[idx++]; /* fallthrough */ \
> > -               } \
> > -       } else { \
> > -               for (i = 0; idx < size; i++, idx++) \
> > -                       obj[i] = ring[idx]; \
> > -               for (idx = 0; i < n; i++, idx++) \
> > -                       obj[i] = ring[idx]; \
> > -       } \
> > -} while (0)
> > +static __rte_always_inline void
> > +copy_elems(uint32_t du32[], const uint32_t su32[], uint32_t num,
> > +uint32_t esize) {
> > +       uint32_t i, sz;
> > +
> > +       sz = (num * esize) / sizeof(uint32_t);
> > +
> > +       for (i = 0; i < (sz & ~7); i += 8)
> > +               memcpy(du32 + i, su32 + i, 8 * sizeof(uint32_t));
> > +
> > +       switch (sz & 7) {
> > +       case 7: du32[sz - 7] = su32[sz - 7]; /* fallthrough */
> > +       case 6: du32[sz - 6] = su32[sz - 6]; /* fallthrough */
> > +       case 5: du32[sz - 5] = su32[sz - 5]; /* fallthrough */
> > +       case 4: du32[sz - 4] = su32[sz - 4]; /* fallthrough */
> > +       case 3: du32[sz - 3] = su32[sz - 3]; /* fallthrough */
> > +       case 2: du32[sz - 2] = su32[sz - 2]; /* fallthrough */
> > +       case 1: du32[sz - 1] = su32[sz - 1]; /* fallthrough */
> > +       }
> > +}
> > +
> > +static __rte_always_inline void
> > +enqueue_elems(struct rte_ring *r, void *ring_start, uint32_t prod_head,
> > +               void *obj_table, uint32_t num, uint32_t esize) {
> > +       uint32_t idx, n;
> > +       uint32_t *du32;
> > +       const uint32_t *su32;
> > +
> > +       const uint32_t size = r->size;
> > +
> > +       idx = prod_head & (r)->mask;
> Same here, 'idx' needs to be normalized to elements of type 'uint32_t' and similar fixes on other variables.

Ups true, my bad.

> I have applied your
> suggestion in 6/6 in v6 along with my corrections. The rte_ring_elem test cases are added in 3/6. I have verified that they are running
> fine (they are done for 64b alone, will add more). Hopefully, there are no more errors.

Cool, we'll re-run perf test om my box.
Thanks
Konstantin

> 
> > +
> > +       du32 = (uint32_t *)ring_start + idx;
> > +       su32 = obj_table;
> > +
> > +       if (idx + num < size)
> > +               copy_elems(du32, su32, num, esize);
> > +       else {
> > +               n = size - idx;
> > +               copy_elems(du32, su32, n, esize);
> > +               copy_elems(ring_start, su32 + n, num - n, esize);
> > +       }
> > +}
> > +
> > +static __rte_always_inline void
> > +dequeue_elems(struct rte_ring *r, void *ring_start, uint32_t cons_head,
> > +               void *obj_table, uint32_t num, uint32_t esize) {
> > +       uint32_t idx, n;
> > +       uint32_t *du32;
> > +       const uint32_t *su32;
> > +
> > +       const uint32_t size = r->size;
> > +
> > +       idx = cons_head & (r)->mask;
> > +
> > +       su32 = (uint32_t *)ring_start + idx;
> > +       du32 = obj_table;
> > +
> > +       if (idx + num < size)
> > +               copy_elems(du32, su32, num, esize);
> > +       else {
> > +               n = size - idx;
> > +               copy_elems(du32, su32, n, esize);
> > +               copy_elems(du32 + n, ring_start, num - n, esize);
> > +       }
> > +}
> >
> >  /* Between load and load. there might be cpu reorder in weak model
> >   * (powerpc/arm).
> > @@ -232,7 +231,7 @@ __rte_ring_do_enqueue_elem(struct rte_ring *r, void
> > * const obj_table,
> >         if (n == 0)
> >                 goto end;
> >
> > -       ENQUEUE_PTRS_GEN(r, &r[1], prod_head, obj_table, esize, n);
> > +       enqueue_elems(r, &r[1], prod_head, obj_table, n, esize);
> >
> >         update_tail(&r->prod, prod_head, prod_next, is_sp, 1);
> >  end:
> > @@ -279,7 +278,7 @@ __rte_ring_do_dequeue_elem(struct rte_ring *r, void
> > *obj_table,
> >         if (n == 0)
> >                 goto end;
> >
> > -       DEQUEUE_PTRS_GEN(r, &r[1], cons_head, obj_table, esize, n);
> > +       dequeue_elems(r, &r[1], cons_head, obj_table, n, esize);
> >
> >         update_tail(&r->cons, cons_head, cons_next, is_sc, 0);
> >
> > --
> > 2.17.1
> >


^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable element size
  2019-10-21  9:04                               ` Ananyev, Konstantin
@ 2019-10-22 15:59                                 ` Ananyev, Konstantin
  2019-10-22 17:57                                   ` Ananyev, Konstantin
  0 siblings, 1 reply; 173+ messages in thread
From: Ananyev, Konstantin @ 2019-10-22 15:59 UTC (permalink / raw)
  To: 'Honnappa Nagarahalli', 'Jerin Jacob'
  Cc: 'David Christensen', 'olivier.matz@6wind.com',
	'sthemmin@microsoft.com', 'jerinj@marvell.com',
	Richardson, Bruce, 'david.marchand@redhat.com',
	'pbhagavatula@marvell.com', 'dev@dpdk.org',
	'Dharmik Thakkar',
	'Ruifeng Wang (Arm Technology China)',
	'Gavin Hu (Arm Technology China)',
	'stephen@networkplumber.org', 'nd', 'nd'



> > I have applied your
> > suggestion in 6/6 in v6 along with my corrections. The rte_ring_elem test cases are added in 3/6. I have verified that they are running
> > fine (they are done for 64b alone, will add more). Hopefully, there are no more errors.

Applied v6 and re-run the tests. 
Functional test passes ok on my boxes.
Pert-tests numbers below.
As I can see pretty much same pattern as in v5 remains:
MP/MC on 2 different cores and SP/SC single enq/deq
show lower numbers for _elem_.
For others _elem_ numbers are about the same or higher.
Personally, I am ok to go ahead with these changes. 
Konstantin

A - ring_perf_autotes
B - ring_perf_elem_autotest

 ### Testing single element and burst enq/deq ###	A	B
SP/SC single enq/dequeue: 				8.27	10.94	
MP/MC single enq/dequeue: 				56.11	47.43
SP/SC burst enq/dequeue (size: 8): 			4.20	3.50
MP/MC burst enq/dequeue (size: 8): 			9.93	9.29
SP/SC burst enq/dequeue (size: 32): 			2.93	1.94
MP/MC burst enq/dequeue (size: 32): 			4.10	3.35

### Testing empty dequeue ###
SC empty dequeue: 					2.00	3.00
MC empty dequeue: 					3.00	2.00

### Testing using a single lcore ###
SP/SC bulk enq/dequeue (size: 8): 			4.06	3.30	
MP/MC bulk enq/dequeue (size: 8): 			9.84	9.28
SP/SC bulk enq/dequeue (size: 32): 			2.93	1.88
MP/MC bulk enq/dequeue (size: 32): 			4.10	3.32

### Testing using two hyperthreads ###
SP/SC bulk enq/dequeue (size: 8): 			9.22	8.83
MP/MC bulk enq/dequeue (size: 8): 			15.73	15.86
SP/SC bulk enq/dequeue (size: 32): 			5.78	3.83
MP/MC bulk enq/dequeue (size: 32): 			6.33	4.53

### Testing using two physical cores ###
SP/SC bulk enq/dequeue (size: 8): 			23.78	19.32
MP/MC bulk enq/dequeue (size: 8): 			68.54	71.97
SP/SC bulk enq/dequeue (size: 32): 			11.99	10.77
MP/MC bulk enq/dequeue (size: 32): 			21.96	18.66

### Testing using two NUMA nodes ###
SP/SC bulk enq/dequeue (size: 8): 			50.13	33.92
MP/MC bulk enq/dequeue (size: 8): 			177.98	195.87
SP/SC bulk enq/dequeue (size: 32): 			32.98	23.12
MP/MC bulk enq/dequeue (size: 32): 			55.86	48.76


^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable element size
  2019-10-22 15:59                                 ` Ananyev, Konstantin
@ 2019-10-22 17:57                                   ` Ananyev, Konstantin
  2019-10-23 18:58                                     ` Honnappa Nagarahalli
  0 siblings, 1 reply; 173+ messages in thread
From: Ananyev, Konstantin @ 2019-10-22 17:57 UTC (permalink / raw)
  To: Ananyev, Konstantin, 'Honnappa Nagarahalli',
	'Jerin Jacob'
  Cc: 'David Christensen', 'olivier.matz@6wind.com',
	'sthemmin@microsoft.com', 'jerinj@marvell.com',
	Richardson, Bruce, 'david.marchand@redhat.com',
	'pbhagavatula@marvell.com', 'dev@dpdk.org',
	'Dharmik Thakkar',
	'Ruifeng Wang (Arm Technology China)',
	'Gavin Hu (Arm Technology China)',
	'stephen@networkplumber.org', 'nd', 'nd'



> -----Original Message-----
> From: dev <dev-bounces@dpdk.org> On Behalf Of Ananyev, Konstantin
> Sent: Tuesday, October 22, 2019 5:00 PM
> To: 'Honnappa Nagarahalli' <Honnappa.Nagarahalli@arm.com>; 'Jerin Jacob' <jerinjacobk@gmail.com>
> Cc: 'David Christensen' <drc@linux.vnet.ibm.com>; 'olivier.matz@6wind.com' <olivier.matz@6wind.com>; 'sthemmin@microsoft.com'
> <sthemmin@microsoft.com>; 'jerinj@marvell.com' <jerinj@marvell.com>; Richardson, Bruce <bruce.richardson@intel.com>;
> 'david.marchand@redhat.com' <david.marchand@redhat.com>; 'pbhagavatula@marvell.com' <pbhagavatula@marvell.com>;
> 'dev@dpdk.org' <dev@dpdk.org>; 'Dharmik Thakkar' <Dharmik.Thakkar@arm.com>; 'Ruifeng Wang (Arm Technology China)'
> <Ruifeng.Wang@arm.com>; 'Gavin Hu (Arm Technology China)' <Gavin.Hu@arm.com>; 'stephen@networkplumber.org'
> <stephen@networkplumber.org>; 'nd' <nd@arm.com>; 'nd' <nd@arm.com>
> Subject: Re: [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable element size
> 
> 
> 
> > > I have applied your
> > > suggestion in 6/6 in v6 along with my corrections. The rte_ring_elem test cases are added in 3/6. I have verified that they are running
> > > fine (they are done for 64b alone, will add more). Hopefully, there are no more errors.
> 
> Applied v6 and re-run the tests.
> Functional test passes ok on my boxes.
> Pert-tests numbers below.
> As I can see pretty much same pattern as in v5 remains:
> MP/MC on 2 different cores

Forgot to add: for 8 elems, for 32 - new ones always better. 

> and SP/SC single enq/deq
> show lower numbers for _elem_.
> For others _elem_ numbers are about the same or higher.
> Personally, I am ok to go ahead with these changes.
> Konstantin
> 
> A - ring_perf_autotes
> B - ring_perf_elem_autotest
> 
>  ### Testing single element and burst enq/deq ###	A	B
> SP/SC single enq/dequeue: 				8.27	10.94
> MP/MC single enq/dequeue: 				56.11	47.43
> SP/SC burst enq/dequeue (size: 8): 			4.20	3.50
> MP/MC burst enq/dequeue (size: 8): 			9.93	9.29
> SP/SC burst enq/dequeue (size: 32): 			2.93	1.94
> MP/MC burst enq/dequeue (size: 32): 			4.10	3.35
> 
> ### Testing empty dequeue ###
> SC empty dequeue: 					2.00	3.00
> MC empty dequeue: 					3.00	2.00
> 
> ### Testing using a single lcore ###
> SP/SC bulk enq/dequeue (size: 8): 			4.06	3.30
> MP/MC bulk enq/dequeue (size: 8): 			9.84	9.28
> SP/SC bulk enq/dequeue (size: 32): 			2.93	1.88
> MP/MC bulk enq/dequeue (size: 32): 			4.10	3.32
> 
> ### Testing using two hyperthreads ###
> SP/SC bulk enq/dequeue (size: 8): 			9.22	8.83
> MP/MC bulk enq/dequeue (size: 8): 			15.73	15.86
> SP/SC bulk enq/dequeue (size: 32): 			5.78	3.83
> MP/MC bulk enq/dequeue (size: 32): 			6.33	4.53
> 
> ### Testing using two physical cores ###
> SP/SC bulk enq/dequeue (size: 8): 			23.78	19.32
> MP/MC bulk enq/dequeue (size: 8): 			68.54	71.97
> SP/SC bulk enq/dequeue (size: 32): 			11.99	10.77
> MP/MC bulk enq/dequeue (size: 32): 			21.96	18.66
> 
> ### Testing using two NUMA nodes ###
> SP/SC bulk enq/dequeue (size: 8): 			50.13	33.92
> MP/MC bulk enq/dequeue (size: 8): 			177.98	195.87
> SP/SC bulk enq/dequeue (size: 32): 			32.98	23.12
> MP/MC bulk enq/dequeue (size: 32): 			55.86	48.76


^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [RFC v6 0/6] lib/ring: APIs to support custom element size
  2019-10-21  0:22   ` [dpdk-dev] [RFC v6 0/6] lib/ring: APIs to support custom element size Honnappa Nagarahalli
                       ` (5 preceding siblings ...)
  2019-10-21  0:23     ` [dpdk-dev] [RFC v6 6/6] lib/ring: improved copy function to copy ring elements Honnappa Nagarahalli
@ 2019-10-23  9:48     ` Olivier Matz
  6 siblings, 0 replies; 173+ messages in thread
From: Olivier Matz @ 2019-10-23  9:48 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: sthemmin, jerinj, bruce.richardson, david.marchand, pbhagavatula,
	konstantin.ananyev, drc, hemant.agrawal, dev, dharmik.thakkar,
	ruifeng.wang, gavin.hu

Hi Honnappa,

On Sun, Oct 20, 2019 at 07:22:54PM -0500, Honnappa Nagarahalli wrote:
> The current rte_ring hard-codes the type of the ring element to 'void *',
> hence the size of the element is hard-coded to 32b/64b. Since the ring
> element type is not an input to rte_ring APIs, it results in couple
> of issues:
> 
> 1) If an application requires to store an element which is not 64b, it
>    needs to write its own ring APIs similar to rte_event_ring APIs. This
>    creates additional burden on the programmers, who end up making
>    work-arounds and often waste memory.
> 2) If there are multiple libraries that store elements of the same
>    type, currently they would have to write their own rte_ring APIs. This
>    results in code duplication.
> 
> This patch adds new APIs to support configurable ring element size.
> The APIs support custom element sizes by allowing to define the ring
> element to be a multiple of 32b.
> 
> The aim is to achieve same performance as the existing ring
> implementation. The patch adds same performance tests that are run
> for existing APIs. This allows for performance comparison.
> 
> I also tested with memcpy. x86 shows significant improvements on bulk
> and burst tests. On the Arm platform, I used, there is a drop of
> 4% to 6% in few tests. May be this is something that we can explore
> later.
> 
> Note that this version skips changes to other libraries as I would
> like to get an agreement on the implementation from the community.
> They will be added once there is agreement on the rte_ring changes.
> 
> v6
>  - Labelled as RFC to indicate the better status
>  - Added unit tests to test the rte_ring_xxx_elem APIs
>  - Corrected 'macro based partial memcpy' (5/6) patch
>  - Added Konstantin's method after correction (6/6)
>  - Check Patch shows significant warnings and errors mainly due
>    copying code from existing test cases. None of them are harmful.
>    I will fix them once we have an agreement.
> 
> v5
>  - Use memcpy for chunks of 32B (Konstantin).
>  - Both 'ring_perf_autotest' and 'ring_perf_elem_autotest' are available
>    to compare the results easily.
>  - Copying without memcpy is also available in 1/3, if anyone wants to
>    experiment on their platform.
>  - Added other platform owners to test on their respective platforms.
> 
> v4
>  - Few fixes after more performance testing
> 
> v3
>  - Removed macro-fest and used inline functions
>    (Stephen, Bruce)
> 
> v2
>  - Change Event Ring implementation to use ring templates
>    (Jerin, Pavan)
> 
> Honnappa Nagarahalli (6):
>   test/ring: use division for cycle count calculation
>   lib/ring: apis to support configurable element size
>   test/ring: add functional tests for configurable element size ring
>   test/ring: add perf tests for configurable element size ring
>   lib/ring: copy ring elements using memcpy partially
>   lib/ring: improved copy function to copy ring elements
> 
>  app/test/Makefile                    |   2 +
>  app/test/meson.build                 |   2 +
>  app/test/test_ring_elem.c            | 859 +++++++++++++++++++++++++++
>  app/test/test_ring_perf.c            |  22 +-
>  app/test/test_ring_perf_elem.c       | 419 +++++++++++++
>  lib/librte_ring/Makefile             |   3 +-
>  lib/librte_ring/meson.build          |   4 +
>  lib/librte_ring/rte_ring.c           |  34 +-
>  lib/librte_ring/rte_ring.h           |   1 +
>  lib/librte_ring/rte_ring_elem.h      | 818 +++++++++++++++++++++++++
>  lib/librte_ring/rte_ring_version.map |   2 +
>  11 files changed, 2147 insertions(+), 19 deletions(-)
>  create mode 100644 app/test/test_ring_elem.c
>  create mode 100644 app/test/test_ring_perf_elem.c
>  create mode 100644 lib/librte_ring/rte_ring_elem.h

Sorry, I come a day after the fair.

I have only few comments on the shape (I'll reply to individual
patches). On the substance, it looks good to me. I also feel this
version is much better than the template-based versions.

Thanks
Olivier

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [RFC v6 1/6] test/ring: use division for cycle count calculation
  2019-10-21  0:22     ` [dpdk-dev] [RFC v6 1/6] test/ring: use division for cycle count calculation Honnappa Nagarahalli
@ 2019-10-23  9:49       ` Olivier Matz
  0 siblings, 0 replies; 173+ messages in thread
From: Olivier Matz @ 2019-10-23  9:49 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: sthemmin, jerinj, bruce.richardson, david.marchand, pbhagavatula,
	konstantin.ananyev, drc, hemant.agrawal, dev, dharmik.thakkar,
	ruifeng.wang, gavin.hu

On Sun, Oct 20, 2019 at 07:22:55PM -0500, Honnappa Nagarahalli wrote:
> Use division instead of modulo operation to calculate more
> accurate cycle count.
> 
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>

Acked-by: Olivier Matz <olivier.matz@6wind.com>

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [RFC v6 2/6] lib/ring: apis to support configurable element size
  2019-10-21  0:22     ` [dpdk-dev] [RFC v6 2/6] lib/ring: apis to support configurable element size Honnappa Nagarahalli
@ 2019-10-23  9:59       ` Olivier Matz
  2019-10-23 19:12         ` Honnappa Nagarahalli
  0 siblings, 1 reply; 173+ messages in thread
From: Olivier Matz @ 2019-10-23  9:59 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: sthemmin, jerinj, bruce.richardson, david.marchand, pbhagavatula,
	konstantin.ananyev, drc, hemant.agrawal, dev, dharmik.thakkar,
	ruifeng.wang, gavin.hu

On Sun, Oct 20, 2019 at 07:22:56PM -0500, Honnappa Nagarahalli wrote:
> Current APIs assume ring elements to be pointers. However, in many
> use cases, the size can be different. Add new APIs to support
> configurable ring element sizes.
> 
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> ---
>  lib/librte_ring/Makefile             |   3 +-
>  lib/librte_ring/meson.build          |   4 +
>  lib/librte_ring/rte_ring.c           |  44 +-
>  lib/librte_ring/rte_ring.h           |   1 +
>  lib/librte_ring/rte_ring_elem.h      | 946 +++++++++++++++++++++++++++
>  lib/librte_ring/rte_ring_version.map |   2 +
>  6 files changed, 991 insertions(+), 9 deletions(-)
>  create mode 100644 lib/librte_ring/rte_ring_elem.h

(...)

> +/* the actual enqueue of pointers on the ring.
> + * Placed here since identical code needed in both
> + * single and multi producer enqueue functions.
> + */
> +#define ENQUEUE_PTRS_ELEM(r, ring_start, prod_head, obj_table, esize, n) do { \
> +	if (esize == 4) \
> +		ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n); \
> +	else if (esize == 8) \
> +		ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n); \
> +	else if (esize == 16) \
> +		ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table, n); \
> +} while (0)

My initial thinking was that it could be a static inline functions
instead of macros. I see that patches 5 and 6 are changing it. I wonder
however if patches 5 and 6 shouldn't be merged and moved before this
one: it would avoid to introduce new macros that will be removed after.

(...)

> +/**
> + * @internal Enqueue several objects on the ring
> + *
> + * @param r
> + *   A pointer to the ring structure.
> + * @param obj_table
> + *   A pointer to a table of void * pointers (objects).
> + * @param esize
> + *   The size of ring element, in bytes. It must be a multiple of 4.
> + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> + *   as passed while creating the ring, otherwise the results are undefined.

The comment "It must be a multiple of 4" and "Currently, sizes 4, 8 and 16 are
supported" are redundant (it appears several times in the file). The second one
should be removed by patch 5 (I think it is missing?).

But if patch 5 and 6 are moved before this one, only "It must be a multiple of
4" would be needed I think, and there would be no transition with only 3
supported sizes.

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [RFC v6 3/6] test/ring: add functional tests for configurable element size ring
  2019-10-21  0:22     ` [dpdk-dev] [RFC v6 3/6] test/ring: add functional tests for configurable element size ring Honnappa Nagarahalli
@ 2019-10-23 10:01       ` Olivier Matz
  2019-10-23 11:12         ` Ananyev, Konstantin
  0 siblings, 1 reply; 173+ messages in thread
From: Olivier Matz @ 2019-10-23 10:01 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: sthemmin, jerinj, bruce.richardson, david.marchand, pbhagavatula,
	konstantin.ananyev, drc, hemant.agrawal, dev, dharmik.thakkar,
	ruifeng.wang, gavin.hu

On Sun, Oct 20, 2019 at 07:22:57PM -0500, Honnappa Nagarahalli wrote:
> Add functional tests for rte_ring_xxx_elem APIs. At this point these
> are derived mainly from existing rte_ring_xxx test cases.
> 
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
>  app/test/Makefile         |   1 +
>  app/test/meson.build      |   1 +
>  app/test/test_ring_elem.c | 859 ++++++++++++++++++++++++++++++++++++++
>  3 files changed, 861 insertions(+)
>  create mode 100644 app/test/test_ring_elem.c

Given the few differences between test_ring_elem.c and test_ring.c, wouldn't
it be possible to have both tests in the same file?

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [RFC v6 4/6] test/ring: add perf tests for configurable element size ring
  2019-10-21  0:22     ` [dpdk-dev] [RFC v6 4/6] test/ring: add perf " Honnappa Nagarahalli
@ 2019-10-23 10:02       ` Olivier Matz
  0 siblings, 0 replies; 173+ messages in thread
From: Olivier Matz @ 2019-10-23 10:02 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: sthemmin, jerinj, bruce.richardson, david.marchand, pbhagavatula,
	konstantin.ananyev, drc, hemant.agrawal, dev, dharmik.thakkar,
	ruifeng.wang, gavin.hu

On Sun, Oct 20, 2019 at 07:22:58PM -0500, Honnappa Nagarahalli wrote:
> Add performance tests for rte_ring_xxx_elem APIs. At this point these
> are derived mainly from existing rte_ring_xxx test cases.
> 
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> ---
>  app/test/Makefile              |   1 +
>  app/test/meson.build           |   1 +
>  app/test/test_ring_perf_elem.c | 419 +++++++++++++++++++++++++++++++++
>  3 files changed, 421 insertions(+)
>  create mode 100644 app/test/test_ring_perf_elem.c

Same question than for previous patch: can it be merged with test_ring_perf.c ?

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [RFC v6 6/6] lib/ring: improved copy function to copy ring elements
  2019-10-21  0:23     ` [dpdk-dev] [RFC v6 6/6] lib/ring: improved copy function to copy ring elements Honnappa Nagarahalli
@ 2019-10-23 10:05       ` Olivier Matz
  0 siblings, 0 replies; 173+ messages in thread
From: Olivier Matz @ 2019-10-23 10:05 UTC (permalink / raw)
  To: Honnappa Nagarahalli
  Cc: sthemmin, jerinj, bruce.richardson, david.marchand, pbhagavatula,
	konstantin.ananyev, drc, hemant.agrawal, dev, dharmik.thakkar,
	ruifeng.wang, gavin.hu

On Sun, Oct 20, 2019 at 07:23:00PM -0500, Honnappa Nagarahalli wrote:
> Improved copy function to copy to/from ring elements.
> 
> Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> Signed-off-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
> ---
>  lib/librte_ring/rte_ring_elem.h | 165 ++++++++++++++++----------------
>  1 file changed, 84 insertions(+), 81 deletions(-)

(...)

> +static __rte_always_inline void
> +copy_elems(uint32_t du32[], const uint32_t su32[], uint32_t nr_num)
> +{
> +	uint32_t i;
> +
> +	for (i = 0; i < (nr_num & ~7); i += 8)
> +		memcpy(du32 + i, su32 + i, 8 * sizeof(uint32_t));
> +
> +	switch (nr_num & 7) {
> +	case 7: du32[nr_num - 7] = su32[nr_num - 7]; /* fallthrough */
> +	case 6: du32[nr_num - 6] = su32[nr_num - 6]; /* fallthrough */
> +	case 5: du32[nr_num - 5] = su32[nr_num - 5]; /* fallthrough */
> +	case 4: du32[nr_num - 4] = su32[nr_num - 4]; /* fallthrough */
> +	case 3: du32[nr_num - 3] = su32[nr_num - 3]; /* fallthrough */
> +	case 2: du32[nr_num - 2] = su32[nr_num - 2]; /* fallthrough */
> +	case 1: du32[nr_num - 1] = su32[nr_num - 1]; /* fallthrough */
> +	}
> +}

minor comment: I suggest src32 and dst32 instead of su32 and du32.

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [RFC v6 3/6] test/ring: add functional tests for configurable element size ring
  2019-10-23 10:01       ` Olivier Matz
@ 2019-10-23 11:12         ` Ananyev, Konstantin
  0 siblings, 0 replies; 173+ messages in thread
From: Ananyev, Konstantin @ 2019-10-23 11:12 UTC (permalink / raw)
  To: Olivier Matz, Honnappa Nagarahalli
  Cc: sthemmin, jerinj, Richardson, Bruce, david.marchand,
	pbhagavatula, drc, hemant.agrawal, dev, dharmik.thakkar,
	ruifeng.wang, gavin.hu


> 
> On Sun, Oct 20, 2019 at 07:22:57PM -0500, Honnappa Nagarahalli wrote:
> > Add functional tests for rte_ring_xxx_elem APIs. At this point these
> > are derived mainly from existing rte_ring_xxx test cases.
> >
> > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > ---
> >  app/test/Makefile         |   1 +
> >  app/test/meson.build      |   1 +
> >  app/test/test_ring_elem.c | 859 ++++++++++++++++++++++++++++++++++++++
> >  3 files changed, 861 insertions(+)
> >  create mode 100644 app/test/test_ring_elem.c
> 
> Given the few differences between test_ring_elem.c and test_ring.c, wouldn't
> it be possible to have both tests in the same file?

+1 to reduce duplication...
Might be move common code into .h file and have actual enqueue/dequeue
calls as defines. 

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [PATCH v4 1/2] lib/ring: apis to support configurable element size
  2019-10-22 17:57                                   ` Ananyev, Konstantin
@ 2019-10-23 18:58                                     ` Honnappa Nagarahalli
  0 siblings, 0 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-23 18:58 UTC (permalink / raw)
  To: Ananyev, Konstantin, 'Jerin Jacob'
  Cc: 'David Christensen', 'olivier.matz@6wind.com',
	'sthemmin@microsoft.com',
	jerinj, Richardson, Bruce, 'david.marchand@redhat.com',
	'pbhagavatula@marvell.com', 'dev@dpdk.org',
	Dharmik Thakkar, Ruifeng Wang (Arm Technology China),
	Gavin Hu (Arm Technology China),
	'stephen@networkplumber.org',
	nd, Honnappa Nagarahalli, nd

<snip>
> >
> > > > I have applied your
> > > > suggestion in 6/6 in v6 along with my corrections. The
> > > > rte_ring_elem test cases are added in 3/6. I have verified that they are
> running fine (they are done for 64b alone, will add more). Hopefully, there are
> no more errors.
> >
> > Applied v6 and re-run the tests.
> > Functional test passes ok on my boxes.
> > Pert-tests numbers below.
> > As I can see pretty much same pattern as in v5 remains:
> > MP/MC on 2 different cores
> 
> Forgot to add: for 8 elems, for 32 - new ones always better.
> 
> > and SP/SC single enq/deq
> > show lower numbers for _elem_.
> > For others _elem_ numbers are about the same or higher.
> > Personally, I am ok to go ahead with these changes.
> > Konstantin
> >
> > A - ring_perf_autotes
> > B - ring_perf_elem_autotest
> >
> >  ### Testing single element and burst enq/deq ###	A	B
> > SP/SC single enq/dequeue: 				8.27	10.94
> > MP/MC single enq/dequeue: 				56.11	47.43
> > SP/SC burst enq/dequeue (size: 8): 			4.20	3.50
> > MP/MC burst enq/dequeue (size: 8): 			9.93	9.29
> > SP/SC burst enq/dequeue (size: 32): 			2.93	1.94
> > MP/MC burst enq/dequeue (size: 32): 			4.10	3.35
> >
> > ### Testing empty dequeue ###
> > SC empty dequeue: 					2.00	3.00
> > MC empty dequeue: 					3.00	2.00
> >
> > ### Testing using a single lcore ###
> > SP/SC bulk enq/dequeue (size: 8): 			4.06	3.30
> > MP/MC bulk enq/dequeue (size: 8): 			9.84	9.28
> > SP/SC bulk enq/dequeue (size: 32): 			2.93	1.88
> > MP/MC bulk enq/dequeue (size: 32): 			4.10	3.32
> >
> > ### Testing using two hyperthreads ###
> > SP/SC bulk enq/dequeue (size: 8): 			9.22	8.83
> > MP/MC bulk enq/dequeue (size: 8): 			15.73	15.86
> > SP/SC bulk enq/dequeue (size: 32): 			5.78	3.83
> > MP/MC bulk enq/dequeue (size: 32): 			6.33	4.53
> >
> > ### Testing using two physical cores ###
> > SP/SC bulk enq/dequeue (size: 8): 			23.78	19.32
> > MP/MC bulk enq/dequeue (size: 8): 			68.54	71.97
> > SP/SC bulk enq/dequeue (size: 32): 			11.99	10.77
> > MP/MC bulk enq/dequeue (size: 32): 			21.96	18.66
> >
> > ### Testing using two NUMA nodes ###
> > SP/SC bulk enq/dequeue (size: 8): 			50.13	33.92
> > MP/MC bulk enq/dequeue (size: 8): 			177.98	195.87
> > SP/SC bulk enq/dequeue (size: 32): 			32.98	23.12
> > MP/MC bulk enq/dequeue (size: 32): 			55.86	48.76

Thanks Konstantin. The performance of 5/6 is mostly worst than 6/6. So, we should not consider 5/6 (will not be included in the future).
A - ring_perf_autotest (existing code)
B - ring_perf_elem_autotest (6/6)

Numbers from my side:
On one Arm platform:
### Testing single element and burst enq/deq ###	A	B
SP/SC single enq/dequeue:				1.04	1.06 (1.92)
MP/MC single enq/dequeue: 				1.46	1.51 (3.42)
SP/SC burst enq/dequeue (size: 8): 			0.18	0.17 (-5.55)
MP/MC burst enq/dequeue (size: 8): 			0.23	0.22 (-4.34)
SP/SC burst enq/dequeue (size: 32): 			0.05	0.05 (0)
MP/MC burst enq/dequeue (size: 32): 			0.07	0.06 (-14.28)
	
### Testing empty dequeue ###	
SC empty dequeue: 					0.27	0.27 (0)
MC empty dequeue: 					0.27	0.27 (0)
	
### Testing using a single lcore ###	
SP/SC bulk enq/dequeue (size: 8): 			0.18	0.17 (-5.55)
MP/MC bulk enq/dequeue (size: 8): 			0.23	0.23 (0)
SP/SC bulk enq/dequeue (size: 32): 			0.05	0.05 (0)
MP/MC bulk enq/dequeue (size: 32): 			0.07	0.06 (0)
	
### Testing using two physical cores ###	
SP/SC bulk enq/dequeue (size: 8): 			0.79	0.79 (0)
MP/MC bulk enq/dequeue (size: 8): 			1.42	1.37 (-3.52)
SP/SC bulk enq/dequeue (size: 32): 			0.20	0.20 (0)
MP/MC bulk enq/dequeue (size: 32): 			0.33	0.35 (6.06)

On another Arm platform:

### Testing single element and burst enq/deq ###	A	B	
SP/SC single enq/dequeue:				11.54	11.79 (2.16)
MP/MC single enq/dequeue: 				11.84	12.54 (5.91)
SP/SC burst enq/dequeue (size: 8): 			1.51	1.33   (-11.92)
MP/MC burst enq/dequeue (size: 8): 			1.91	1.73   (-9.42)
SP/SC burst enq/dequeue (size: 32): 			0.62	0.42   (-32.25)
MP/MC burst enq/dequeue (size: 32): 			0.72	0.52   (-27.77)
	
### Testing empty dequeue ###	
SC empty dequeue: 					2.48	2.48 (0)
MC empty dequeue: 					2.48	2.48 (0)
	
### Testing using a single lcore ###	
SP/SC bulk enq/dequeue (size: 8): 			1.52	1.33 (-12.5)
MP/MC bulk enq/dequeue (size: 8): 			1.92	1.73 (-9.89)
SP/SC bulk enq/dequeue (size: 32): 			0.62	0.42 (-32.25)
MP/MC bulk enq/dequeue (size: 32): 			0.72	0.52 (-27.77)
	
### Testing using two physical cores ###	
SP/SC bulk enq/dequeue (size: 8): 			6.30	6.57   (4.28)
MP/MC bulk enq/dequeue (size: 8): 			10.59	10.45 (-1.32)
SP/SC bulk enq/dequeue (size: 32): 			1.92	1.58   (-17.70)
MP/MC bulk enq/dequeue (size: 32): 			2.51	2.47   (-1.59)

From my side, I would say let us just go with patch 2/6.

Jerin/David, any opinion on your side?

^ permalink raw reply	[flat|nested] 173+ messages in thread

* Re: [dpdk-dev] [RFC v6 2/6] lib/ring: apis to support configurable element size
  2019-10-23  9:59       ` Olivier Matz
@ 2019-10-23 19:12         ` Honnappa Nagarahalli
  0 siblings, 0 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-10-23 19:12 UTC (permalink / raw)
  To: Olivier Matz
  Cc: sthemmin, jerinj, bruce.richardson, david.marchand, pbhagavatula,
	konstantin.ananyev, drc, hemant.agrawal, dev, Dharmik Thakkar,
	Ruifeng Wang (Arm Technology China),
	Gavin Hu (Arm Technology China),
	Honnappa Nagarahalli, nd, nd

> 
> On Sun, Oct 20, 2019 at 07:22:56PM -0500, Honnappa Nagarahalli wrote:
> > Current APIs assume ring elements to be pointers. However, in many use
> > cases, the size can be different. Add new APIs to support configurable
> > ring element sizes.
> >
> > Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
> > Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
> > Reviewed-by: Gavin Hu <gavin.hu@arm.com>
> > Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
> > ---
> >  lib/librte_ring/Makefile             |   3 +-
> >  lib/librte_ring/meson.build          |   4 +
> >  lib/librte_ring/rte_ring.c           |  44 +-
> >  lib/librte_ring/rte_ring.h           |   1 +
> >  lib/librte_ring/rte_ring_elem.h      | 946 +++++++++++++++++++++++++++
> >  lib/librte_ring/rte_ring_version.map |   2 +
> >  6 files changed, 991 insertions(+), 9 deletions(-)  create mode
> > 100644 lib/librte_ring/rte_ring_elem.h
> 
> (...)
> 
> > +/* the actual enqueue of pointers on the ring.
> > + * Placed here since identical code needed in both
> > + * single and multi producer enqueue functions.
> > + */
> > +#define ENQUEUE_PTRS_ELEM(r, ring_start, prod_head, obj_table, esize, n)
> do { \
> > +	if (esize == 4) \
> > +		ENQUEUE_PTRS_32(r, ring_start, prod_head, obj_table, n); \
> > +	else if (esize == 8) \
> > +		ENQUEUE_PTRS_64(r, ring_start, prod_head, obj_table, n); \
> > +	else if (esize == 16) \
> > +		ENQUEUE_PTRS_128(r, ring_start, prod_head, obj_table, n); \ }
> while
> > +(0)
> 
> My initial thinking was that it could be a static inline functions instead of
> macros. I see that patches 5 and 6 are changing it. I wonder however if patches
> 5 and 6 shouldn't be merged and moved before this
> one: it would avoid to introduce new macros that will be removed after.
Patch 2, 5 and 6 implement different methods to do the copy of elements. We can drop 5, as 6 proves to be better than 5 in my tests. The question on choosing between 2 and 6 is still open. If we go with 2, I will convert the macros into inline functions.

> 
> (...)
> 
> > +/**
> > + * @internal Enqueue several objects on the ring
> > + *
> > + * @param r
> > + *   A pointer to the ring structure.
> > + * @param obj_table
> > + *   A pointer to a table of void * pointers (objects).
> > + * @param esize
> > + *   The size of ring element, in bytes. It must be a multiple of 4.
> > + *   Currently, sizes 4, 8 and 16 are supported. This should be the same
> > + *   as passed while creating the ring, otherwise the results are undefined.
> 
> The comment "It must be a multiple of 4" and "Currently, sizes 4, 8 and 16 are
> supported" are redundant (it appears several times in the file). The second one
> should be removed by patch 5 (I think it is missing?).
> 
> But if patch 5 and 6 are moved before this one, only "It must be a multiple of
> 4" would be needed I think, and there would be no transition with only 3
> supported sizes.
(refer to the comment above) if 2 is chosen, then, I would like to remove the restriction of limited sizes by adding a for loop around the 32b copy. 64b and 128b will remain the same to meet the existing performance.

^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH v7 00/17] lib/ring: APIs to support custom element size
  2019-09-06 19:05 ` [dpdk-dev] [PATCH v2 0/6] " Honnappa Nagarahalli
                     ` (11 preceding siblings ...)
  2019-10-21  0:22   ` [dpdk-dev] [RFC v6 0/6] lib/ring: APIs to support custom element size Honnappa Nagarahalli
@ 2019-12-20  4:45   ` Honnappa Nagarahalli
  2019-12-20  4:45     ` [dpdk-dev] [PATCH v7 01/17] test/ring: use division for cycle count calculation Honnappa Nagarahalli
                       ` (16 more replies)
  2020-01-13 17:25   ` [dpdk-dev] [PATCH v8 0/6] lib/ring: APIs to support custom element size Honnappa Nagarahalli
                     ` (2 subsequent siblings)
  15 siblings, 17 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-12-20  4:45 UTC (permalink / raw)
  To: olivier.matz, sthemmin, jerinj, bruce.richardson, david.marchand,
	pbhagavatula, konstantin.ananyev, honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, ruifeng.wang, gavin.hu, nd

The current rte_ring hard-codes the type of the ring element to 'void *',
hence the size of the element is hard-coded to 32b/64b. Since the ring
element type is not an input to rte_ring APIs, it results in couple
of issues:

1) If an application requires to store an element which is not 64b, it
   needs to write its own ring APIs similar to rte_event_ring APIs. This
   creates additional burden on the programmers, who end up making
   work-arounds and often waste memory.
2) If there are multiple libraries that store elements of the same
   type, currently they would have to write their own rte_ring APIs. This
   results in code duplication.

This patch adds new APIs to support configurable ring element size.
The APIs support custom element sizes by allowing to define the ring
element to be a multiple of 32b.

The aim is to achieve same performance as the existing ring
implementation.

The changes to test cases are significant. The patches 3/17 to 15/17
are created to help with the review. Otherwise, they can be quashed
into a single commit.

v7
 - Merged the test cases to test both legacy APIs and rte_ring_xxx_elem APIs
   without code duplication (Konstantin, Olivier)
 - Performance test cases are merged as well (Konstantin, Olivier)
 - Macros to copy elements are converted into inline functions (Olivier)
 - Added back the changes to hash and event libraries

v6
 - Labelled as RFC to indicate the better status
 - Added unit tests to test the rte_ring_xxx_elem APIs
 - Corrected 'macro based partial memcpy' (5/6) patch
 - Added Konstantin's method after correction (6/6)
 - Check Patch shows significant warnings and errors mainly due
   copying code from existing test cases. None of them are harmful.
   I will fix them once we have an agreement.

v5
 - Use memcpy for chunks of 32B (Konstantin).
 - Both 'ring_perf_autotest' and 'ring_perf_elem_autotest' are available
   to compare the results easily.
 - Copying without memcpy is also available in 1/3, if anyone wants to
   experiment on their platform.
 - Added other platform owners to test on their respective platforms.

v4
 - Few fixes after more performance testing

v3
 - Removed macro-fest and used inline functions
   (Stephen, Bruce)

v2
 - Change Event Ring implementation to use ring templates
   (Jerin, Pavan)

Honnappa Nagarahalli (17):
  test/ring: use division for cycle count calculation
  lib/ring: apis to support configurable element size
  test/ring: add functional tests for rte_ring_xxx_elem APIs
  test/ring: test burst APIs with random empty-full test case
  test/ring: add default, single element test cases
  test/ring: rte_ring_xxx_elem test cases for exact size ring
  test/ring: negative test cases for rte_ring_xxx_elem APIs
  test/ring: remove duplicate test cases
  test/ring: removed unused variable synchro
  test/ring: modify single element enq/deq perf test cases
  test/ring: modify burst enq/deq perf test cases
  test/ring: modify bulk enq/deq perf test cases
  test/ring: modify bulk empty deq perf test cases
  test/ring: modify multi-lcore perf test cases
  test/ring: adjust run-on-all-cores perf test cases
  lib/hash: use ring with 32b element size to save memory
  lib/eventdev: use custom element size ring for event rings

 app/test/test_ring.c                 | 1227 +++++++++++---------------
 app/test/test_ring.h                 |  203 +++++
 app/test/test_ring_perf.c            |  434 +++++----
 lib/librte_eventdev/rte_event_ring.c |  147 +--
 lib/librte_eventdev/rte_event_ring.h |   45 +-
 lib/librte_hash/rte_cuckoo_hash.c    |   97 +-
 lib/librte_hash/rte_cuckoo_hash.h    |    2 +-
 lib/librte_ring/Makefile             |    3 +-
 lib/librte_ring/meson.build          |    4 +
 lib/librte_ring/rte_ring.c           |   41 +-
 lib/librte_ring/rte_ring.h           |    1 +
 lib/librte_ring/rte_ring_elem.h      | 1002 +++++++++++++++++++++
 lib/librte_ring/rte_ring_version.map |    2 +
 13 files changed, 2102 insertions(+), 1106 deletions(-)
 create mode 100644 app/test/test_ring.h
 create mode 100644 lib/librte_ring/rte_ring_elem.h

-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH v7 01/17] test/ring: use division for cycle count calculation
  2019-12-20  4:45   ` [dpdk-dev] [PATCH v7 00/17] " Honnappa Nagarahalli
@ 2019-12-20  4:45     ` Honnappa Nagarahalli
  2019-12-20  4:45     ` [dpdk-dev] [PATCH v7 02/17] lib/ring: apis to support configurable element size Honnappa Nagarahalli
                       ` (15 subsequent siblings)
  16 siblings, 0 replies; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-12-20  4:45 UTC (permalink / raw)
  To: olivier.matz, sthemmin, jerinj, bruce.richardson, david.marchand,
	pbhagavatula, konstantin.ananyev, honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, ruifeng.wang, gavin.hu, nd

Use division instead of modulo operation to calculate more
accurate cycle count.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Acked-by: Olivier Matz <olivier.matz@6wind.com>
---
 app/test/test_ring_perf.c | 22 ++++++++++++----------
 1 file changed, 12 insertions(+), 10 deletions(-)

diff --git a/app/test/test_ring_perf.c b/app/test/test_ring_perf.c
index 70ee46ffe..6c2aca483 100644
--- a/app/test/test_ring_perf.c
+++ b/app/test/test_ring_perf.c
@@ -357,10 +357,10 @@ test_single_enqueue_dequeue(struct rte_ring *r)
 	}
 	const uint64_t mc_end = rte_rdtsc();
 
-	printf("SP/SC single enq/dequeue: %"PRIu64"\n",
-			(sc_end-sc_start) >> iter_shift);
-	printf("MP/MC single enq/dequeue: %"PRIu64"\n",
-			(mc_end-mc_start) >> iter_shift);
+	printf("SP/SC single enq/dequeue: %.2F\n",
+			((double)(sc_end-sc_start)) / iterations);
+	printf("MP/MC single enq/dequeue: %.2F\n",
+			((double)(mc_end-mc_start)) / iterations);
 }
 
 /*
@@ -395,13 +395,15 @@ test_burst_enqueue_dequeue(struct rte_ring *r)
 		}
 		const uint64_t mc_end = rte_rdtsc();
 
-		uint64_t mc_avg = ((mc_end-mc_start) >> iter_shift) / bulk_sizes[sz];
-		uint64_t sc_avg = ((sc_end-sc_start) >> iter_shift) / bulk_sizes[sz];
+		double mc_avg = ((double)(mc_end-mc_start) / iterations) /
+					bulk_sizes[sz];
+		double sc_avg = ((double)(sc_end-sc_start) / iterations) /
+					bulk_sizes[sz];
 
-		printf("SP/SC burst enq/dequeue (size: %u): %"PRIu64"\n", bulk_sizes[sz],
-				sc_avg);
-		printf("MP/MC burst enq/dequeue (size: %u): %"PRIu64"\n", bulk_sizes[sz],
-				mc_avg);
+		printf("SP/SC burst enq/dequeue (size: %u): %.2F\n",
+				bulk_sizes[sz], sc_avg);
+		printf("MP/MC burst enq/dequeue (size: %u): %.2F\n",
+				bulk_sizes[sz], mc_avg);
 	}
 }
 
-- 
2.17.1


^ permalink raw reply	[flat|nested] 173+ messages in thread

* [dpdk-dev] [PATCH v7 02/17] lib/ring: apis to support configurable element size
  2019-12-20  4:45   ` [dpdk-dev] [PATCH v7 00/17] " Honnappa Nagarahalli
  2019-12-20  4:45     ` [dpdk-dev] [PATCH v7 01/17] test/ring: use division for cycle count calculation Honnappa Nagarahalli
@ 2019-12-20  4:45     ` Honnappa Nagarahalli
  2020-01-02 16:42       ` Ananyev, Konstantin
  2019-12-20  4:45     ` [dpdk-dev] [PATCH v7 03/17] test/ring: add functional tests for rte_ring_xxx_elem APIs Honnappa Nagarahalli
                       ` (14 subsequent siblings)
  16 siblings, 1 reply; 173+ messages in thread
From: Honnappa Nagarahalli @ 2019-12-20  4:45 UTC (permalink / raw)
  To: olivier.matz, sthemmin, jerinj, bruce.richardson, david.marchand,
	pbhagavatula, konstantin.ananyev, honnappa.nagarahalli
  Cc: dev, dharmik.thakkar, ruifeng.wang, gavin.hu, nd

Current APIs assume ring elements to be pointers. However, in many
use cases, the size can be different. Add new APIs to support
configurable ring element sizes.

Signed-off-by: Honnappa Nagarahalli <honnappa.nagarahalli@arm.com>
Reviewed-by: Dharmik Thakkar <dharmik.thakkar@arm.com>
Reviewed-by: Gavin Hu <gavin.hu@arm.com>
Reviewed-by: Ruifeng Wang <ruifeng.wang@arm.com>
---
 lib/librte_ring/Makefile             |    3 +-
 lib/librte_ring/meson.build          |    4 +
 lib/librte_ring/rte_ring.c           |   41 +-
 lib/librte_ring/rte_ring.h           |    1 +
 lib/librte_ring/rte_ring_elem.h      | 1002 ++++++++++++++++++++++++++
 lib/librte_ring/rte_ring_version.map |    2 +
 6 files changed, 1044 insertions(+), 9 deletions(-)
 create mode 100644 lib/librte_ring/rte_ring_elem.h

diff --git a/lib/librte_ring/Makefile b/lib/librte_ring/Makefile
index 22454b084..917c560ad 100644
--- a/lib/librte_ring/Makefile
+++ b/lib/librte_ring/Makefile
@@ -6,7 +6,7 @@ include $(RTE_SDK)/mk/rte.vars.mk
 # library name
 LIB = librte_ring.a
 
-CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3
+CFLAGS += $(WERROR_FLAGS) -I$(SRCDIR) -O3 -DALLOW_EXPERIMENTAL_API
 LDLIBS += -lrte_eal
 
 EXPORT_MAP := rte_ring_version.map
@@ -16,6 +16,7 @@ SRCS-$(CONFIG_RTE_LIBRTE_RING) := rte_ring.c
 
 # install includes
 SYMLINK-$(CONFIG_RTE_LIBRTE_RING)-include := rte_ring.h \
+					rte_ring_elem.h \
 					rte_ring_generic.h \
 					rte_ring_c11_mem.h
 
diff --git a/lib/librte_ring/meson.build b/lib/librte_ring/meson.build
index ca8a435e9..f2f3ccc88 100644
--- a/lib/librte_ring/meson.build
+++ b/lib/librte_ring/meson.build
@@ -3,5 +3,9 @@
 
 sources = files('rte_ring.c')
 headers = files('rte_ring.h',
+		'rte_ring_elem.h',
 		'rte_ring_c11_mem.h',
 		'rte_ring_generic.h')
+
+# rte_ring_create_elem and rte_ring_get_memsize_elem are experimental
+allow_experimental_apis = true
diff --git a/lib/librte_ring/rte_ring.c b/lib/librte_ring/rte_ring.c
index d9b308036..3e15dc398 100644
--- a/lib/librte_ring/rte_ring.c
+++ b/lib/librte_ring/rte_ring.c
@@ -33,6 +33,7 @@
 #include <rte_tailq.h>
 
 #include "rte_ring.h"
+#include "rte_ring_elem.h"
 
 TAILQ_HEAD(rte_ring_list, rte_tailq_entry);
 
@@ -46,23 +47,38 @@ EAL_REGISTER_TAILQ(rte_ring_tailq)
 
 /* return the size of memory occupied by a ring */
 ssize_t
-rte_ring_get_memsize(unsigned count)
+rte_ring_get_memsize_elem(unsigned int esize, unsigned int count)
 {
 	ssize_t sz;
 
+	/* Check if element size is a multiple of 4B */
+	if (esize % 4 != 0) {
+		RTE_LOG(ERR, RING, "element size is not a multiple of 4\n");
+
+		return -EINVAL;
+	}
+
 	/* count must be a power of 2 */
 	if ((!POWEROF2(count)) || (count > RTE_RING_SZ_MASK )) {
 		RTE_LOG(ERR, RING,
-			"Requested size is invalid, must be power of 2, and "
-			"do not exceed the size limit %u\n", RTE_RING_SZ_MASK);
+			"Requested number of elements is invalid, must be power of 2, and not exceed %u\n",
+			RTE_RING_SZ_MASK);
+
 		return -EINVAL;
 	}
 
-	sz = sizeof(struct rte_ring) + count * sizeof(void *);
+	sz = sizeof(struct rte_ring) + count * esize;
 	sz = RTE_ALIGN(sz, RTE_CACHE_LINE_SIZE);
 	return sz;
 }
 
+/* return the size of memory occupied by a ring */
+ssize_t
+rte_ring_get_memsize(unsigned count)
+{
+	return rte_ring_get_memsize_elem(sizeof(void *), count);
+}
+
 void
 rte_ring_reset(struct rte_ring *r)
 {
@@ -114,10 +130,10 @@ rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
 	return 0;
 }
 
-/* create the ring */
+/* create the ring for a given element size */
 struct rte_ring *
-rte_ring_create(const char *name, unsigned count, int socket_id,
-		unsigned flags)
+rte_ring_create_elem(const char *name, unsigned int esize, unsigned int count,
+		int socket_id, unsigned int flags)
 {
 	char mz_name[RTE_MEMZONE_NAMESIZE];
 	struct rte_ring *r;
@@ -135,7 +151,7 @@ rte_ring_create(const char *name, unsigned count, int socket_id,
 	if (flags & RING_F_EXACT_SZ)
 		count = rte_align32pow2(count + 1);
 
-	ring_size = rte_ring_get_memsize(count);
+	ring_size = rte_ring_get_memsize_elem(esize, count);
 	if (ring_size < 0) {
 		rte_errno = ring_size;
 		return NULL;
@@ -182,6 +198,15 @@ rte_ring_create(const char *name, unsigned count, int socket_id,
 	return r;
 }
 
+/* create the ring */
+struct rte_ring *
+rte_ring_create(const char *name, unsigned count, int socket_id,
+		unsigned flags)
+{
+	return rte_ring_create_elem(name, sizeof(void *), count, socket_id,
+		flags);
+}
+
 /* free the ring */
 void
 rte_ring_free(struct rte_ring *r)
diff --git a/lib/librte_ring/rte_ring.h b/lib/librte_ring/rte_ring.h
index 2a9f768a1..18fc5d845 100644
--- a/lib/librte_ring/rte_ring.h
+++ b/lib/librte_ring/rte_ring.h
@@ -216,6 +216,7 @@ int rte_ring_init(struct rte_ring *r, const char *name, unsigned count,
  */
 struct rte_ring *rte_ring_create(const char *name, unsigned count,
 				 int socket_id, unsigned flags);
+
 /**
  * De-allocate all memory used by the ring.
  *
diff --git a/lib/librte_ring/rte_ring_elem.h b/lib/librte_ring/rte_ring_elem.h
new file mode 100644
index 000000000..fc7fe127c
--- /dev/null
+++ b/lib/librte_ring/rte_ring_elem.h
@@ -0,0 +1,1002 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ *
+ * Copyright (c) 2019 Arm Limited
+ * Copyright (c) 2010-2017 Intel Corporation
+ * Copyright (c) 2007-2009 Kip Macy kmacy@freebsd.org
+ * All rights reserved.
+ * Derived from FreeBSD's bufring.h
+ * Used as BSD-3 Licensed with permission from Kip Macy.
+ */
+
+#ifndef _RTE_RING_ELEM_H_
+#define _RTE_RING_ELEM_H_
+
+/**
+ * @file
+ * RTE Ring with user defined element size
+ */
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+#include <stdio.h>
+#include <stdint.h>
+#include <sys/queue.h>
+#include <errno.h>
+#include <rte_common.h>
+#include <rte_config.h>
+#include <rte_memory.h>
+#include <rte_lcore.h>
+#include <rte_atomic.h>
+#include <rte_branch_prediction.h>
+#include <rte_memzone.h>
+#include <rte_pause.h>
+
+#include "rte_ring.h"
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Calculate the memory size needed for a ring with given element size
+ *
+ * This function returns the number of bytes needed for a ring, given
+ * the number of elements in it and the size of the element. This value
+ * is the sum of the size of the structure rte_ring and the size of the
+ * memory needed for storing the elements. The value is aligned to a cache
+ * line size.
+ *
+ * @param esize
+ *   The size of ring element, in bytes. It must be a multiple of 4.
+ * @param count
+ *   The number of elements in the ring (must be a power of 2).
+ * @return
+ *   - The memory size needed for the ring on success.
+ *   - -EINVAL - esize is not a multiple of 4 or count provided is not a
+ *		 power of 2.
+ */
+__rte_experimental
+ssize_t rte_ring_get_memsize_elem(unsigned in