[RFC 0/2] Add high-performance timer facility

DPDK patches and discussions
 help / color / mirror / Atom feed

* [RFC 0/2] Add high-performance timer facility
@ 2023-02-28  9:39 Mattias Rönnblom
  2023-02-28  9:39 ` [RFC 1/2] eal: add bitset type Mattias Rönnblom
                   ` (3 more replies)
  0 siblings, 4 replies; 31+ messages in thread
From: Mattias Rönnblom @ 2023-02-28  9:39 UTC (permalink / raw)
  To: dev
  Cc: Erik Gabriel Carrillo, David Marchand, maria.lingemark,
	Stefan Sundkvist, Mattias Rönnblom

This patchset is an attempt to introduce a high-performance, highly
scalable timer facility into DPDK.

More specifically, the goals for the htimer library are:

* Efficient handling of a handful up to hundreds of thousands of
  concurrent timers.
* Reduced overhead of adding and canceling timers.
* Provide a service functionally equivalent to that of
  <rte_timer.h>. API/ABI backward compatibility is secondary.

In the author's opinion, there are two main shortcomings with the
current DPDK timer library (i.e., rte_timer.[ch]).

One is the synchronization overhead, where heavy-weight full-barrier
type synchronization is used. rte_timer.c uses per-EAL/lcore skip
lists, but any thread may add or cancel (or otherwise access) timers
managed by another lcore (and thus resides in its timer skip list).

The other is an algorithmic shortcoming, with rte_timer.c's reliance
on a skip list, which, seemingly, is less efficient than certain
alternatives.

This patchset implements a hierarchical timer wheel (HWT, in
rte_htw.c), as per the Varghese and Lauck paper "Hashed and
Hierarchical Timing Wheels: Data Structures for the Efficient
Implementation of a Timer Facility". A HWT is a data structure
purposely design for this task, and used by many operating system
kernel timer facilities.

To further improve the solution described by Varghese and Lauck, a
bitset is placed in front of each of the timer wheel in the HWT,
reducing overhead of rte_htimer_mgr_manage() (i.e., progressing time
and expiry processing).

Cycle-efficient scanning and manipulation of these bitsets are crucial
for the HWT's performance.

The htimer module keeps a per-lcore (or per-registered EAL thread) HWT
instance, much like rte_timer.c keeps a per-lcore skip list.

To avoid expensive synchronization overhead for thread-local timer
management, the HWTs are accessed only from the "owning" thread.  Any
interaction any other thread has with a particular lcore's timer
wheel goes over a set of DPDK rings. A side-effect of this design is
that all operations working toward a "remote" HWT must be
asynchronous.

The <rte_htimer.h> API is available only to EAL threads and registered
non-EAL threads.

The htimer API allows the application to supply the current time,
useful in case it already has retrieved this for other purposes,
saving the cost of a rdtsc instruction (or its equivalent).

Relative htimer does not retrieve a new time, but reuse the current
time (as known via/at-the-time of the manage-call), again to shave off
some cycles of overhead.

A semantic improvement compared to the <rte_timer.h> API is that the
htimer library can give a definite answer on the question if the timer
expiry callback was called, after a timer has been canceled.

Below is a performance data from DPDK's 'app/test' micro benchmarks,
using 10k concurrent timers. The benchmarks (test_timer_perf.c and
test_htimer_mgr_perf.c) aren't identical in their structure, but the
numbers give some indication of the difference.

Use case               htimer  timer
------------------------------------
Add timer                 28    253
Cancel timer              10    412
Async add (source lcore)  64
Async add (target lcore)  13

(AMD 5900X CPU. Time in TSC.)

Prototype integration of the htimer library into real, timer-heavy,
applications indicates that htimer may result in significant
application-level performance gains.

The bitset implementation which the HWT implementation depends upon
seemed generic-enough and potentially useful outside the world of
HWTs, to justify being located in the EAL.

This patchset is very much an RFC, and the author is yet to form an
opinion on many important issues.

* If deemed a suitable replacement, should the htimer replace the
  current DPDK timer library in some particular (ABI-breaking)
  release, or should it live side-by-side with the then-legacy
  <rte_timer.h> API? A lot of things in and outside DPDK depend on
  <rte_timer.h>, so coexistence may be required to facilitate a smooth
  transition.

* Should the htimer and htw-related files be colocated with rte_timer.c
  in the timer library?

* Would it be useful for applications using asynchronous cancel to
  have the option of having the timer callback run not only in case of
  timer expiration, but also cancellation (on the target lcore)? The
  timer cb signature would need to include an additional parameter in
  that case.

* Should the rte_htimer be a nested struct, so the htw parts be separated
  from the htimer parts?

* <rte_htimer.h> is kept separate from <rte_htimer_mgr.h>, so that
  <rte_htw.h> may avoid a depedency to <rte_htimer_mgr.h>. Should it
  be so?

* rte_htimer struct is only supposed to be used by the application to
  give an indication of how much memory it needs to allocate, and is
  its member are not supposed to be directly accessed (w/ the possible
  exception of the owner_lcore_id field). Should there be a dummy
  struct, or a #define RTE_HTIMER_MEMSIZE or a rte_htimer_get_memsize()
  function instead, serving the same purpose? Better encapsulation,
  but more inconvenient for applications. Run-time dynamic sizing
  would force application-level dynamic allocations.

* Asynchronous cancellation is a little tricky to use for the
  application (primarily due to timer memory reclamation/race
  issues). Should this functionality be removed?

* Should rte_htimer_mgr_init() also retrieve the current time? If so,
  there should to be a variant which allows the user to specify the
  time (to match rte_htimer_mgr_manage_time()). One pitfall with the
  current proposed API is an application calling rte_htimer_mgr_init()
  and then immediately adding a timer with a relative timeout, in
  which case the current absolute time used is 0, which might be a
  surprise.

* Should libdivide (optionally) be used to avoid the div in the TSC ->
  tick conversion? (Doesn't improve performance on Zen 3, but may
  do on other CPUs.) Consider <rte_reciprocal.h> as well.

* Should the TSC-per-tick be rounded up to a power of 2, so shifts can be
  used for conversion? Very minor performance gains to be found there,
  at least on Zen 3 cores.

* Should it be possible to supply the time in rte_htimer_mgr_add()
  and/or rte_htimer_mgr_manage_time() functions as ticks, rather than
  as TSC? Should it be possible to also use nanoseconds?
  rte_htimer_mgr_manage_time() would need a flags parameter in that
  case.

* Would the event timer adapter be best off using <rte_htw.h>
  directly, or <rte_htimer.h>? In the latter case, there needs to be a
  way to instantiate more HWTs (similar to the "alt" functions of
  <rte_timer.h>)?

* Should the PERIODICAL flag (and the complexity it brings) be
  removed? And leave the application with only single-shot timers, and
  the option to re-add them in the timer callback.

* Should the async result codes and the sync cancel error codes be merged
  into one set of result codes?

* Should the rte_htimer_mgr_async_add() have a flag which allow
  buffering add request messages until rte_htimer_mgr_process() is
  called? Or any manage function. Would reduce ring signaling overhead
  (i.e., burst enqueue operations instead of single-element
  enqueue). Could also be a rte_htimer_mgr_async_add_burst() function,
  solving the same "problem" a different way. (The signature of such
  a function would not be pretty.)

* Does the functionality provided by the rte_htimer_mgr_process()
  function match its the use cases? Should there me a more clear
  separation between expiry processing and asynchronous operation
  processing?

* Should the patchset be split into more commits? If so, how?

Thanks to Erik Carrillo for his assistance.

Mattias Rönnblom (2):
  eal: add bitset type
  eal: add high-performance timer facility

 app/test/meson.build             |  10 +-
 app/test/test_bitset.c           | 646 +++++++++++++++++++++++
 app/test/test_htimer_mgr.c       | 674 ++++++++++++++++++++++++
 app/test/test_htimer_mgr_perf.c  | 324 ++++++++++++
 app/test/test_htw.c              | 478 +++++++++++++++++
 app/test/test_htw_perf.c         | 181 +++++++
 doc/api/doxy-api-index.md        |   5 +-
 doc/api/doxy-api.conf.in         |   1 +
 lib/eal/common/meson.build       |   1 +
 lib/eal/common/rte_bitset.c      |  29 +
 lib/eal/include/meson.build      |   1 +
 lib/eal/include/rte_bitset.h     | 878 +++++++++++++++++++++++++++++++
 lib/eal/version.map              |   3 +
 lib/htimer/meson.build           |   7 +
 lib/htimer/rte_htimer.h          |  65 +++
 lib/htimer/rte_htimer_mgr.c      | 488 +++++++++++++++++
 lib/htimer/rte_htimer_mgr.h      | 497 +++++++++++++++++
 lib/htimer/rte_htimer_msg.h      |  44 ++
 lib/htimer/rte_htimer_msg_ring.c |  18 +
 lib/htimer/rte_htimer_msg_ring.h |  49 ++
 lib/htimer/rte_htw.c             | 437 +++++++++++++++
 lib/htimer/rte_htw.h             |  49 ++
 lib/htimer/version.map           |  17 +
 lib/meson.build                  |   1 +
 24 files changed, 4901 insertions(+), 2 deletions(-)
 create mode 100644 app/test/test_bitset.c
 create mode 100644 app/test/test_htimer_mgr.c
 create mode 100644 app/test/test_htimer_mgr_perf.c
 create mode 100644 app/test/test_htw.c
 create mode 100644 app/test/test_htw_perf.c
 create mode 100644 lib/eal/common/rte_bitset.c
 create mode 100644 lib/eal/include/rte_bitset.h
 create mode 100644 lib/htimer/meson.build
 create mode 100644 lib/htimer/rte_htimer.h
 create mode 100644 lib/htimer/rte_htimer_mgr.c
 create mode 100644 lib/htimer/rte_htimer_mgr.h
 create mode 100644 lib/htimer/rte_htimer_msg.h
 create mode 100644 lib/htimer/rte_htimer_msg_ring.c
 create mode 100644 lib/htimer/rte_htimer_msg_ring.h
 create mode 100644 lib/htimer/rte_htw.c
 create mode 100644 lib/htimer/rte_htw.h
 create mode 100644 lib/htimer/version.map

-- 
2.34.1

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC 1/2] eal: add bitset type
  2023-02-28  9:39 [RFC 0/2] Add high-performance timer facility Mattias Rönnblom
@ 2023-02-28  9:39 ` Mattias Rönnblom
  2023-02-28 18:46   ` Tyler Retzlaff
  2023-02-28  9:39 ` [RFC 2/2] eal: add high-performance timer facility Mattias Rönnblom
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 31+ messages in thread
From: Mattias Rönnblom @ 2023-02-28  9:39 UTC (permalink / raw)
  To: dev
  Cc: Erik Gabriel Carrillo, David Marchand, maria.lingemark,
	Stefan Sundkvist, Mattias Rönnblom

Introduce a set of functions and macros that operate on sets of bits,
kept in arrays of 64-bit elements.

RTE bitset is designed for bitsets which are larger than what fits in
a single machine word (i.e., 64 bits). For very large bitsets, the
<rte_bitmap.h> API may be a more appropriate choice.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 app/test/meson.build         |   4 +-
 app/test/test_bitset.c       | 646 ++++++++++++++++++++++++++
 lib/eal/common/meson.build   |   1 +
 lib/eal/common/rte_bitset.c  |  29 ++
 lib/eal/include/meson.build  |   1 +
 lib/eal/include/rte_bitset.h | 878 +++++++++++++++++++++++++++++++++++
 lib/eal/version.map          |   3 +
 7 files changed, 1561 insertions(+), 1 deletion(-)
 create mode 100644 app/test/test_bitset.c
 create mode 100644 lib/eal/common/rte_bitset.c
 create mode 100644 lib/eal/include/rte_bitset.h

diff --git a/app/test/meson.build b/app/test/meson.build
index f34d19e3c3..03811ff692 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -13,8 +13,9 @@ test_sources = files(
         'test_alarm.c',
         'test_atomic.c',
         'test_barrier.c',
-        'test_bitops.c',
         'test_bitmap.c',
+        'test_bitset.c',
+        'test_bitops.c',
         'test_bpf.c',
         'test_byteorder.c',
         'test_cksum.c',
@@ -164,6 +165,7 @@ fast_tests = [
         ['bpf_autotest', true, true],
         ['bpf_convert_autotest', true, true],
         ['bitops_autotest', true, true],
+        ['bitset_autotest', true, true],
         ['byteorder_autotest', true, true],
         ['cksum_autotest', true, true],
         ['cmdline_autotest', true, true],
diff --git a/app/test/test_bitset.c b/app/test/test_bitset.c
new file mode 100644
index 0000000000..504363e86e
--- /dev/null
+++ b/app/test/test_bitset.c
@@ -0,0 +1,646 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <inttypes.h>
+
+#include <rte_random.h>
+
+#include <rte_bitset.h>
+
+#include "test.h"
+
+#define MAGIC UINT64_C(0xdeadbeefdeadbeef)
+
+static void
+rand_buf(void *buf, size_t n)
+{
+	size_t i;
+
+	for (i = 0; i < n; i++)
+		((char *)buf)[i] = (char)rte_rand();
+}
+
+static uint64_t *
+alloc_bitset(size_t size)
+{
+	uint64_t *p;
+
+	p = malloc(RTE_BITSET_SIZE(size) + 2 * sizeof(uint64_t));
+
+	if (p == NULL)
+		rte_panic("Unable to allocate memory\n");
+
+	rand_buf(&p[0], RTE_BITSET_SIZE(size));
+
+	p[0] = MAGIC;
+	p[RTE_BITSET_NUM_WORDS(size) + 1] = MAGIC;
+
+	return p + 1;
+}
+
+
+static int
+free_bitset(uint64_t *bitset, size_t size)
+{
+	uint64_t *p;
+
+	p = bitset - 1;
+
+	if (p[0] != MAGIC)
+		return TEST_FAILED;
+
+	if (p[RTE_BITSET_NUM_WORDS(size) + 1] != MAGIC)
+		return TEST_FAILED;
+
+	free(p);
+
+	return TEST_SUCCESS;
+}
+
+static bool
+rand_bool(void)
+{
+	return rte_rand_max(2);
+}
+
+static void
+rand_bool_ary(bool *ary, size_t len)
+{
+	size_t i;
+
+	for (i = 0; i < len; i++)
+		ary[i] = rand_bool();
+}
+
+static int
+test_test_set_size(size_t size)
+{
+	size_t i;
+	bool reference[size];
+	uint64_t *bitset;
+
+	rand_bool_ary(reference, size);
+
+	bitset = alloc_bitset(size);
+
+	if (bitset == NULL)
+		return TEST_FAILED;
+
+	rte_bitset_init(bitset, size);
+
+	for (i = 0; i < size; i++) {
+		if (reference[i])
+			rte_bitset_set(bitset, i);
+		else
+			rte_bitset_clear(bitset, i);
+	}
+
+	for (i = 0; i < size; i++)
+		if (reference[i] != rte_bitset_test(bitset, i))
+			return TEST_FAILED;
+
+	if (free_bitset(bitset, size) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+#define RAND_ITERATIONS (10000)
+#define RAND_SET_MAX_SIZE (1000)
+
+static int
+test_test_set(void)
+{
+	size_t i;
+
+	for (i = 0; i < RAND_ITERATIONS; i++) {
+		size_t size = 1 + rte_rand_max(RAND_SET_MAX_SIZE - 1);
+
+		if (test_test_set_size(size) != TEST_SUCCESS)
+			return TEST_FAILED;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static ssize_t
+find(const bool *ary, size_t num_bools, size_t start, size_t len, bool set)
+{
+	size_t i;
+
+	for (i = 0; i < len; i++) {
+		ssize_t idx = (start + i) % num_bools;
+
+		if (ary[idx] == set)
+			return idx;
+	}
+
+	return -1;
+}
+
+static ssize_t
+find_set(const bool *ary, size_t num_bools, size_t start, size_t len)
+{
+	return find(ary, num_bools, start, len, true);
+}
+
+static ssize_t
+find_clear(const bool *ary, size_t num_bools, size_t start, size_t len)
+{
+	return find(ary, num_bools, start, len, false);
+}
+
+#define FFS_ITERATIONS (100)
+
+static int
+test_find_size(size_t size, bool set)
+{
+	uint64_t *bitset;
+	bool reference[size];
+	size_t i;
+
+	bitset = alloc_bitset(size);
+
+	if (bitset == NULL)
+		return TEST_FAILED;
+
+	rte_bitset_init(bitset, size);
+
+	for (i = 0; i < size; i++) {
+		bool bit = rand_bool();
+		reference[i] = bit;
+
+		if (bit)
+			rte_bitset_set(bitset, i);
+		else /* redundant, still useful for testing */
+			rte_bitset_clear(bitset, i);
+	}
+
+	for (i = 0; i < FFS_ITERATIONS; i++) {
+		size_t start_bit = rte_rand_max(size);
+		size_t len = rte_rand_max(size + 1);
+		bool full_range = len == size && start_bit == 0;
+		bool wraps = start_bit + len > size;
+		ssize_t rc;
+
+		if (set) {
+			if (full_range && rand_bool())
+				rc = rte_bitset_find_first_set(bitset,
+							       size);
+			else if (wraps || rand_bool()) {
+				rc = rte_bitset_find_set_wrap(bitset, size,
+							      start_bit, len);
+
+			} else
+				rc = rte_bitset_find_set(bitset, size,
+							 start_bit, len);
+
+			if (rc != find_set(reference, size, start_bit,
+					   len))
+				return TEST_FAILED;
+		} else {
+			if (full_range && rand_bool())
+				rc = rte_bitset_find_first_clear(bitset,
+								 size);
+			else if (wraps || rand_bool())
+				rc = rte_bitset_find_clear_wrap(bitset,
+								size,
+								start_bit, len);
+			else
+				rc = rte_bitset_find_clear(bitset, size,
+							   start_bit, len);
+
+			if (rc != find_clear(reference, size, start_bit,
+					     len))
+				return TEST_FAILED;
+		}
+
+	}
+
+	if (free_bitset(bitset, size) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_find_set_size(size_t size)
+{
+	return test_find_size(size, true);
+}
+
+static int
+test_find_clear_size(size_t size)
+{
+	return test_find_size(size, false);
+}
+
+static int
+test_find(void)
+{
+	size_t i;
+
+	for (i = 0; i < RAND_ITERATIONS; i++) {
+		size_t size = 2 + rte_rand_max(RAND_SET_MAX_SIZE - 2);
+
+		if (test_find_set_size(size) != TEST_SUCCESS)
+			return TEST_FAILED;
+
+		if (test_find_clear_size(size) != TEST_SUCCESS)
+			return TEST_FAILED;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+record_match(ssize_t match_idx, size_t size, int *calls)
+{
+	if (match_idx < 0 || (size_t)match_idx >= size)
+		return TEST_FAILED;
+
+	calls[match_idx]++;
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_foreach_size(ssize_t size, bool may_wrap, bool set)
+{
+	bool reference[size];
+	int calls[size];
+	uint64_t *bitset;
+	ssize_t i;
+	ssize_t start_bit;
+	ssize_t len;
+	bool full_range;
+	size_t total_calls = 0;
+
+	rand_bool_ary(reference, size);
+
+	bitset = alloc_bitset(size);
+
+	if (bitset == NULL)
+		return TEST_FAILED;
+
+	memset(calls, 0, sizeof(calls));
+
+	start_bit = rte_rand_max(size);
+	len = may_wrap ? rte_rand_max(size + 1) :
+		rte_rand_max(size - start_bit + 1);
+
+	rte_bitset_init(bitset, size);
+
+	/* random data in the unused bits should not matter */
+	rand_buf(bitset, RTE_BITSET_SIZE(size));
+
+	for (i = start_bit; i < start_bit + len; i++) {
+		size_t idx = i % size;
+
+		if (reference[idx])
+			rte_bitset_set(bitset, idx);
+		else
+			rte_bitset_clear(bitset, idx);
+
+		if (rte_bitset_test(bitset, idx) != reference[idx])
+			return TEST_FAILED;
+	}
+
+	full_range = (len == size && start_bit == 0);
+
+	/* XXX: verify iteration order as well */
+	if (set) {
+		if (full_range && rand_bool()) {
+			RTE_BITSET_FOREACH_SET(i, bitset, size) {
+				if (record_match(i, size, calls) !=
+				    TEST_SUCCESS)
+					return TEST_FAILED;
+			}
+		} else if (may_wrap) {
+			RTE_BITSET_FOREACH_SET_WRAP(i, bitset, size,
+						    start_bit, len) {
+				if (record_match(i, size, calls) !=
+				    TEST_SUCCESS) {
+					printf("failed\n");
+					return TEST_FAILED;
+				}
+			}
+		} else {
+			RTE_BITSET_FOREACH_SET_RANGE(i, bitset, size,
+						     start_bit, len) {
+				if (record_match(i, size, calls) !=
+				    TEST_SUCCESS)
+					return TEST_FAILED;
+			}
+		}
+	} else {
+		if (full_range && rand_bool()) {
+			RTE_BITSET_FOREACH_CLEAR(i, bitset, size)
+				if (record_match(i, size, calls) !=
+				    TEST_SUCCESS)
+					return TEST_FAILED;
+		} else if (may_wrap) {
+			RTE_BITSET_FOREACH_CLEAR_WRAP(i, bitset, size,
+						      start_bit, len) {
+				if (record_match(i, size, calls) !=
+				    TEST_SUCCESS)
+					return TEST_FAILED;
+			}
+		} else {
+			RTE_BITSET_FOREACH_CLEAR_RANGE(i, bitset, size,
+						       start_bit, len)
+				if (record_match(i, size, calls) !=
+				    TEST_SUCCESS)
+					return TEST_FAILED;
+		}
+	}
+
+	for (i = 0; i < len; i++) {
+		size_t idx = (start_bit + i) % size;
+
+		if (reference[idx] == set && calls[idx] != 1) {
+			printf("bit %zd shouldn't have been found %d "
+			       "times\n", idx, calls[idx]);
+			return TEST_FAILED;
+		}
+
+		if (reference[idx] != set && calls[idx] != 0) {
+			puts("bar");
+			return TEST_FAILED;
+		}
+
+		total_calls += calls[idx];
+	}
+
+	if (full_range) {
+		size_t count;
+
+		count = set ? rte_bitset_count_set(bitset, size) :
+			rte_bitset_count_clear(bitset, size);
+
+		if (count != total_calls)
+			return TEST_FAILED;
+	}
+
+	if (free_bitset(bitset, size) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_foreach(void)
+{
+	size_t i;
+
+	for (i = 0; i < RAND_ITERATIONS; i++) {
+		size_t size = 1 + rte_rand_max(RAND_SET_MAX_SIZE - 1);
+
+		if (test_foreach_size(size, false, true) != TEST_SUCCESS)
+			return TEST_FAILED;
+
+		if (test_foreach_size(size, false, false) != TEST_SUCCESS)
+			return TEST_FAILED;
+
+		if (test_foreach_size(size, true, true) != TEST_SUCCESS)
+			return TEST_FAILED;
+
+		if (test_foreach_size(size, true, false) != TEST_SUCCESS)
+			return TEST_FAILED;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_count_size(size_t size)
+{
+	uint64_t *bitset;
+
+	bitset = alloc_bitset(size);
+
+	if (bitset == NULL)
+		return TEST_FAILED;
+
+	rte_bitset_init(bitset, size);
+
+	if (rte_bitset_count_set(bitset, size) != 0)
+		return TEST_FAILED;
+
+	if (rte_bitset_count_clear(bitset, size) != size)
+		return TEST_FAILED;
+
+	rte_bitset_set_all(bitset, size);
+
+	if (rte_bitset_count_set(bitset, size) != size)
+		return TEST_FAILED;
+
+	if (rte_bitset_count_clear(bitset, size) != 0)
+		return TEST_FAILED;
+
+	rte_bitset_clear_all(bitset, size);
+
+	if (rte_bitset_count_set(bitset, size) != 0)
+		return TEST_FAILED;
+
+	if (rte_bitset_count_clear(bitset, size) != size)
+		return TEST_FAILED;
+
+	rte_bitset_set(bitset, rte_rand_max(size));
+
+	if (rte_bitset_count_set(bitset, size) != 1)
+		return TEST_FAILED;
+
+	if (rte_bitset_count_clear(bitset, size) != (size - 1))
+		return TEST_FAILED;
+
+	rte_bitset_clear_all(bitset, size);
+	if (rte_bitset_count_set(bitset, size) != 0)
+		return TEST_FAILED;
+	if (rte_bitset_count_clear(bitset, size) != size)
+		return TEST_FAILED;
+
+	rte_bitset_set_all(bitset, size);
+	if (rte_bitset_count_set(bitset, size) != size)
+		return TEST_FAILED;
+	if (rte_bitset_count_clear(bitset, size) != 0)
+		return TEST_FAILED;
+
+	if (free_bitset(bitset, size) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_count(void)
+{
+	size_t i;
+
+	if (test_count_size(128) != TEST_SUCCESS)
+		return TEST_FAILED;
+	if (test_count_size(1) != TEST_SUCCESS)
+		return TEST_FAILED;
+	if (test_count_size(63) != TEST_SUCCESS)
+		return TEST_FAILED;
+	if (test_count_size(64) != TEST_SUCCESS)
+		return TEST_FAILED;
+	if (test_count_size(65) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	for (i = 0; i < RAND_ITERATIONS; i++) {
+		size_t size = 1 + rte_rand_max(RAND_SET_MAX_SIZE - 1);
+
+		if (test_count_size(size) != TEST_SUCCESS)
+			return TEST_FAILED;
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define GEN_DECLARE(size)						\
+	{								\
+		RTE_BITSET_DECLARE(bitset, size);			\
+		size_t idx;						\
+									\
+		idx = rte_rand_max(size);				\
+		rte_bitset_init(bitset, size);				\
+									\
+		rte_bitset_set(bitset, idx);				\
+		if (!rte_bitset_test(bitset, idx))			\
+			return TEST_FAILED;				\
+		if (rte_bitset_count_set(bitset, size) != 1)		\
+			return TEST_FAILED;				\
+		return TEST_SUCCESS;					\
+	}
+
+static int
+test_define(void)
+{
+	GEN_DECLARE(1);
+	GEN_DECLARE(64);
+	GEN_DECLARE(65);
+	GEN_DECLARE(4097);
+}
+
+static int
+test_equal(void)
+{
+	const size_t size = 100;
+	RTE_BITSET_DECLARE(bitset_a, size);
+	RTE_BITSET_DECLARE(bitset_b, size);
+
+	rand_buf(bitset_a, RTE_BITSET_SIZE(size));
+	rand_buf(bitset_b, RTE_BITSET_SIZE(size));
+
+	rte_bitset_init(bitset_a, size);
+	rte_bitset_init(bitset_b, size);
+
+	rte_bitset_set(bitset_a, 9);
+	rte_bitset_set(bitset_b, 9);
+	rte_bitset_set(bitset_a, 90);
+	rte_bitset_set(bitset_b, 90);
+
+	if (!rte_bitset_equal(bitset_a, bitset_b, size))
+		return TEST_FAILED;
+
+	/* set unused bit, which should be ignored */
+	rte_bitset_set(&bitset_a[1], 60);
+
+	if (!rte_bitset_equal(bitset_a, bitset_b, size))
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_copy(void)
+{
+	const size_t size = 100;
+	RTE_BITSET_DECLARE(bitset_a, size);
+	RTE_BITSET_DECLARE(bitset_b, size);
+
+	rand_buf(bitset_a, RTE_BITSET_SIZE(size));
+	rand_buf(bitset_b, RTE_BITSET_SIZE(size));
+
+	if (rte_bitset_equal(bitset_a, bitset_b, size))
+		return TEST_FAILED;
+
+	rte_bitset_copy(bitset_a, bitset_b, size);
+
+	if (!rte_bitset_equal(bitset_a, bitset_b, size))
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_to_str(void)
+{
+	char buf[1024];
+	RTE_BITSET_DECLARE(bitset, 128);
+
+	rte_bitset_init(bitset, 128);
+	rte_bitset_set(bitset, 1);
+
+	if (rte_bitset_to_str(bitset, 2, buf, 3) != 3)
+		return TEST_FAILED;
+	if (strcmp(buf, "10") != 0)
+		return TEST_FAILED;
+
+	rte_bitset_set(bitset, 0);
+
+	if (rte_bitset_to_str(bitset, 1, buf, sizeof(buf)) != 2)
+		return TEST_FAILED;
+	if (strcmp(buf, "1") != 0)
+		return TEST_FAILED;
+
+	rte_bitset_init(bitset, 99);
+	rte_bitset_set(bitset, 98);
+
+	if (rte_bitset_to_str(bitset, 99, buf, sizeof(buf)) != 100)
+		return TEST_FAILED;
+
+	if (buf[0] != '1' || strchr(&buf[1], '1') != NULL)
+		return TEST_FAILED;
+
+	if (rte_bitset_to_str(bitset, 128, buf, 64) != -EINVAL)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_bitset(void)
+{
+	if (test_test_set() != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_find() != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_foreach() != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_count() != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_define() != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_equal() != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_copy() != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_to_str() != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+REGISTER_TEST_COMMAND(bitset_autotest, test_bitset);
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 917758cc65..687ae51d87 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -32,6 +32,7 @@ sources += files(
         'eal_common_uuid.c',
         'malloc_elem.c',
         'malloc_heap.c',
+        'rte_bitset.c',
         'rte_malloc.c',
         'rte_random.c',
         'rte_reciprocal.c',
diff --git a/lib/eal/common/rte_bitset.c b/lib/eal/common/rte_bitset.c
new file mode 100644
index 0000000000..35e55a64db
--- /dev/null
+++ b/lib/eal/common/rte_bitset.c
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#include <errno.h>
+
+#include "rte_bitset.h"
+
+ssize_t
+rte_bitset_to_str(const uint64_t *bitset, size_t num_bits, char *buf,
+		  size_t capacity)
+{
+	size_t i;
+
+	if (capacity < (num_bits + 1))
+		return -EINVAL;
+
+	for (i = 0; i < num_bits; i++) {
+		bool value;
+
+		value = rte_bitset_test(bitset, num_bits - 1 - i);
+
+		buf[i] = value ? '1' : '0';
+	}
+
+	buf[num_bits] = '\0';
+
+	return num_bits + 1;
+}
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index b0db9b3b3a..fa3cb884e9 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -5,6 +5,7 @@ includes += include_directories('.')
 
 headers += files(
         'rte_alarm.h',
+        'rte_bitset.h',
         'rte_bitmap.h',
         'rte_bitops.h',
         'rte_branch_prediction.h',
diff --git a/lib/eal/include/rte_bitset.h b/lib/eal/include/rte_bitset.h
new file mode 100644
index 0000000000..e333e527e5
--- /dev/null
+++ b/lib/eal/include/rte_bitset.h
@@ -0,0 +1,878 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#ifndef _RTE_BITSET_H_
+#define _RTE_BITSET_H_
+
+/**
+ * @file
+ * RTE Bitset
+ *
+ * This file provides functions and macros for querying and
+ * manipulating sets of bits kept in arrays of @c uint64_t-sized
+ * elements.
+ *
+ * The bits in a bitset are numbered from 0 to @c size - 1, with the
+ * lowest index being the least significant bit.
+ *
+ * The bitset array must be properly aligned.
+ *
+ * For optimal performance, the @c size parameter, required by
+ * many of the API's functions, should be a compile-time constant.
+ *
+ * For large bitsets, the rte_bitmap.h API may be more appropriate.
+ *
+ * @warning
+ * All functions modifying a bitset may overwrite any unused bits of
+ * the last word. Such unused bits are ignored by all functions reading
+ * bits.
+ *
+ */
+
+#include <limits.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <sys/types.h>
+
+#include <rte_branch_prediction.h>
+#include <rte_common.h>
+#include <rte_debug.h>
+#include <rte_memcpy.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * The size (in bytes) of each element in the array used to represent
+ * a bitset.
+ */
+#define RTE_BITSET_WORD_SIZE (sizeof(uint64_t))
+
+/**
+ * The size (in bits) of each element in the array used to represent
+ * a bitset.
+ */
+#define RTE_BITSET_WORD_BITS (RTE_BITSET_WORD_SIZE * CHAR_BIT)
+
+/**
+ * Computes the number of words required to store @c size bits.
+ */
+#define RTE_BITSET_NUM_WORDS(size)			       \
+	((size + RTE_BITSET_WORD_BITS - 1) / RTE_BITSET_WORD_BITS)
+
+/**
+ * Computes the amount of memory (in bytes) required to fit a bitset
+ * holding @c size bits.
+ */
+#define RTE_BITSET_SIZE(size)					\
+	((size_t)(RTE_BITSET_NUM_WORDS(size) * RTE_BITSET_WORD_SIZE))
+
+#define __RTE_BITSET_WORD_IDX(bit_num) ((bit_num) / RTE_BITSET_WORD_BITS)
+#define __RTE_BITSET_BIT_OFFSET(bit_num) ((bit_num) % RTE_BITSET_WORD_BITS)
+#define __RTE_BITSET_UNUSED(size)				\
+	((RTE_BITSET_NUM_WORDS(size) * RTE_BITSET_WORD_BITS) \
+	 - (size))
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Declare a bitset.
+ *
+ * Declare (e.g., as a struct field) or define (e.g., as a stack
+ * variable) a bitset of the specified size.
+ *
+ * @param size
+ *   The number of bits the bitset must be able to represent. Must be
+ *   a compile-time constant.
+ * @param name
+ *   The field or variable name of the resulting definition.
+ */
+#define RTE_BITSET_DECLARE(name, size)		\
+	uint64_t name[RTE_BITSET_NUM_WORDS(size)]
+
+/* XXX: should one include flags here and use to avoid a comparison? */
+/* XXX: would this be better off as a function? */
+
+#define __RTE_BITSET_FOREACH_LEFT(var, size, start_bit, len)	\
+	((len) - 1 - ((var) >= (start_bit) ? (var) - (start_bit) :	\
+		  (size) - (start_bit) + (var)))
+
+#define __RTE_BITSET_FOREACH(var, bitset, size, start_bit, len, flags) \
+	for ((var) = __rte_bitset_find(bitset, size, start_bit, len, \
+				       flags);				\
+	     (var) != -1;						\
+	     (var) = __RTE_BITSET_FOREACH_LEFT(var, size, start_bit, \
+					       len) > 0	?		\
+		     __rte_bitset_find(bitset, size,		\
+				       ((var) + 1) % (size),	\
+				       __RTE_BITSET_FOREACH_LEFT(var,	\
+								 size, \
+								 start_bit, \
+								 len),	\
+				       flags) : -1)
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Iterate over all bits set.
+ *
+ * This macro iterates over all bits set (i.e., all ones) in the
+ * bitset, in the forward direction (i.e., starting with the least
+ * significant '1').
+ *
+ * @param var
+ *   An iterator variable of type @c ssize_t. For each sucessive iteration,
+ *   this variable will hold the bit index of a set bit.
+ * @param bitset
+ *   A <tt>const uint64_t *</tt> pointer to the bitset array.
+ * @param size
+ *   The size of the bitset (in bits).
+ */
+
+#define RTE_BITSET_FOREACH_SET(var, bitset, size)			\
+	__RTE_BITSET_FOREACH(var, bitset, size, 0, size, 0)
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Iterate over all bits cleared.
+ *
+ * This macro iterates over all bits cleared in the bitset, in the
+ * forward direction (i.e., starting with the lowest-indexed set bit).
+ *
+ * @param var
+ *   An iterator variable of type @c ssize_t. For each successive iteration,
+ *   this variable will hold the bit index of a cleared bit.
+ * @param bitset
+ *   A <tt>const uint64_t *</tt> pointer to the bitset array.
+ * @param size
+ *   The size of the bitset (in bits).
+ */
+
+#define RTE_BITSET_FOREACH_CLEAR(var, bitset, size)			\
+	__RTE_BITSET_FOREACH(var, bitset, size, 0, size,	\
+			     __RTE_BITSET_FIND_FLAG_FIND_CLEAR)
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Iterate over all bits set within a range.
+ *
+ * This macro iterates over all bits set (i.e., all ones) in the
+ * specified range, in the forward direction (i.e., starting with the
+ * least significant '1').
+ *
+ * @param var
+ *   An iterator variable of type @c ssize_t. For each sucessive iteration,
+ *   this variable will hold the bit index of a set bit.
+ * @param bitset
+ *   A <tt>const uint64_t *</tt> pointer to the bitset array.
+ * @param size
+ *   The size of the bitset (in bits).
+ * @param start_bit
+ *   The index of the first bit to check. Must be less than @c size.
+ * @param len
+ *   The length (in bits) of the range. @c start_bit + @c len must be less
+ *   than or equal to @c size.
+ */
+
+#define RTE_BITSET_FOREACH_SET_RANGE(var, bitset, size, start_bit, \
+				     len)			       \
+	__RTE_BITSET_FOREACH(var, bitset, size, start_bit, len, 0)
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Iterate over all cleared bits within a range.
+ *
+ * This macro iterates over all bits cleared (i.e., all zeroes) in the
+ * specified range, in the forward direction (i.e., starting with the
+ * least significant '0').
+ *
+ * @param var
+ *   An iterator variable of type @c ssize_t. For each sucessive iteration,
+ *   this variable will hold the bit index of a set bit.
+ * @param bitset
+ *   A <tt>const uint64_t *</tt> pointer to the bitset array.
+ * @param size
+ *   The size of the bitset (in bits).
+ * @param start_bit
+ *   The index of the first bit to check. Must be less than @c size.
+ * @param len
+ *   The length (in bits) of the range. @c start_bit + @c len must be less
+ *   than or equal to @c size.
+ */
+
+#define RTE_BITSET_FOREACH_CLEAR_RANGE(var, bitset, size, start_bit,	\
+				       len)				\
+	__RTE_BITSET_FOREACH(var, bitset, size, start_bit, len,	\
+			     __RTE_BITSET_FIND_FLAG_FIND_CLEAR)
+
+#define RTE_BITSET_FOREACH_SET_WRAP(var, bitset, size, start_bit,  \
+				    len)			       \
+	__RTE_BITSET_FOREACH(var, bitset, size, start_bit, len,    \
+			     __RTE_BITSET_FIND_FLAG_WRAP)
+
+#define RTE_BITSET_FOREACH_CLEAR_WRAP(var, bitset, size, start_bit, \
+				       len)				\
+	__RTE_BITSET_FOREACH(var, bitset, size, start_bit, len,	\
+			     __RTE_BITSET_FIND_FLAG_WRAP |		\
+			     __RTE_BITSET_FIND_FLAG_FIND_CLEAR)
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Initializes a bitset.
+ *
+ * All bits are cleared.
+ *
+ * @param bitset
+ *   A pointer to the array of bitset 64-bit words.
+ * @param size
+ *   The size of the bitset (in bits).
+ */
+
+__rte_experimental
+static inline void
+rte_bitset_init(uint64_t *bitset, size_t size)
+{
+	memset(bitset, 0, RTE_BITSET_SIZE(size));
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Set a bit in the bitset.
+ *
+ * Bits are numbered from 0 to (size - 1) (inclusive).
+ *
+ * @param bitset
+ *   A pointer to the array words making up the bitset.
+ * @param bit_num
+ *   The index of the bit to be set.
+ */
+
+__rte_experimental
+static inline void
+rte_bitset_set(uint64_t *bitset, size_t bit_num)
+{
+	size_t word;
+	size_t offset;
+	uint64_t mask;
+
+	word = __RTE_BITSET_WORD_IDX(bit_num);
+	offset = __RTE_BITSET_BIT_OFFSET(bit_num);
+	mask = UINT64_C(1) << offset;
+
+	bitset[word] |= mask;
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Clear a bit in the bitset.
+ *
+ * Bits are numbered 0 to (size - 1) (inclusive).
+ *
+ * @param bitset
+ *   A pointer to the array words making up the bitset.
+ * @param bit_num
+ *   The index of the bit to be cleared.
+ */
+
+__rte_experimental
+static inline void
+rte_bitset_clear(uint64_t *bitset, size_t bit_num)
+{
+	size_t word;
+	size_t offset;
+	uint64_t mask;
+
+	word = __RTE_BITSET_WORD_IDX(bit_num);
+	offset = __RTE_BITSET_BIT_OFFSET(bit_num);
+	mask = ~(UINT64_C(1) << offset);
+
+	bitset[word] &= mask;
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Set all bits in the bitset.
+ *
+ * @param bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param size
+ *   The size of the bitset (in bits).
+ */
+
+__rte_experimental
+static inline void
+rte_bitset_set_all(uint64_t *bitset, size_t size)
+{
+	memset(bitset, 0xFF, RTE_BITSET_SIZE(size));
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Clear all bits in the bitset.
+ *
+ * @param bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param size
+ *   The size of the bitset (in bits).
+ */
+
+__rte_experimental
+static inline void
+rte_bitset_clear_all(uint64_t *bitset, size_t size)
+{
+	rte_bitset_init(bitset, size);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Count all set bits.
+ *
+ * @param bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param size
+ *   The size of the bitset (in bits).
+ * @return
+ *   Returns the number of '1' bits in the bitset.
+ */
+
+__rte_experimental
+static inline size_t
+rte_bitset_count_set(const uint64_t *bitset, size_t size)
+{
+	size_t i;
+	size_t total = 0;
+	uint64_t unused_mask;
+
+	/*
+	 * Unused bits in a rte_bitset are always '0', and thus are
+	 * not included in this count.
+	 */
+	for (i = 0; i < RTE_BITSET_NUM_WORDS(size) - 1; i++)
+		total += __builtin_popcountll(bitset[i]);
+
+	unused_mask = UINT64_MAX >> __RTE_BITSET_UNUSED(size);
+	total += __builtin_popcountll(bitset[i] & unused_mask);
+
+	return total;
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Count all cleared bits.
+ *
+ * @param bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param size
+ *   The size of the bitset (in bits).
+ * @return
+ *   Returns the number of '0' bits in the bitset.
+ */
+
+__rte_experimental
+static inline size_t
+rte_bitset_count_clear(const uint64_t *bitset, size_t size)
+{
+	return size - rte_bitset_count_set(bitset, size);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Test if a bit is set.
+ *
+ * @param bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param bit_num
+ *   Index of the bit to test. Index 0 is the least significant bit.
+ * @return
+ *   Returns true if the bit is '1', and false if the bit is '0'.
+ */
+
+__rte_experimental
+static inline bool
+rte_bitset_test(const uint64_t *bitset, size_t bit_num)
+{
+	size_t word;
+	size_t offset;
+
+	word = __RTE_BITSET_WORD_IDX(bit_num);
+	offset = __RTE_BITSET_BIT_OFFSET(bit_num);
+
+	return (bitset[word] >> offset) & 1;
+}
+
+#define __RTE_BITSET_FIND_FLAG_FIND_CLEAR (1U << 0)
+#define __RTE_BITSET_FIND_FLAG_WRAP (1U << 1)
+
+__rte_experimental
+static inline ssize_t
+__rte_bitset_find_nowrap(const uint64_t *bitset, size_t __rte_unused size,
+			 size_t start_bit, size_t len, bool find_clear)
+{
+	size_t word_idx;
+	size_t offset;
+	size_t end_bit = start_bit + len;
+
+	RTE_ASSERT(end_bit <= size);
+
+	if (unlikely(len == 0))
+		return -1;
+
+	word_idx = __RTE_BITSET_WORD_IDX(start_bit);
+	offset = __RTE_BITSET_BIT_OFFSET(start_bit);
+
+	while (word_idx <= __RTE_BITSET_WORD_IDX(end_bit - 1)) {
+		uint64_t word;
+		int word_ffs;
+
+		word = bitset[word_idx];
+		if (find_clear)
+			word = ~word;
+
+		word >>= offset;
+
+		word_ffs = __builtin_ffsll(word);
+
+		if (word_ffs != 0) {
+			ssize_t ffs = start_bit + word_ffs - 1;
+
+			/*
+			 * Check if set bit were among the last,
+			 * unused bits, in the last word.
+			 */
+			if (unlikely(ffs >= (ssize_t)end_bit))
+				return -1;
+
+			return ffs;
+		}
+
+		start_bit += (RTE_BITSET_WORD_BITS - offset);
+		word_idx++;
+		offset = 0;
+	}
+
+	return -1;
+
+}
+
+__rte_experimental
+static inline ssize_t
+__rte_bitset_find(const uint64_t *bitset, size_t size, size_t start_bit,
+		  size_t len, unsigned int flags)
+{
+	bool find_clear = flags & __RTE_BITSET_FIND_FLAG_FIND_CLEAR;
+	bool may_wrap = flags & __RTE_BITSET_FIND_FLAG_WRAP;
+	bool does_wrap = (start_bit + len) > size;
+	ssize_t rc;
+
+	RTE_ASSERT(len <= size);
+	if (!may_wrap)
+		RTE_ASSERT(!does_wrap);
+
+	if (may_wrap && does_wrap) {
+		size_t len0 = size - start_bit;
+		size_t len1 = len - len0;
+
+		rc = __rte_bitset_find_nowrap(bitset, size, start_bit, len0,
+					      find_clear);
+		if (rc < 0)
+			rc =  __rte_bitset_find_nowrap(bitset, size,
+						       0, len1, find_clear);
+	} else
+		rc = __rte_bitset_find_nowrap(bitset, size, start_bit,
+					      len, find_clear);
+
+	return rc;
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Find first bit set.
+ *
+ * Scans the bitset in the forward direction (i.e., starting at the
+ * least significant bit), and returns the index of the first '1'.
+ *
+ * @param bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param size
+ *   The size of the bitset (in bits).
+ * @return
+ *   Returns the index of the least significant '1', or -1 if all
+ *   bits are '0'.
+ */
+
+__rte_experimental
+static inline ssize_t
+rte_bitset_find_first_set(const uint64_t *bitset, size_t size)
+{
+	return __rte_bitset_find(bitset, size, 0, size, 0);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Find first bit set at offset.
+ *
+ * Scans the bitset in the forward direction (i.e., starting at the
+ * least significant bit), starting at an offset @c start_bit into the
+ * bitset, and returns the index of the first '1' encountered.
+ *
+ * @param bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param size
+ *   The size of the bitset (in bits).
+ * @param start_bit
+ *   The index of the first bit to check. Must be less than @c size.
+ * @param len
+ *   The number of bits to scan. @c start_bit + @c len must be less
+ *   than or equal to @c size.
+ * @return
+ *   Returns the index of the least significant '1', or -1 if all
+ *   bits are '0'.
+ */
+
+__rte_experimental
+static inline ssize_t
+rte_bitset_find_set(const uint64_t *bitset, size_t size,
+		    size_t start_bit, size_t len)
+{
+	return __rte_bitset_find(bitset, size, start_bit, len, 0);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Find first bit set at offset, with wrap-around.
+ *
+ * Scans the bitset in the forward direction (i.e., starting at the
+ * least significant bit), starting at an offset @c start_bit into the
+ * bitset. If no '1' is encountered before the end of the bitset, the search
+ * will continue at index 0.
+ *
+ * @param bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param size
+ *   The size of the bitset (in bits).
+ * @param start_bit
+ *   The index of the first bit to check. Must be less than @c size.
+ * @param len
+ *   The number of bits to scan. @c start_bit + @c len must be less
+ *   than or equal to @c size.
+ * @return
+ *   Returns the index of the least significant '1', or -1 if all
+ *   bits are '0'.
+ */
+
+__rte_experimental
+static inline ssize_t
+rte_bitset_find_set_wrap(const uint64_t *bitset, size_t size,
+			 size_t start_bit, size_t len)
+{
+	return __rte_bitset_find(bitset, size, start_bit, len,
+				 __RTE_BITSET_FIND_FLAG_WRAP);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Find first cleared bit.
+ *
+ * Scans the bitset in the forward direction (i.e., starting at the
+ * least significant bit), and returns the index of the first '0'.
+ *
+ * @param bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param size
+ *   The size of the bitset (in bits).
+ * @return
+ *   Returns the index of the least significant '0', or -1 if all
+ *   bits are '1'.
+ */
+
+__rte_experimental
+static inline ssize_t
+rte_bitset_find_first_clear(const uint64_t *bitset, size_t size)
+{
+	return __rte_bitset_find(bitset, size, 0, size,
+				 __RTE_BITSET_FIND_FLAG_FIND_CLEAR);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Find first cleared bit at offset.
+ *
+ * Scans the bitset in the forward direction (i.e., starting at the
+ * least significant bit), starting at an offset @c start_bit into the
+ * bitset, and returns the index of the first '0' encountered.
+ *
+ * @param bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param size
+ *   The size of the bitset (in bits).
+ * @param start_bit
+ *   The index of the first bit to check. Must be less than @c size.
+ * @param len
+ *   The number of bits to scan. @c start_bit + @c len must be less
+ *   than or equal to @c size.
+ * @return
+ *   Returns the index of the least significant '0', or -1 if all
+ *   bits are '1'.
+ */
+
+__rte_experimental
+static inline ssize_t
+rte_bitset_find_clear(const uint64_t *bitset, size_t size,
+			   size_t start_bit, size_t len)
+{
+	return __rte_bitset_find(bitset, size, start_bit, len,
+				 __RTE_BITSET_FIND_FLAG_FIND_CLEAR);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Find first cleared bit at offset, with wrap-around.
+ *
+ * Scans the bitset in the forward direction (i.e., starting at the
+ * least significant bit), starting at an offset @c start_bit into the
+ * bitset. If no '0' is encountered before the end of the bitset, the
+ * search will continue at index 0.
+ *
+ * @param bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param size
+ *   The size of the bitset (in bits).
+ * @param start_bit
+ *   The index of the first bit to check. Must be less than @c size.
+ * @param len
+ *   The number of bits to scan. @c start_bit + @c len must be less
+ *   than or equal to @c size.
+ * @return
+ *   Returns the index of the least significant '0', or -1 if all
+ *   bits are '1'.
+ */
+
+__rte_experimental
+static inline ssize_t
+rte_bitset_find_clear_wrap(const uint64_t *bitset, size_t size,
+			   size_t start_bit, size_t len)
+{
+	return __rte_bitset_find(bitset, size, start_bit, len,
+				 __RTE_BITSET_FIND_FLAG_FIND_CLEAR |
+				 __RTE_BITSET_FIND_FLAG_WRAP);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Copy bitset.
+ *
+ * Copy the bits of the @c src_bitset to the @c dst_bitset.
+ *
+ * The bitsets may not overlap and must be of equal size.
+ *
+ * @param dst_bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param src_bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param size
+ *   The size of the bitsets (in bits).
+ */
+
+__rte_experimental
+static inline void
+rte_bitset_copy(uint64_t *__rte_restrict dst_bitset,
+		const uint64_t *__rte_restrict src_bitset,
+		size_t size)
+{
+	rte_memcpy(dst_bitset, src_bitset, RTE_BITSET_SIZE(size));
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Bitwise or two bitsets.
+ *
+ * Perform a bitwise OR operation on all bits in the two equal-size
+ * bitsets @c dst_bitset and @c src_bitset, and store the results in
+ * @c dst_bitset.
+ *
+ * @param dst_bitset
+ *   A pointer to the destination bitset.
+ * @param src_bitset
+ *   A pointer to the source bitset.
+ * @param size
+ *   The size of the bitsets (in bits).
+ */
+
+__rte_experimental
+static inline void
+rte_bitset_or(uint64_t *dst_bitset, const uint64_t *src_bitset, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < RTE_BITSET_NUM_WORDS(size); i++)
+		dst_bitset[i] |= src_bitset[i];
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Bitwise and two bitsets.
+ *
+ * Perform a bitwise AND operation on all bits in the two equal-size
+ * bitsets @c dst_bitset and @c src_bitset, and store the results in
+ * @c dst_bitset.
+ *
+ * @param dst_bitset
+ *   A pointer to the destination bitset.
+ * @param src_bitset
+ *   A pointer to the source bitset.
+ * @param size
+ *   The size of the bitsets (in bits).
+ */
+
+__rte_experimental
+static inline void
+rte_bitset_and(uint64_t *dst_bitset, const uint64_t *src_bitset, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < RTE_BITSET_NUM_WORDS(size); i++)
+		dst_bitset[i] &= src_bitset[i];
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Bitwise xor two bitsets.
+ *
+ * Perform a bitwise XOR operation on all bits in the two equal-size
+ * bitsets @c dst_bitset and @c src_bitset, and store the results in
+ * @c dst_bitset.
+ *
+ * @param dst_bitset
+ *   A pointer to the destination bitset.
+ * @param src_bitset
+ *   A pointer to the source bitset.
+ * @param size
+ *   The size of the bitsets (in bits).
+ */
+
+__rte_experimental
+static inline void
+rte_bitset_xor(uint64_t *__rte_restrict dst_bitset,
+	       const uint64_t *__rte_restrict src_bitset, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < RTE_BITSET_NUM_WORDS(size); i++)
+		dst_bitset[i] ^= src_bitset[i];
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Compare two bitsets.
+ *
+ * Compare two bitsets for equality.
+ *
+ * @param bitset_a
+ *   A pointer to the destination bitset.
+ * @param bitset_b
+ *   A pointer to the source bitset.
+ * @param size
+ *   The size of the bitsets (in bits).
+ */
+
+__rte_experimental
+static inline bool
+rte_bitset_equal(const uint64_t *bitset_a, const uint64_t *bitset_b,
+		 size_t size)
+{
+	size_t i;
+	uint64_t last_a, last_b;
+
+	for (i = 0; i < RTE_BITSET_NUM_WORDS(size) - 1; i++)
+		if (bitset_a[i] != bitset_b[i])
+			return false;
+
+	last_a = bitset_a[i] << __RTE_BITSET_UNUSED(size);
+	last_b = bitset_b[i] << __RTE_BITSET_UNUSED(size);
+
+	return last_a == last_b;
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Converts a bitset to a string.
+ *
+ * This function prints a string representation of the bitstring to
+ * the supplied buffer.
+ *
+ * Each bit is represented either by '0' or '1' in the output. The
+ * resulting string is NUL terminated.
+ *
+ * @param bitset
+ *   A pointer to the array of bitset 64-bit words.
+ * @param size
+ *   The number of bits the bitset represent.
+ * @param buf
+ *   A buffer to hold the output.
+ * @param capacity
+ *   The size of the buffer. Must be @c size + 1 or larger.
+ * @return
+ *   Returns the number of bytes written (i.e., @c size + 1), or -EINVAL
+ *   in case the buffer capacity was too small.
+ */
+
+__rte_experimental
+ssize_t
+rte_bitset_to_str(const uint64_t *bitset, size_t size, char *buf,
+		  size_t capacity);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_BITSET_H_ */
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 6d6978f108..9136a71c73 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -430,6 +430,9 @@ EXPERIMENTAL {
 	rte_thread_create_control;
 	rte_thread_set_name;
 	__rte_eal_trace_generic_blob;
+
+	# added in X.Y
+	rte_bitset_to_str;
 };
 
 INTERNAL {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC 2/2] eal: add high-performance timer facility
  2023-02-28  9:39 [RFC 0/2] Add high-performance timer facility Mattias Rönnblom
  2023-02-28  9:39 ` [RFC 1/2] eal: add bitset type Mattias Rönnblom
@ 2023-02-28  9:39 ` Mattias Rönnblom
  2023-03-05 17:25   ` Stephen Hemminger
  2023-02-28 16:01 ` [RFC 0/2] Add " Morten Brørup
  2023-03-15 17:03 ` [RFC v2 " Mattias Rönnblom
  3 siblings, 1 reply; 31+ messages in thread
From: Mattias Rönnblom @ 2023-02-28  9:39 UTC (permalink / raw)
  To: dev
  Cc: Erik Gabriel Carrillo, David Marchand, maria.lingemark,
	Stefan Sundkvist, Mattias Rönnblom

The htimer library attempts at providing a timer facility with roughly
the same functionality, but less overhead and better scalability than
DPDK timer library.

The htimer library employs per-lcore hierachical timer wheels and a
message-based synchronization/MT-safety scheme.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 app/test/meson.build             |   8 +-
 app/test/test_htimer_mgr.c       | 674 +++++++++++++++++++++++++++++++
 app/test/test_htimer_mgr_perf.c  | 324 +++++++++++++++
 app/test/test_htw.c              | 478 ++++++++++++++++++++++
 app/test/test_htw_perf.c         | 181 +++++++++
 doc/api/doxy-api-index.md        |   5 +-
 doc/api/doxy-api.conf.in         |   1 +
 lib/htimer/meson.build           |   7 +
 lib/htimer/rte_htimer.h          |  65 +++
 lib/htimer/rte_htimer_mgr.c      | 488 ++++++++++++++++++++++
 lib/htimer/rte_htimer_mgr.h      | 497 +++++++++++++++++++++++
 lib/htimer/rte_htimer_msg.h      |  44 ++
 lib/htimer/rte_htimer_msg_ring.c |  18 +
 lib/htimer/rte_htimer_msg_ring.h |  49 +++
 lib/htimer/rte_htw.c             | 437 ++++++++++++++++++++
 lib/htimer/rte_htw.h             |  49 +++
 lib/htimer/version.map           |  17 +
 lib/meson.build                  |   1 +
 18 files changed, 3341 insertions(+), 2 deletions(-)
 create mode 100644 app/test/test_htimer_mgr.c
 create mode 100644 app/test/test_htimer_mgr_perf.c
 create mode 100644 app/test/test_htw.c
 create mode 100644 app/test/test_htw_perf.c
 create mode 100644 lib/htimer/meson.build
 create mode 100644 lib/htimer/rte_htimer.h
 create mode 100644 lib/htimer/rte_htimer_mgr.c
 create mode 100644 lib/htimer/rte_htimer_mgr.h
 create mode 100644 lib/htimer/rte_htimer_msg.h
 create mode 100644 lib/htimer/rte_htimer_msg_ring.c
 create mode 100644 lib/htimer/rte_htimer_msg_ring.h
 create mode 100644 lib/htimer/rte_htw.c
 create mode 100644 lib/htimer/rte_htw.h
 create mode 100644 lib/htimer/version.map

diff --git a/app/test/meson.build b/app/test/meson.build
index 03811ff692..5a48775a60 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -143,6 +143,10 @@ test_sources = files(
         'test_timer_perf.c',
         'test_timer_racecond.c',
         'test_timer_secondary.c',
+        'test_htw.c',
+        'test_htw_perf.c',
+        'test_htimer_mgr_perf.c',
+        'test_htimer_mgr.c',
         'test_ticketlock.c',
         'test_trace.c',
         'test_trace_register.c',
@@ -165,7 +169,6 @@ fast_tests = [
         ['bpf_autotest', true, true],
         ['bpf_convert_autotest', true, true],
         ['bitops_autotest', true, true],
-        ['bitset_autotest', true, true],
         ['byteorder_autotest', true, true],
         ['cksum_autotest', true, true],
         ['cmdline_autotest', true, true],
@@ -193,6 +196,7 @@ fast_tests = [
         ['fib6_autotest', true, true],
         ['func_reentrancy_autotest', false, true],
         ['hash_autotest', true, true],
+        ['htimer_mgr_autotest', true, true],
         ['interrupt_autotest', true, true],
         ['ipfrag_autotest', false, true],
         ['lcores_autotest', true, true],
@@ -265,6 +269,8 @@ perf_test_names = [
         'memcpy_perf_autotest',
         'hash_perf_autotest',
         'timer_perf_autotest',
+        'htimer_mgr_perf_autotest',
+        'htw_perf_autotest',
         'reciprocal_division',
         'reciprocal_division_perf',
         'lpm_perf_autotest',
diff --git a/app/test/test_htimer_mgr.c b/app/test/test_htimer_mgr.c
new file mode 100644
index 0000000000..4d82a5e8b0
--- /dev/null
+++ b/app/test/test_htimer_mgr.c
@@ -0,0 +1,674 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#include "test.h"
+
+#include <sys/queue.h>
+#include <stdlib.h>
+#include <inttypes.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_htimer_mgr.h>
+#include <rte_launch.h>
+#include <rte_lcore.h>
+#include <rte_random.h>
+
+static int
+timer_lcore(void *arg)
+{
+	bool *stop = arg;
+
+	while (!__atomic_load_n(stop, __ATOMIC_RELAXED))
+		rte_htimer_mgr_manage();
+
+	return 0;
+}
+
+static void
+count_timer_cb(struct rte_htimer *timer __rte_unused, void *arg)
+{
+	unsigned int *count = arg;
+
+	__atomic_fetch_add(count, 1, __ATOMIC_RELAXED);
+}
+
+static void
+count_async_cb(struct rte_htimer *timer __rte_unused, int result,
+	       void *cb_arg)
+{
+	unsigned int *count = cb_arg;
+
+	if (result == RTE_HTIMER_MGR_ASYNC_RESULT_ADDED)
+		__atomic_fetch_add(count, 1, __ATOMIC_RELAXED);
+}
+
+static uint64_t s_to_tsc(double s)
+{
+	return s * rte_get_tsc_hz();
+}
+
+#define ASYNC_ADD_TEST_EXPIRATION_TIME 0.25 /* s */
+#define ASYNC_TEST_TICK s_to_tsc(1e-6)
+
+static int
+test_htimer_mgr_async_add(unsigned int num_timers_per_lcore)
+{
+	struct rte_htimer *timers;
+	unsigned int timer_idx;
+	unsigned int lcore_id;
+	bool stop = false;
+	unsigned int timeout_count = 0;
+	unsigned int async_count = 0;
+	unsigned int num_workers = 0;
+	uint64_t expiration_time;
+	unsigned int num_total_timers;
+
+	rte_htimer_mgr_init(ASYNC_TEST_TICK);
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		if (rte_eal_remote_launch(timer_lcore, &stop, lcore_id) != 0)
+			rte_panic("Unable to launch timer lcore\n");
+		num_workers++;
+	}
+
+	num_total_timers = num_workers * num_timers_per_lcore;
+
+	timers = malloc(num_total_timers * sizeof(struct rte_htimer));
+	timer_idx = 0;
+
+	if (timers == NULL)
+		rte_panic("Unable to allocate heap memory\n");
+
+	expiration_time = rte_get_tsc_hz() * ASYNC_ADD_TEST_EXPIRATION_TIME;
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		unsigned int i;
+
+		for (i = 0; i < num_timers_per_lcore; i++) {
+			struct rte_htimer *timer = &timers[timer_idx++];
+
+			for (;;) {
+				int rc;
+
+				rc = rte_htimer_mgr_async_add(timer, lcore_id,
+							      expiration_time,
+							      0,
+							      count_timer_cb,
+							      &timeout_count, 0,
+							      count_async_cb,
+							      &async_count);
+				if (unlikely(rc == -EBUSY))
+					rte_htimer_mgr_process();
+				else
+					break;
+			}
+		}
+	}
+
+	while (__atomic_load_n(&async_count, __ATOMIC_RELAXED) !=
+	       num_total_timers ||
+	       __atomic_load_n(&timeout_count, __ATOMIC_RELAXED) !=
+	       num_total_timers)
+		rte_htimer_mgr_manage();
+
+	__atomic_store_n(&stop, true, __ATOMIC_RELAXED);
+
+	rte_eal_mp_wait_lcore();
+
+	rte_htimer_mgr_deinit();
+
+	free(timers);
+
+	return TEST_SUCCESS;
+}
+
+struct async_recorder_state {
+	bool timer_cb_run;
+	bool async_add_cb_run;
+	bool async_cancel_cb_run;
+	bool failed;
+};
+
+static void
+record_async_add_cb(struct rte_htimer *timer __rte_unused,
+		    int result, void *cb_arg)
+{
+	struct async_recorder_state *state = cb_arg;
+
+	if (state->failed)
+		return;
+
+	if (state->async_add_cb_run ||
+	    result != RTE_HTIMER_MGR_ASYNC_RESULT_ADDED) {
+		puts("async add run already");
+		state->failed = true;
+	}
+
+	state->async_add_cb_run = true;
+}
+
+static void
+record_async_cancel_cb(struct rte_htimer *timer __rte_unused,
+		       int result, void *cb_arg)
+{
+	struct async_recorder_state *state = cb_arg;
+
+	if (state->failed)
+		return;
+
+	if (state->async_cancel_cb_run) {
+		state->failed = true;
+		return;
+	}
+
+	switch (result) {
+	case RTE_HTIMER_MGR_ASYNC_RESULT_EXPIRED:
+		if (!state->timer_cb_run)
+			state->failed = true;
+		break;
+	case RTE_HTIMER_MGR_ASYNC_RESULT_CANCELED:
+		if (state->timer_cb_run)
+			state->failed = true;
+		break;
+	case RTE_HTIMER_MGR_ASYNC_RESULT_ALREADY_CANCELED:
+		state->failed = true;
+	}
+
+	state->async_cancel_cb_run = true;
+}
+
+static int
+record_check_consistency(struct async_recorder_state *state)
+{
+	if (state->failed)
+		return -1;
+
+	return state->async_cancel_cb_run ? 1 : 0;
+}
+
+static int
+records_check_consistency(struct async_recorder_state *states,
+			  unsigned int num_states)
+{
+	unsigned int i;
+	int canceled = 0;
+
+	for (i = 0; i < num_states; i++) {
+		int rc;
+
+		rc = record_check_consistency(&states[i]);
+
+		if (rc < 0)
+			return -1;
+		canceled += rc;
+	}
+
+	return canceled;
+}
+
+static void
+log_timer_expiry_cb(struct rte_htimer *timer __rte_unused,
+		    void *arg)
+{
+	bool *timer_run = arg;
+
+	*timer_run = true;
+}
+
+
+#define ASYNC_ADD_CANCEL_TEST_EXPIRATION_TIME_MAX 10e-3 /* s */
+
+static int
+test_htimer_mgr_async_add_cancel(unsigned int num_timers_per_lcore)
+{
+	struct rte_htimer *timers;
+	struct async_recorder_state *recorder_states;
+	unsigned int timer_idx = 0;
+	unsigned int lcore_id;
+	uint64_t now;
+	unsigned int num_workers = 0;
+	bool stop = false;
+	uint64_t max_expiration_time =
+		s_to_tsc(ASYNC_ADD_CANCEL_TEST_EXPIRATION_TIME_MAX);
+	unsigned int num_total_timers;
+	int canceled = 0;
+
+	rte_htimer_mgr_init(ASYNC_TEST_TICK);
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		if (rte_eal_remote_launch(timer_lcore, &stop, lcore_id) != 0)
+			rte_panic("Unable to launch timer lcore\n");
+		num_workers++;
+	}
+
+	num_total_timers = num_workers * num_timers_per_lcore;
+
+	timers = malloc(num_total_timers * sizeof(struct rte_htimer));
+	recorder_states =
+		malloc(num_total_timers * sizeof(struct async_recorder_state));
+
+	if (timers == NULL || recorder_states == NULL)
+		rte_panic("Unable to allocate heap memory\n");
+
+	now = rte_get_tsc_cycles();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		unsigned int i;
+
+		for (i = 0; i < num_timers_per_lcore; i++) {
+			struct rte_htimer *timer = &timers[timer_idx];
+			struct async_recorder_state *state =
+				&recorder_states[timer_idx];
+
+			timer_idx++;
+
+			*state = (struct async_recorder_state) {};
+
+			uint64_t expiration_time =
+				now + rte_rand_max(max_expiration_time);
+
+			for (;;) {
+				int rc;
+
+				rc = rte_htimer_mgr_async_add(timer, lcore_id,
+							 expiration_time,
+							 0,
+							 log_timer_expiry_cb,
+							 &state->timer_cb_run,
+							 0,
+							 record_async_add_cb,
+							 state);
+
+				if (unlikely(rc == -EBUSY))
+					rte_htimer_mgr_process();
+				else
+					break;
+			}
+		}
+	}
+
+	timer_idx = 0;
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		unsigned int i;
+
+		for (i = 0; i < num_timers_per_lcore; i++) {
+			struct rte_htimer *timer = &timers[timer_idx];
+			struct async_recorder_state *state =
+				&recorder_states[timer_idx];
+
+			timer_idx++;
+
+			/* cancel roughly half of the timers */
+			if (rte_rand_max(2) == 0)
+				continue;
+
+			for (;;) {
+				int rc;
+
+				rc = rte_htimer_mgr_async_cancel(timer,
+							record_async_cancel_cb,
+							state);
+
+				if (unlikely(rc == -EBUSY)) {
+					puts("busy");
+					rte_htimer_mgr_process();
+				} else
+					break;
+			}
+
+			canceled++;
+		}
+	}
+
+	for (;;) {
+		int cancel_completed;
+
+		cancel_completed = records_check_consistency(recorder_states,
+							     num_total_timers);
+
+		if (cancel_completed < 0) {
+			puts("Inconstinency found");
+			return TEST_FAILED;
+		}
+
+		if (cancel_completed == canceled)
+			break;
+
+		rte_htimer_mgr_process();
+	}
+
+	__atomic_store_n(&stop, true, __ATOMIC_RELAXED);
+
+	rte_eal_mp_wait_lcore();
+
+	rte_htimer_mgr_deinit();
+
+	free(timers);
+	free(recorder_states);
+
+	return TEST_SUCCESS;
+}
+
+/*
+ * This is a test case where one thread asynchronously adds two timers,
+ * with the same expiration time; one on the local lcore and one on a
+ * remote lcore. This creates a tricky situation for the timer
+ * manager, and for the application as well, if the htimer struct is
+ * dynamically allocated.
+ */
+
+struct test_timer {
+	uint32_t ref_cnt;
+	uint64_t expiration_time; /* in TSC, not tick */
+	uint32_t *timeout_count;
+	bool *failure_occured;
+	struct rte_htimer htimer;
+};
+
+
+static struct test_timer *
+test_timer_create(uint64_t expiration_time, uint32_t *timeout_count,
+		  bool *failure_occured)
+{
+	struct test_timer *timer;
+
+	timer = malloc(sizeof(struct test_timer));
+
+	if (timer == NULL)
+		rte_panic("Unable to allocate timer memory\n");
+
+	timer->ref_cnt = 1;
+	timer->expiration_time = expiration_time;
+	timer->timeout_count = timeout_count;
+	timer->failure_occured = failure_occured;
+
+	return timer;
+}
+
+static void
+test_timer_inc_ref_cnt(struct test_timer *timer)
+{
+	__atomic_add_fetch(&timer->ref_cnt, 1, __ATOMIC_RELEASE);
+}
+
+static void
+test_timer_dec_ref_cnt(struct test_timer *timer)
+{
+	if (timer != NULL) {
+		uint32_t cnt = __atomic_sub_fetch(&timer->ref_cnt, 1,
+						  __ATOMIC_RELEASE);
+		if (cnt == 0)
+			free(timer);
+	}
+}
+
+static void
+test_timer_cb(struct rte_htimer *timer, void *arg __rte_unused)
+{
+	struct test_timer *test_timer =
+		container_of(timer, struct test_timer, htimer);
+	uint64_t now = rte_get_tsc_cycles();
+
+	if (now < test_timer->expiration_time)
+		*(test_timer->failure_occured) = true;
+
+	__atomic_fetch_add(test_timer->timeout_count, 1, __ATOMIC_RELAXED);
+
+	test_timer_dec_ref_cnt(test_timer);
+}
+
+static int
+worker_lcore(void *arg)
+{
+	bool *stop = arg;
+
+	while (!__atomic_load_n(stop, __ATOMIC_RELAXED))
+		rte_htimer_mgr_manage();
+
+	return 0;
+}
+
+struct cancel_timer {
+	bool cancel;
+	struct rte_htimer *target_timer;
+	uint32_t *cancel_count;
+	uint32_t *expired_count;
+	bool *failure_occured;
+	struct rte_htimer htimer;
+};
+
+static struct cancel_timer *
+cancel_timer_create(bool cancel, struct rte_htimer *target_timer,
+		    uint32_t *cancel_count, uint32_t *expired_count,
+		    bool *failure_occured)
+{
+	struct cancel_timer *timer;
+
+	timer = malloc(sizeof(struct cancel_timer));
+
+	if (timer == NULL)
+		rte_panic("Unable to allocate timer memory\n");
+
+	timer->cancel = cancel;
+	timer->target_timer = target_timer;
+	timer->cancel_count = cancel_count;
+	timer->expired_count = expired_count;
+	timer->failure_occured = failure_occured;
+
+	return timer;
+}
+
+static void
+async_cancel_cb(struct rte_htimer *timer, int result, void *cb_arg)
+{
+	struct test_timer *test_timer =
+		container_of(timer, struct test_timer, htimer);
+	struct cancel_timer *cancel_timer = cb_arg;
+	bool *failure_occured = cancel_timer->failure_occured;
+
+	if (!cancel_timer->cancel || cancel_timer->target_timer != timer)
+		*failure_occured = true;
+
+	if (result == RTE_HTIMER_MGR_ASYNC_RESULT_CANCELED) {
+		uint32_t *cancel_count = cancel_timer->cancel_count;
+
+		/* decrease target lcore's ref count */
+		test_timer_dec_ref_cnt(test_timer);
+		(*cancel_count)++;
+	} else if (result == RTE_HTIMER_MGR_ASYNC_RESULT_EXPIRED) {
+		uint32_t *expired_count = cancel_timer->expired_count;
+
+		(*expired_count)++;
+	} else
+		*failure_occured = true;
+
+	/* source lcore's ref count */
+	test_timer_dec_ref_cnt(test_timer);
+
+	free(cancel_timer);
+}
+
+static void
+cancel_timer_cb(struct rte_htimer *timer, void *arg __rte_unused)
+{
+	struct cancel_timer *cancel_timer =
+		container_of(timer, struct cancel_timer, htimer);
+
+	if (cancel_timer->cancel) {
+		int rc;
+
+		rc = rte_htimer_mgr_async_cancel(cancel_timer->target_timer,
+						 async_cancel_cb, cancel_timer);
+
+		if (rc == -EBUSY)
+			rte_htimer_mgr_add(timer, 0, 0, cancel_timer_cb,
+					   NULL, 0);
+	} else
+		free(cancel_timer);
+}
+
+#define REF_CNT_TEST_TICK s_to_tsc(10e-9)
+#define REF_CNT_AVG_EXPIRATION_TIME (50e-6)
+#define REF_CNT_MAX_EXPIRATION_TIME (2 * REF_CNT_AVG_EXPIRATION_TIME)
+#define REF_CNT_CANCEL_FUZZ(expiration_time) \
+	((uint64_t)((expiration_time) * (rte_drand()/10 + 0.95)))
+
+static int
+test_htimer_mgr_ref_cnt_timers(unsigned int num_timers_per_lcore)
+{
+	unsigned int lcore_id;
+	bool stop = false;
+	unsigned int num_workers = 0;
+	struct test_timer **timers;
+	struct cancel_timer **cancel_timers;
+	unsigned int num_timers;
+	uint32_t timeout_count = 0;
+	uint32_t cancel_count = 0;
+	uint32_t expired_count = 0;
+	bool failure_occured = false;
+	unsigned int timer_idx;
+	unsigned int expected_cancel_attempts;
+	uint64_t deadline;
+	uint64_t now;
+
+	rte_htimer_mgr_init(REF_CNT_TEST_TICK);
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		if (rte_eal_remote_launch(worker_lcore, &stop, lcore_id) != 0)
+			rte_panic("Unable to launch timer lcore\n");
+		num_workers++;
+	}
+
+	/* give the workers a chance to get going */
+	rte_delay_us_block(10*1000);
+
+	num_timers = num_timers_per_lcore * num_workers;
+
+	timers = malloc(sizeof(struct test_timer *) * num_timers);
+	cancel_timers = malloc(sizeof(struct cancel_timer *) * num_timers);
+
+	if (timers == NULL || cancel_timers == NULL)
+		rte_panic("Unable to allocate memory\n");
+
+	timer_idx = 0;
+	expected_cancel_attempts = 0;
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		unsigned int i;
+
+		for (i = 0; i < num_timers_per_lcore; i++) {
+			uint64_t expiration_time;
+			struct test_timer *timer;
+			struct rte_htimer *htimer;
+			bool cancel;
+			struct cancel_timer *cancel_timer;
+			uint64_t cancel_expiration_time;
+
+			expiration_time =
+				s_to_tsc(REF_CNT_MAX_EXPIRATION_TIME *
+					 rte_drand());
+
+			timer = test_timer_create(expiration_time,
+						  &timeout_count,
+						  &failure_occured);
+			htimer = &timer->htimer;
+
+			timers[timer_idx++] = timer;
+
+			/* for the target lcore's usage of this time */
+			test_timer_inc_ref_cnt(timer);
+
+			for (;;) {
+				int rc;
+
+				rc = rte_htimer_mgr_async_add(htimer, lcore_id,
+							      expiration_time,
+							      0, test_timer_cb,
+							      NULL, 0, NULL,
+							      NULL);
+				if (unlikely(rc == -EBUSY))
+					rte_htimer_mgr_process();
+				else
+					break;
+			}
+
+			cancel = rte_rand_max(2);
+
+			cancel_timer =
+				cancel_timer_create(cancel, &timer->htimer,
+						    &cancel_count,
+						    &expired_count,
+						    &failure_occured);
+
+			cancel_expiration_time =
+				REF_CNT_CANCEL_FUZZ(expiration_time);
+
+			rte_htimer_mgr_add(&cancel_timer->htimer,
+					   cancel_expiration_time, 0,
+					   cancel_timer_cb, NULL, 0);
+
+			if (cancel)
+				expected_cancel_attempts++;
+		}
+	}
+
+	deadline = rte_get_tsc_cycles() + REF_CNT_MAX_EXPIRATION_TIME +
+		s_to_tsc(0.25);
+
+	do {
+		now = rte_get_tsc_cycles();
+
+		rte_htimer_mgr_manage_time(now);
+
+	} while (now < deadline);
+
+	__atomic_store_n(&stop, true, __ATOMIC_RELAXED);
+
+	rte_eal_mp_wait_lcore();
+
+	if (failure_occured)
+		return TEST_FAILED;
+
+	if ((cancel_count + expired_count) != expected_cancel_attempts)
+		return TEST_FAILED;
+
+	if (timeout_count != (num_timers - cancel_count))
+		return TEST_FAILED;
+
+	rte_htimer_mgr_deinit();
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_htimer_mgr(void)
+{
+	int rc;
+
+	rc = test_htimer_mgr_async_add(1);
+	if (rc != TEST_SUCCESS)
+		return rc;
+
+	rc = test_htimer_mgr_async_add(100000);
+	if (rc != TEST_SUCCESS)
+		return rc;
+
+	rc = test_htimer_mgr_async_add_cancel(100);
+	if (rc != TEST_SUCCESS)
+		return rc;
+
+	rc = test_htimer_mgr_ref_cnt_timers(10);
+	if (rc != TEST_SUCCESS)
+		return rc;
+
+	rc = test_htimer_mgr_ref_cnt_timers(10000);
+	if (rc != TEST_SUCCESS)
+		return rc;
+
+	return TEST_SUCCESS;
+}
+
+REGISTER_TEST_COMMAND(htimer_mgr_autotest, test_htimer_mgr);
diff --git a/app/test/test_htimer_mgr_perf.c b/app/test/test_htimer_mgr_perf.c
new file mode 100644
index 0000000000..179b0ba6e1
--- /dev/null
+++ b/app/test/test_htimer_mgr_perf.c
@@ -0,0 +1,324 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#include "test.h"
+
+#include <sys/queue.h>
+#include <stdlib.h>
+#include <inttypes.h>
+
+#include <rte_cycles.h>
+#include <rte_htimer_mgr.h>
+#include <rte_launch.h>
+#include <rte_malloc.h>
+#include <rte_random.h>
+
+static void
+nop_cb(struct rte_htimer *, void *)
+{
+}
+
+static uint64_t
+add_rand_timers(struct rte_htimer *timers, uint64_t num,
+		uint64_t timeout_start, uint64_t max_timeout)
+{
+	uint64_t i;
+	uint64_t expiration_times[num];
+	uint64_t start_ts;
+	uint64_t end_ts;
+
+	for (i = 0; i < num; i++)
+		expiration_times[i] =
+			1 + timeout_start + rte_rand_max(max_timeout - 1);
+
+	start_ts = rte_get_tsc_cycles();
+
+	for (i = 0; i < num; i++)
+		rte_htimer_mgr_add(&timers[i], expiration_times[i], 0, nop_cb,
+				   NULL, RTE_HTIMER_FLAG_ABSOLUTE_TIME);
+
+	/* make sure the timers are actually scheduled in the wheel */
+	rte_htimer_mgr_process();
+
+	end_ts = rte_get_tsc_cycles();
+
+	return end_ts - start_ts;
+}
+
+#define TIME_STEP 16
+
+static void
+test_add_manage_perf(const char *scenario_name, uint64_t num_timers,
+		     uint64_t timespan)
+{
+	uint64_t manage_calls;
+	struct rte_htimer *timers;
+	uint64_t start;
+	uint64_t now;
+	uint64_t start_ts;
+	uint64_t end_ts;
+	uint64_t add_latency;
+	uint64_t manage_latency;
+
+	rte_htimer_mgr_init(1);
+
+	manage_calls = timespan / TIME_STEP;
+
+	printf("Scenario: %s\n", scenario_name);
+	printf("    Configuration:\n");
+	printf("        Timers: %"PRIu64"\n", num_timers);
+	printf("        Max timeout: %"PRIu64" ticks\n", timespan);
+	printf("        Average timeouts/manage call: %.3f\n",
+	       num_timers / (double)manage_calls);
+	printf("        Time advance per manage call: %d\n", TIME_STEP);
+
+	printf("    Results:\n");
+
+	timers = rte_malloc(NULL, sizeof(struct rte_htimer) * num_timers,
+			    0);
+
+	if (timers == NULL)
+		rte_panic("Unable to allocate memory\n");
+
+	start = 1 + rte_rand_max(UINT64_MAX / 2);
+
+	rte_htimer_mgr_manage_time(start - 1);
+
+	add_latency = add_rand_timers(timers, num_timers, start, timespan);
+
+	start_ts = rte_get_tsc_cycles();
+
+	for (now = start; now < (start + timespan); now += TIME_STEP)
+		rte_htimer_mgr_manage_time(now);
+
+	end_ts = rte_get_tsc_cycles();
+
+	manage_latency = end_ts - start_ts;
+
+	printf("        %.0f TSC cycles / add op\n",
+	       (double)add_latency / num_timers);
+	printf("        %.0f TSC cycles / manage call\n",
+	       (double)manage_latency / manage_calls);
+	printf("        %.1f TSC cycles / tick\n",
+	       (double)manage_latency / timespan);
+
+	rte_htimer_mgr_deinit();
+
+	rte_free(timers);
+}
+
+#define ITERATIONS 500
+
+static int
+test_del_perf(uint64_t num_timers, uint64_t timespan)
+{
+	struct rte_htimer *timers;
+	uint64_t start;
+	uint64_t i, j;
+	uint64_t start_ts;
+	uint64_t end_ts;
+	uint64_t latency = 0;
+
+	rte_htimer_mgr_init(1);
+
+	timers = rte_malloc(NULL, sizeof(struct rte_htimer) * num_timers,
+			    0);
+
+	if (timers == NULL)
+		rte_panic("Unable to allocate memory\n");
+
+	start = 1 + rte_rand_max(UINT64_MAX / 2);
+
+	for (i = 0; i < ITERATIONS; i++) {
+		rte_htimer_mgr_manage_time(start - 1);
+
+		add_rand_timers(timers, num_timers, start, timespan);
+
+		/* A manage (or process) call is required to get all
+		 * timers scheduled, which may in turn make them a
+		 * little more expensive to remove.
+		 */
+		rte_htimer_mgr_manage_time(start);
+
+		start_ts = rte_get_tsc_cycles();
+
+		for (j = 0; j < num_timers; j++)
+			if (rte_htimer_mgr_cancel(&timers[j]) < 0)
+				return TEST_FAILED;
+
+		end_ts = rte_get_tsc_cycles();
+
+		latency += (end_ts - start_ts);
+
+		start += (timespan + 1);
+	}
+
+	printf("Timer delete: %.0f TSC cycles / call\n",
+	       (double)latency / (double)ITERATIONS / (double)num_timers);
+
+	rte_htimer_mgr_deinit();
+
+	rte_free(timers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+target_lcore(void *arg)
+{
+	bool *stop = arg;
+
+	while (!__atomic_load_n(stop, __ATOMIC_RELAXED))
+		rte_htimer_mgr_manage();
+
+	return 0;
+}
+
+static void
+count_async_cb(struct rte_htimer *timer __rte_unused, int result,
+	       void *cb_arg)
+{
+	unsigned int *count = cb_arg;
+
+	if (result == RTE_HTIMER_MGR_ASYNC_RESULT_ADDED)
+		(*count)++;
+}
+
+static uint64_t
+s_to_tsc(double s)
+{
+	return s * rte_get_tsc_hz();
+}
+
+static uint64_t
+tsc_to_us(uint64_t tsc)
+{
+	return (double)tsc / (double)rte_get_tsc_hz() * 1e6;
+}
+
+#define ASYNC_ADD_TEST_TICK s_to_tsc(500e-9)
+/*
+ * The number of test timers must be kept less than size of the
+ * htimer-internal message ring for this test case to work.
+ */
+#define ASYNC_ADD_TEST_NUM_TIMERS 1000
+#define ASYNC_ADD_TEST_MIN_TIMEOUT (ASYNC_ADD_TEST_NUM_TIMERS * s_to_tsc(1e-6))
+#define ASYNC_ADD_TEST_MAX_TIMEOUT (2 * ASYNC_ADD_TEST_MIN_TIMEOUT)
+
+static void
+test_async_add_perf(void)
+{
+	uint64_t max_timeout = ASYNC_ADD_TEST_MAX_TIMEOUT;
+	uint64_t min_timeout = ASYNC_ADD_TEST_MIN_TIMEOUT;
+	unsigned int num_timers = ASYNC_ADD_TEST_NUM_TIMERS;
+	struct rte_htimer *timers;
+	bool *stop;
+	unsigned int lcore_id = rte_lcore_id();
+	unsigned int target_lcore_id =
+		rte_get_next_lcore(lcore_id, true, true);
+	uint64_t now;
+	uint64_t request_latency = 0;
+	uint64_t response_latency = 0;
+	unsigned int i;
+
+	rte_htimer_mgr_init(ASYNC_ADD_TEST_TICK);
+
+	timers = rte_malloc(NULL, sizeof(struct rte_htimer) * num_timers,
+			    RTE_CACHE_LINE_SIZE);
+	stop = rte_malloc(NULL, sizeof(bool), RTE_CACHE_LINE_SIZE);
+
+	if (timers == NULL || stop == NULL)
+		rte_panic("Unable to allocate memory\n");
+
+	*stop = false;
+
+	if (rte_eal_remote_launch(target_lcore, stop, target_lcore_id) != 0)
+		rte_panic("Unable to launch worker lcore\n");
+
+	/* wait for launch to complete */
+	rte_delay_us_block(100);
+
+	for (i = 0; i < ITERATIONS; i++) {
+		uint64_t expiration_times[num_timers];
+		unsigned int j;
+		uint64_t start_ts;
+		uint64_t end_ts;
+		unsigned int count = 0;
+
+		now = rte_get_tsc_cycles();
+
+		for (j = 0; j < num_timers; j++)
+			expiration_times[j] = now + min_timeout +
+				rte_rand_max(max_timeout - min_timeout);
+
+		start_ts = rte_get_tsc_cycles();
+
+		for (j = 0; j < num_timers; j++)
+			rte_htimer_mgr_async_add(&timers[j], target_lcore_id,
+					     expiration_times[j], 0,
+					     nop_cb, NULL,
+					     RTE_HTIMER_FLAG_ABSOLUTE_TIME,
+					     count_async_cb, &count);
+
+		end_ts = rte_get_tsc_cycles();
+
+		request_latency += (end_ts - start_ts);
+
+		/* wait long-enough for the target lcore to answered */
+		rte_delay_us_block(1 * num_timers);
+
+		start_ts = rte_get_tsc_cycles();
+
+		while (count != num_timers)
+			rte_htimer_mgr_process();
+
+		end_ts = rte_get_tsc_cycles();
+
+		response_latency += (end_ts - start_ts);
+
+		/* wait until all timeouts have fired */
+		rte_delay_us_block(tsc_to_us(max_timeout));
+	}
+
+	__atomic_store_n(stop, true, __ATOMIC_RELAXED);
+
+	rte_eal_mp_wait_lcore();
+
+	rte_free(timers);
+
+	rte_htimer_mgr_deinit();
+
+	printf("Timer async add:\n");
+	printf("    Configuration:\n");
+	printf("        Timers: %d\n", ASYNC_ADD_TEST_NUM_TIMERS);
+	printf("    Results:\n");
+	printf("        Source lcore cost: %.0f TSC cycles / add request\n",
+	       (double)request_latency / (double)ITERATIONS / num_timers);
+	printf("                           %.0f TSC cycles / async add "
+	       "response\n",
+	       (double)response_latency / (double)ITERATIONS / num_timers);
+}
+
+static int
+test_htimer_mgr_perf(void)
+{
+	rte_delay_us_block(10000);
+
+	test_add_manage_perf("Sparse", 100000, 10000000);
+
+	test_add_manage_perf("Dense", 100000, 200000);
+
+	test_add_manage_perf("Idle", 10, 100000);
+
+	test_add_manage_perf("Small", 1000, 100000);
+
+	if (test_del_perf(100000, 100000) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	test_async_add_perf();
+
+	return TEST_SUCCESS;
+}
+
+REGISTER_TEST_COMMAND(htimer_mgr_perf_autotest, test_htimer_mgr_perf);
diff --git a/app/test/test_htw.c b/app/test/test_htw.c
new file mode 100644
index 0000000000..3cddfaed7f
--- /dev/null
+++ b/app/test/test_htw.c
@@ -0,0 +1,478 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#include "test.h"
+
+#include <sys/queue.h>
+#include <stdlib.h>
+#include <inttypes.h>
+
+#include <rte_cycles.h>
+#include <rte_htw.h>
+#include <rte_random.h>
+
+struct recorder {
+	struct rte_htimer_list timeout_list;
+	uint64_t num_timeouts;
+};
+
+static void
+recorder_init(struct recorder *recorder)
+{
+	recorder->num_timeouts = 0;
+	LIST_INIT(&recorder->timeout_list);
+}
+
+static void
+recorder_cb(struct rte_htimer *timer, void *arg)
+{
+	struct recorder *recorder = arg;
+
+	recorder->num_timeouts++;
+
+	LIST_INSERT_HEAD(&recorder->timeout_list, timer, entry);
+}
+
+static int
+recorder_verify(struct recorder *recorder, uint64_t min_expiry,
+		uint64_t max_expiry)
+{
+	struct rte_htimer *timer;
+
+	LIST_FOREACH(timer, &recorder->timeout_list, entry) {
+		if (timer->expiration_time > max_expiry)
+			return TEST_FAILED;
+
+		if (timer->expiration_time < min_expiry)
+			return TEST_FAILED;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static void
+add_rand_timers(struct rte_htw *htw, struct rte_htimer *timers,
+		uint64_t num, uint64_t timeout_start, uint64_t max_timeout,
+		rte_htimer_cb_t cb, void *cb_arg)
+{
+	uint64_t i;
+
+	for (i = 0; i < num; i++) {
+		struct rte_htimer *timer = &timers[i];
+		bool use_absolute = rte_rand() & 1;
+		unsigned int flags = 0;
+		uint64_t expiration_time;
+
+		expiration_time = timeout_start + rte_rand_max(max_timeout);
+
+		if (use_absolute)
+			flags |= RTE_HTIMER_FLAG_ABSOLUTE_TIME;
+		else {
+			uint64_t htw_current_time;
+
+			htw_current_time = rte_htw_current_time(htw);
+
+			if (expiration_time < htw_current_time)
+				expiration_time = 0;
+			else
+				expiration_time -= htw_current_time;
+		}
+
+		rte_htw_add(htw, timer, expiration_time, 0, cb, cb_arg, flags);
+	}
+}
+
+#define ADVANCE_TIME_MAX_STEP 16
+
+static int
+test_rand_timers(uint64_t in_flight_timers, uint64_t max_timeout,
+		 uint64_t runtime)
+{
+	struct recorder recorder;
+	struct rte_htimer *timers;
+	uint64_t fired = 0;
+	uint64_t start;
+	uint64_t now;
+	struct rte_htw *htw;
+	uint64_t added;
+
+	recorder_init(&recorder);
+
+	timers = malloc(sizeof(struct rte_htimer) * in_flight_timers);
+
+	if (timers == NULL)
+		rte_panic("Unable to allocate heap memory\n");
+
+	start = rte_rand_max(UINT64_MAX - max_timeout);
+
+	htw = rte_htw_create();
+
+	if (htw == NULL)
+		return TEST_FAILED;
+
+	added = in_flight_timers;
+	add_rand_timers(htw, timers, added, start + 1, max_timeout,
+			recorder_cb, &recorder);
+
+	for (now = start; now < (start + runtime); ) {
+		uint64_t advance;
+
+		advance = rte_rand_max(ADVANCE_TIME_MAX_STEP);
+
+		now += advance;
+
+		rte_htw_manage(htw, now);
+
+		if (recorder.num_timeouts > 0) {
+			struct rte_htimer *timer;
+
+			if (advance == 0)
+				return TEST_FAILED;
+
+			if (recorder_verify(&recorder, now - advance + 1, now)
+			    != TEST_SUCCESS)
+				return TEST_FAILED;
+
+			while ((timer = LIST_FIRST(&recorder.timeout_list))
+			       != NULL) {
+				LIST_REMOVE(timer, entry);
+
+				add_rand_timers(htw, timer, 1,
+						now + 1, max_timeout,
+						recorder_cb, &recorder);
+				added++;
+				fired++;
+			}
+
+			recorder.num_timeouts = 0;
+		}
+	}
+
+	/* finish the remaining timeouts */
+
+	rte_htw_manage(htw, now + max_timeout);
+
+	if (recorder_verify(&recorder, now, now + max_timeout) != TEST_SUCCESS)
+		return TEST_FAILED;
+	fired += recorder.num_timeouts;
+
+	if (fired != added)
+		return TEST_FAILED;
+
+	rte_htw_destroy(htw);
+
+	free(timers);
+
+	return TEST_SUCCESS;
+}
+
+struct counter_state {
+	int calls;
+	struct rte_htw *htw;
+	bool cancel;
+};
+
+static void
+count_timeouts_cb(struct rte_htimer *timer __rte_unused, void *arg)
+{
+	struct counter_state *state = arg;
+
+	state->calls++;
+
+	if (state->cancel)
+		rte_htw_cancel(state->htw, timer);
+}
+
+static int
+test_single_timeout_type(uint64_t now, uint64_t distance, bool use_absolute)
+{
+	struct rte_htw *htw;
+	struct counter_state cstate = {};
+	struct rte_htimer timer;
+	uint64_t expiration_time;
+	unsigned int flags = 0;
+
+	htw = rte_htw_create();
+
+	rte_htw_manage(htw, now);
+
+	if (use_absolute) {
+		expiration_time = now + distance;
+		flags |= RTE_HTIMER_FLAG_ABSOLUTE_TIME;
+	} else
+		expiration_time = distance;
+
+	rte_htw_add(htw, &timer, expiration_time, 0, count_timeouts_cb,
+		    &cstate, flags);
+
+	rte_htw_manage(htw, now);
+
+	if (cstate.calls != 0)
+		return TEST_FAILED;
+
+	rte_htw_manage(htw, now + distance - 1);
+
+	if (cstate.calls != 0)
+		return TEST_FAILED;
+
+	rte_htw_manage(htw, now + distance);
+
+
+	if (cstate.calls != 1)
+		return TEST_FAILED;
+
+	rte_htw_manage(htw, now + distance);
+
+	if (cstate.calls != 1)
+		return TEST_FAILED;
+
+	rte_htw_manage(htw, now + distance + 1);
+
+	if (cstate.calls != 1)
+		return TEST_FAILED;
+
+	rte_htw_destroy(htw);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_single_timeout(uint64_t now, uint64_t distance)
+{
+
+	int rc;
+
+	rc = test_single_timeout_type(now, distance, true);
+	if (rc < 0)
+		return rc;
+
+	rc = test_single_timeout_type(now, distance, false);
+	if (rc < 0)
+		return rc;
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_periodical_timer(uint64_t now, uint64_t start, uint64_t period)
+{
+	struct rte_htw *htw;
+	struct counter_state cstate;
+	struct rte_htimer timer;
+
+	htw = rte_htw_create();
+
+	cstate = (struct counter_state) {
+		.htw = htw
+	};
+
+	rte_htw_manage(htw, now);
+
+	rte_htw_add(htw, &timer, start, period, count_timeouts_cb,
+		    &cstate, RTE_HTIMER_FLAG_PERIODICAL);
+
+	rte_htw_manage(htw, now);
+
+	if (cstate.calls != 0)
+		return TEST_FAILED;
+
+	rte_htw_manage(htw, now + start - 1);
+
+	if (cstate.calls != 0)
+		return TEST_FAILED;
+
+	rte_htw_manage(htw, now + start);
+
+	if (cstate.calls != 1)
+		return TEST_FAILED;
+
+	rte_htw_manage(htw, now + start + 1);
+
+	if (cstate.calls != 1)
+		return TEST_FAILED;
+
+	rte_htw_manage(htw, now + start + period);
+
+	if (cstate.calls != 2)
+		return TEST_FAILED;
+
+	cstate.cancel = true;
+
+	rte_htw_manage(htw, now + start + 2 * period);
+
+	if (cstate.calls != 3)
+		return TEST_FAILED;
+
+	rte_htw_manage(htw, now + start + 3 * period);
+
+	if (cstate.calls != 3)
+		return TEST_FAILED;
+
+	rte_htw_destroy(htw);
+
+	return TEST_SUCCESS;
+}
+
+#define CANCEL_ITERATIONS 1000
+#define CANCEL_NUM_TIMERS 1000
+#define CANCEL_MAX_DISTANCE 10000
+
+static int
+test_cancel_timer(void)
+{
+	uint64_t now;
+	struct rte_htw *htw;
+	int i;
+	struct rte_htimer timers[CANCEL_NUM_TIMERS];
+	struct counter_state timeouts[CANCEL_NUM_TIMERS];
+
+	now = rte_rand_max(UINT64_MAX / 2);
+
+	htw = rte_htw_create();
+
+	for (i = 0; i < CANCEL_ITERATIONS; i++) {
+		int j;
+		int target;
+
+		for (j = 0; j < CANCEL_NUM_TIMERS; j++) {
+			struct rte_htimer *timer = &timers[j];
+			uint64_t expiration_time;
+
+			timeouts[j] = (struct counter_state) {};
+
+			expiration_time = now + 1 +
+				rte_rand_max(CANCEL_MAX_DISTANCE);
+
+			rte_htw_add(htw, timer, expiration_time, 0,
+				    count_timeouts_cb, &timeouts[j],
+				    RTE_HTIMER_FLAG_ABSOLUTE_TIME);
+		}
+
+		target = rte_rand_max(CANCEL_NUM_TIMERS);
+
+		rte_htw_cancel(htw, &timers[target]);
+
+		now += CANCEL_MAX_DISTANCE;
+
+		rte_htw_manage(htw, now);
+
+		for (j = 0; j < CANCEL_NUM_TIMERS; j++) {
+			if (j != target) {
+				if (timeouts[j].calls != 1)
+					return TEST_FAILED;
+			} else {
+				if (timeouts[j].calls > 0)
+					return TEST_FAILED;
+			}
+		}
+	}
+
+	rte_htw_destroy(htw);
+
+	return TEST_SUCCESS;
+}
+
+static void
+nop_cb(struct rte_htimer *timer __rte_unused, void *arg __rte_unused)
+{
+}
+
+#define NEXT_NUM_TIMERS 1000
+#define NEXT_MAX_DISTANCE 10000
+
+static int
+test_next_timeout(void)
+{
+	uint64_t now;
+	struct rte_htw *htw;
+	int i;
+	struct rte_htimer timers[NEXT_NUM_TIMERS];
+	uint64_t last_expiration;
+
+	now = rte_rand_max(NEXT_MAX_DISTANCE);
+
+	htw = rte_htw_create();
+
+	if (rte_htw_next_timeout(htw, UINT64_MAX) != UINT64_MAX)
+		return TEST_FAILED;
+	if (rte_htw_next_timeout(htw, now + 1) != (now + 1))
+		return TEST_FAILED;
+
+	rte_htw_manage(htw, now);
+
+	last_expiration = now + NEXT_MAX_DISTANCE * NEXT_NUM_TIMERS;
+
+	for (i = 0; i < NEXT_NUM_TIMERS; i++) {
+		struct rte_htimer *timer = &timers[i];
+		uint64_t expiration;
+		uint64_t upper_bound;
+
+		/* add timers, each new one closer than the last */
+
+		expiration = last_expiration - rte_rand_max(NEXT_MAX_DISTANCE);
+
+		rte_htw_add(htw, timer, expiration, 0, nop_cb, NULL,
+			    RTE_HTIMER_FLAG_ABSOLUTE_TIME);
+
+		if (rte_htw_next_timeout(htw, UINT64_MAX) != expiration)
+			return TEST_FAILED;
+
+		upper_bound = expiration + rte_rand_max(100000);
+
+		if (rte_htw_next_timeout(htw, upper_bound) != expiration)
+			return TEST_FAILED;
+
+		upper_bound = expiration - rte_rand_max(expiration);
+
+		if (rte_htw_next_timeout(htw, upper_bound) != upper_bound)
+			return TEST_FAILED;
+
+		last_expiration = expiration;
+	}
+
+	rte_htw_destroy(htw);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_htw(void)
+{
+	if (test_single_timeout(0, 10) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_single_timeout(0, 254) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_single_timeout(0, 255) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_single_timeout(255, 1) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_single_timeout(254, 2) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_periodical_timer(10000, 500, 2) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_periodical_timer(1234567, 12345, 100000) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_cancel_timer() != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_rand_timers(1000, 100000, 100000000) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_rand_timers(100000, 100000, 1000000) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_next_timeout() != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+REGISTER_TEST_COMMAND(htw_autotest, test_htw);
diff --git a/app/test/test_htw_perf.c b/app/test/test_htw_perf.c
new file mode 100644
index 0000000000..65901f0874
--- /dev/null
+++ b/app/test/test_htw_perf.c
@@ -0,0 +1,181 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#include "test.h"
+
+#include <sys/queue.h>
+#include <stdlib.h>
+#include <inttypes.h>
+
+#include <rte_cycles.h>
+#include <rte_htw.h>
+#include <rte_malloc.h>
+#include <rte_random.h>
+
+static void
+nop_cb(struct rte_htimer *timer __rte_unused, void *arg __rte_unused)
+{
+}
+
+static void
+add_rand_timers(struct rte_htw *htw, struct rte_htimer *timers,
+		uint64_t num, uint64_t timeout_start, uint64_t max_timeout)
+{
+	uint64_t i;
+	uint64_t expiration_times[num];
+	uint64_t start_ts;
+	uint64_t end_ts;
+
+	for (i = 0; i < num; i++)
+		expiration_times[i] = timeout_start + rte_rand_max(max_timeout);
+
+	start_ts = rte_get_tsc_cycles();
+
+	for (i = 0; i < num; i++) {
+		struct rte_htimer *timer = &timers[i];
+
+		rte_htw_add(htw, timer, expiration_times[i], 0, nop_cb, NULL,
+			    RTE_HTIMER_FLAG_ABSOLUTE_TIME);
+	}
+
+	/* actually install the timers */
+	rte_htw_process(htw);
+
+	end_ts = rte_get_tsc_cycles();
+
+	printf("        %.0f TSC cycles / add op\n",
+	       (double)(end_ts - start_ts) / num);
+}
+
+#define TIME_STEP 16
+
+static int
+test_add_manage_perf(const char *scenario_name, uint64_t num_timers,
+		     uint64_t timespan)
+{
+	uint64_t manage_calls;
+	struct rte_htimer *timers;
+	uint64_t start;
+	uint64_t now;
+	struct rte_htw *htw;
+	uint64_t start_ts;
+	uint64_t end_ts;
+	double latency;
+
+	manage_calls = timespan / TIME_STEP;
+
+	printf("Scenario: %s\n", scenario_name);
+	printf("    Configuration:\n");
+	printf("        Timers: %"PRIu64"\n", num_timers);
+	printf("        Max timeout: %"PRIu64" ticks\n", timespan);
+	printf("        Average timeouts/manage call: %.3f\n",
+	       num_timers / (double)manage_calls);
+	printf("        Time advance per manage call: %d\n", TIME_STEP);
+
+	printf("    Results:\n");
+
+	timers = rte_malloc(NULL, sizeof(struct rte_htimer) *
+			    num_timers, 0);
+
+	if (timers == NULL)
+		rte_panic("Unable to allocate memory\n");
+
+	htw = rte_htw_create();
+
+	if (htw == NULL)
+		return TEST_FAILED;
+
+	start = 1 + rte_rand_max(UINT64_MAX / 2);
+
+	rte_htw_manage(htw, start - 1);
+
+	add_rand_timers(htw, timers, num_timers, start, timespan);
+
+	start_ts = rte_get_tsc_cycles();
+
+	for (now = start; now < (start + timespan); now += TIME_STEP)
+		rte_htw_manage(htw, now);
+
+	end_ts = rte_get_tsc_cycles();
+
+	latency = end_ts - start_ts;
+
+	printf("        %.0f TSC cycles / manage call\n",
+	       latency / manage_calls);
+	printf("        %.1f TSC cycles / tick\n", latency / timespan);
+
+	rte_htw_destroy(htw);
+
+	rte_free(timers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cancel_perf(uint64_t num_timers, uint64_t timespan)
+{
+	struct rte_htimer *timers;
+	uint64_t start;
+	struct rte_htw *htw;
+	uint64_t i;
+	uint64_t start_ts;
+	uint64_t end_ts;
+	double latency;
+
+	timers = rte_malloc(NULL, sizeof(struct rte_htimer) * num_timers, 0);
+
+	if (timers == NULL)
+		rte_panic("Unable to allocate memory\n");
+
+	htw = rte_htw_create();
+
+	if (htw == NULL)
+		return TEST_FAILED;
+
+	start = 1 + rte_rand_max(UINT64_MAX / 2);
+
+	rte_htw_manage(htw, start - 1);
+
+	add_rand_timers(htw, timers, num_timers, start, timespan);
+
+	start_ts = rte_get_tsc_cycles();
+
+	for (i = 0; i < num_timers; i++)
+		rte_htw_cancel(htw, &timers[i]);
+
+	end_ts = rte_get_tsc_cycles();
+
+	latency = end_ts - start_ts;
+
+	printf("Timer delete: %.0f TSC cycles / call\n",
+	       latency / num_timers);
+
+	rte_htw_destroy(htw);
+
+	rte_free(timers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_htw_perf(void)
+{
+	rte_delay_us_block(100);
+
+	if (test_add_manage_perf("Sparse", 100000, 10000000) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_add_manage_perf("Dense", 100000, 200000) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_add_manage_perf("Idle", 10, 100000) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_cancel_perf(100000, 100000) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+REGISTER_TEST_COMMAND(htw_perf_autotest, test_htw_perf);
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index 2deec7ea19..5ea1dfa262 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -67,6 +67,8 @@ The public API headers are grouped by topics:
 - **timers**:
   [cycles](@ref rte_cycles.h),
   [timer](@ref rte_timer.h),
+  [htimer_mgr](@ref rte_htimer_mgr.h),
+  [htimer](@ref rte_htimer.h),
   [alarm](@ref rte_alarm.h)
 
 - **locks**:
@@ -163,7 +165,8 @@ The public API headers are grouped by topics:
   [ring](@ref rte_ring.h),
   [stack](@ref rte_stack.h),
   [tailq](@ref rte_tailq.h),
-  [bitmap](@ref rte_bitmap.h)
+  [bitmap](@ref rte_bitmap.h),
+  [bitset](@ref rte_bitset.h)
 
 - **packet framework**:
   * [port](@ref rte_port.h):
diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
index e859426099..c0cd64db34 100644
--- a/doc/api/doxy-api.conf.in
+++ b/doc/api/doxy-api.conf.in
@@ -45,6 +45,7 @@ INPUT                   = @TOPDIR@/doc/api/doxy-api-index.md \
                           @TOPDIR@/lib/gro \
                           @TOPDIR@/lib/gso \
                           @TOPDIR@/lib/hash \
+                          @TOPDIR@/lib/htimer \
                           @TOPDIR@/lib/ip_frag \
                           @TOPDIR@/lib/ipsec \
                           @TOPDIR@/lib/jobstats \
diff --git a/lib/htimer/meson.build b/lib/htimer/meson.build
new file mode 100644
index 0000000000..2dd5d6a24b
--- /dev/null
+++ b/lib/htimer/meson.build
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2023 Ericsson AB
+
+sources = files('rte_htw.c', 'rte_htimer_msg_ring.c', 'rte_htimer_mgr.c')
+headers = files('rte_htimer_mgr.h', 'rte_htimer.h')
+
+deps += ['ring']
diff --git a/lib/htimer/rte_htimer.h b/lib/htimer/rte_htimer.h
new file mode 100644
index 0000000000..e245b30c65
--- /dev/null
+++ b/lib/htimer/rte_htimer.h
@@ -0,0 +1,65 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#ifndef _RTE_HTIMER_H_
+#define _RTE_HTIMER_H_
+
+#include <stdbool.h>
+#include <stdint.h>
+#include <sys/queue.h>
+
+struct rte_htimer;
+
+typedef void (*rte_htimer_cb_t)(struct rte_htimer *, void *);
+
+struct rte_htimer {
+	/**
+	 * Absolute timer expiration time (in ticks).
+	 */
+	uint64_t expiration_time;
+	/**
+	 * Time between expirations (in ticks). Zero for one-shot timers.
+	 */
+	uint64_t period;
+	/**
+	 * Owning lcore (in ticks). Zero for one-shot timers. May safely
+	 * be read from any thread.
+	 */
+	uint32_t owner_lcore_id;
+	/**
+	 * The current state of the timer.
+	 */
+	uint32_t state:4;
+	/**
+	 * Flags set on this timer.
+	 */
+	uint32_t flags:28;
+	/**
+	 * User-specified callback function pointer.
+	 */
+	rte_htimer_cb_t cb;
+	/**
+	 * Argument for user callback.
+	 */
+	void *cb_arg;
+	/**
+	 * Pointers used to add timer to various internal lists.
+	 */
+	LIST_ENTRY(rte_htimer) entry;
+};
+
+#define RTE_HTIMER_FLAG_ABSOLUTE_TIME (UINT32_C(1) << 0)
+#define RTE_HTIMER_FLAG_PERIODICAL (UINT32_C(1) << 1)
+
+#define RTE_HTIMER_STATE_PENDING 1
+#define RTE_HTIMER_STATE_EXPIRED 2
+#define RTE_HTIMER_STATE_CANCELED 3
+
+LIST_HEAD(rte_htimer_list, rte_htimer);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_HTIMER_H_ */
diff --git a/lib/htimer/rte_htimer_mgr.c b/lib/htimer/rte_htimer_mgr.c
new file mode 100644
index 0000000000..7bb1630680
--- /dev/null
+++ b/lib/htimer/rte_htimer_mgr.c
@@ -0,0 +1,488 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#include <stdbool.h>
+#include <sys/queue.h>
+#include <inttypes.h>
+#include <unistd.h>
+
+#include <rte_branch_prediction.h>
+#include <rte_common.h>
+#include <rte_errno.h>
+#include <rte_htw.h>
+#include <rte_prefetch.h>
+#include <rte_ring_elem.h>
+
+#include "rte_htimer_mgr.h"
+#include "rte_htimer_msg.h"
+#include "rte_htimer_msg_ring.h"
+
+#define MAX_MSG_BATCH_SIZE 16
+
+struct htimer_mgr {
+	struct rte_htimer_msg_ring *msg_ring;
+	struct rte_htw *htw;
+
+	unsigned int async_msgs_idx __rte_cache_aligned;
+	unsigned int num_async_msgs;
+	struct rte_htimer_msg async_msgs[MAX_MSG_BATCH_SIZE];
+} __rte_cache_aligned;
+
+static uint64_t tsc_per_tick;
+
+static struct htimer_mgr mgrs[RTE_MAX_LCORE + 1];
+
+#define MAX_ASYNC_TRANSACTIONS 1024
+#define MSG_RING_SIZE MAX_ASYNC_TRANSACTIONS
+
+static inline uint64_t
+tsc_to_tick(uint64_t tsc)
+{
+	return tsc / tsc_per_tick;
+}
+
+static inline uint64_t
+tsc_to_tick_round_up(uint64_t tsc)
+{
+	uint64_t tick;
+	uint64_t remainder;
+
+	tick = tsc / tsc_per_tick;
+	remainder = tsc % tsc_per_tick;
+
+	if (likely(remainder > 0))
+		tick++;
+
+	return tick;
+}
+
+static uint64_t
+tick_to_tsc(uint64_t tick)
+{
+	return tick * tsc_per_tick;
+}
+
+static struct htimer_mgr *
+mgr_get(unsigned int lcore_id)
+{
+	return &mgrs[lcore_id];
+}
+
+static int
+mgr_init(unsigned int lcore_id)
+{
+	char ring_name[RTE_RING_NAMESIZE];
+	unsigned int socket_id;
+	struct htimer_mgr *mgr = &mgrs[lcore_id];
+
+	socket_id = rte_lcore_to_socket_id(lcore_id);
+
+	snprintf(ring_name, sizeof(ring_name), "htimer_%d", lcore_id);
+
+	mgr->msg_ring =
+		rte_htimer_msg_ring_create(ring_name, MSG_RING_SIZE, socket_id,
+					   RING_F_SC_DEQ);
+
+	if (mgr->msg_ring == NULL)
+		goto err;
+
+	mgr->htw = rte_htw_create();
+
+	if (mgr->htw == NULL)
+		goto err_free_ring;
+
+	mgr->async_msgs_idx = 0;
+	mgr->num_async_msgs = 0;
+
+	return 0;
+
+err_free_ring:
+	rte_htimer_msg_ring_free(mgr->msg_ring);
+err:
+	return -ENOMEM;
+}
+
+static void
+mgr_deinit(unsigned int lcore_id)
+{
+	struct htimer_mgr *mgr = &mgrs[lcore_id];
+
+	rte_htw_destroy(mgr->htw);
+
+	rte_htimer_msg_ring_free(mgr->msg_ring);
+}
+
+static volatile bool initialized;
+
+static void
+assure_initialized(void)
+{
+	RTE_ASSERT(initialized);
+}
+
+int
+rte_htimer_mgr_init(uint64_t _tsc_per_tick)
+{
+	unsigned int lcore_id;
+
+	RTE_VERIFY(!initialized);
+
+	tsc_per_tick = _tsc_per_tick;
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		int rc;
+
+		rc = mgr_init(lcore_id);
+
+		if (rc < 0) {
+			unsigned int deinit_lcore_id;
+
+			for (deinit_lcore_id = 0; deinit_lcore_id < lcore_id;
+			     deinit_lcore_id++)
+				mgr_deinit(deinit_lcore_id);
+
+			return rc;
+		}
+	}
+
+	initialized = true;
+
+	return 0;
+}
+
+void
+rte_htimer_mgr_deinit(void)
+{
+	unsigned int lcore_id;
+
+	assure_initialized();
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++)
+		mgr_deinit(lcore_id);
+
+	initialized = false;
+}
+
+void
+rte_htimer_mgr_add(struct rte_htimer *timer, uint64_t expiration_time_tsc,
+		   uint64_t period_tsc, rte_htimer_cb_t timer_cb,
+		   void *timer_cb_arg, uint32_t flags)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct htimer_mgr *mgr = mgr_get(lcore_id);
+	uint64_t expiration_time_tick =
+		tsc_to_tick_round_up(expiration_time_tsc);
+	uint64_t period_tick =
+		tsc_to_tick_round_up(period_tsc);
+
+	assure_initialized();
+
+	rte_htw_add(mgr->htw, timer, expiration_time_tick, period_tick,
+		    timer_cb, timer_cb_arg, flags);
+
+	timer->owner_lcore_id = lcore_id;
+}
+
+int
+rte_htimer_mgr_cancel(struct rte_htimer *timer)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct htimer_mgr *mgr = mgr_get(lcore_id);
+
+	assure_initialized();
+
+	RTE_ASSERT(timer->owner_lcore_id == lcore_id);
+
+	switch (timer->state) {
+	case RTE_HTIMER_STATE_PENDING:
+		rte_htw_cancel(mgr->htw, timer);
+		return 0;
+	case RTE_HTIMER_STATE_EXPIRED:
+		return -ETIME;
+	default:
+		RTE_ASSERT(timer->state == RTE_HTIMER_STATE_CANCELED);
+		return -ENOENT;
+	}
+}
+
+static int
+send_msg(unsigned int receiver_lcore_id, enum rte_htimer_msg_type msg_type,
+	 struct rte_htimer *timer, rte_htimer_mgr_async_op_cb_t async_cb,
+	 void *async_cb_arg, const struct rte_htimer_msg_request *request,
+	 const struct rte_htimer_msg_response *response)
+{
+	struct htimer_mgr *receiver_mgr;
+	struct rte_htimer_msg_ring *receiver_ring;
+	struct rte_htimer_msg msg = (struct rte_htimer_msg) {
+		.msg_type = msg_type,
+		.timer = timer,
+		.async_cb = async_cb,
+		.async_cb_arg = async_cb_arg
+	};
+	int rc;
+
+	if (request != NULL)
+		msg.request = *request;
+	else
+		msg.response = *response;
+
+	receiver_mgr = mgr_get(receiver_lcore_id);
+
+	receiver_ring = receiver_mgr->msg_ring;
+
+	rc = rte_htimer_msg_ring_enqueue(receiver_ring, &msg);
+
+	return rc;
+}
+
+static int
+send_request(unsigned int receiver_lcore_id, enum rte_htimer_msg_type msg_type,
+	     struct rte_htimer *timer,
+	     rte_htimer_mgr_async_op_cb_t async_cb, void *async_cb_arg)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct rte_htimer_msg_request request = {
+		.source_lcore_id = lcore_id
+	};
+
+	return send_msg(receiver_lcore_id, msg_type, timer, async_cb,
+			async_cb_arg, &request, NULL);
+}
+
+static int
+send_response(unsigned int receiver_lcore_id, enum rte_htimer_msg_type msg_type,
+	      struct rte_htimer *timer,
+	      rte_htimer_mgr_async_op_cb_t async_cb, void *async_cb_arg,
+	      int result)
+{
+	struct rte_htimer_msg_response response = {
+		.result = result
+	};
+
+	return send_msg(receiver_lcore_id, msg_type, timer, async_cb,
+			async_cb_arg, NULL, &response);
+}
+
+int
+rte_htimer_mgr_async_add(struct rte_htimer *timer,
+			 unsigned int target_lcore_id,
+			 uint64_t expiration_time, uint64_t period,
+			 rte_htimer_cb_t timer_cb, void *timer_cb_arg,
+			 uint32_t flags,
+			 rte_htimer_mgr_async_op_cb_t async_cb,
+			 void *async_cb_arg)
+{
+	*timer = (struct rte_htimer) {
+		.expiration_time = expiration_time,
+		.period = period,
+		.owner_lcore_id = target_lcore_id,
+		.flags = flags,
+		.cb = timer_cb,
+		.cb_arg = timer_cb_arg
+	};
+
+	assure_initialized();
+
+	if (send_request(target_lcore_id, rte_htimer_msg_type_add_request,
+			 timer, async_cb, async_cb_arg) < 0)
+		return -EBUSY;
+
+	return 0;
+}
+
+int
+rte_htimer_mgr_async_cancel(struct rte_htimer *timer,
+			    rte_htimer_mgr_async_op_cb_t async_cb,
+			    void *async_cb_arg)
+{
+	if (send_request(timer->owner_lcore_id,
+			 rte_htimer_msg_type_cancel_request,
+			 timer, async_cb, async_cb_arg) < 0)
+		return -EBUSY;
+
+	return 0;
+}
+
+static int
+process_add_request(struct rte_htimer_msg *request)
+{
+	struct rte_htimer *timer = request->timer;
+
+	if (request->async_cb != NULL &&
+	    send_response(request->request.source_lcore_id,
+			  rte_htimer_msg_type_add_response, timer,
+			  request->async_cb, request->async_cb_arg,
+			  RTE_HTIMER_MGR_ASYNC_RESULT_ADDED) < 0)
+		return -EBUSY;
+
+	rte_htimer_mgr_add(timer, timer->expiration_time, timer->period,
+			   timer->cb, timer->cb_arg, timer->flags);
+
+	return 0;
+}
+
+static int
+process_cancel_request(struct rte_htimer_msg *request)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct htimer_mgr *mgr = mgr_get(lcore_id);
+	struct rte_htimer *timer = request->timer;
+	int result;
+
+	switch (timer->state) {
+	case RTE_HTIMER_STATE_PENDING:
+		result = RTE_HTIMER_MGR_ASYNC_RESULT_CANCELED;
+		break;
+	case RTE_HTIMER_STATE_CANCELED:
+		result = RTE_HTIMER_MGR_ASYNC_RESULT_ALREADY_CANCELED;
+		break;
+	case RTE_HTIMER_STATE_EXPIRED:
+		result = RTE_HTIMER_MGR_ASYNC_RESULT_EXPIRED;
+		break;
+	default:
+		RTE_ASSERT(0);
+		result = -1;
+	}
+
+	if (request->async_cb != NULL &&
+	    send_response(request->request.source_lcore_id,
+			  rte_htimer_msg_type_cancel_response, timer,
+			  request->async_cb, request->async_cb_arg,
+			  result) < 0)
+		return -EBUSY;
+
+	if (timer->state == RTE_HTIMER_STATE_PENDING)
+		rte_htw_cancel(mgr->htw, timer);
+
+	return 0;
+}
+
+static int
+process_response(struct rte_htimer_msg *msg)
+{
+	struct rte_htimer_msg_response *response = &msg->response;
+
+	if (msg->async_cb != NULL)
+		msg->async_cb(msg->timer, response->result, msg->async_cb_arg);
+
+	return 0;
+}
+
+static int
+process_msg(struct rte_htimer_msg *msg)
+{
+	switch (msg->msg_type) {
+	case rte_htimer_msg_type_add_request:
+		return process_add_request(msg);
+	case rte_htimer_msg_type_cancel_request:
+		return process_cancel_request(msg);
+	case rte_htimer_msg_type_add_response:
+	case rte_htimer_msg_type_cancel_response:
+		return process_response(msg);
+	default:
+		RTE_ASSERT(0);
+		return -EBUSY;
+	}
+}
+
+static void
+dequeue_async_msgs(struct htimer_mgr *mgr)
+{
+	if (mgr->num_async_msgs == 0) {
+		unsigned int i;
+
+		mgr->async_msgs_idx = 0;
+
+		mgr->num_async_msgs =
+			rte_htimer_msg_ring_dequeue_burst(mgr->msg_ring,
+							  mgr->async_msgs,
+							  MAX_MSG_BATCH_SIZE);
+
+		for (i = 0; i < mgr->num_async_msgs; i++)
+			rte_prefetch1(mgr->async_msgs[i].timer);
+	}
+}
+
+static void
+process_async(struct htimer_mgr *mgr)
+{
+	for (;;) {
+		struct rte_htimer_msg *msg;
+
+		dequeue_async_msgs(mgr);
+
+		if (mgr->num_async_msgs == 0)
+			break;
+
+		msg = &mgr->async_msgs[mgr->async_msgs_idx];
+
+		if (process_msg(msg) < 0)
+			break;
+
+		mgr->num_async_msgs--;
+		mgr->async_msgs_idx++;
+	}
+}
+
+void
+rte_htimer_mgr_manage_time(uint64_t current_time)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct htimer_mgr *mgr = mgr_get(lcore_id);
+	uint64_t current_tick;
+
+	assure_initialized();
+
+	process_async(mgr);
+
+	current_tick = tsc_to_tick(current_time);
+
+	rte_htw_manage(mgr->htw, current_tick);
+}
+
+void
+rte_htimer_mgr_manage(void)
+{
+	uint64_t current_time;
+
+	assure_initialized();
+
+	current_time = rte_get_tsc_cycles();
+
+	rte_htimer_mgr_manage_time(current_time);
+}
+
+void
+rte_htimer_mgr_process(void)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct htimer_mgr *mgr = mgr_get(lcore_id);
+
+	process_async(mgr);
+	assure_initialized();
+
+	rte_htw_process(mgr->htw);
+}
+
+uint64_t
+rte_htimer_mgr_current_time(void)
+{
+	uint64_t current_tick;
+
+	current_tick = rte_htimer_mgr_current_tick();
+
+	return tick_to_tsc(current_tick);
+}
+
+uint64_t
+rte_htimer_mgr_current_tick(void)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct htimer_mgr *mgr = mgr_get(lcore_id);
+	uint64_t current_tick;
+
+	current_tick = rte_htw_current_time(mgr->htw);
+
+	return current_tick;
+}
diff --git a/lib/htimer/rte_htimer_mgr.h b/lib/htimer/rte_htimer_mgr.h
new file mode 100644
index 0000000000..1fbd69dbf6
--- /dev/null
+++ b/lib/htimer/rte_htimer_mgr.h
@@ -0,0 +1,497 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#ifndef _RTE_HTIMER_MGR_H_
+#define _RTE_HTIMER_MGR_H_
+
+/**
+ * @file
+ *
+ * RTE High-performance Timer Manager
+ *
+ * The high-performance timer manager (htimer_mgr) API provides access
+ * to a low-overhead, scalable timer service.
+ *
+ * The functionality offered similar to that of <rte_timer.h>, but the
+ * internals differs significantly, and there are slight differences
+ * in the programming interface as well.
+ *
+ * Core timer management is implemented by means of a hierarchical
+ * timer wheel (HWT), as per the Varghese and Lauck paper <em>Hashed
+ * and Hierarchical Timing Wheels: Data Structures for the Efficient
+ * Implementation of a Timer Facility</em>.
+ *
+ * Varghese et al's approach is further enhanced by the placement of a
+ * bitset in front of each wheel's slots. Each slot has a
+ * corresponding bit in the bitset. If a bit is clear, there are no
+ * pending timers scheduled for that slot. A set bit means there
+ * potentially are timers scheduled for that slot. This scheme reduces
+ * the overhead of the rte_htimer_mgr_manage() function, where slots
+ * of one or more of the wheels of the thread's HWT are scanned if
+ * time has progressed since last call. This improves performance is
+ * all cases, except for very densely populated timer wheels.
+ *
+ * One such HWT is instantiated for each lcore (EAL thread), and
+ * instances are also available for registered non-EAL threads.
+ *
+ * The <rte_htimer_mgr.h> API may not be called from unregistered
+ * non-EAL threads.
+ *
+ * The per-lcore-id HWT instance is private to that thread.
+ *
+ * The htimer API supports scheduling timers to a different thread
+ * (and thus, a different HWT) than the caller's. It is also possible
+ * to cancel timers managed by a "remote" timer wheel.
+ *
+ * All interaction (i.e., adding timers to or removing timers from) a
+ * remote HWT is done by sending a request, in the form of message on
+ * a DPDK ring, to that instance. Such requests are processed and, if
+ * required, acknowledged when the remote (target) thread calls
+ * rte_htimer_mgr_manage(), rte_htimer_mgr_manage_time() or
+ * rte_htimer_mgr_process().
+ *
+ * This message-based interaction avoid comparatively heavy-weight
+ * synchronization primitives such as spinlocks. Only release-acquire
+ * type synchronization on the rings are needed.
+ *
+ * Timer memory management is the responsibility of the
+ * application. After library-level initialization has completed, no
+ * more dynamic memory is allocated by the htimer library. When
+ * installing timers on remote lcores, care must be taken by the
+ * application to avoid race conditions, in particular use-after-free
+ * (or use-after-recycle) issues of the rte_timer structure. A timer
+ * struct may only be deallocated and/or recycled if the application
+ * can guarantee that there are no cancel requests in flight.
+ *
+ * The htimer library is able to give a definitive answer to the
+ * question if a remote timer's had expired or not, at the time of
+ * cancellation.
+ *
+ * The htimer library uses TSC as the default time source. A different
+ * time source may be used, in which case the application must
+ * explicitly provide the time using rte_htimer_mgr_manage_time().
+ * This function may also be used even if TSC is the time source, in
+ * cases where the application for some other purpose already is in
+ * possession of the current TSC time, avoiding the overhead of the
+ * `rdtsc` instruction (or its equivalent on non-x86 ISAs).
+ *
+ * The htimer supports periodic and single-shot timers.
+ *
+ * The timer tick defines a quantum of time in the htimer library. The
+ * length of a tick (quantified in TSC) is left to the application to
+ * specify. The core HWT implementation allows for all 64 bits to be
+ * used.
+ *
+ * Very fine-grained ticks increase the HWT overhead (since more slots
+ * needs to be scanned). Long ticks will only allow for very
+ * course-grained timers, and in timer-heavy application may cause
+ * load spikes when time advances into a new tick.
+ *
+ * Seemingly reasonable timer tick length range in between 100 ns and
+ * 100 us (or maybe up to as high as 1 ms), depending on the
+ * application.
+ */
+
+#include <stdint.h>
+
+#include <rte_common.h>
+#include <rte_htimer.h>
+
+/**
+ * The timer has been added to the timer manager on the target lcore.
+ */
+#define RTE_HTIMER_MGR_ASYNC_RESULT_ADDED 1
+
+/**
+ * The timer cancellation request has completed, before the timer expired
+ * on the target lcore.
+ */
+#define RTE_HTIMER_MGR_ASYNC_RESULT_CANCELED 2
+
+/**
+ * The timer cancellation request was denied, since the timer was
+ * already marked as canceled.
+ */
+#define RTE_HTIMER_MGR_ASYNC_RESULT_ALREADY_CANCELED 3
+
+/**
+ * At the time of the cancellation request process on the target
+ * lcore, the timer had already expired.
+ */
+#define RTE_HTIMER_MGR_ASYNC_RESULT_EXPIRED 4
+
+typedef void (*rte_htimer_mgr_async_op_cb_t)(struct rte_htimer *timer,
+					     int result, void *cb_arg);
+
+/**
+ * Initialize the htimer library.
+ *
+ * Instantiates per-lcore (or per-registered non-EAL thread) timer
+ * wheels and other htimer library data structures, for all current
+ * and future threads.
+ *
+ * This function must be called prior to any other <rte_htimer.h> API
+ * call.
+ *
+ * This function may not be called if the htimer library is already
+ * initialized, but may be called multiple times, provided the library
+ * is deinitialized in between rte_htimer_mgr_init() calls.
+ *
+ * For applications not using TSC as the time source, the \c tsc_per_tick
+ * parameter will denote the number of such application time-source-units
+ * per tick.
+ *
+ * This function is not multi-thread safe.
+ *
+ * @param tsc_per_tick
+ *   The length (in TSC) of a HWT tick.
+ *
+ * @return
+ *   - 0: Success
+ *   - -ENOMEM: Unable to allocate memory needed to initialize timer
+ *      subsystem
+ *
+ * @see rte_htimer_mgr_deinit()
+ * @see rte_get_tsc_hz()
+ */
+
+__rte_experimental
+int
+rte_htimer_mgr_init(uint64_t tsc_per_tick);
+
+/**
+ * Deinitialize the htimer library.
+ *
+ * This function deallocates all dynamic memory used by the library,
+ * including HWT instances used by other threads than the caller.
+ *
+ * After this call has been made, no <rte_htimer.h> API call may be
+ * made, except rte_htimer_mgr_init().
+ *
+ * This function may not be called if the htimer library has never be
+ * initialized, or has been be deinitialized but not yet initialized
+ * again.
+ *
+ * This function is not multi-thread safe. In particular, no thread
+ * may call any <rte_htimer.h> functions (e.g., rte_htimer_mgr_manage())
+ * during (or after) the htimer library is deinitialized, except if it
+ * is initialized again.
+ *
+ * @see rte_htimer_mgr_init()
+ */
+
+__rte_experimental
+void
+rte_htimer_mgr_deinit(void);
+
+/**
+ * Adds a timer to the calling thread's timer wheel.
+ *
+ * This function schedules a timer on the calling thread's HWT.
+ *
+ * The \c timer_cb callback is called at a point when this thread
+ * calls rte_htimer_mgr_process(), rte_htimer_mgr_manage(), or
+ * rte_htimer_mgr_manage_time() and the expiration time has passed the
+ * current time (either as retrieved by rte_htimer_mgr_manage() or
+ * specified by the application in rte_htimer_mgr_manage_time().
+ *
+ * The HWT trackes times in units of \c ticks, which are likely more
+ * coarse-grained than the TSC resolution.
+ *
+ * The \c expiration_time is specified in units of TSC, and rounded up
+ * to the nearest tick. Thus, a timer with a certain expiration time
+ * (specified in TSC) maybe not expire even though this time
+ * (specified in TSC) was supplied in rte_timer_manage_time().  The
+ * maximum error is the length of one tick (not including any delays
+ * caused by infrequent manage calls).
+ *
+ * This timer may be canceled using rte_htimer_mgr_cancel() or
+ * rte_htimer_mgr_async_cancel().
+ *
+ * rte_htimer_mgr_add() is multi-thread safe, and may only be called
+ * from an EAL thread or a registered non-EAL thread.
+ *
+ * @param timer
+ *   The chunk of memory used for managing this timer. This memory
+ *   must not be read or written (or free'd) by the application until
+ *   this timer has expired, or any cancellation attempts have
+ *   completed.
+ * @param expiration_time
+ *   The expiration time (measured in TSC). For periodical timers,
+ *   this time represent the first expiration time.
+ * @param period
+ *   The time in between periodic timer expirations (measured in TSC).
+ *   Must be set to zero unless the RTE_HTIMER_FLAG_PERIODICAL flag is set,
+ *   in case it must be a positive integer.
+ * @param timer_cb
+ *   The timer callback to be called upon timer expiration.
+ * @param timer_cb_arg
+ *   A pointer which will be supplied back to the application in the
+ *   timer callback call.
+ * @param flags
+ *   RTE_HTIMER_FLAG_ABSOLUTE_TIME and/or RTE_HTIMER_FLAG_PERIODICAL.
+ */
+
+__rte_experimental
+void
+rte_htimer_mgr_add(struct rte_htimer *timer, uint64_t expiration_time,
+		   uint64_t period, rte_htimer_cb_t timer_cb,
+		   void *timer_cb_arg, uint32_t flags);
+
+/**
+ * Cancel a timer scheduled in the calling thread's timer wheel.
+ *
+ * This function cancel a timer scheduled on the calling thread's HWT.
+ *
+ * rte_htimer_mgr_cancel() may be called on a timer which has already
+ * (synchronously or asynchronously) been canceled, or may have expired.
+ * However, the \c rte_htimer struct pointed to by \c timer may not
+ * have been freed or recycled since.
+ *
+ * rte_htimer_mgr_cancel() may not be called for a timer that was
+ * never (or, not yet) added.
+ *
+ * A timer added using rte_htimer_mgr_async_add() may be not be
+ * canceled using this function until after the add operation has
+ * completed (i.e, the completion callback has been run).
+ *
+ * rte_htimer_mgr_cancel() is multi-thread safe, and may only be
+ * called from an EAL thread or a registered non-EAL thread.
+ *
+ * @param timer
+ *   The timer to be canceled.
+ * @return
+ *   - 0: Success
+ *   - -ETIME: Timer has expired, and thus could not be canceled.
+ *   - -ENOENT: Timer was already canceled.
+ */
+
+__rte_experimental
+int
+rte_htimer_mgr_cancel(struct rte_htimer *timer);
+
+/**
+ * Asynchronuosly add a timer to the specified lcore's timer wheel.
+ *
+ * This function is the equivalent of rte_htimer_mgr_add(), but allows
+ * the calling ("source") thread to scheduled a timer in a HWT other
+ * than it's own. The operation is asynchronous.
+ *
+ * The timer works the same as a timer added locally. Thus, the \c
+ * timer_cb callback is called by the target thread, and it may be
+ * canceled using rte_htimer_mgr_cancel().
+ *
+ * The source thread may be the same as the target thread.
+ *
+ * Only EAL threads or registered non-EAL thread may be targeted.
+ *
+ * A successful rte_htimer_mgr_async_add() call guarantees that the
+ * timer will be scheduled on the target lcore at some future time,
+ * provided the target thread calls either rte_htimer_mgr_process(),
+ * rte_htimer_mgr_manage(), and/or rte_htimer_mgr_manage_time().
+ *
+ * The \c async_cb callback is called on the source thread as a part
+ * of its rte_htimer_mgr_process(), rte_htimer_mgr_manage(), or
+ * rte_htimer_mgr_manage_time() call, when the asynchronous add
+ * operation has completed (i.e., the timer is scheduled in the target
+ * HWT).
+ *
+ * \c async_cb may be NULL, in which case no notification is given.
+ *
+ * An asynchronously added timer may be asynchronously canceled (i.e.,
+ * using rte_htimer_mgr_async_cancel()) at any point, by any thread,
+ * after the rte_htimer_mgr_async_add() call. A asynchronously added
+ * timer may be not be canceled using rte_htimer_mgr_cancel() until
+ * after the completion callback has been executed.
+ *
+ * rte_htimer_mgr_async_add() is multi-thread safe, and may only be called
+ * from an EAL thread or a registered non-EAL thread.
+ *
+ * @param timer
+ *   The chunk of memory used for managing this timer. This memory
+ *   must not be read or written (or free'd) by the application until
+ *   this timer has expired, or any cancellation attempts have
+ *   completed.
+ * @param target_lcore_id
+ *   The lcore id of the thread which HWT will be manage this timer.
+ * @param expiration_time
+ *   The expiration time (measured in TSC). For periodical timers,
+ *   this time represent the first expiration time.
+ * @param period
+ *   The time in between periodic timer expirations (measured in TSC).
+ *   Must be set to zero unless the RTE_HTIMER_FLAG_PERIODICAL flag is set,
+ *   in case it must be a positive integer.
+ * @param timer_cb
+ *   The timer callback to be called upon timer expiration.
+ * @param timer_cb_arg
+ *   A pointer which will be supplied back to the application in the
+ *   timer callback call.
+ * @param async_cb
+ *   The asynchronous operationg callback to be called when the
+ *   add operation is completed.
+ * @param async_cb_arg
+ *   A pointer which will be supplied back to the application in the
+ *   \c async_cb callback call.
+ * @param flags
+ *   RTE_HTIMER_FLAG_ABSOLUTE_TIME and/or RTE_HTIMER_FLAG_PERIODICAL.
+ * @return
+ *   - 0: Success
+ *   - -EBUSY: The maximum number of concurrently queued asynchronous
+ *      operations has been reached.
+ */
+__rte_experimental
+int
+rte_htimer_mgr_async_add(struct rte_htimer *timer,
+			 unsigned int target_lcore_id,
+			 uint64_t expiration_time, uint64_t period,
+			 rte_htimer_cb_t timer_cb, void *timer_cb_arg,
+			 uint32_t flags,
+			 rte_htimer_mgr_async_op_cb_t async_cb,
+			 void *async_cb_arg);
+
+/**
+ * Asynchronuosly cancel a timer in any thread's timer wheel.
+ *
+ * This function is the equivalent of rte_htimer_mgr_cancel(), but
+ * allows the calling ("source") thread to also cancel a timer in a
+ * HWT other than it's own. The operation is asynchronous.
+ *
+ * A thread may asynchronously cancel a timer scheduled on its own
+ * HWT.
+ *
+ * The \c async_cb callback is called on the source thread as a part
+ * of its rte_htimer_mgr_process(), rte_htimer_mgr_manage(), or
+ * rte_htimer_mgr_manage_time() call, when the asynchronous add
+ * operation has completed (i.e., the timer is scheduled in the target
+ * HWT).
+ *
+ * \c async_cb may be NULL, in which case no notification is given.
+ *
+ * A timer may be asynchronously canceled at any point, by any thread,
+ * after it has been either synchronously or asynchronously added.
+ *
+ * rte_htimer_mgr_async_cancel() is multi-thread safe, and may only be
+ * called from an EAL thread or a registered non-EAL thread.
+ *
+ * @param timer
+ *   The memory used for managing this timer. This memory must not be
+ *   read or written (or free'd) by the application until this timer
+ *   has expired, or any cancellation attempts have completed.
+ * @param async_cb
+ *   The asynchronous operationg callback to be called when the
+ *   add operation is completed.
+ * @param async_cb_arg
+ *   A pointer which will be supplied back to the application in the
+ *   \c async_cb callback call.
+ * @return
+ *   - 0: Success
+ *   - -EBUSY: The maximum number of concurrently queued asynchronous
+ *      operations has been reached.
+ */
+__rte_experimental
+int
+rte_htimer_mgr_async_cancel(struct rte_htimer *timer,
+			    rte_htimer_mgr_async_op_cb_t async_cb,
+			    void *async_cb_arg);
+
+/**
+ * Update HWT time and perform timer expiry and asyncronous operation
+ * processing.
+ *
+ * This function is the equivalent of retrieving the current TSC time,
+ * and calling rte_htimer_mgr_manage_time().
+ *
+ * rte_htimer_mgr_manage() is multi-thread safe, and may only be
+ * called from an EAL thread or a registered non-EAL thread.
+ */
+
+__rte_experimental
+void
+rte_htimer_mgr_manage(void);
+
+/**
+ * Progress HWT time, and perform timer expiry and asynchronous
+ * operation processing in the process.
+ *
+ * This function progress the calling thread's HWT up to the point
+ * specified by \c current_time, calling the callbacks of any expired
+ * timers.
+ *
+ * The time source must be a monotonic clock, and thus each new \c
+ * current_time must be equal or greater than the time supplied in the
+ * previous call.
+ *
+ * The timer precision for timers scheduled on a particular thread's
+ * HWT depends on that threads call frequency to this function.
+ *
+ * rte_htimer_mgr_manage_time() also performs asynchronous operation
+ * processing. See rte_htimer_mgr_process() for details.
+ *
+ * rte_htimer_mgr_manage_time() is multi-thread safe, and may only be
+ * called from an EAL thread or a registered non-EAL thread.
+ *
+ * @param current_time
+ *   The current time (usually in TSC).
+ */
+
+__rte_experimental
+void
+rte_htimer_mgr_manage_time(uint64_t current_time);
+
+/**
+ * Perform asynchronous operation processing.
+ *
+ * rte_htimer_mgr_process() serves pending asynchronous add or cancel
+ * requests, and produces the necessary responses. The timer callbacks
+ * of already-expired timers added are called.
+ *
+ * This function also processes asynchronous operation response
+ * messages received, and calls the asynchronous callbacks, if such
+ * was provided by the application.
+ *
+ * rte_htimer_mgr_process() is multi-thread safe, and may only be
+ * called from an EAL thread or a registered non-EAL thread.
+ */
+
+__rte_experimental
+void
+rte_htimer_mgr_process(void);
+
+/**
+ * Return the current local HWT time in TSC.
+ *
+ * This functino returns the most recent time provided by this thread,
+ * either via rte_htimer_mgr_manage_time(), or as sampled by
+ * rte_htimer_mgr_manage().
+ *
+ * The initial time, prior to any manage-calls, is 0.
+ *
+ * rte_htimer_mgr_current_time() is multi-thread safe, and may only be
+ * called from an EAL thread or a registered non-EAL thread.
+ */
+
+__rte_experimental
+uint64_t
+rte_htimer_mgr_current_time(void);
+
+/**
+ * Return the current local HWT time in ticks.
+ *
+ * This function returns the current time of the calling thread's HWT. The
+ * tick is the current time provided by the application (via
+ * rte_htimer_mgr_manage_time()), or as retrieved (using
+ * rte_timer_get_tsc_cycles() in rte_htimer_mgr_manage()), divided by the
+ * tick length (as provided in rte_htimer_mgr_init()).
+ *
+ * The initial time, prior to any manage-calls, is 0.
+ *
+ * rte_htimer_mgr_current_tick() is multi-thread safe, and may only be
+ * called from an EAL thread or a registered non-EAL thread.
+ */
+
+__rte_experimental
+uint64_t
+rte_htimer_mgr_current_tick(void);
+
+#endif
diff --git a/lib/htimer/rte_htimer_msg.h b/lib/htimer/rte_htimer_msg.h
new file mode 100644
index 0000000000..ceb106e263
--- /dev/null
+++ b/lib/htimer/rte_htimer_msg.h
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#ifndef _RTE_HTIMER_MSG_
+#define _RTE_HTIMER_MSG_
+
+#include <rte_htimer.h>
+
+typedef void (*rte_htimer_msg_async_op_cb_t)(struct rte_htimer *timer,
+					     int result, void *cb_arg);
+
+typedef rte_htimer_msg_async_op_cb_t async_cb;
+
+enum rte_htimer_msg_type {
+	rte_htimer_msg_type_add_request,
+	rte_htimer_msg_type_add_response,
+	rte_htimer_msg_type_cancel_request,
+	rte_htimer_msg_type_cancel_response
+};
+
+struct rte_htimer_msg_request {
+	unsigned int source_lcore_id;
+};
+
+struct rte_htimer_msg_response {
+	int result;
+};
+
+struct rte_htimer_msg {
+	enum rte_htimer_msg_type msg_type;
+
+	struct rte_htimer *timer;
+
+	rte_htimer_msg_async_op_cb_t async_cb;
+	void *async_cb_arg;
+
+	union {
+		struct rte_htimer_msg_request request;
+		struct rte_htimer_msg_response response;
+	};
+};
+
+#endif
diff --git a/lib/htimer/rte_htimer_msg_ring.c b/lib/htimer/rte_htimer_msg_ring.c
new file mode 100644
index 0000000000..4019b7819a
--- /dev/null
+++ b/lib/htimer/rte_htimer_msg_ring.c
@@ -0,0 +1,18 @@
+#include "rte_htimer_msg_ring.h"
+
+struct rte_htimer_msg_ring *
+rte_htimer_msg_ring_create(const char *name, unsigned int count, int socket_id,
+			   unsigned int flags)
+{
+	struct rte_ring *ring =
+		rte_ring_create_elem(name, sizeof(struct rte_htimer_msg),
+				     count, socket_id, flags);
+
+	return (struct rte_htimer_msg_ring *)ring;
+}
+
+void
+rte_htimer_msg_ring_free(struct rte_htimer_msg_ring *msg_ring)
+{
+	rte_ring_free((struct rte_ring *)msg_ring);
+}
diff --git a/lib/htimer/rte_htimer_msg_ring.h b/lib/htimer/rte_htimer_msg_ring.h
new file mode 100644
index 0000000000..0e408991d1
--- /dev/null
+++ b/lib/htimer/rte_htimer_msg_ring.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#ifndef _RTE_HTIMER_MSG_RING_
+#define _RTE_HTIMER_MSG_RING_
+
+#include <rte_ring.h>
+
+#include "rte_htimer_msg.h"
+
+struct rte_htimer_msg_ring {
+	struct rte_ring ring;
+};
+
+struct rte_htimer_msg_ring *
+rte_htimer_msg_ring_create(const char *name, unsigned int count, int socket_id,
+			   unsigned int flags);
+
+void
+rte_htimer_msg_ring_free(struct rte_htimer_msg_ring *msg_ring);
+
+static inline unsigned int
+rte_htimer_msg_ring_dequeue_burst(struct rte_htimer_msg_ring *msg_ring,
+				  struct rte_htimer_msg *msgs,
+				  unsigned int n)
+{
+	unsigned int dequeued;
+
+	dequeued = rte_ring_dequeue_burst_elem(&msg_ring->ring, msgs,
+					       sizeof(struct rte_htimer_msg),
+					       n, NULL);
+
+	return dequeued;
+}
+
+static inline unsigned int
+rte_htimer_msg_ring_enqueue(struct rte_htimer_msg_ring *msg_ring,
+			    struct rte_htimer_msg *msg)
+{
+	int rc;
+
+	rc = rte_ring_enqueue_elem(&msg_ring->ring, msg,
+				   sizeof(struct rte_htimer_msg));
+
+	return rc;
+}
+
+#endif
diff --git a/lib/htimer/rte_htw.c b/lib/htimer/rte_htw.c
new file mode 100644
index 0000000000..0104dced34
--- /dev/null
+++ b/lib/htimer/rte_htw.c
@@ -0,0 +1,437 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+/*
+ * This is an implementation of a hierarchical timer wheel based on
+ * Hashed and Hierarchical Timing Wheels: Data Structures
+ * for the Efficient Implementation of a Timer Facility by Varghese and
+ * Lauck.
+ *
+ * To improve efficiency when the slots are sparsely populate (i.e.,
+ * many ticks do not have any timers), each slot is represented by a
+ * bit in a separately-managed, per-wheel, bitset. This allows for
+ * very efficient scanning. The cost of managing this bitset is small.
+ */
+
+/* XXX: remove */
+#include <inttypes.h>
+
+#include <rte_bitset.h>
+#include <rte_branch_prediction.h>
+#include <rte_errno.h>
+#include <rte_malloc.h>
+
+#include "rte_htw.h"
+
+#define TICK_BITS 64
+
+#define WHEEL_BITS 8
+#define WHEEL_SLOTS (1U << WHEEL_BITS)
+#define WHEEL_LEVELS (TICK_BITS / WHEEL_BITS)
+
+struct wheel {
+	uint64_t wheel_time;
+	RTE_BITSET_DECLARE(used_slots, WHEEL_SLOTS);
+	struct rte_htimer_list slots[WHEEL_SLOTS];
+};
+
+struct rte_htw {
+	uint64_t current_time;
+
+	struct wheel wheels[WHEEL_LEVELS];
+
+	struct rte_htimer_list added;
+	struct rte_htimer_list expiring;
+
+	struct rte_htimer *running_timer;
+};
+
+static uint64_t
+time_to_wheel_time(uint64_t t, uint16_t level)
+{
+	return t >> (level * WHEEL_BITS);
+}
+
+static uint64_t
+wheel_time_to_time(uint64_t wheel_time, uint16_t level)
+{
+	return wheel_time << (level * WHEEL_BITS);
+}
+
+static void
+wheel_init(struct wheel *wheel)
+{
+	uint16_t i;
+
+	wheel->wheel_time = 0;
+
+	rte_bitset_init(wheel->used_slots, WHEEL_SLOTS);
+
+	for (i = 0; i < WHEEL_SLOTS; i++)
+		LIST_INIT(&wheel->slots[i]);
+}
+
+static uint64_t
+list_next_timeout(struct rte_htimer_list *timers)
+{
+	struct rte_htimer *timer;
+	uint64_t candidate = UINT64_MAX;
+
+	LIST_FOREACH(timer, timers, entry)
+		candidate = RTE_MIN(timer->expiration_time, candidate);
+
+	return candidate;
+}
+
+static uint16_t
+wheel_time_to_slot(uint64_t wheel_time)
+{
+	return wheel_time % WHEEL_SLOTS;
+}
+
+static uint64_t
+wheel_current_slot_time(struct wheel *wheel, uint16_t level)
+{
+	return wheel->wheel_time << (level * WHEEL_BITS);
+}
+
+static uint64_t
+wheel_next_timeout(struct wheel *wheel, uint16_t level, uint64_t upper_bound)
+{
+	uint16_t start_slot;
+	ssize_t slot;
+
+	start_slot = wheel_current_slot_time(wheel, level);
+
+	if (wheel_time_to_time(wheel->wheel_time, level) >= upper_bound)
+		return upper_bound;
+
+	RTE_BITSET_FOREACH_SET_WRAP(slot, wheel->used_slots, WHEEL_SLOTS,
+				    start_slot, WHEEL_SLOTS) {
+		struct rte_htimer_list *timers = &wheel->slots[slot];
+		uint64_t next_timeout;
+
+		next_timeout = list_next_timeout(timers);
+
+		if (next_timeout != UINT64_MAX)
+			return next_timeout;
+	}
+
+	return UINT64_MAX;
+}
+
+static uint16_t
+get_slot(uint64_t t, uint16_t level)
+{
+	uint64_t wheel_time;
+	uint16_t slot;
+
+	wheel_time = time_to_wheel_time(t, level);
+	slot = wheel_time_to_slot(wheel_time);
+
+	return slot;
+}
+
+struct rte_htw *
+rte_htw_create(void)
+{
+	struct rte_htw *htw;
+	uint16_t level;
+
+	RTE_BUILD_BUG_ON((TICK_BITS % WHEEL_BITS) != 0);
+	RTE_BUILD_BUG_ON(sizeof(uint16_t) * CHAR_BIT <= WHEEL_BITS);
+
+	htw = rte_malloc(NULL, sizeof(struct rte_htw), RTE_CACHE_LINE_SIZE);
+
+	if (htw == NULL) {
+		rte_errno = ENOMEM;
+		return NULL;
+	}
+
+	htw->current_time = 0;
+
+	LIST_INIT(&htw->added);
+	LIST_INIT(&htw->expiring);
+
+	for (level = 0; level < WHEEL_LEVELS; level++)
+		wheel_init(&htw->wheels[level]);
+
+	return htw;
+}
+
+void
+rte_htw_destroy(struct rte_htw *htw)
+{
+	rte_free(htw);
+}
+
+static uint16_t
+get_level(uint64_t remaining_time)
+{
+	int last_set = 64 - __builtin_clzll(remaining_time);
+
+	return (last_set - 1) / WHEEL_BITS;
+}
+
+static void
+mark_added(struct rte_htw *htw, struct rte_htimer *timer)
+{
+	timer->state = RTE_HTIMER_STATE_PENDING;
+	LIST_INSERT_HEAD(&htw->added, timer, entry);
+}
+
+void
+rte_htw_add(struct rte_htw *htw, struct rte_htimer *timer,
+	    uint64_t expiration_time, uint64_t period,
+	    rte_htimer_cb_t timer_cb, void *timer_cb_arg, uint32_t flags)
+{
+	RTE_ASSERT(rte_htimer_is_periodical(timer) ?
+		   period > 0 : period == 0);
+
+	if (flags & RTE_HTIMER_FLAG_ABSOLUTE_TIME)
+		timer->expiration_time = expiration_time;
+	else
+		timer->expiration_time = htw->current_time + expiration_time;
+
+	timer->period = period;
+	timer->flags = flags;
+	timer->cb = timer_cb;
+	timer->cb_arg = timer_cb_arg;
+
+	mark_added(htw, timer);
+}
+
+void
+rte_htw_cancel(struct rte_htw *htw, struct rte_htimer *timer)
+{
+	/*
+	 * One could consider clearing the relevant used_slots bit in
+	 * case this was the last entry in the wheel's slot
+	 * list. However, from a correctness point of view, a "false
+	 * positive" is not an issue. From a performance perspective,
+	 * checking the list head and clearing the bit is likely more
+	 * expensive than just deferring a minor cost to a future
+	 * rte_htw_manage() call.
+	 */
+
+	RTE_ASSERT(timer->state == RTE_HTIMER_STATE_PENDING ||
+		   timer->state == RTE_HTIMER_STATE_EXPIRED);
+
+	if (likely(timer->state == RTE_HTIMER_STATE_PENDING)) {
+		LIST_REMOVE(timer, entry);
+		timer->state = RTE_HTIMER_STATE_CANCELED;
+	} else if (timer == htw->running_timer) {
+		/* periodical timer being canceled by its own callback */
+		RTE_ASSERT(timer->flags & RTE_HTIMER_FLAG_PERIODICAL);
+
+		timer->state = RTE_HTIMER_STATE_CANCELED;
+
+		/* signals running timer canceled */
+		htw->running_timer = NULL;
+	}
+}
+
+static void
+mark_expiring(struct rte_htw *htw, struct rte_htimer *timer)
+{
+	LIST_INSERT_HEAD(&htw->expiring, timer, entry);
+}
+
+static void
+schedule_timer(struct rte_htw *htw, struct rte_htimer *timer)
+{
+	uint64_t remaining_time;
+	uint16_t level;
+	struct wheel *wheel;
+	uint16_t slot;
+	struct rte_htimer_list *slot_timers;
+
+	remaining_time = timer->expiration_time - htw->current_time;
+
+	level = get_level(remaining_time);
+
+	wheel = &htw->wheels[level];
+
+	slot = get_slot(timer->expiration_time, level);
+
+	slot_timers = &htw->wheels[level].slots[slot];
+
+	LIST_INSERT_HEAD(slot_timers, timer, entry);
+
+	rte_bitset_set(wheel->used_slots, slot);
+}
+
+static void
+process_added(struct rte_htw *htw)
+{
+	struct rte_htimer *timer;
+
+	while ((timer = LIST_FIRST(&htw->added)) != NULL) {
+		LIST_REMOVE(timer, entry);
+
+		if (timer->expiration_time > htw->current_time)
+			schedule_timer(htw, timer);
+		else
+			mark_expiring(htw, timer);
+	}
+}
+
+static void
+process_expiring(struct rte_htw *htw)
+{
+	struct rte_htimer *timer;
+
+	while ((timer = LIST_FIRST(&htw->expiring)) != NULL) {
+		bool is_periodical;
+		bool running_timer_canceled;
+
+		/*
+		 * The timer struct may cannot be safely accessed
+		 * after the callback has been called (except for
+		 * non-canceled periodical timers), since the callback
+		 * may have free'd (or reused) the memory.
+		 */
+
+		LIST_REMOVE(timer, entry);
+
+		is_periodical = timer->flags & RTE_HTIMER_FLAG_PERIODICAL;
+
+		timer->state = RTE_HTIMER_STATE_EXPIRED;
+
+		htw->running_timer = timer;
+
+		timer->cb(timer, timer->cb_arg);
+
+		running_timer_canceled = htw->running_timer == NULL;
+
+		htw->running_timer = NULL;
+
+		if (is_periodical && !running_timer_canceled) {
+			timer->expiration_time += timer->period;
+			mark_added(htw, timer);
+		}
+	}
+}
+
+uint64_t
+rte_htw_current_time(struct rte_htw *htw)
+{
+	return htw->current_time;
+}
+
+uint64_t
+rte_htw_next_timeout(struct rte_htw *htw, uint64_t upper_bound)
+{
+	uint16_t level;
+
+	/* scheduling timeouts will sort them in temporal order */
+	process_added(htw);
+
+	if (!LIST_EMPTY(&htw->expiring))
+		return 0;
+
+	for (level = 0; level < WHEEL_LEVELS; level++) {
+		uint64_t wheel_timeout;
+
+		wheel_timeout = wheel_next_timeout(&htw->wheels[level],
+						   level, upper_bound);
+		if (wheel_timeout != UINT64_MAX)
+			return RTE_MIN(wheel_timeout, upper_bound);
+	}
+
+	return upper_bound;
+}
+
+static __rte_always_inline void
+process_slot(struct rte_htw *htw, uint16_t level, struct wheel *wheel,
+	     uint16_t slot)
+{
+	struct rte_htimer_list *slot_timers;
+	struct rte_htimer *timer;
+
+	slot_timers = &wheel->slots[slot];
+
+	rte_bitset_clear(wheel->used_slots, slot);
+
+	while ((timer = LIST_FIRST(slot_timers)) != NULL) {
+		LIST_REMOVE(timer, entry);
+
+		if (level == 0 || timer->expiration_time <= htw->current_time)
+			mark_expiring(htw, timer);
+		else
+			schedule_timer(htw, timer);
+	}
+}
+
+static __rte_always_inline void
+process_slots(struct rte_htw *htw, uint16_t level, struct wheel *wheel,
+	      uint16_t start_slot, uint16_t num_slots)
+{
+	ssize_t slot;
+
+	RTE_BITSET_FOREACH_SET_WRAP(slot, wheel->used_slots, WHEEL_SLOTS,
+				    start_slot, num_slots)
+		process_slot(htw, level, wheel, slot);
+}
+
+static void
+advance(struct rte_htw *htw)
+{
+	uint16_t level;
+
+	for (level = 0; level < WHEEL_LEVELS; level++) {
+		struct wheel *wheel = &htw->wheels[level];
+		uint64_t new_wheel_time;
+		uint16_t start_slot;
+		uint16_t num_slots;
+
+		new_wheel_time = time_to_wheel_time(htw->current_time, level);
+
+		if (new_wheel_time == wheel->wheel_time)
+			break;
+
+		start_slot = wheel_time_to_slot(wheel->wheel_time + 1);
+		num_slots = RTE_MIN(new_wheel_time - wheel->wheel_time,
+				    WHEEL_SLOTS);
+
+		wheel->wheel_time = new_wheel_time;
+
+		process_slots(htw, level, wheel, start_slot, num_slots);
+	}
+}
+
+void
+rte_htw_manage(struct rte_htw *htw, uint64_t new_time)
+{
+	RTE_VERIFY(new_time >= htw->current_time);
+
+	/*
+	 * Scheduling added timers, core timer wheeling processing and
+	 * expiry callback execution is kept as separate stages, to
+	 * avoid having the core wheel traversal code to deal with a
+	 * situation where a timeout callbacks re-adding the timer.
+	 * This split also results in seemingly reasonable semantics
+	 * in regards to the execution of the callbacks of
+	 * already-expired timeouts (e.g., with time 0) being added in
+	 * a timeout callback. Instead of creating an end-less loop,
+	 * with rte_htw_manage() never returning, it defers the
+	 * execution of the timer until the next rte_htw_manage()
+	 * call.
+	 */
+
+	process_added(htw);
+
+	htw->current_time = new_time;
+
+	advance(htw);
+
+	process_expiring(htw);
+}
+
+void
+rte_htw_process(struct rte_htw *htw)
+{
+	process_added(htw);
+	process_expiring(htw);
+}
diff --git a/lib/htimer/rte_htw.h b/lib/htimer/rte_htw.h
new file mode 100644
index 0000000000..c93358bb13
--- /dev/null
+++ b/lib/htimer/rte_htw.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#ifndef _RTE_HTW_H_
+#define _RTE_HTW_H_
+
+#include <stdint.h>
+#include <sys/queue.h>
+
+#include <rte_htimer.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+struct rte_htw;
+
+struct rte_htw *
+rte_htw_create(void);
+
+void
+rte_htw_destroy(struct rte_htw *htw);
+
+void
+rte_htw_add(struct rte_htw *htw, struct rte_htimer *timer,
+	    uint64_t expiration_time, uint64_t period,
+	    rte_htimer_cb_t cb, void *cb_arg, uint32_t flags);
+
+void
+rte_htw_cancel(struct rte_htw *htw, struct rte_htimer *timer);
+
+uint64_t
+rte_htw_current_time(struct rte_htw *htw);
+
+uint64_t
+rte_htw_next_timeout(struct rte_htw *htw, uint64_t upper_bound);
+
+void
+rte_htw_manage(struct rte_htw *htw, uint64_t new_time);
+
+void
+rte_htw_process(struct rte_htw *htw);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_HTW_H_ */
diff --git a/lib/htimer/version.map b/lib/htimer/version.map
new file mode 100644
index 0000000000..0e71dc7d57
--- /dev/null
+++ b/lib/htimer/version.map
@@ -0,0 +1,17 @@
+EXPERIMENTAL {
+	global:
+
+	rte_htimer_mgr_init;
+	rte_htimer_mgr_deinit;
+	rte_htimer_mgr_add;
+	rte_htimer_mgr_cancel;
+	rte_htimer_mgr_async_add;
+	rte_htimer_mgr_async_cancel;
+	rte_htimer_mgr_manage;
+	rte_htimer_mgr_manage_time;
+	rte_htimer_mgr_process;
+	rte_htimer_mgr_current_time;
+	rte_htimer_mgr_current_tick;
+
+        local: *;
+};
diff --git a/lib/meson.build b/lib/meson.build
index 2bc0932ad5..c7c0e42ae8 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -37,6 +37,7 @@ libraries = [
         'gpudev',
         'gro',
         'gso',
+        'htimer',
         'ip_frag',
         'jobstats',
         'kni',
-- 
2.34.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [RFC 0/2] Add high-performance timer facility
  2023-02-28  9:39 [RFC 0/2] Add high-performance timer facility Mattias Rönnblom
  2023-02-28  9:39 ` [RFC 1/2] eal: add bitset type Mattias Rönnblom
  2023-02-28  9:39 ` [RFC 2/2] eal: add high-performance timer facility Mattias Rönnblom
@ 2023-02-28 16:01 ` Morten Brørup
  2023-03-01 11:18   ` Mattias Rönnblom
  2023-03-15 17:03 ` [RFC v2 " Mattias Rönnblom
  3 siblings, 1 reply; 31+ messages in thread
From: Morten Brørup @ 2023-02-28 16:01 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: Erik Gabriel Carrillo, David Marchand, maria.lingemark, Stefan Sundkvist

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Tuesday, 28 February 2023 10.39

I have been looking for a high performance timer library (for use in a fast path TCP stack), and this looks very useful, Mattias.

My initial feedback is based on quickly skimming the patch source code, and reading this cover letter.

> 
> This patchset is an attempt to introduce a high-performance, highly
> scalable timer facility into DPDK.
> 
> More specifically, the goals for the htimer library are:
> 
> * Efficient handling of a handful up to hundreds of thousands of
>   concurrent timers.
> * Reduced overhead of adding and canceling timers.
> * Provide a service functionally equivalent to that of
>   <rte_timer.h>. API/ABI backward compatibility is secondary.
> 
> In the author's opinion, there are two main shortcomings with the
> current DPDK timer library (i.e., rte_timer.[ch]).
> 
> One is the synchronization overhead, where heavy-weight full-barrier
> type synchronization is used. rte_timer.c uses per-EAL/lcore skip
> lists, but any thread may add or cancel (or otherwise access) timers
> managed by another lcore (and thus resides in its timer skip list).
> 
> The other is an algorithmic shortcoming, with rte_timer.c's reliance
> on a skip list, which, seemingly, is less efficient than certain
> alternatives.
> 
> This patchset implements a hierarchical timer wheel (HWT, in

Typo: HWT or HTW?

> rte_htw.c), as per the Varghese and Lauck paper "Hashed and
> Hierarchical Timing Wheels: Data Structures for the Efficient
> Implementation of a Timer Facility". A HWT is a data structure
> purposely design for this task, and used by many operating system
> kernel timer facilities.
> 
> To further improve the solution described by Varghese and Lauck, a
> bitset is placed in front of each of the timer wheel in the HWT,
> reducing overhead of rte_htimer_mgr_manage() (i.e., progressing time
> and expiry processing).
> 
> Cycle-efficient scanning and manipulation of these bitsets are crucial
> for the HWT's performance.
> 
> The htimer module keeps a per-lcore (or per-registered EAL thread) HWT
> instance, much like rte_timer.c keeps a per-lcore skip list.
> 
> To avoid expensive synchronization overhead for thread-local timer
> management, the HWTs are accessed only from the "owning" thread.  Any
> interaction any other thread has with a particular lcore's timer
> wheel goes over a set of DPDK rings. A side-effect of this design is
> that all operations working toward a "remote" HWT must be
> asynchronous.
> 
> The <rte_htimer.h> API is available only to EAL threads and registered
> non-EAL threads.
> 
> The htimer API allows the application to supply the current time,
> useful in case it already has retrieved this for other purposes,
> saving the cost of a rdtsc instruction (or its equivalent).
> 
> Relative htimer does not retrieve a new time, but reuse the current
> time (as known via/at-the-time of the manage-call), again to shave off
> some cycles of overhead.

I have a comment to the two points above.

I agree that the application should supply the current time.

This should be the concept throughout the library. I don't understand why TSC is used in the library at all?

Please use a unit-less tick, and let the application decide what one tick means.

A unit-less tick will also let the application instantiate a HTW with higher resolution than the TSC. (E.g. think about oversampling in audio processing, or Brezenham's line drawing algorithm for 2D visuals - oversampling can sound and look better.)

For reference (supporting my suggestion), the dynamic timestamp field in the rte_mbuf structure is also defined as being unit-less. (I think NVIDIA implements it as nanoseconds, but that's an implementation specific choice.)

> 
> A semantic improvement compared to the <rte_timer.h> API is that the
> htimer library can give a definite answer on the question if the timer
> expiry callback was called, after a timer has been canceled.
> 
> Below is a performance data from DPDK's 'app/test' micro benchmarks,
> using 10k concurrent timers. The benchmarks (test_timer_perf.c and
> test_htimer_mgr_perf.c) aren't identical in their structure, but the
> numbers give some indication of the difference.
> 
> Use case               htimer  timer
> ------------------------------------
> Add timer                 28    253
> Cancel timer              10    412
> Async add (source lcore)  64
> Async add (target lcore)  13
> 
> (AMD 5900X CPU. Time in TSC.)
> 
> Prototype integration of the htimer library into real, timer-heavy,
> applications indicates that htimer may result in significant
> application-level performance gains.
> 
> The bitset implementation which the HWT implementation depends upon
> seemed generic-enough and potentially useful outside the world of
> HWTs, to justify being located in the EAL.
> 
> This patchset is very much an RFC, and the author is yet to form an
> opinion on many important issues.
> 
> * If deemed a suitable replacement, should the htimer replace the
>   current DPDK timer library in some particular (ABI-breaking)
>   release, or should it live side-by-side with the then-legacy
>   <rte_timer.h> API? A lot of things in and outside DPDK depend on
>   <rte_timer.h>, so coexistence may be required to facilitate a smooth
>   transition.

It's my immediate impression that they are totally different in both design philosophy and API.

Personal opinion: I would call it an entirely different library.

> 
> * Should the htimer and htw-related files be colocated with rte_timer.c
>   in the timer library?

Personal opinion: No. This is an entirely different library, and should live for itself in a directory of its own.

> 
> * Would it be useful for applications using asynchronous cancel to
>   have the option of having the timer callback run not only in case of
>   timer expiration, but also cancellation (on the target lcore)? The
>   timer cb signature would need to include an additional parameter in
>   that case.

If one thread cancels something in another thread, some synchronization between the threads is going to be required anyway. So we could reprase your question: Will the burden of the otherwise required synchronization between the two threads be significantly reduced if the library has the ability to run the callback on asynchronous cancel?

Is such a feature mostly "Must have" or "Nice to have"?

More thoughts in this area...

If adding and additional callback parameter, it could be an enum, so the callback could be expanded to support "timeout (a.k.a. timer fired)", "cancel" and more events we have not yet come up with, e.g. "early kick".

Here's an idea off the top of my head: An additional callback parameter has a (small) performance cost incurred with every timer fired (which is a very large multiplier). It might not be required. As an alternative to an "what happened" parameter to the callback, the callback could investigate the state of the object for which the timer fired, and draw its own conclusion on how to proceed. Obviously, this also has a performance cost, but perhaps the callback works on the object's state anyway, making this cost insignificant.

Here's another alternative to adding a "what happened" parameter to the callback:

The rte_htimer could have one more callback pointer, which (if set) will be called on cancellation of the timer.

> 
> * Should the rte_htimer be a nested struct, so the htw parts be separated
>   from the htimer parts?
> 
> * <rte_htimer.h> is kept separate from <rte_htimer_mgr.h>, so that
>   <rte_htw.h> may avoid a depedency to <rte_htimer_mgr.h>. Should it
>   be so?
> 
> * rte_htimer struct is only supposed to be used by the application to
>   give an indication of how much memory it needs to allocate, and is
>   its member are not supposed to be directly accessed (w/ the possible
>   exception of the owner_lcore_id field). Should there be a dummy
>   struct, or a #define RTE_HTIMER_MEMSIZE or a rte_htimer_get_memsize()
>   function instead, serving the same purpose? Better encapsulation,
>   but more inconvenient for applications. Run-time dynamic sizing
>   would force application-level dynamic allocations.
> 
> * Asynchronous cancellation is a little tricky to use for the
>   application (primarily due to timer memory reclamation/race
>   issues). Should this functionality be removed?
> 
> * Should rte_htimer_mgr_init() also retrieve the current time? If so,
>   there should to be a variant which allows the user to specify the
>   time (to match rte_htimer_mgr_manage_time()). One pitfall with the
>   current proposed API is an application calling rte_htimer_mgr_init()
>   and then immediately adding a timer with a relative timeout, in
>   which case the current absolute time used is 0, which might be a
>   surprise.
> 
> * Should libdivide (optionally) be used to avoid the div in the TSC ->
>   tick conversion? (Doesn't improve performance on Zen 3, but may
>   do on other CPUs.) Consider <rte_reciprocal.h> as well.
> 
> * Should the TSC-per-tick be rounded up to a power of 2, so shifts can be
>   used for conversion? Very minor performance gains to be found there,
>   at least on Zen 3 cores.
> 
> * Should it be possible to supply the time in rte_htimer_mgr_add()
>   and/or rte_htimer_mgr_manage_time() functions as ticks, rather than
>   as TSC? Should it be possible to also use nanoseconds?
>   rte_htimer_mgr_manage_time() would need a flags parameter in that
>   case.

Do not use TSC anywhere in this library. Let the application decide the meaning of a tick.

> 
> * Would the event timer adapter be best off using <rte_htw.h>
>   directly, or <rte_htimer.h>? In the latter case, there needs to be a
>   way to instantiate more HWTs (similar to the "alt" functions of
>   <rte_timer.h>)?
> 
> * Should the PERIODICAL flag (and the complexity it brings) be
>   removed? And leave the application with only single-shot timers, and
>   the option to re-add them in the timer callback.

First thought: Yes, keep it lean and remove the periodical stuff.

Second thought: This needs a more detailed analysis.

From one angle:

How many PERIODICAL versus ONESHOT timers do we expect?

Intuitively, I would use this library for ONESHOT timers, and perhaps implement my periodical timers by other means.

If the PERIODICAL:ONESHOT ratio is low, we can probably live with the extra cost of cancel+add for a few periodical timers.

From another angle:

What is the performance gain with the PERIODICAL flag?

Without a periodical timer, cancel+add costs 10+28 cycles. How many cycles would a "move" function, performing both cancel and add, use?

And then compare that to the cost (in cycles) of repeating a timer with PERIODICAL?

Furthermore, not having the PERIODICAL flag probably improves the performance for non-periodical timers. How many cycles could we gain here?

Another, vaguely related, idea:

The callback pointer might not need to be stored per rte_htimer, but could instead be common for the rte_htw.

When a timer fires, the callback probably needs to check/update the state of the object for which the timer fired anyway, so why not just let the application use that state to determine the appropriate action. This might provide some performance benefit.

It might complicate using one HTW for multiple different purposes, though. Probably a useless idea, but I wanted to share the idea anyway. It might trigger other, better ideas in the community.

> 
> * Should the async result codes and the sync cancel error codes be merged
>   into one set of result codes?
> 
> * Should the rte_htimer_mgr_async_add() have a flag which allow
>   buffering add request messages until rte_htimer_mgr_process() is
>   called? Or any manage function. Would reduce ring signaling overhead
>   (i.e., burst enqueue operations instead of single-element
>   enqueue). Could also be a rte_htimer_mgr_async_add_burst() function,
>   solving the same "problem" a different way. (The signature of such
>   a function would not be pretty.)
> 
> * Does the functionality provided by the rte_htimer_mgr_process()
>   function match its the use cases? Should there me a more clear
>   separation between expiry processing and asynchronous operation
>   processing?
> 
> * Should the patchset be split into more commits? If so, how?
> 
> Thanks to Erik Carrillo for his assistance.
> 
> Mattias Rönnblom (2):
>   eal: add bitset type
>   eal: add high-performance timer facility

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 1/2] eal: add bitset type
  2023-02-28  9:39 ` [RFC 1/2] eal: add bitset type Mattias Rönnblom
@ 2023-02-28 18:46   ` Tyler Retzlaff
  2023-03-02  6:31     ` Mattias Rönnblom
  0 siblings, 1 reply; 31+ messages in thread
From: Tyler Retzlaff @ 2023-02-28 18:46 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, Erik Gabriel Carrillo, David Marchand, maria.lingemark,
	Stefan Sundkvist

On Tue, Feb 28, 2023 at 10:39:15AM +0100, Mattias Rönnblom wrote:
> Introduce a set of functions and macros that operate on sets of bits,
> kept in arrays of 64-bit elements.
> 
> RTE bitset is designed for bitsets which are larger than what fits in
> a single machine word (i.e., 64 bits). For very large bitsets, the
> <rte_bitmap.h> API may be a more appropriate choice.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---

...

> diff --git a/lib/eal/include/rte_bitset.h b/lib/eal/include/rte_bitset.h
> new file mode 100644
> index 0000000000..e333e527e5
> --- /dev/null
> +++ b/lib/eal/include/rte_bitset.h
> @@ -0,0 +1,878 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2023 Ericsson AB
> + */
> +
> +#ifndef _RTE_BITSET_H_
> +#define _RTE_BITSET_H_
> +
> +/**
> + * @file
> + * RTE Bitset
> + *
> + * This file provides functions and macros for querying and
> + * manipulating sets of bits kept in arrays of @c uint64_t-sized
> + * elements.
> + *
> + * The bits in a bitset are numbered from 0 to @c size - 1, with the
> + * lowest index being the least significant bit.
> + *
> + * The bitset array must be properly aligned.
> + *
> + * For optimal performance, the @c size parameter, required by
> + * many of the API's functions, should be a compile-time constant.
> + *
> + * For large bitsets, the rte_bitmap.h API may be more appropriate.
> + *
> + * @warning
> + * All functions modifying a bitset may overwrite any unused bits of
> + * the last word. Such unused bits are ignored by all functions reading
> + * bits.
> + *
> + */
> +
> +#include <limits.h>
> +#include <stdbool.h>
> +#include <stdint.h>
> +#include <sys/types.h>

windows has no sys/types.h if there is a shim being picked up somewhere
portable code shouldn't depend on sys/types.h

> +
> +#include <rte_branch_prediction.h>
> +#include <rte_common.h>
> +#include <rte_debug.h>
> +#include <rte_memcpy.h>
> +
> +#ifdef __cplusplus
> +extern "C" {
> +#endif
> +
> +/**
> + * The size (in bytes) of each element in the array used to represent
> + * a bitset.
> + */
> +#define RTE_BITSET_WORD_SIZE (sizeof(uint64_t))
> +
> +/**
> + * The size (in bits) of each element in the array used to represent
> + * a bitset.
> + */
> +#define RTE_BITSET_WORD_BITS (RTE_BITSET_WORD_SIZE * CHAR_BIT)
> +
> +/**
> + * Computes the number of words required to store @c size bits.
> + */
> +#define RTE_BITSET_NUM_WORDS(size)			       \
> +	((size + RTE_BITSET_WORD_BITS - 1) / RTE_BITSET_WORD_BITS)
> +
> +/**
> + * Computes the amount of memory (in bytes) required to fit a bitset
> + * holding @c size bits.
> + */
> +#define RTE_BITSET_SIZE(size)					\
> +	((size_t)(RTE_BITSET_NUM_WORDS(size) * RTE_BITSET_WORD_SIZE))
> +
> +#define __RTE_BITSET_WORD_IDX(bit_num) ((bit_num) / RTE_BITSET_WORD_BITS)
> +#define __RTE_BITSET_BIT_OFFSET(bit_num) ((bit_num) % RTE_BITSET_WORD_BITS)
> +#define __RTE_BITSET_UNUSED(size)				\
> +	((RTE_BITSET_NUM_WORDS(size) * RTE_BITSET_WORD_BITS) \
> +	 - (size))
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Declare a bitset.
> + *
> + * Declare (e.g., as a struct field) or define (e.g., as a stack
> + * variable) a bitset of the specified size.
> + *
> + * @param size
> + *   The number of bits the bitset must be able to represent. Must be
> + *   a compile-time constant.
> + * @param name
> + *   The field or variable name of the resulting definition.
> + */
> +#define RTE_BITSET_DECLARE(name, size)		\
> +	uint64_t name[RTE_BITSET_NUM_WORDS(size)]
> +
> +/* XXX: should one include flags here and use to avoid a comparison? */
> +/* XXX: would this be better off as a function? */
> +
> +#define __RTE_BITSET_FOREACH_LEFT(var, size, start_bit, len)	\
> +	((len) - 1 - ((var) >= (start_bit) ? (var) - (start_bit) :	\
> +		  (size) - (start_bit) + (var)))
> +
> +#define __RTE_BITSET_FOREACH(var, bitset, size, start_bit, len, flags) \
> +	for ((var) = __rte_bitset_find(bitset, size, start_bit, len, \
> +				       flags);				\
> +	     (var) != -1;						\
> +	     (var) = __RTE_BITSET_FOREACH_LEFT(var, size, start_bit, \
> +					       len) > 0	?		\
> +		     __rte_bitset_find(bitset, size,		\
> +				       ((var) + 1) % (size),	\
> +				       __RTE_BITSET_FOREACH_LEFT(var,	\
> +								 size, \
> +								 start_bit, \
> +								 len),	\
> +				       flags) : -1)
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Iterate over all bits set.
> + *
> + * This macro iterates over all bits set (i.e., all ones) in the
> + * bitset, in the forward direction (i.e., starting with the least
> + * significant '1').
> + *
> + * @param var
> + *   An iterator variable of type @c ssize_t. For each sucessive iteration,
> + *   this variable will hold the bit index of a set bit.
> + * @param bitset
> + *   A <tt>const uint64_t *</tt> pointer to the bitset array.
> + * @param size
> + *   The size of the bitset (in bits).
> + */
> +
> +#define RTE_BITSET_FOREACH_SET(var, bitset, size)			\
> +	__RTE_BITSET_FOREACH(var, bitset, size, 0, size, 0)
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Iterate over all bits cleared.
> + *
> + * This macro iterates over all bits cleared in the bitset, in the
> + * forward direction (i.e., starting with the lowest-indexed set bit).
> + *
> + * @param var
> + *   An iterator variable of type @c ssize_t. For each successive iteration,
> + *   this variable will hold the bit index of a cleared bit.
> + * @param bitset
> + *   A <tt>const uint64_t *</tt> pointer to the bitset array.
> + * @param size
> + *   The size of the bitset (in bits).
> + */
> +
> +#define RTE_BITSET_FOREACH_CLEAR(var, bitset, size)			\
> +	__RTE_BITSET_FOREACH(var, bitset, size, 0, size,	\
> +			     __RTE_BITSET_FIND_FLAG_FIND_CLEAR)
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Iterate over all bits set within a range.
> + *
> + * This macro iterates over all bits set (i.e., all ones) in the
> + * specified range, in the forward direction (i.e., starting with the
> + * least significant '1').
> + *
> + * @param var
> + *   An iterator variable of type @c ssize_t. For each sucessive iteration,
> + *   this variable will hold the bit index of a set bit.
> + * @param bitset
> + *   A <tt>const uint64_t *</tt> pointer to the bitset array.
> + * @param size
> + *   The size of the bitset (in bits).
> + * @param start_bit
> + *   The index of the first bit to check. Must be less than @c size.
> + * @param len
> + *   The length (in bits) of the range. @c start_bit + @c len must be less
> + *   than or equal to @c size.
> + */
> +
> +#define RTE_BITSET_FOREACH_SET_RANGE(var, bitset, size, start_bit, \
> +				     len)			       \
> +	__RTE_BITSET_FOREACH(var, bitset, size, start_bit, len, 0)
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Iterate over all cleared bits within a range.
> + *
> + * This macro iterates over all bits cleared (i.e., all zeroes) in the
> + * specified range, in the forward direction (i.e., starting with the
> + * least significant '0').
> + *
> + * @param var
> + *   An iterator variable of type @c ssize_t. For each sucessive iteration,
> + *   this variable will hold the bit index of a set bit.
> + * @param bitset
> + *   A <tt>const uint64_t *</tt> pointer to the bitset array.
> + * @param size
> + *   The size of the bitset (in bits).
> + * @param start_bit
> + *   The index of the first bit to check. Must be less than @c size.
> + * @param len
> + *   The length (in bits) of the range. @c start_bit + @c len must be less
> + *   than or equal to @c size.
> + */
> +
> +#define RTE_BITSET_FOREACH_CLEAR_RANGE(var, bitset, size, start_bit,	\
> +				       len)				\
> +	__RTE_BITSET_FOREACH(var, bitset, size, start_bit, len,	\
> +			     __RTE_BITSET_FIND_FLAG_FIND_CLEAR)
> +
> +#define RTE_BITSET_FOREACH_SET_WRAP(var, bitset, size, start_bit,  \
> +				    len)			       \
> +	__RTE_BITSET_FOREACH(var, bitset, size, start_bit, len,    \
> +			     __RTE_BITSET_FIND_FLAG_WRAP)
> +
> +#define RTE_BITSET_FOREACH_CLEAR_WRAP(var, bitset, size, start_bit, \
> +				       len)				\
> +	__RTE_BITSET_FOREACH(var, bitset, size, start_bit, len,	\
> +			     __RTE_BITSET_FIND_FLAG_WRAP |		\
> +			     __RTE_BITSET_FIND_FLAG_FIND_CLEAR)
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Initializes a bitset.
> + *
> + * All bits are cleared.
> + *
> + * @param bitset
> + *   A pointer to the array of bitset 64-bit words.
> + * @param size
> + *   The size of the bitset (in bits).
> + */
> +
> +__rte_experimental
> +static inline void
> +rte_bitset_init(uint64_t *bitset, size_t size)
> +{
> +	memset(bitset, 0, RTE_BITSET_SIZE(size));
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Set a bit in the bitset.
> + *
> + * Bits are numbered from 0 to (size - 1) (inclusive).
> + *
> + * @param bitset
> + *   A pointer to the array words making up the bitset.
> + * @param bit_num
> + *   The index of the bit to be set.
> + */
> +
> +__rte_experimental
> +static inline void
> +rte_bitset_set(uint64_t *bitset, size_t bit_num)
> +{
> +	size_t word;
> +	size_t offset;
> +	uint64_t mask;
> +
> +	word = __RTE_BITSET_WORD_IDX(bit_num);
> +	offset = __RTE_BITSET_BIT_OFFSET(bit_num);
> +	mask = UINT64_C(1) << offset;
> +
> +	bitset[word] |= mask;
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Clear a bit in the bitset.
> + *
> + * Bits are numbered 0 to (size - 1) (inclusive).
> + *
> + * @param bitset
> + *   A pointer to the array words making up the bitset.
> + * @param bit_num
> + *   The index of the bit to be cleared.
> + */
> +
> +__rte_experimental
> +static inline void
> +rte_bitset_clear(uint64_t *bitset, size_t bit_num)
> +{
> +	size_t word;
> +	size_t offset;
> +	uint64_t mask;
> +
> +	word = __RTE_BITSET_WORD_IDX(bit_num);
> +	offset = __RTE_BITSET_BIT_OFFSET(bit_num);
> +	mask = ~(UINT64_C(1) << offset);
> +
> +	bitset[word] &= mask;
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Set all bits in the bitset.
> + *
> + * @param bitset
> + *   A pointer to the array of words making up the bitset.
> + * @param size
> + *   The size of the bitset (in bits).
> + */
> +
> +__rte_experimental
> +static inline void
> +rte_bitset_set_all(uint64_t *bitset, size_t size)
> +{
> +	memset(bitset, 0xFF, RTE_BITSET_SIZE(size));
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Clear all bits in the bitset.
> + *
> + * @param bitset
> + *   A pointer to the array of words making up the bitset.
> + * @param size
> + *   The size of the bitset (in bits).
> + */
> +
> +__rte_experimental
> +static inline void
> +rte_bitset_clear_all(uint64_t *bitset, size_t size)
> +{
> +	rte_bitset_init(bitset, size);
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Count all set bits.
> + *
> + * @param bitset
> + *   A pointer to the array of words making up the bitset.
> + * @param size
> + *   The size of the bitset (in bits).
> + * @return
> + *   Returns the number of '1' bits in the bitset.
> + */
> +
> +__rte_experimental
> +static inline size_t
> +rte_bitset_count_set(const uint64_t *bitset, size_t size)
> +{
> +	size_t i;
> +	size_t total = 0;
> +	uint64_t unused_mask;
> +
> +	/*
> +	 * Unused bits in a rte_bitset are always '0', and thus are
> +	 * not included in this count.
> +	 */
> +	for (i = 0; i < RTE_BITSET_NUM_WORDS(size) - 1; i++)
> +		total += __builtin_popcountll(bitset[i]);
> +
> +	unused_mask = UINT64_MAX >> __RTE_BITSET_UNUSED(size);
> +	total += __builtin_popcountll(bitset[i] & unused_mask);
> +
> +	return total;
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Count all cleared bits.
> + *
> + * @param bitset
> + *   A pointer to the array of words making up the bitset.
> + * @param size
> + *   The size of the bitset (in bits).
> + * @return
> + *   Returns the number of '0' bits in the bitset.
> + */
> +
> +__rte_experimental
> +static inline size_t
> +rte_bitset_count_clear(const uint64_t *bitset, size_t size)
> +{
> +	return size - rte_bitset_count_set(bitset, size);
> +}
> +
> +/**
> + * @warning
> + * @b EXPERIMENTAL: this API may change without prior notice.
> + *
> + * Test if a bit is set.
> + *
> + * @param bitset
> + *   A pointer to the array of words making up the bitset.
> + * @param bit_num
> + *   Index of the bit to test. Index 0 is the least significant bit.
> + * @return
> + *   Returns true if the bit is '1', and false if the bit is '0'.
> + */
> +
> +__rte_experimental
> +static inline bool
> +rte_bitset_test(const uint64_t *bitset, size_t bit_num)
> +{
> +	size_t word;
> +	size_t offset;
> +
> +	word = __RTE_BITSET_WORD_IDX(bit_num);
> +	offset = __RTE_BITSET_BIT_OFFSET(bit_num);
> +
> +	return (bitset[word] >> offset) & 1;
> +}
> +
> +#define __RTE_BITSET_FIND_FLAG_FIND_CLEAR (1U << 0)
> +#define __RTE_BITSET_FIND_FLAG_WRAP (1U << 1)
> +
> +__rte_experimental
> +static inline ssize_t
> +__rte_bitset_find_nowrap(const uint64_t *bitset, size_t __rte_unused size,
> +			 size_t start_bit, size_t len, bool find_clear)

   ^^ seems like the intent here is for this to be internal (not part of
      public api) i wonder do we have to mark it __rte_experimental?

      or better can we prevent visibility / consumption outside of the
      inline functions in this header?

> +{
> +	size_t word_idx;
> +	size_t offset;
> +	size_t end_bit = start_bit + len;
> +
> +	RTE_ASSERT(end_bit <= size);
> +
> +	if (unlikely(len == 0))
> +		return -1;
> +
> +	word_idx = __RTE_BITSET_WORD_IDX(start_bit);
> +	offset = __RTE_BITSET_BIT_OFFSET(start_bit);
> +
> +	while (word_idx <= __RTE_BITSET_WORD_IDX(end_bit - 1)) {
> +		uint64_t word;
> +		int word_ffs;
> +
> +		word = bitset[word_idx];
> +		if (find_clear)
> +			word = ~word;
> +
> +		word >>= offset;
> +
> +		word_ffs = __builtin_ffsll(word);
> +
> +		if (word_ffs != 0) {
> +			ssize_t ffs = start_bit + word_ffs - 1;
> +
> +			/*
> +			 * Check if set bit were among the last,
> +			 * unused bits, in the last word.
> +			 */
> +			if (unlikely(ffs >= (ssize_t)end_bit))
> +				return -1;
> +
> +			return ffs;
> +		}
> +
> +		start_bit += (RTE_BITSET_WORD_BITS - offset);
> +		word_idx++;
> +		offset = 0;
> +	}
> +
> +	return -1;
> +
> +}
> +

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 0/2] Add high-performance timer facility
  2023-02-28 16:01 ` [RFC 0/2] Add " Morten Brørup
@ 2023-03-01 11:18   ` Mattias Rönnblom
  2023-03-01 13:31     ` Morten Brørup
  0 siblings, 1 reply; 31+ messages in thread
From: Mattias Rönnblom @ 2023-03-01 11:18 UTC (permalink / raw)
  To: Morten Brørup, dev
  Cc: Erik Gabriel Carrillo, David Marchand, Maria Lingemark, Stefan Sundkvist

On 2023-02-28 17:01, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Tuesday, 28 February 2023 10.39
> 
> I have been looking for a high performance timer library (for use in a fast path TCP stack), and this looks very useful, Mattias.
> 
> My initial feedback is based on quickly skimming the patch source code, and reading this cover letter.
> 
>>
>> This patchset is an attempt to introduce a high-performance, highly
>> scalable timer facility into DPDK.
>>
>> More specifically, the goals for the htimer library are:
>>
>> * Efficient handling of a handful up to hundreds of thousands of
>>    concurrent timers.
>> * Reduced overhead of adding and canceling timers.
>> * Provide a service functionally equivalent to that of
>>    <rte_timer.h>. API/ABI backward compatibility is secondary.
>>
>> In the author's opinion, there are two main shortcomings with the
>> current DPDK timer library (i.e., rte_timer.[ch]).
>>
>> One is the synchronization overhead, where heavy-weight full-barrier
>> type synchronization is used. rte_timer.c uses per-EAL/lcore skip
>> lists, but any thread may add or cancel (or otherwise access) timers
>> managed by another lcore (and thus resides in its timer skip list).
>>
>> The other is an algorithmic shortcoming, with rte_timer.c's reliance
>> on a skip list, which, seemingly, is less efficient than certain
>> alternatives.
>>
>> This patchset implements a hierarchical timer wheel (HWT, in
> 
> Typo: HWT or HTW?

Yes. I don't understand how I could managed to make so many such HTW -> 
HWT typos. At least I got the filenames (rte_htw.[ch]) correct.

> 
>> rte_htw.c), as per the Varghese and Lauck paper "Hashed and
>> Hierarchical Timing Wheels: Data Structures for the Efficient
>> Implementation of a Timer Facility". A HWT is a data structure
>> purposely design for this task, and used by many operating system
>> kernel timer facilities.
>>
>> To further improve the solution described by Varghese and Lauck, a
>> bitset is placed in front of each of the timer wheel in the HWT,
>> reducing overhead of rte_htimer_mgr_manage() (i.e., progressing time
>> and expiry processing).
>>
>> Cycle-efficient scanning and manipulation of these bitsets are crucial
>> for the HWT's performance.
>>
>> The htimer module keeps a per-lcore (or per-registered EAL thread) HWT
>> instance, much like rte_timer.c keeps a per-lcore skip list.
>>
>> To avoid expensive synchronization overhead for thread-local timer
>> management, the HWTs are accessed only from the "owning" thread.  Any
>> interaction any other thread has with a particular lcore's timer
>> wheel goes over a set of DPDK rings. A side-effect of this design is
>> that all operations working toward a "remote" HWT must be
>> asynchronous.
>>
>> The <rte_htimer.h> API is available only to EAL threads and registered
>> non-EAL threads.
>>
>> The htimer API allows the application to supply the current time,
>> useful in case it already has retrieved this for other purposes,
>> saving the cost of a rdtsc instruction (or its equivalent).
>>
>> Relative htimer does not retrieve a new time, but reuse the current
>> time (as known via/at-the-time of the manage-call), again to shave off
>> some cycles of overhead.
> 
> I have a comment to the two points above.
> 
> I agree that the application should supply the current time.
> 
> This should be the concept throughout the library. I don't understand why TSC is used in the library at all?
> 
> Please use a unit-less tick, and let the application decide what one tick means.
> 

I suspect the design of rte_htimer_mgr.h (and rte_timer.h) makes more 
sense if you think of the user of the API as not just a "monolithic" 
application, but rather a set of different modules, developed by 
different organizations, and reused across a set of applications. The 
idea behind the API design is they should all be able to share one timer 
service instance.

The different parts of the application and any future DPDK platform 
modules that use the htimer service needs to agree what a tick means in 
terms of actual wall-time, if it's not mandated by the API.

There might be room for module-specific timer wheels as well, with 
different resolution or other characteristics. The event timer adapter's 
use of a timer wheel could be one example (although I'm not sure it is).

If timer-wheel-as-a-private-lego-piece is also a valid use case, then 
one could consider make the <rte_htw.h> API public as well. That is what 
I think you as asking for here: a generic timer wheel that doesn't know 
anything about time sources, time source time -> tick conversion, or 
timer source time -> monotonic wall time conversion, and maybe is also 
not bound to a particular thread.

I picked TSC because it seemed like a good "universal time unit" for 
DPDK. rdtsc (and its equivalent) is also a very precise (especially on 
x86) and cheap-to-retrieve (especially on ARM, from what I understand).

That said, at the moment, I'm leaning toward nanoseconds (uint64_t 
format) should be the default for timer expiration time instead of TSC. 
TSC could still be an option for passing the current time, since TSC 
will be a common time source, and it shaves off one conversion.

> A unit-less tick will also let the application instantiate a HTW with higher resolution than the TSC. (E.g. think about oversampling in audio processing, or Brezenham's line drawing algorithm for 2D visuals - oversampling can sound and look better.)
> 
> For reference (supporting my suggestion), the dynamic timestamp field in the rte_mbuf structure is also defined as being unit-less. (I think NVIDIA implements it as nanoseconds, but that's an implementation specific choice.)
> 
>>
>> A semantic improvement compared to the <rte_timer.h> API is that the
>> htimer library can give a definite answer on the question if the timer
>> expiry callback was called, after a timer has been canceled.
>>
>> Below is a performance data from DPDK's 'app/test' micro benchmarks,
>> using 10k concurrent timers. The benchmarks (test_timer_perf.c and
>> test_htimer_mgr_perf.c) aren't identical in their structure, but the
>> numbers give some indication of the difference.
>>
>> Use case               htimer  timer
>> ------------------------------------
>> Add timer                 28    253
>> Cancel timer              10    412
>> Async add (source lcore)  64
>> Async add (target lcore)  13
>>
>> (AMD 5900X CPU. Time in TSC.)
>>
>> Prototype integration of the htimer library into real, timer-heavy,
>> applications indicates that htimer may result in significant
>> application-level performance gains.
>>
>> The bitset implementation which the HWT implementation depends upon
>> seemed generic-enough and potentially useful outside the world of
>> HWTs, to justify being located in the EAL.
>>
>> This patchset is very much an RFC, and the author is yet to form an
>> opinion on many important issues.
>>
>> * If deemed a suitable replacement, should the htimer replace the
>>    current DPDK timer library in some particular (ABI-breaking)
>>    release, or should it live side-by-side with the then-legacy
>>    <rte_timer.h> API? A lot of things in and outside DPDK depend on
>>    <rte_timer.h>, so coexistence may be required to facilitate a smooth
>>    transition.
> 
> It's my immediate impression that they are totally different in both design philosophy and API.
> 
> Personal opinion: I would call it an entirely different library.
> 
>>
>> * Should the htimer and htw-related files be colocated with rte_timer.c
>>    in the timer library?
> 
> Personal opinion: No. This is an entirely different library, and should live for itself in a directory of its own.
> 
>>
>> * Would it be useful for applications using asynchronous cancel to
>>    have the option of having the timer callback run not only in case of
>>    timer expiration, but also cancellation (on the target lcore)? The
>>    timer cb signature would need to include an additional parameter in
>>    that case.
> 
> If one thread cancels something in another thread, some synchronization between the threads is going to be required anyway. So we could reprase your question: Will the burden of the otherwise required synchronization between the two threads be significantly reduced if the library has the ability to run the callback on asynchronous cancel?
> 

Yes.

Intuitively, it seems convenient that if you hand off a timer to a 
different lcore, the timer callback will be called exactly once, 
regardless if the timer was canceled or expired.

But, as you indicate, you may still need synchronization to solve the 
resource reclamation issue.

> Is such a feature mostly "Must have" or "Nice to have"?
> 
> More thoughts in this area...
> 
> If adding and additional callback parameter, it could be an enum, so the callback could be expanded to support "timeout (a.k.a. timer fired)", "cancel" and more events we have not yet come up with, e.g. "early kick".
> 

Yes, or an int.

> Here's an idea off the top of my head: An additional callback parameter has a (small) performance cost incurred with every timer fired (which is a very large multiplier). It might not be required. As an alternative to an "what happened" parameter to the callback, the callback could investigate the state of the object for which the timer fired, and draw its own conclusion on how to proceed. Obviously, this also has a performance cost, but perhaps the callback works on the object's state anyway, making this cost insignificant.
> 

It's not obvious to me that you, in the timer callback, can determine 
what happened, if the same callback is called both in the cancel and the 
expired case.

The cost of an extra integer passed in a register (or checking a flag, 
if the timer callback should be called at all at cancellation) that is 
the concern for me; it's extra bit of API complexity.

> Here's another alternative to adding a "what happened" parameter to the callback:
> 
> The rte_htimer could have one more callback pointer, which (if set) will be called on cancellation of the timer.
> 

This will grow the timer struct with 16 bytes.

>>
>> * Should the rte_htimer be a nested struct, so the htw parts be separated
>>    from the htimer parts?
>>
>> * <rte_htimer.h> is kept separate from <rte_htimer_mgr.h>, so that
>>    <rte_htw.h> may avoid a depedency to <rte_htimer_mgr.h>. Should it
>>    be so?
>>
>> * rte_htimer struct is only supposed to be used by the application to
>>    give an indication of how much memory it needs to allocate, and is
>>    its member are not supposed to be directly accessed (w/ the possible
>>    exception of the owner_lcore_id field). Should there be a dummy
>>    struct, or a #define RTE_HTIMER_MEMSIZE or a rte_htimer_get_memsize()
>>    function instead, serving the same purpose? Better encapsulation,
>>    but more inconvenient for applications. Run-time dynamic sizing
>>    would force application-level dynamic allocations.
>>
>> * Asynchronous cancellation is a little tricky to use for the
>>    application (primarily due to timer memory reclamation/race
>>    issues). Should this functionality be removed?
>>
>> * Should rte_htimer_mgr_init() also retrieve the current time? If so,
>>    there should to be a variant which allows the user to specify the
>>    time (to match rte_htimer_mgr_manage_time()). One pitfall with the
>>    current proposed API is an application calling rte_htimer_mgr_init()
>>    and then immediately adding a timer with a relative timeout, in
>>    which case the current absolute time used is 0, which might be a
>>    surprise.
>>
>> * Should libdivide (optionally) be used to avoid the div in the TSC ->
>>    tick conversion? (Doesn't improve performance on Zen 3, but may
>>    do on other CPUs.) Consider <rte_reciprocal.h> as well.
>>
>> * Should the TSC-per-tick be rounded up to a power of 2, so shifts can be
>>    used for conversion? Very minor performance gains to be found there,
>>    at least on Zen 3 cores.
>>
>> * Should it be possible to supply the time in rte_htimer_mgr_add()
>>    and/or rte_htimer_mgr_manage_time() functions as ticks, rather than
>>    as TSC? Should it be possible to also use nanoseconds?
>>    rte_htimer_mgr_manage_time() would need a flags parameter in that
>>    case.
> 
> Do not use TSC anywhere in this library. Let the application decide the meaning of a tick.
> 
>>
>> * Would the event timer adapter be best off using <rte_htw.h>
>>    directly, or <rte_htimer.h>? In the latter case, there needs to be a
>>    way to instantiate more HWTs (similar to the "alt" functions of
>>    <rte_timer.h>)?
>>
>> * Should the PERIODICAL flag (and the complexity it brings) be
>>    removed? And leave the application with only single-shot timers, and
>>    the option to re-add them in the timer callback.
> 
> First thought: Yes, keep it lean and remove the periodical stuff.
> 
> Second thought: This needs a more detailed analysis.
> 
>  From one angle:
> 
> How many PERIODICAL versus ONESHOT timers do we expect?
> 

I suspect you should be prepared for the ratio being anything.

> Intuitively, I would use this library for ONESHOT timers, and perhaps implement my periodical timers by other means.
> 
> If the PERIODICAL:ONESHOT ratio is low, we can probably live with the extra cost of cancel+add for a few periodical timers.
> 
>  From another angle:
> 
> What is the performance gain with the PERIODICAL flag?
> 

None, pretty much. It's just there for convenience.

> Without a periodical timer, cancel+add costs 10+28 cycles. How many cycles would a "move" function, performing both cancel and add, use?
> 
> And then compare that to the cost (in cycles) of repeating a timer with PERIODICAL?
> 
> Furthermore, not having the PERIODICAL flag probably improves the performance for non-periodical timers. How many cycles could we gain here?
> 
> 
> Another, vaguely related, idea:
> 
> The callback pointer might not need to be stored per rte_htimer, but could instead be common for the rte_htw.
> 

Do you mean rte_htw, or rte_htimer_mgr?

If you make one common callback, all the different parts of the 
application needs to be coordinated (in a big switch-statement, or 
something of that sort), or have some convention for using an 
application-specific wrapper structure (accessed via container_of()).

This is a problem if the timer service API consumer is a set of largely 
uncoordinated software modules.

Btw, the eventdev API has the same issue, and the proposed event 
dispatcher is one way to help facilitate application-internal decoupling.

For a module-private rte_htw instance your suggestion may work, but not 
for <rte_htimer_mgr.h>.

> When a timer fires, the callback probably needs to check/update the state of the object for which the timer fired anyway, so why not just let the application use that state to determine the appropriate action. This might provide some performance benefit.
> 
> It might complicate using one HTW for multiple different purposes, though. Probably a useless idea, but I wanted to share the idea anyway. It might trigger other, better ideas in the community.
> 
>>
>> * Should the async result codes and the sync cancel error codes be merged
>>    into one set of result codes?
>>
>> * Should the rte_htimer_mgr_async_add() have a flag which allow
>>    buffering add request messages until rte_htimer_mgr_process() is
>>    called? Or any manage function. Would reduce ring signaling overhead
>>    (i.e., burst enqueue operations instead of single-element
>>    enqueue). Could also be a rte_htimer_mgr_async_add_burst() function,
>>    solving the same "problem" a different way. (The signature of such
>>    a function would not be pretty.)
>>
>> * Does the functionality provided by the rte_htimer_mgr_process()
>>    function match its the use cases? Should there me a more clear
>>    separation between expiry processing and asynchronous operation
>>    processing?
>>
>> * Should the patchset be split into more commits? If so, how?
>>
>> Thanks to Erik Carrillo for his assistance.
>>
>> Mattias Rönnblom (2):
>>    eal: add bitset type
>>    eal: add high-performance timer facility


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [RFC 0/2] Add high-performance timer facility
  2023-03-01 11:18   ` Mattias Rönnblom
@ 2023-03-01 13:31     ` Morten Brørup
  2023-03-01 15:50       ` Mattias Rönnblom
  0 siblings, 1 reply; 31+ messages in thread
From: Morten Brørup @ 2023-03-01 13:31 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: Erik Gabriel Carrillo, David Marchand, Maria Lingemark, Stefan Sundkvist

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Wednesday, 1 March 2023 12.18
> 
> On 2023-02-28 17:01, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >> Sent: Tuesday, 28 February 2023 10.39
> >
> > I have been looking for a high performance timer library (for use in a fast
> path TCP stack), and this looks very useful, Mattias.
> >
> > My initial feedback is based on quickly skimming the patch source code, and
> reading this cover letter.
> >
> >>
> >> This patchset is an attempt to introduce a high-performance, highly
> >> scalable timer facility into DPDK.
> >>
> >> More specifically, the goals for the htimer library are:
> >>
> >> * Efficient handling of a handful up to hundreds of thousands of
> >>    concurrent timers.
> >> * Reduced overhead of adding and canceling timers.
> >> * Provide a service functionally equivalent to that of
> >>    <rte_timer.h>. API/ABI backward compatibility is secondary.
> >>
> >> In the author's opinion, there are two main shortcomings with the
> >> current DPDK timer library (i.e., rte_timer.[ch]).
> >>
> >> One is the synchronization overhead, where heavy-weight full-barrier
> >> type synchronization is used. rte_timer.c uses per-EAL/lcore skip
> >> lists, but any thread may add or cancel (or otherwise access) timers
> >> managed by another lcore (and thus resides in its timer skip list).
> >>
> >> The other is an algorithmic shortcoming, with rte_timer.c's reliance
> >> on a skip list, which, seemingly, is less efficient than certain
> >> alternatives.
> >>
> >> This patchset implements a hierarchical timer wheel (HWT, in
> >
> > Typo: HWT or HTW?
> 
> Yes. I don't understand how I could managed to make so many such HTW ->
> HWT typos. At least I got the filenames (rte_htw.[ch]) correct.
> 
> >
> >> rte_htw.c), as per the Varghese and Lauck paper "Hashed and
> >> Hierarchical Timing Wheels: Data Structures for the Efficient
> >> Implementation of a Timer Facility". A HWT is a data structure
> >> purposely design for this task, and used by many operating system
> >> kernel timer facilities.
> >>
> >> To further improve the solution described by Varghese and Lauck, a
> >> bitset is placed in front of each of the timer wheel in the HWT,
> >> reducing overhead of rte_htimer_mgr_manage() (i.e., progressing time
> >> and expiry processing).
> >>
> >> Cycle-efficient scanning and manipulation of these bitsets are crucial
> >> for the HWT's performance.
> >>
> >> The htimer module keeps a per-lcore (or per-registered EAL thread) HWT
> >> instance, much like rte_timer.c keeps a per-lcore skip list.
> >>
> >> To avoid expensive synchronization overhead for thread-local timer
> >> management, the HWTs are accessed only from the "owning" thread.  Any
> >> interaction any other thread has with a particular lcore's timer
> >> wheel goes over a set of DPDK rings. A side-effect of this design is
> >> that all operations working toward a "remote" HWT must be
> >> asynchronous.
> >>
> >> The <rte_htimer.h> API is available only to EAL threads and registered
> >> non-EAL threads.
> >>
> >> The htimer API allows the application to supply the current time,
> >> useful in case it already has retrieved this for other purposes,
> >> saving the cost of a rdtsc instruction (or its equivalent).
> >>
> >> Relative htimer does not retrieve a new time, but reuse the current
> >> time (as known via/at-the-time of the manage-call), again to shave off
> >> some cycles of overhead.
> >
> > I have a comment to the two points above.
> >
> > I agree that the application should supply the current time.
> >
> > This should be the concept throughout the library. I don't understand why
> TSC is used in the library at all?
> >
> > Please use a unit-less tick, and let the application decide what one tick
> means.
> >
> 
> I suspect the design of rte_htimer_mgr.h (and rte_timer.h) makes more
> sense if you think of the user of the API as not just a "monolithic"
> application, but rather a set of different modules, developed by
> different organizations, and reused across a set of applications. The
> idea behind the API design is they should all be able to share one timer
> service instance.
> 
> The different parts of the application and any future DPDK platform
> modules that use the htimer service needs to agree what a tick means in
> terms of actual wall-time, if it's not mandated by the API.

I see. Then those non-monolithic applications can agree that the unit of time is nanoseconds, or whatever makes sense for those applications. And then they can instantiate one shared HTW for that purpose.

There is no need to impose such an API limit on other users of the library.

> 
> There might be room for module-specific timer wheels as well, with
> different resolution or other characteristics. The event timer adapter's
> use of a timer wheel could be one example (although I'm not sure it is).

We are not using the event device, and I have not looked into it, so I have no qualified comments to this.

> 
> If timer-wheel-as-a-private-lego-piece is also a valid use case, then
> one could consider make the <rte_htw.h> API public as well. That is what
> I think you as asking for here: a generic timer wheel that doesn't know
> anything about time sources, time source time -> tick conversion, or
> timer source time -> monotonic wall time conversion, and maybe is also
> not bound to a particular thread.

Yes, that is what I had been searching the Internet for.

(I'm not sure what you mean by "not bound to a particular thread". Your per-thread design seems good to me.)

I don't want more stuff in the EAL. What I want is high-performance DPDK libraries we can use in our applications.

> 
> I picked TSC because it seemed like a good "universal time unit" for
> DPDK. rdtsc (and its equivalent) is also a very precise (especially on
> x86) and cheap-to-retrieve (especially on ARM, from what I understand).

The TSC does have excellent performance, but on all other parameters it is a horrible time keeper: The measurement unit depends on the underlying hardware, the TSC drifts depending on temperature, it cannot be PTP synchronized, the list is endless!

> 
> That said, at the moment, I'm leaning toward nanoseconds (uint64_t
> format) should be the default for timer expiration time instead of TSC.
> TSC could still be an option for passing the current time, since TSC
> will be a common time source, and it shaves off one conversion.

There are many reasons why nanoseconds is a much better choice than TSC.

> 
> > A unit-less tick will also let the application instantiate a HTW with higher
> resolution than the TSC. (E.g. think about oversampling in audio processing,
> or Brezenham's line drawing algorithm for 2D visuals - oversampling can sound
> and look better.)

Some of the timing data in our application have a resolution orders of magnitude higher than one nanosecond. If we combined that with a HTW library with nanosecond resolution, we would need to keep these timer values in two locations: The original high-res timer in our data structure, and the shadow low-res (nanosecond) timer in the HTW.

We might also need to frequently update the HTW timers to prevent drifting away from the high-res timers. E.g. 1.2 + 1.2 is still 2 when rounded, but + 1.2 becomes 3 when it should have been 4 (3 * 1.2 = 3.6) rounded. This level of drifting would also make periodic timers in the HTW useless.

Please note: I haven't really considered merging the high-res timing in our application with this HTW, and I'm also not saying that PERIODIC timers in the HTW are required or even useful for our application. I'm only providing arguments for a unit-less time!

> >
> > For reference (supporting my suggestion), the dynamic timestamp field in the
> rte_mbuf structure is also defined as being unit-less. (I think NVIDIA
> implements it as nanoseconds, but that's an implementation specific choice.)
> >
> >>
> >> A semantic improvement compared to the <rte_timer.h> API is that the
> >> htimer library can give a definite answer on the question if the timer
> >> expiry callback was called, after a timer has been canceled.
> >>
> >> Below is a performance data from DPDK's 'app/test' micro benchmarks,
> >> using 10k concurrent timers. The benchmarks (test_timer_perf.c and
> >> test_htimer_mgr_perf.c) aren't identical in their structure, but the
> >> numbers give some indication of the difference.
> >>
> >> Use case               htimer  timer
> >> ------------------------------------
> >> Add timer                 28    253
> >> Cancel timer              10    412
> >> Async add (source lcore)  64
> >> Async add (target lcore)  13
> >>
> >> (AMD 5900X CPU. Time in TSC.)
> >>
> >> Prototype integration of the htimer library into real, timer-heavy,
> >> applications indicates that htimer may result in significant
> >> application-level performance gains.
> >>
> >> The bitset implementation which the HWT implementation depends upon
> >> seemed generic-enough and potentially useful outside the world of
> >> HWTs, to justify being located in the EAL.
> >>
> >> This patchset is very much an RFC, and the author is yet to form an
> >> opinion on many important issues.
> >>
> >> * If deemed a suitable replacement, should the htimer replace the
> >>    current DPDK timer library in some particular (ABI-breaking)
> >>    release, or should it live side-by-side with the then-legacy
> >>    <rte_timer.h> API? A lot of things in and outside DPDK depend on
> >>    <rte_timer.h>, so coexistence may be required to facilitate a smooth
> >>    transition.
> >
> > It's my immediate impression that they are totally different in both design
> philosophy and API.
> >
> > Personal opinion: I would call it an entirely different library.
> >
> >>
> >> * Should the htimer and htw-related files be colocated with rte_timer.c
> >>    in the timer library?
> >
> > Personal opinion: No. This is an entirely different library, and should live
> for itself in a directory of its own.
> >
> >>
> >> * Would it be useful for applications using asynchronous cancel to
> >>    have the option of having the timer callback run not only in case of
> >>    timer expiration, but also cancellation (on the target lcore)? The
> >>    timer cb signature would need to include an additional parameter in
> >>    that case.
> >
> > If one thread cancels something in another thread, some synchronization
> between the threads is going to be required anyway. So we could reprase your
> question: Will the burden of the otherwise required synchronization between
> the two threads be significantly reduced if the library has the ability to run
> the callback on asynchronous cancel?
> >
> 
> Yes.
> 
> Intuitively, it seems convenient that if you hand off a timer to a
> different lcore, the timer callback will be called exactly once,
> regardless if the timer was canceled or expired.
> 
> But, as you indicate, you may still need synchronization to solve the
> resource reclamation issue.
> 
> > Is such a feature mostly "Must have" or "Nice to have"?
> >
> > More thoughts in this area...
> >
> > If adding and additional callback parameter, it could be an enum, so the
> callback could be expanded to support "timeout (a.k.a. timer fired)", "cancel"
> and more events we have not yet come up with, e.g. "early kick".
> >
> 
> Yes, or an int.
> 
> > Here's an idea off the top of my head: An additional callback parameter has
> a (small) performance cost incurred with every timer fired (which is a very
> large multiplier). It might not be required. As an alternative to an "what
> happened" parameter to the callback, the callback could investigate the state
> of the object for which the timer fired, and draw its own conclusion on how to
> proceed. Obviously, this also has a performance cost, but perhaps the callback
> works on the object's state anyway, making this cost insignificant.
> >
> 
> It's not obvious to me that you, in the timer callback, can determine
> what happened, if the same callback is called both in the cancel and the
> expired case.
> 
> The cost of an extra integer passed in a register (or checking a flag,
> if the timer callback should be called at all at cancellation) that is
> the concern for me; it's extra bit of API complexity.

Then introduce the library without this feature. More features can be added later.

The library will be introduced as "experimental", so we are free to improve it and modify the ABI along the way.

> 
> > Here's another alternative to adding a "what happened" parameter to the
> callback:
> >
> > The rte_htimer could have one more callback pointer, which (if set) will be
> called on cancellation of the timer.
> >
> 
> This will grow the timer struct with 16 bytes.

If the rte_htimer struct stays within one cache line, it should be acceptable.

On the other hand, this approach is less generic than passing an additional parameter. (E.g. add yet another callback pointer for "early kick"?)

BTW, async cancel is a form of inter-thread communication. Does this library really need to provide any inter-thread communication mechanisms? Doesn't an inter-thread communication mechanism belong in a separate library?

> 
> >>
> >> * Should the rte_htimer be a nested struct, so the htw parts be separated
> >>    from the htimer parts?
> >>
> >> * <rte_htimer.h> is kept separate from <rte_htimer_mgr.h>, so that
> >>    <rte_htw.h> may avoid a depedency to <rte_htimer_mgr.h>. Should it
> >>    be so?
> >>
> >> * rte_htimer struct is only supposed to be used by the application to
> >>    give an indication of how much memory it needs to allocate, and is
> >>    its member are not supposed to be directly accessed (w/ the possible
> >>    exception of the owner_lcore_id field). Should there be a dummy
> >>    struct, or a #define RTE_HTIMER_MEMSIZE or a rte_htimer_get_memsize()
> >>    function instead, serving the same purpose? Better encapsulation,
> >>    but more inconvenient for applications. Run-time dynamic sizing
> >>    would force application-level dynamic allocations.
> >>
> >> * Asynchronous cancellation is a little tricky to use for the
> >>    application (primarily due to timer memory reclamation/race
> >>    issues). Should this functionality be removed?
> >>
> >> * Should rte_htimer_mgr_init() also retrieve the current time? If so,
> >>    there should to be a variant which allows the user to specify the
> >>    time (to match rte_htimer_mgr_manage_time()). One pitfall with the
> >>    current proposed API is an application calling rte_htimer_mgr_init()
> >>    and then immediately adding a timer with a relative timeout, in
> >>    which case the current absolute time used is 0, which might be a
> >>    surprise.
> >>
> >> * Should libdivide (optionally) be used to avoid the div in the TSC ->
> >>    tick conversion? (Doesn't improve performance on Zen 3, but may
> >>    do on other CPUs.) Consider <rte_reciprocal.h> as well.
> >>
> >> * Should the TSC-per-tick be rounded up to a power of 2, so shifts can be
> >>    used for conversion? Very minor performance gains to be found there,
> >>    at least on Zen 3 cores.
> >>
> >> * Should it be possible to supply the time in rte_htimer_mgr_add()
> >>    and/or rte_htimer_mgr_manage_time() functions as ticks, rather than
> >>    as TSC? Should it be possible to also use nanoseconds?
> >>    rte_htimer_mgr_manage_time() would need a flags parameter in that
> >>    case.
> >
> > Do not use TSC anywhere in this library. Let the application decide the
> meaning of a tick.
> >
> >>
> >> * Would the event timer adapter be best off using <rte_htw.h>
> >>    directly, or <rte_htimer.h>? In the latter case, there needs to be a
> >>    way to instantiate more HWTs (similar to the "alt" functions of
> >>    <rte_timer.h>)?
> >>
> >> * Should the PERIODICAL flag (and the complexity it brings) be
> >>    removed? And leave the application with only single-shot timers, and
> >>    the option to re-add them in the timer callback.
> >
> > First thought: Yes, keep it lean and remove the periodical stuff.
> >
> > Second thought: This needs a more detailed analysis.
> >
> >  From one angle:
> >
> > How many PERIODICAL versus ONESHOT timers do we expect?
> >
> 
> I suspect you should be prepared for the ratio being anything.

In theory, anything is possible. But I'm asking that we consider realistic use cases.

> 
> > Intuitively, I would use this library for ONESHOT timers, and perhaps
> implement my periodical timers by other means.
> >
> > If the PERIODICAL:ONESHOT ratio is low, we can probably live with the extra
> cost of cancel+add for a few periodical timers.
> >
> >  From another angle:
> >
> > What is the performance gain with the PERIODICAL flag?
> >
> 
> None, pretty much. It's just there for convenience.

OK, then I suggest that you remove it, unless you get objections.

The library can be expanded with useful features at any time later. Useless features are (nearly) impossible to remove, once they are in there - they are just "technical debt" with associated maintenance costs, added complexity weaving into other features, etc..

> 
> > Without a periodical timer, cancel+add costs 10+28 cycles. How many cycles
> would a "move" function, performing both cancel and add, use?
> >
> > And then compare that to the cost (in cycles) of repeating a timer with
> PERIODICAL?
> >
> > Furthermore, not having the PERIODICAL flag probably improves the
> performance for non-periodical timers. How many cycles could we gain here?
> >
> >
> > Another, vaguely related, idea:
> >
> > The callback pointer might not need to be stored per rte_htimer, but could
> instead be common for the rte_htw.
> >
> 
> Do you mean rte_htw, or rte_htimer_mgr?
> 
> If you make one common callback, all the different parts of the
> application needs to be coordinated (in a big switch-statement, or
> something of that sort), or have some convention for using an
> application-specific wrapper structure (accessed via container_of()).
> 
> This is a problem if the timer service API consumer is a set of largely
> uncoordinated software modules.
> 
> Btw, the eventdev API has the same issue, and the proposed event
> dispatcher is one way to help facilitate application-internal decoupling.
> 
> For a module-private rte_htw instance your suggestion may work, but not
> for <rte_htimer_mgr.h>.

I was speculating that a common callback pointer might provide a performance benefit for single-purpose HTW instances. (The same concept applies if there are multiple callbacks, e.g. a "Timer Fired", a "Timer Cancelled", and an "Early Kick" callback pointer - i.e. having the callback pointers per HTW instance, instead of per timer.)

> 
> > When a timer fires, the callback probably needs to check/update the state of
> the object for which the timer fired anyway, so why not just let the
> application use that state to determine the appropriate action. This might
> provide some performance benefit.
> >
> > It might complicate using one HTW for multiple different purposes, though.
> Probably a useless idea, but I wanted to share the idea anyway. It might
> trigger other, better ideas in the community.
> >
> >>
> >> * Should the async result codes and the sync cancel error codes be merged
> >>    into one set of result codes?
> >>
> >> * Should the rte_htimer_mgr_async_add() have a flag which allow
> >>    buffering add request messages until rte_htimer_mgr_process() is
> >>    called? Or any manage function. Would reduce ring signaling overhead
> >>    (i.e., burst enqueue operations instead of single-element
> >>    enqueue). Could also be a rte_htimer_mgr_async_add_burst() function,
> >>    solving the same "problem" a different way. (The signature of such
> >>    a function would not be pretty.)
> >>
> >> * Does the functionality provided by the rte_htimer_mgr_process()
> >>    function match its the use cases? Should there me a more clear
> >>    separation between expiry processing and asynchronous operation
> >>    processing?
> >>
> >> * Should the patchset be split into more commits? If so, how?
> >>
> >> Thanks to Erik Carrillo for his assistance.
> >>
> >> Mattias Rönnblom (2):
> >>    eal: add bitset type
> >>    eal: add high-performance timer facility


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 0/2] Add high-performance timer facility
  2023-03-01 13:31     ` Morten Brørup
@ 2023-03-01 15:50       ` Mattias Rönnblom
  2023-03-01 17:06         ` Morten Brørup
  0 siblings, 1 reply; 31+ messages in thread
From: Mattias Rönnblom @ 2023-03-01 15:50 UTC (permalink / raw)
  To: Morten Brørup, dev
  Cc: Erik Gabriel Carrillo, David Marchand, Maria Lingemark, Stefan Sundkvist

On 2023-03-01 14:31, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Wednesday, 1 March 2023 12.18
>>
>> On 2023-02-28 17:01, Morten Brørup wrote:
>>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>>>> Sent: Tuesday, 28 February 2023 10.39
>>>
>>> I have been looking for a high performance timer library (for use in a fast
>> path TCP stack), and this looks very useful, Mattias.
>>>
>>> My initial feedback is based on quickly skimming the patch source code, and
>> reading this cover letter.
>>>
>>>>
>>>> This patchset is an attempt to introduce a high-performance, highly
>>>> scalable timer facility into DPDK.
>>>>
>>>> More specifically, the goals for the htimer library are:
>>>>
>>>> * Efficient handling of a handful up to hundreds of thousands of
>>>>     concurrent timers.
>>>> * Reduced overhead of adding and canceling timers.
>>>> * Provide a service functionally equivalent to that of
>>>>     <rte_timer.h>. API/ABI backward compatibility is secondary.
>>>>
>>>> In the author's opinion, there are two main shortcomings with the
>>>> current DPDK timer library (i.e., rte_timer.[ch]).
>>>>
>>>> One is the synchronization overhead, where heavy-weight full-barrier
>>>> type synchronization is used. rte_timer.c uses per-EAL/lcore skip
>>>> lists, but any thread may add or cancel (or otherwise access) timers
>>>> managed by another lcore (and thus resides in its timer skip list).
>>>>
>>>> The other is an algorithmic shortcoming, with rte_timer.c's reliance
>>>> on a skip list, which, seemingly, is less efficient than certain
>>>> alternatives.
>>>>
>>>> This patchset implements a hierarchical timer wheel (HWT, in
>>>
>>> Typo: HWT or HTW?
>>
>> Yes. I don't understand how I could managed to make so many such HTW ->
>> HWT typos. At least I got the filenames (rte_htw.[ch]) correct.
>>
>>>
>>>> rte_htw.c), as per the Varghese and Lauck paper "Hashed and
>>>> Hierarchical Timing Wheels: Data Structures for the Efficient
>>>> Implementation of a Timer Facility". A HWT is a data structure
>>>> purposely design for this task, and used by many operating system
>>>> kernel timer facilities.
>>>>
>>>> To further improve the solution described by Varghese and Lauck, a
>>>> bitset is placed in front of each of the timer wheel in the HWT,
>>>> reducing overhead of rte_htimer_mgr_manage() (i.e., progressing time
>>>> and expiry processing).
>>>>
>>>> Cycle-efficient scanning and manipulation of these bitsets are crucial
>>>> for the HWT's performance.
>>>>
>>>> The htimer module keeps a per-lcore (or per-registered EAL thread) HWT
>>>> instance, much like rte_timer.c keeps a per-lcore skip list.
>>>>
>>>> To avoid expensive synchronization overhead for thread-local timer
>>>> management, the HWTs are accessed only from the "owning" thread.  Any
>>>> interaction any other thread has with a particular lcore's timer
>>>> wheel goes over a set of DPDK rings. A side-effect of this design is
>>>> that all operations working toward a "remote" HWT must be
>>>> asynchronous.
>>>>
>>>> The <rte_htimer.h> API is available only to EAL threads and registered
>>>> non-EAL threads.
>>>>
>>>> The htimer API allows the application to supply the current time,
>>>> useful in case it already has retrieved this for other purposes,
>>>> saving the cost of a rdtsc instruction (or its equivalent).
>>>>
>>>> Relative htimer does not retrieve a new time, but reuse the current
>>>> time (as known via/at-the-time of the manage-call), again to shave off
>>>> some cycles of overhead.
>>>
>>> I have a comment to the two points above.
>>>
>>> I agree that the application should supply the current time.
>>>
>>> This should be the concept throughout the library. I don't understand why
>> TSC is used in the library at all?
>>>
>>> Please use a unit-less tick, and let the application decide what one tick
>> means.
>>>
>>
>> I suspect the design of rte_htimer_mgr.h (and rte_timer.h) makes more
>> sense if you think of the user of the API as not just a "monolithic"
>> application, but rather a set of different modules, developed by
>> different organizations, and reused across a set of applications. The
>> idea behind the API design is they should all be able to share one timer
>> service instance.
>>
>> The different parts of the application and any future DPDK platform
>> modules that use the htimer service needs to agree what a tick means in
>> terms of actual wall-time, if it's not mandated by the API.
> 
> I see. Then those non-monolithic applications can agree that the unit of time is nanoseconds, or whatever makes sense for those applications. And then they can instantiate one shared HTW for that purpose.
> 

<rte_htimer_mgr.h> contains nothing but shared HTWs.

> There is no need to impose such an API limit on other users of the library.
> 
>>
>> There might be room for module-specific timer wheels as well, with
>> different resolution or other characteristics. The event timer adapter's
>> use of a timer wheel could be one example (although I'm not sure it is).
> 
> We are not using the event device, and I have not looked into it, so I have no qualified comments to this.
> 
>>
>> If timer-wheel-as-a-private-lego-piece is also a valid use case, then
>> one could consider make the <rte_htw.h> API public as well. That is what
>> I think you as asking for here: a generic timer wheel that doesn't know
>> anything about time sources, time source time -> tick conversion, or
>> timer source time -> monotonic wall time conversion, and maybe is also
>> not bound to a particular thread.
> 
> Yes, that is what I had been searching the Internet for.
> 
> (I'm not sure what you mean by "not bound to a particular thread". Your per-thread design seems good to me.)
> 
> I don't want more stuff in the EAL. What I want is high-performance DPDK libraries we can use in our applications.
> 
>>
>> I picked TSC because it seemed like a good "universal time unit" for
>> DPDK. rdtsc (and its equivalent) is also a very precise (especially on
>> x86) and cheap-to-retrieve (especially on ARM, from what I understand).
> 
> The TSC does have excellent performance, but on all other parameters it is a horrible time keeper: The measurement unit depends on the underlying hardware, the TSC drifts depending on temperature, it cannot be PTP synchronized, the list is endless!
> 
>>
>> That said, at the moment, I'm leaning toward nanoseconds (uint64_t
>> format) should be the default for timer expiration time instead of TSC.
>> TSC could still be an option for passing the current time, since TSC
>> will be a common time source, and it shaves off one conversion.
> 
> There are many reasons why nanoseconds is a much better choice than TSC.
> 
>>
>>> A unit-less tick will also let the application instantiate a HTW with higher
>> resolution than the TSC. (E.g. think about oversampling in audio processing,
>> or Brezenham's line drawing algorithm for 2D visuals - oversampling can sound
>> and look better.)
> 
> Some of the timing data in our application have a resolution orders of magnitude higher than one nanosecond. If we combined that with a HTW library with nanosecond resolution, we would need to keep these timer values in two locations: The original high-res timer in our data structure, and the shadow low-res (nanosecond) timer in the HTW.
> 

There is no way you will meet timers with anything approaching 
pico-second-level precision. You will also get into a value range issue, 
since you will wrap around a 64-bit integer in a matter of days.

The HTW only stores the timeout in ticks, not TSC, nanoseconds or 
picoseconds. Generally, you don't want pico-second-level tick 
granularity, since it increases the overhead of advancing the wheel(s). 
The first (lowest-significance) few wheels will pretty much always be empty.

> We might also need to frequently update the HTW timers to prevent drifting away from the high-res timers. E.g. 1.2 + 1.2 is still 2 when rounded, but + 1.2 becomes 3 when it should have been 4 (3 * 1.2 = 3.6) rounded. This level of drifting would also make periodic timers in the HTW useless.
> 

Useless, for a certain class of applications. What application would 
that be?

> Please note: I haven't really considered merging the high-res timing in our application with this HTW, and I'm also not saying that PERIODIC timers in the HTW are required or even useful for our application. I'm only providing arguments for a unit-less time!
> 
>>>
>>> For reference (supporting my suggestion), the dynamic timestamp field in the
>> rte_mbuf structure is also defined as being unit-less. (I think NVIDIA
>> implements it as nanoseconds, but that's an implementation specific choice.)
>>>
>>>>
>>>> A semantic improvement compared to the <rte_timer.h> API is that the
>>>> htimer library can give a definite answer on the question if the timer
>>>> expiry callback was called, after a timer has been canceled.
>>>>
>>>> Below is a performance data from DPDK's 'app/test' micro benchmarks,
>>>> using 10k concurrent timers. The benchmarks (test_timer_perf.c and
>>>> test_htimer_mgr_perf.c) aren't identical in their structure, but the
>>>> numbers give some indication of the difference.
>>>>
>>>> Use case               htimer  timer
>>>> ------------------------------------
>>>> Add timer                 28    253
>>>> Cancel timer              10    412
>>>> Async add (source lcore)  64
>>>> Async add (target lcore)  13
>>>>
>>>> (AMD 5900X CPU. Time in TSC.)
>>>>
>>>> Prototype integration of the htimer library into real, timer-heavy,
>>>> applications indicates that htimer may result in significant
>>>> application-level performance gains.
>>>>
>>>> The bitset implementation which the HWT implementation depends upon
>>>> seemed generic-enough and potentially useful outside the world of
>>>> HWTs, to justify being located in the EAL.
>>>>
>>>> This patchset is very much an RFC, and the author is yet to form an
>>>> opinion on many important issues.
>>>>
>>>> * If deemed a suitable replacement, should the htimer replace the
>>>>     current DPDK timer library in some particular (ABI-breaking)
>>>>     release, or should it live side-by-side with the then-legacy
>>>>     <rte_timer.h> API? A lot of things in and outside DPDK depend on
>>>>     <rte_timer.h>, so coexistence may be required to facilitate a smooth
>>>>     transition.
>>>
>>> It's my immediate impression that they are totally different in both design
>> philosophy and API.
>>>
>>> Personal opinion: I would call it an entirely different library.
>>>
>>>>
>>>> * Should the htimer and htw-related files be colocated with rte_timer.c
>>>>     in the timer library?
>>>
>>> Personal opinion: No. This is an entirely different library, and should live
>> for itself in a directory of its own.
>>>
>>>>
>>>> * Would it be useful for applications using asynchronous cancel to
>>>>     have the option of having the timer callback run not only in case of
>>>>     timer expiration, but also cancellation (on the target lcore)? The
>>>>     timer cb signature would need to include an additional parameter in
>>>>     that case.
>>>
>>> If one thread cancels something in another thread, some synchronization
>> between the threads is going to be required anyway. So we could reprase your
>> question: Will the burden of the otherwise required synchronization between
>> the two threads be significantly reduced if the library has the ability to run
>> the callback on asynchronous cancel?
>>>
>>
>> Yes.
>>
>> Intuitively, it seems convenient that if you hand off a timer to a
>> different lcore, the timer callback will be called exactly once,
>> regardless if the timer was canceled or expired.
>>
>> But, as you indicate, you may still need synchronization to solve the
>> resource reclamation issue.
>>
>>> Is such a feature mostly "Must have" or "Nice to have"?
>>>
>>> More thoughts in this area...
>>>
>>> If adding and additional callback parameter, it could be an enum, so the
>> callback could be expanded to support "timeout (a.k.a. timer fired)", "cancel"
>> and more events we have not yet come up with, e.g. "early kick".
>>>
>>
>> Yes, or an int.
>>
>>> Here's an idea off the top of my head: An additional callback parameter has
>> a (small) performance cost incurred with every timer fired (which is a very
>> large multiplier). It might not be required. As an alternative to an "what
>> happened" parameter to the callback, the callback could investigate the state
>> of the object for which the timer fired, and draw its own conclusion on how to
>> proceed. Obviously, this also has a performance cost, but perhaps the callback
>> works on the object's state anyway, making this cost insignificant.
>>>
>>
>> It's not obvious to me that you, in the timer callback, can determine
>> what happened, if the same callback is called both in the cancel and the
>> expired case.
>>
>> The cost of an extra integer passed in a register (or checking a flag,
>> if the timer callback should be called at all at cancellation) that is
>> the concern for me; it's extra bit of API complexity.
> 
> Then introduce the library without this feature. More features can be added later.
> 
> The library will be introduced as "experimental", so we are free to improve it and modify the ABI along the way.
> 
>>
>>> Here's another alternative to adding a "what happened" parameter to the
>> callback:
>>>
>>> The rte_htimer could have one more callback pointer, which (if set) will be
>> called on cancellation of the timer.
>>>
>>
>> This will grow the timer struct with 16 bytes.
> 
> If the rte_htimer struct stays within one cache line, it should be acceptable.
> 

Timer structs are often embedded in other structures, and need not 
themselves be cache line aligned (although the "parent" struct may need 
to be, e.g. if it's dynamically allocated).

So smaller is better. Just consider if you want your attosecond-level 
time stamp in a struct:

struct my_timer {
     uint64_t high_precision_time_high_bits;
     uint64_t high_precision_time_low_bits;
     struct rte_htimer timer;
};

...and you allocate those structs from a mempool. If rte_htimer is small 
enough, you will fit on one cache line.

> On the other hand, this approach is less generic than passing an additional parameter. (E.g. add yet another callback pointer for "early kick"?)
> 
> BTW, async cancel is a form of inter-thread communication. Does this library really need to provide any inter-thread communication mechanisms? Doesn't an inter-thread communication mechanism belong in a separate library?
> 

Yes, <rte_htimer_mgr.h> needs this because:
1) Being able to schedule timers on a remote lcore is a useful feature 
(especially since we don't have much else in terms of deferred work 
mechanisms in DPDK).
2) htimer aspires to be a plug-in replacement for <rte_timer.h> (albeit 
an ABI-breaking one).

The pure HTW is in rte_htw.[ch].

Plus, with the current design, async operations basically come for free 
(if you don't use them), from a performance perspective. The extra 
overhead boils down to occasionally polling an empty ring, which is an 
inexpensive operation.

>>
>>>>
>>>> * Should the rte_htimer be a nested struct, so the htw parts be separated
>>>>     from the htimer parts?
>>>>
>>>> * <rte_htimer.h> is kept separate from <rte_htimer_mgr.h>, so that
>>>>     <rte_htw.h> may avoid a depedency to <rte_htimer_mgr.h>. Should it
>>>>     be so?
>>>>
>>>> * rte_htimer struct is only supposed to be used by the application to
>>>>     give an indication of how much memory it needs to allocate, and is
>>>>     its member are not supposed to be directly accessed (w/ the possible
>>>>     exception of the owner_lcore_id field). Should there be a dummy
>>>>     struct, or a #define RTE_HTIMER_MEMSIZE or a rte_htimer_get_memsize()
>>>>     function instead, serving the same purpose? Better encapsulation,
>>>>     but more inconvenient for applications. Run-time dynamic sizing
>>>>     would force application-level dynamic allocations.
>>>>
>>>> * Asynchronous cancellation is a little tricky to use for the
>>>>     application (primarily due to timer memory reclamation/race
>>>>     issues). Should this functionality be removed?
>>>>
>>>> * Should rte_htimer_mgr_init() also retrieve the current time? If so,
>>>>     there should to be a variant which allows the user to specify the
>>>>     time (to match rte_htimer_mgr_manage_time()). One pitfall with the
>>>>     current proposed API is an application calling rte_htimer_mgr_init()
>>>>     and then immediately adding a timer with a relative timeout, in
>>>>     which case the current absolute time used is 0, which might be a
>>>>     surprise.
>>>>
>>>> * Should libdivide (optionally) be used to avoid the div in the TSC ->
>>>>     tick conversion? (Doesn't improve performance on Zen 3, but may
>>>>     do on other CPUs.) Consider <rte_reciprocal.h> as well.
>>>>
>>>> * Should the TSC-per-tick be rounded up to a power of 2, so shifts can be
>>>>     used for conversion? Very minor performance gains to be found there,
>>>>     at least on Zen 3 cores.
>>>>
>>>> * Should it be possible to supply the time in rte_htimer_mgr_add()
>>>>     and/or rte_htimer_mgr_manage_time() functions as ticks, rather than
>>>>     as TSC? Should it be possible to also use nanoseconds?
>>>>     rte_htimer_mgr_manage_time() would need a flags parameter in that
>>>>     case.
>>>
>>> Do not use TSC anywhere in this library. Let the application decide the
>> meaning of a tick.
>>>
>>>>
>>>> * Would the event timer adapter be best off using <rte_htw.h>
>>>>     directly, or <rte_htimer.h>? In the latter case, there needs to be a
>>>>     way to instantiate more HWTs (similar to the "alt" functions of
>>>>     <rte_timer.h>)?
>>>>
>>>> * Should the PERIODICAL flag (and the complexity it brings) be
>>>>     removed? And leave the application with only single-shot timers, and
>>>>     the option to re-add them in the timer callback.
>>>
>>> First thought: Yes, keep it lean and remove the periodical stuff.
>>>
>>> Second thought: This needs a more detailed analysis.
>>>
>>>   From one angle:
>>>
>>> How many PERIODICAL versus ONESHOT timers do we expect?
>>>
>>
>> I suspect you should be prepared for the ratio being anything.
> 
> In theory, anything is possible. But I'm asking that we consider realistic use cases.
> 
>>
>>> Intuitively, I would use this library for ONESHOT timers, and perhaps
>> implement my periodical timers by other means.
>>>
>>> If the PERIODICAL:ONESHOT ratio is low, we can probably live with the extra
>> cost of cancel+add for a few periodical timers.
>>>
>>>   From another angle:
>>>
>>> What is the performance gain with the PERIODICAL flag?
>>>
>>
>> None, pretty much. It's just there for convenience.
> 
> OK, then I suggest that you remove it, unless you get objections.
> 
> The library can be expanded with useful features at any time later. Useless features are (nearly) impossible to remove, once they are in there - they are just "technical debt" with associated maintenance costs, added complexity weaving into other features, etc..
> 
>>
>>> Without a periodical timer, cancel+add costs 10+28 cycles. How many cycles
>> would a "move" function, performing both cancel and add, use?
>>>
>>> And then compare that to the cost (in cycles) of repeating a timer with
>> PERIODICAL?
>>>
>>> Furthermore, not having the PERIODICAL flag probably improves the
>> performance for non-periodical timers. How many cycles could we gain here?
>>>
>>>
>>> Another, vaguely related, idea:
>>>
>>> The callback pointer might not need to be stored per rte_htimer, but could
>> instead be common for the rte_htw.
>>>
>>
>> Do you mean rte_htw, or rte_htimer_mgr?
>>
>> If you make one common callback, all the different parts of the
>> application needs to be coordinated (in a big switch-statement, or
>> something of that sort), or have some convention for using an
>> application-specific wrapper structure (accessed via container_of()).
>>
>> This is a problem if the timer service API consumer is a set of largely
>> uncoordinated software modules.
>>
>> Btw, the eventdev API has the same issue, and the proposed event
>> dispatcher is one way to help facilitate application-internal decoupling.
>>
>> For a module-private rte_htw instance your suggestion may work, but not
>> for <rte_htimer_mgr.h>.
> 
> I was speculating that a common callback pointer might provide a performance benefit for single-purpose HTW instances. (The same concept applies if there are multiple callbacks, e.g. a "Timer Fired", a "Timer Cancelled", and an "Early Kick" callback pointer - i.e. having the callback pointers per HTW instance, instead of per timer.)
> 
>>
>>> When a timer fires, the callback probably needs to check/update the state of
>> the object for which the timer fired anyway, so why not just let the
>> application use that state to determine the appropriate action. This might
>> provide some performance benefit.
>>>
>>> It might complicate using one HTW for multiple different purposes, though.
>> Probably a useless idea, but I wanted to share the idea anyway. It might
>> trigger other, better ideas in the community.
>>>
>>>>
>>>> * Should the async result codes and the sync cancel error codes be merged
>>>>     into one set of result codes?
>>>>
>>>> * Should the rte_htimer_mgr_async_add() have a flag which allow
>>>>     buffering add request messages until rte_htimer_mgr_process() is
>>>>     called? Or any manage function. Would reduce ring signaling overhead
>>>>     (i.e., burst enqueue operations instead of single-element
>>>>     enqueue). Could also be a rte_htimer_mgr_async_add_burst() function,
>>>>     solving the same "problem" a different way. (The signature of such
>>>>     a function would not be pretty.)
>>>>
>>>> * Does the functionality provided by the rte_htimer_mgr_process()
>>>>     function match its the use cases? Should there me a more clear
>>>>     separation between expiry processing and asynchronous operation
>>>>     processing?
>>>>
>>>> * Should the patchset be split into more commits? If so, how?
>>>>
>>>> Thanks to Erik Carrillo for his assistance.
>>>>
>>>> Mattias Rönnblom (2):
>>>>     eal: add bitset type
>>>>     eal: add high-performance timer facility
> 


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [RFC 0/2] Add high-performance timer facility
  2023-03-01 15:50       ` Mattias Rönnblom
@ 2023-03-01 17:06         ` Morten Brørup
  0 siblings, 0 replies; 31+ messages in thread
From: Morten Brørup @ 2023-03-01 17:06 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: Erik Gabriel Carrillo, David Marchand, Maria Lingemark, Stefan Sundkvist

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Wednesday, 1 March 2023 16.50
> 
> On 2023-03-01 14:31, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >> Sent: Wednesday, 1 March 2023 12.18
> >>
> >> On 2023-02-28 17:01, Morten Brørup wrote:
> >>>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >>>> Sent: Tuesday, 28 February 2023 10.39
> >>>
> >>> I have been looking for a high performance timer library (for use in
> a fast
> >> path TCP stack), and this looks very useful, Mattias.
> >>>
> >>> My initial feedback is based on quickly skimming the patch source
> code, and
> >> reading this cover letter.
> >>>
> >>>>
> >>>> This patchset is an attempt to introduce a high-performance, highly
> >>>> scalable timer facility into DPDK.
> >>>>
> >>>> More specifically, the goals for the htimer library are:
> >>>>
> >>>> * Efficient handling of a handful up to hundreds of thousands of
> >>>>     concurrent timers.
> >>>> * Reduced overhead of adding and canceling timers.
> >>>> * Provide a service functionally equivalent to that of
> >>>>     <rte_timer.h>. API/ABI backward compatibility is secondary.
> >>>>
> >>>> In the author's opinion, there are two main shortcomings with the
> >>>> current DPDK timer library (i.e., rte_timer.[ch]).
> >>>>
> >>>> One is the synchronization overhead, where heavy-weight full-
> barrier
> >>>> type synchronization is used. rte_timer.c uses per-EAL/lcore skip
> >>>> lists, but any thread may add or cancel (or otherwise access)
> timers
> >>>> managed by another lcore (and thus resides in its timer skip list).
> >>>>
> >>>> The other is an algorithmic shortcoming, with rte_timer.c's
> reliance
> >>>> on a skip list, which, seemingly, is less efficient than certain
> >>>> alternatives.
> >>>>
> >>>> This patchset implements a hierarchical timer wheel (HWT, in
> >>>
> >>> Typo: HWT or HTW?
> >>
> >> Yes. I don't understand how I could managed to make so many such HTW
> ->
> >> HWT typos. At least I got the filenames (rte_htw.[ch]) correct.
> >>
> >>>
> >>>> rte_htw.c), as per the Varghese and Lauck paper "Hashed and
> >>>> Hierarchical Timing Wheels: Data Structures for the Efficient
> >>>> Implementation of a Timer Facility". A HWT is a data structure
> >>>> purposely design for this task, and used by many operating system
> >>>> kernel timer facilities.
> >>>>
> >>>> To further improve the solution described by Varghese and Lauck, a
> >>>> bitset is placed in front of each of the timer wheel in the HWT,
> >>>> reducing overhead of rte_htimer_mgr_manage() (i.e., progressing
> time
> >>>> and expiry processing).
> >>>>
> >>>> Cycle-efficient scanning and manipulation of these bitsets are
> crucial
> >>>> for the HWT's performance.
> >>>>
> >>>> The htimer module keeps a per-lcore (or per-registered EAL thread)
> HWT
> >>>> instance, much like rte_timer.c keeps a per-lcore skip list.
> >>>>
> >>>> To avoid expensive synchronization overhead for thread-local timer
> >>>> management, the HWTs are accessed only from the "owning" thread.
> Any
> >>>> interaction any other thread has with a particular lcore's timer
> >>>> wheel goes over a set of DPDK rings. A side-effect of this design
> is
> >>>> that all operations working toward a "remote" HWT must be
> >>>> asynchronous.
> >>>>
> >>>> The <rte_htimer.h> API is available only to EAL threads and
> registered
> >>>> non-EAL threads.
> >>>>
> >>>> The htimer API allows the application to supply the current time,
> >>>> useful in case it already has retrieved this for other purposes,
> >>>> saving the cost of a rdtsc instruction (or its equivalent).
> >>>>
> >>>> Relative htimer does not retrieve a new time, but reuse the current
> >>>> time (as known via/at-the-time of the manage-call), again to shave
> off
> >>>> some cycles of overhead.
> >>>
> >>> I have a comment to the two points above.
> >>>
> >>> I agree that the application should supply the current time.
> >>>
> >>> This should be the concept throughout the library. I don't
> understand why
> >> TSC is used in the library at all?
> >>>
> >>> Please use a unit-less tick, and let the application decide what one
> tick
> >> means.
> >>>
> >>
> >> I suspect the design of rte_htimer_mgr.h (and rte_timer.h) makes more
> >> sense if you think of the user of the API as not just a "monolithic"
> >> application, but rather a set of different modules, developed by
> >> different organizations, and reused across a set of applications. The
> >> idea behind the API design is they should all be able to share one
> timer
> >> service instance.
> >>
> >> The different parts of the application and any future DPDK platform
> >> modules that use the htimer service needs to agree what a tick means
> in
> >> terms of actual wall-time, if it's not mandated by the API.
> >
> > I see. Then those non-monolithic applications can agree that the unit
> of time is nanoseconds, or whatever makes sense for those applications.
> And then they can instantiate one shared HTW for that purpose.
> >
> 
> <rte_htimer_mgr.h> contains nothing but shared HTWs.
> 
> > There is no need to impose such an API limit on other users of the
> library.
> >
> >>
> >> There might be room for module-specific timer wheels as well, with
> >> different resolution or other characteristics. The event timer
> adapter's
> >> use of a timer wheel could be one example (although I'm not sure it
> is).
> >
> > We are not using the event device, and I have not looked into it, so I
> have no qualified comments to this.
> >
> >>
> >> If timer-wheel-as-a-private-lego-piece is also a valid use case, then
> >> one could consider make the <rte_htw.h> API public as well. That is
> what
> >> I think you as asking for here: a generic timer wheel that doesn't
> know
> >> anything about time sources, time source time -> tick conversion, or
> >> timer source time -> monotonic wall time conversion, and maybe is
> also
> >> not bound to a particular thread.
> >
> > Yes, that is what I had been searching the Internet for.
> >
> > (I'm not sure what you mean by "not bound to a particular thread".
> Your per-thread design seems good to me.)
> >
> > I don't want more stuff in the EAL. What I want is high-performance
> DPDK libraries we can use in our applications.
> >
> >>
> >> I picked TSC because it seemed like a good "universal time unit" for
> >> DPDK. rdtsc (and its equivalent) is also a very precise (especially
> on
> >> x86) and cheap-to-retrieve (especially on ARM, from what I
> understand).
> >
> > The TSC does have excellent performance, but on all other parameters
> it is a horrible time keeper: The measurement unit depends on the
> underlying hardware, the TSC drifts depending on temperature, it cannot
> be PTP synchronized, the list is endless!
> >
> >>
> >> That said, at the moment, I'm leaning toward nanoseconds (uint64_t
> >> format) should be the default for timer expiration time instead of
> TSC.
> >> TSC could still be an option for passing the current time, since TSC
> >> will be a common time source, and it shaves off one conversion.
> >
> > There are many reasons why nanoseconds is a much better choice than
> TSC.
> >
> >>
> >>> A unit-less tick will also let the application instantiate a HTW
> with higher
> >> resolution than the TSC. (E.g. think about oversampling in audio
> processing,
> >> or Brezenham's line drawing algorithm for 2D visuals - oversampling
> can sound
> >> and look better.)
> >
> > Some of the timing data in our application have a resolution orders of
> magnitude higher than one nanosecond. If we combined that with a HTW
> library with nanosecond resolution, we would need to keep these timer
> values in two locations: The original high-res timer in our data
> structure, and the shadow low-res (nanosecond) timer in the HTW.
> >
> 
> There is no way you will meet timers with anything approaching
> pico-second-level precision.

Correct. Our sub-nanosecond timers don't need to meet the exact time, but the higher resolution prevents loss of accuracy when a number has been added to it many times. Think of it like a special fixed-point number, where the least significant part is included to ensure accuracy in calculations, while the actual timer only considers the most significant part of the number.

> You will also get into a value range issue,
> since you will wrap around a 64-bit integer in a matter of days.

Yes. We use timers with different scales for individual purposes. Our highest resolution are sub-nanosecond.

> 
> The HTW only stores the timeout in ticks, not TSC, nanoseconds or
> picoseconds.

Excellent. Then I'm happy.

> Generally, you don't want pico-second-level tick
> granularity, since it increases the overhead of advancing the wheel(s).

We currently use proprietary algorithms for our bandwidth scheduling. It seems that a HTW is not a good fit for this purpose. Perhaps you are offering a hammer, and it's not a good replacement for my screwdriver.

I suppose that nanosecond resolution suffices for a TCP stack, which is the use case I have been on the lookout for a timer library for. :-)

> The first (lowest-significance) few wheels will pretty much always be
> empty.
> 
> > We might also need to frequently update the HTW timers to prevent
> drifting away from the high-res timers. E.g. 1.2 + 1.2 is still 2 when
> rounded, but + 1.2 becomes 3 when it should have been 4 (3 * 1.2 = 3.6)
> rounded. This level of drifting would also make periodic timers in the
> HTW useless.
> >
> 
> Useless, for a certain class of applications. What application would
> that be?

Sorry about being unclear there. Yes, I only meant the specific application I was talking about, i.e. our application for high precision bandwidth management. For reference, 1 bit at 100 Gbit/s is 10 picoseconds.

> 
> > Please note: I haven't really considered merging the high-res timing
> in our application with this HTW, and I'm also not saying that PERIODIC
> timers in the HTW are required or even useful for our application. I'm
> only providing arguments for a unit-less time!
> >
> >>>
> >>> For reference (supporting my suggestion), the dynamic timestamp
> field in the
> >> rte_mbuf structure is also defined as being unit-less. (I think
> NVIDIA
> >> implements it as nanoseconds, but that's an implementation specific
> choice.)
> >>>
> >>>>
> >>>> A semantic improvement compared to the <rte_timer.h> API is that
> the
> >>>> htimer library can give a definite answer on the question if the
> timer
> >>>> expiry callback was called, after a timer has been canceled.
> >>>>
> >>>> Below is a performance data from DPDK's 'app/test' micro
> benchmarks,
> >>>> using 10k concurrent timers. The benchmarks (test_timer_perf.c and
> >>>> test_htimer_mgr_perf.c) aren't identical in their structure, but
> the
> >>>> numbers give some indication of the difference.
> >>>>
> >>>> Use case               htimer  timer
> >>>> ------------------------------------
> >>>> Add timer                 28    253
> >>>> Cancel timer              10    412
> >>>> Async add (source lcore)  64
> >>>> Async add (target lcore)  13
> >>>>
> >>>> (AMD 5900X CPU. Time in TSC.)
> >>>>
> >>>> Prototype integration of the htimer library into real, timer-heavy,
> >>>> applications indicates that htimer may result in significant
> >>>> application-level performance gains.
> >>>>
> >>>> The bitset implementation which the HWT implementation depends upon
> >>>> seemed generic-enough and potentially useful outside the world of
> >>>> HWTs, to justify being located in the EAL.
> >>>>
> >>>> This patchset is very much an RFC, and the author is yet to form an
> >>>> opinion on many important issues.
> >>>>
> >>>> * If deemed a suitable replacement, should the htimer replace the
> >>>>     current DPDK timer library in some particular (ABI-breaking)
> >>>>     release, or should it live side-by-side with the then-legacy
> >>>>     <rte_timer.h> API? A lot of things in and outside DPDK depend
> on
> >>>>     <rte_timer.h>, so coexistence may be required to facilitate a
> smooth
> >>>>     transition.
> >>>
> >>> It's my immediate impression that they are totally different in both
> design
> >> philosophy and API.
> >>>
> >>> Personal opinion: I would call it an entirely different library.
> >>>
> >>>>
> >>>> * Should the htimer and htw-related files be colocated with
> rte_timer.c
> >>>>     in the timer library?
> >>>
> >>> Personal opinion: No. This is an entirely different library, and
> should live
> >> for itself in a directory of its own.
> >>>
> >>>>
> >>>> * Would it be useful for applications using asynchronous cancel to
> >>>>     have the option of having the timer callback run not only in
> case of
> >>>>     timer expiration, but also cancellation (on the target lcore)?
> The
> >>>>     timer cb signature would need to include an additional
> parameter in
> >>>>     that case.
> >>>
> >>> If one thread cancels something in another thread, some
> synchronization
> >> between the threads is going to be required anyway. So we could
> reprase your
> >> question: Will the burden of the otherwise required synchronization
> between
> >> the two threads be significantly reduced if the library has the
> ability to run
> >> the callback on asynchronous cancel?
> >>>
> >>
> >> Yes.
> >>
> >> Intuitively, it seems convenient that if you hand off a timer to a
> >> different lcore, the timer callback will be called exactly once,
> >> regardless if the timer was canceled or expired.
> >>
> >> But, as you indicate, you may still need synchronization to solve the
> >> resource reclamation issue.
> >>
> >>> Is such a feature mostly "Must have" or "Nice to have"?
> >>>
> >>> More thoughts in this area...
> >>>
> >>> If adding and additional callback parameter, it could be an enum, so
> the
> >> callback could be expanded to support "timeout (a.k.a. timer fired)",
> "cancel"
> >> and more events we have not yet come up with, e.g. "early kick".
> >>>
> >>
> >> Yes, or an int.
> >>
> >>> Here's an idea off the top of my head: An additional callback
> parameter has
> >> a (small) performance cost incurred with every timer fired (which is
> a very
> >> large multiplier). It might not be required. As an alternative to an
> "what
> >> happened" parameter to the callback, the callback could investigate
> the state
> >> of the object for which the timer fired, and draw its own conclusion
> on how to
> >> proceed. Obviously, this also has a performance cost, but perhaps the
> callback
> >> works on the object's state anyway, making this cost insignificant.
> >>>
> >>
> >> It's not obvious to me that you, in the timer callback, can determine
> >> what happened, if the same callback is called both in the cancel and
> the
> >> expired case.
> >>
> >> The cost of an extra integer passed in a register (or checking a
> flag,
> >> if the timer callback should be called at all at cancellation) that
> is
> >> the concern for me; it's extra bit of API complexity.
> >
> > Then introduce the library without this feature. More features can be
> added later.
> >
> > The library will be introduced as "experimental", so we are free to
> improve it and modify the ABI along the way.
> >
> >>
> >>> Here's another alternative to adding a "what happened" parameter to
> the
> >> callback:
> >>>
> >>> The rte_htimer could have one more callback pointer, which (if set)
> will be
> >> called on cancellation of the timer.
> >>>
> >>
> >> This will grow the timer struct with 16 bytes.
> >
> > If the rte_htimer struct stays within one cache line, it should be
> acceptable.
> >
> 
> Timer structs are often embedded in other structures, and need not
> themselves be cache line aligned (although the "parent" struct may need
> to be, e.g. if it's dynamically allocated).
> 
> So smaller is better. Just consider if you want your attosecond-level
> time stamp in a struct:
> 
> struct my_timer {
>      uint64_t high_precision_time_high_bits;
>      uint64_t high_precision_time_low_bits;
>      struct rte_htimer timer;
> };
> 
> ...and you allocate those structs from a mempool. If rte_htimer is small
> enough, you will fit on one cache line.

Ahh... I somehow assumed they only existed as stand-alone elements inside the HTW.

Then I obviously agree that shorter is better.

> 
> > On the other hand, this approach is less generic than passing an
> additional parameter. (E.g. add yet another callback pointer for "early
> kick"?)
> >
> > BTW, async cancel is a form of inter-thread communication. Does this
> library really need to provide any inter-thread communication
> mechanisms? Doesn't an inter-thread communication mechanism belong in a
> separate library?
> >
> 
> Yes, <rte_htimer_mgr.h> needs this because:
> 1) Being able to schedule timers on a remote lcore is a useful feature
> (especially since we don't have much else in terms of deferred work
> mechanisms in DPDK).

Although remote procedures is a useful feature, providing such a feature doesn't necessarily belong in a library that uses remote procedures.

> 2) htimer aspires to be a plug-in replacement for <rte_timer.h> (albeit
> an ABI-breaking one).

This is a good argument.

But I would much rather have a highly tuned stand-alone HTW library than a plug-in replacement of the old <rte_timer.h>.

> 
> The pure HTW is in rte_htw.[ch].
> 
> Plus, with the current design, async operations basically come for free
> (if you don't use them), from a performance perspective. The extra
> overhead boils down to occasionally polling an empty ring, which is an
> inexpensive operation.

OK. Then no worries.

> 
> >>
> >>>>
> >>>> * Should the rte_htimer be a nested struct, so the htw parts be
> separated
> >>>>     from the htimer parts?
> >>>>
> >>>> * <rte_htimer.h> is kept separate from <rte_htimer_mgr.h>, so that
> >>>>     <rte_htw.h> may avoid a depedency to <rte_htimer_mgr.h>. Should
> it
> >>>>     be so?
> >>>>
> >>>> * rte_htimer struct is only supposed to be used by the application
> to
> >>>>     give an indication of how much memory it needs to allocate, and
> is
> >>>>     its member are not supposed to be directly accessed (w/ the
> possible
> >>>>     exception of the owner_lcore_id field). Should there be a dummy
> >>>>     struct, or a #define RTE_HTIMER_MEMSIZE or a
> rte_htimer_get_memsize()
> >>>>     function instead, serving the same purpose? Better
> encapsulation,
> >>>>     but more inconvenient for applications. Run-time dynamic sizing
> >>>>     would force application-level dynamic allocations.
> >>>>
> >>>> * Asynchronous cancellation is a little tricky to use for the
> >>>>     application (primarily due to timer memory reclamation/race
> >>>>     issues). Should this functionality be removed?
> >>>>
> >>>> * Should rte_htimer_mgr_init() also retrieve the current time? If
> so,
> >>>>     there should to be a variant which allows the user to specify
> the
> >>>>     time (to match rte_htimer_mgr_manage_time()). One pitfall with
> the
> >>>>     current proposed API is an application calling
> rte_htimer_mgr_init()
> >>>>     and then immediately adding a timer with a relative timeout, in
> >>>>     which case the current absolute time used is 0, which might be
> a
> >>>>     surprise.
> >>>>
> >>>> * Should libdivide (optionally) be used to avoid the div in the TSC
> ->
> >>>>     tick conversion? (Doesn't improve performance on Zen 3, but may
> >>>>     do on other CPUs.) Consider <rte_reciprocal.h> as well.
> >>>>
> >>>> * Should the TSC-per-tick be rounded up to a power of 2, so shifts
> can be
> >>>>     used for conversion? Very minor performance gains to be found
> there,
> >>>>     at least on Zen 3 cores.
> >>>>
> >>>> * Should it be possible to supply the time in rte_htimer_mgr_add()
> >>>>     and/or rte_htimer_mgr_manage_time() functions as ticks, rather
> than
> >>>>     as TSC? Should it be possible to also use nanoseconds?
> >>>>     rte_htimer_mgr_manage_time() would need a flags parameter in
> that
> >>>>     case.
> >>>
> >>> Do not use TSC anywhere in this library. Let the application decide
> the
> >> meaning of a tick.
> >>>
> >>>>
> >>>> * Would the event timer adapter be best off using <rte_htw.h>
> >>>>     directly, or <rte_htimer.h>? In the latter case, there needs to
> be a
> >>>>     way to instantiate more HWTs (similar to the "alt" functions of
> >>>>     <rte_timer.h>)?
> >>>>
> >>>> * Should the PERIODICAL flag (and the complexity it brings) be
> >>>>     removed? And leave the application with only single-shot
> timers, and
> >>>>     the option to re-add them in the timer callback.
> >>>
> >>> First thought: Yes, keep it lean and remove the periodical stuff.
> >>>
> >>> Second thought: This needs a more detailed analysis.
> >>>
> >>>   From one angle:
> >>>
> >>> How many PERIODICAL versus ONESHOT timers do we expect?
> >>>
> >>
> >> I suspect you should be prepared for the ratio being anything.
> >
> > In theory, anything is possible. But I'm asking that we consider
> realistic use cases.
> >
> >>
> >>> Intuitively, I would use this library for ONESHOT timers, and
> perhaps
> >> implement my periodical timers by other means.
> >>>
> >>> If the PERIODICAL:ONESHOT ratio is low, we can probably live with
> the extra
> >> cost of cancel+add for a few periodical timers.
> >>>
> >>>   From another angle:
> >>>
> >>> What is the performance gain with the PERIODICAL flag?
> >>>
> >>
> >> None, pretty much. It's just there for convenience.
> >
> > OK, then I suggest that you remove it, unless you get objections.
> >
> > The library can be expanded with useful features at any time later.
> Useless features are (nearly) impossible to remove, once they are in
> there - they are just "technical debt" with associated maintenance
> costs, added complexity weaving into other features, etc..
> >
> >>
> >>> Without a periodical timer, cancel+add costs 10+28 cycles. How many
> cycles
> >> would a "move" function, performing both cancel and add, use?
> >>>
> >>> And then compare that to the cost (in cycles) of repeating a timer
> with
> >> PERIODICAL?
> >>>
> >>> Furthermore, not having the PERIODICAL flag probably improves the
> >> performance for non-periodical timers. How many cycles could we gain
> here?
> >>>
> >>>
> >>> Another, vaguely related, idea:
> >>>
> >>> The callback pointer might not need to be stored per rte_htimer, but
> could
> >> instead be common for the rte_htw.
> >>>
> >>
> >> Do you mean rte_htw, or rte_htimer_mgr?
> >>
> >> If you make one common callback, all the different parts of the
> >> application needs to be coordinated (in a big switch-statement, or
> >> something of that sort), or have some convention for using an
> >> application-specific wrapper structure (accessed via container_of()).
> >>
> >> This is a problem if the timer service API consumer is a set of
> largely
> >> uncoordinated software modules.
> >>
> >> Btw, the eventdev API has the same issue, and the proposed event
> >> dispatcher is one way to help facilitate application-internal
> decoupling.
> >>
> >> For a module-private rte_htw instance your suggestion may work, but
> not
> >> for <rte_htimer_mgr.h>.
> >
> > I was speculating that a common callback pointer might provide a
> performance benefit for single-purpose HTW instances. (The same concept
> applies if there are multiple callbacks, e.g. a "Timer Fired", a "Timer
> Cancelled", and an "Early Kick" callback pointer - i.e. having the
> callback pointers per HTW instance, instead of per timer.)
> >
> >>
> >>> When a timer fires, the callback probably needs to check/update the
> state of
> >> the object for which the timer fired anyway, so why not just let the
> >> application use that state to determine the appropriate action. This
> might
> >> provide some performance benefit.
> >>>
> >>> It might complicate using one HTW for multiple different purposes,
> though.
> >> Probably a useless idea, but I wanted to share the idea anyway. It
> might
> >> trigger other, better ideas in the community.
> >>>
> >>>>
> >>>> * Should the async result codes and the sync cancel error codes be
> merged
> >>>>     into one set of result codes?
> >>>>
> >>>> * Should the rte_htimer_mgr_async_add() have a flag which allow
> >>>>     buffering add request messages until rte_htimer_mgr_process()
> is
> >>>>     called? Or any manage function. Would reduce ring signaling
> overhead
> >>>>     (i.e., burst enqueue operations instead of single-element
> >>>>     enqueue). Could also be a rte_htimer_mgr_async_add_burst()
> function,
> >>>>     solving the same "problem" a different way. (The signature of
> such
> >>>>     a function would not be pretty.)
> >>>>
> >>>> * Does the functionality provided by the rte_htimer_mgr_process()
> >>>>     function match its the use cases? Should there me a more clear
> >>>>     separation between expiry processing and asynchronous operation
> >>>>     processing?
> >>>>
> >>>> * Should the patchset be split into more commits? If so, how?
> >>>>
> >>>> Thanks to Erik Carrillo for his assistance.
> >>>>
> >>>> Mattias Rönnblom (2):
> >>>>     eal: add bitset type
> >>>>     eal: add high-performance timer facility
> >


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 1/2] eal: add bitset type
  2023-02-28 18:46   ` Tyler Retzlaff
@ 2023-03-02  6:31     ` Mattias Rönnblom
  2023-03-02 20:39       ` Tyler Retzlaff
  0 siblings, 1 reply; 31+ messages in thread
From: Mattias Rönnblom @ 2023-03-02  6:31 UTC (permalink / raw)
  To: Tyler Retzlaff
  Cc: dev, Erik Gabriel Carrillo, David Marchand, Maria Lingemark,
	Stefan Sundkvist

On 2023-02-28 19:46, Tyler Retzlaff wrote:
> On Tue, Feb 28, 2023 at 10:39:15AM +0100, Mattias Rönnblom wrote:
>> Introduce a set of functions and macros that operate on sets of bits,
>> kept in arrays of 64-bit elements.
>>
>> RTE bitset is designed for bitsets which are larger than what fits in
>> a single machine word (i.e., 64 bits). For very large bitsets, the
>> <rte_bitmap.h> API may be a more appropriate choice.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
>> ---
> 
> ...
> 
>> diff --git a/lib/eal/include/rte_bitset.h b/lib/eal/include/rte_bitset.h
>> new file mode 100644
>> index 0000000000..e333e527e5
>> --- /dev/null
>> +++ b/lib/eal/include/rte_bitset.h
>> @@ -0,0 +1,878 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2023 Ericsson AB
>> + */
>> +
>> +#ifndef _RTE_BITSET_H_
>> +#define _RTE_BITSET_H_
>> +
>> +/**
>> + * @file
>> + * RTE Bitset
>> + *
>> + * This file provides functions and macros for querying and
>> + * manipulating sets of bits kept in arrays of @c uint64_t-sized
>> + * elements.
>> + *
>> + * The bits in a bitset are numbered from 0 to @c size - 1, with the
>> + * lowest index being the least significant bit.
>> + *
>> + * The bitset array must be properly aligned.
>> + *
>> + * For optimal performance, the @c size parameter, required by
>> + * many of the API's functions, should be a compile-time constant.
>> + *
>> + * For large bitsets, the rte_bitmap.h API may be more appropriate.
>> + *
>> + * @warning
>> + * All functions modifying a bitset may overwrite any unused bits of
>> + * the last word. Such unused bits are ignored by all functions reading
>> + * bits.
>> + *
>> + */
>> +
>> +#include <limits.h>
>> +#include <stdbool.h>
>> +#include <stdint.h>
>> +#include <sys/types.h>
> 
> windows has no sys/types.h if there is a shim being picked up somewhere
> portable code shouldn't depend on sys/types.h
> 

That include was a misdirected attempt to get the 'size_t' definition. I 
will replaced it with <stddef.h>. Thanks.

>> +
>> +#include <rte_branch_prediction.h>
>> +#include <rte_common.h>
>> +#include <rte_debug.h>
>> +#include <rte_memcpy.h>
>> +
>> +#ifdef __cplusplus
>> +extern "C" {
>> +#endif
>> +
>> +/**
>> + * The size (in bytes) of each element in the array used to represent
>> + * a bitset.
>> + */
>> +#define RTE_BITSET_WORD_SIZE (sizeof(uint64_t))
>> +
>> +/**
>> + * The size (in bits) of each element in the array used to represent
>> + * a bitset.
>> + */
>> +#define RTE_BITSET_WORD_BITS (RTE_BITSET_WORD_SIZE * CHAR_BIT)
>> +
>> +/**
>> + * Computes the number of words required to store @c size bits.
>> + */
>> +#define RTE_BITSET_NUM_WORDS(size)			       \
>> +	((size + RTE_BITSET_WORD_BITS - 1) / RTE_BITSET_WORD_BITS)
>> +
>> +/**
>> + * Computes the amount of memory (in bytes) required to fit a bitset
>> + * holding @c size bits.
>> + */
>> +#define RTE_BITSET_SIZE(size)					\
>> +	((size_t)(RTE_BITSET_NUM_WORDS(size) * RTE_BITSET_WORD_SIZE))
>> +
>> +#define __RTE_BITSET_WORD_IDX(bit_num) ((bit_num) / RTE_BITSET_WORD_BITS)
>> +#define __RTE_BITSET_BIT_OFFSET(bit_num) ((bit_num) % RTE_BITSET_WORD_BITS)
>> +#define __RTE_BITSET_UNUSED(size)				\
>> +	((RTE_BITSET_NUM_WORDS(size) * RTE_BITSET_WORD_BITS) \
>> +	 - (size))
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice.
>> + *
>> + * Declare a bitset.
>> + *
>> + * Declare (e.g., as a struct field) or define (e.g., as a stack
>> + * variable) a bitset of the specified size.
>> + *
>> + * @param size
>> + *   The number of bits the bitset must be able to represent. Must be
>> + *   a compile-time constant.
>> + * @param name
>> + *   The field or variable name of the resulting definition.
>> + */
>> +#define RTE_BITSET_DECLARE(name, size)		\
>> +	uint64_t name[RTE_BITSET_NUM_WORDS(size)]
>> +
>> +/* XXX: should one include flags here and use to avoid a comparison? */
>> +/* XXX: would this be better off as a function? */
>> +
>> +#define __RTE_BITSET_FOREACH_LEFT(var, size, start_bit, len)	\
>> +	((len) - 1 - ((var) >= (start_bit) ? (var) - (start_bit) :	\
>> +		  (size) - (start_bit) + (var)))
>> +
>> +#define __RTE_BITSET_FOREACH(var, bitset, size, start_bit, len, flags) \
>> +	for ((var) = __rte_bitset_find(bitset, size, start_bit, len, \
>> +				       flags);				\
>> +	     (var) != -1;						\
>> +	     (var) = __RTE_BITSET_FOREACH_LEFT(var, size, start_bit, \
>> +					       len) > 0	?		\
>> +		     __rte_bitset_find(bitset, size,		\
>> +				       ((var) + 1) % (size),	\
>> +				       __RTE_BITSET_FOREACH_LEFT(var,	\
>> +								 size, \
>> +								 start_bit, \
>> +								 len),	\
>> +				       flags) : -1)
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice.
>> + *
>> + * Iterate over all bits set.
>> + *
>> + * This macro iterates over all bits set (i.e., all ones) in the
>> + * bitset, in the forward direction (i.e., starting with the least
>> + * significant '1').
>> + *
>> + * @param var
>> + *   An iterator variable of type @c ssize_t. For each sucessive iteration,
>> + *   this variable will hold the bit index of a set bit.
>> + * @param bitset
>> + *   A <tt>const uint64_t *</tt> pointer to the bitset array.
>> + * @param size
>> + *   The size of the bitset (in bits).
>> + */
>> +
>> +#define RTE_BITSET_FOREACH_SET(var, bitset, size)			\
>> +	__RTE_BITSET_FOREACH(var, bitset, size, 0, size, 0)
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice.
>> + *
>> + * Iterate over all bits cleared.
>> + *
>> + * This macro iterates over all bits cleared in the bitset, in the
>> + * forward direction (i.e., starting with the lowest-indexed set bit).
>> + *
>> + * @param var
>> + *   An iterator variable of type @c ssize_t. For each successive iteration,
>> + *   this variable will hold the bit index of a cleared bit.
>> + * @param bitset
>> + *   A <tt>const uint64_t *</tt> pointer to the bitset array.
>> + * @param size
>> + *   The size of the bitset (in bits).
>> + */
>> +
>> +#define RTE_BITSET_FOREACH_CLEAR(var, bitset, size)			\
>> +	__RTE_BITSET_FOREACH(var, bitset, size, 0, size,	\
>> +			     __RTE_BITSET_FIND_FLAG_FIND_CLEAR)
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice.
>> + *
>> + * Iterate over all bits set within a range.
>> + *
>> + * This macro iterates over all bits set (i.e., all ones) in the
>> + * specified range, in the forward direction (i.e., starting with the
>> + * least significant '1').
>> + *
>> + * @param var
>> + *   An iterator variable of type @c ssize_t. For each sucessive iteration,
>> + *   this variable will hold the bit index of a set bit.
>> + * @param bitset
>> + *   A <tt>const uint64_t *</tt> pointer to the bitset array.
>> + * @param size
>> + *   The size of the bitset (in bits).
>> + * @param start_bit
>> + *   The index of the first bit to check. Must be less than @c size.
>> + * @param len
>> + *   The length (in bits) of the range. @c start_bit + @c len must be less
>> + *   than or equal to @c size.
>> + */
>> +
>> +#define RTE_BITSET_FOREACH_SET_RANGE(var, bitset, size, start_bit, \
>> +				     len)			       \
>> +	__RTE_BITSET_FOREACH(var, bitset, size, start_bit, len, 0)
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice.
>> + *
>> + * Iterate over all cleared bits within a range.
>> + *
>> + * This macro iterates over all bits cleared (i.e., all zeroes) in the
>> + * specified range, in the forward direction (i.e., starting with the
>> + * least significant '0').
>> + *
>> + * @param var
>> + *   An iterator variable of type @c ssize_t. For each sucessive iteration,
>> + *   this variable will hold the bit index of a set bit.
>> + * @param bitset
>> + *   A <tt>const uint64_t *</tt> pointer to the bitset array.
>> + * @param size
>> + *   The size of the bitset (in bits).
>> + * @param start_bit
>> + *   The index of the first bit to check. Must be less than @c size.
>> + * @param len
>> + *   The length (in bits) of the range. @c start_bit + @c len must be less
>> + *   than or equal to @c size.
>> + */
>> +
>> +#define RTE_BITSET_FOREACH_CLEAR_RANGE(var, bitset, size, start_bit,	\
>> +				       len)				\
>> +	__RTE_BITSET_FOREACH(var, bitset, size, start_bit, len,	\
>> +			     __RTE_BITSET_FIND_FLAG_FIND_CLEAR)
>> +
>> +#define RTE_BITSET_FOREACH_SET_WRAP(var, bitset, size, start_bit,  \
>> +				    len)			       \
>> +	__RTE_BITSET_FOREACH(var, bitset, size, start_bit, len,    \
>> +			     __RTE_BITSET_FIND_FLAG_WRAP)
>> +
>> +#define RTE_BITSET_FOREACH_CLEAR_WRAP(var, bitset, size, start_bit, \
>> +				       len)				\
>> +	__RTE_BITSET_FOREACH(var, bitset, size, start_bit, len,	\
>> +			     __RTE_BITSET_FIND_FLAG_WRAP |		\
>> +			     __RTE_BITSET_FIND_FLAG_FIND_CLEAR)
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice.
>> + *
>> + * Initializes a bitset.
>> + *
>> + * All bits are cleared.
>> + *
>> + * @param bitset
>> + *   A pointer to the array of bitset 64-bit words.
>> + * @param size
>> + *   The size of the bitset (in bits).
>> + */
>> +
>> +__rte_experimental
>> +static inline void
>> +rte_bitset_init(uint64_t *bitset, size_t size)
>> +{
>> +	memset(bitset, 0, RTE_BITSET_SIZE(size));
>> +}
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice.
>> + *
>> + * Set a bit in the bitset.
>> + *
>> + * Bits are numbered from 0 to (size - 1) (inclusive).
>> + *
>> + * @param bitset
>> + *   A pointer to the array words making up the bitset.
>> + * @param bit_num
>> + *   The index of the bit to be set.
>> + */
>> +
>> +__rte_experimental
>> +static inline void
>> +rte_bitset_set(uint64_t *bitset, size_t bit_num)
>> +{
>> +	size_t word;
>> +	size_t offset;
>> +	uint64_t mask;
>> +
>> +	word = __RTE_BITSET_WORD_IDX(bit_num);
>> +	offset = __RTE_BITSET_BIT_OFFSET(bit_num);
>> +	mask = UINT64_C(1) << offset;
>> +
>> +	bitset[word] |= mask;
>> +}
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice.
>> + *
>> + * Clear a bit in the bitset.
>> + *
>> + * Bits are numbered 0 to (size - 1) (inclusive).
>> + *
>> + * @param bitset
>> + *   A pointer to the array words making up the bitset.
>> + * @param bit_num
>> + *   The index of the bit to be cleared.
>> + */
>> +
>> +__rte_experimental
>> +static inline void
>> +rte_bitset_clear(uint64_t *bitset, size_t bit_num)
>> +{
>> +	size_t word;
>> +	size_t offset;
>> +	uint64_t mask;
>> +
>> +	word = __RTE_BITSET_WORD_IDX(bit_num);
>> +	offset = __RTE_BITSET_BIT_OFFSET(bit_num);
>> +	mask = ~(UINT64_C(1) << offset);
>> +
>> +	bitset[word] &= mask;
>> +}
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice.
>> + *
>> + * Set all bits in the bitset.
>> + *
>> + * @param bitset
>> + *   A pointer to the array of words making up the bitset.
>> + * @param size
>> + *   The size of the bitset (in bits).
>> + */
>> +
>> +__rte_experimental
>> +static inline void
>> +rte_bitset_set_all(uint64_t *bitset, size_t size)
>> +{
>> +	memset(bitset, 0xFF, RTE_BITSET_SIZE(size));
>> +}
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice.
>> + *
>> + * Clear all bits in the bitset.
>> + *
>> + * @param bitset
>> + *   A pointer to the array of words making up the bitset.
>> + * @param size
>> + *   The size of the bitset (in bits).
>> + */
>> +
>> +__rte_experimental
>> +static inline void
>> +rte_bitset_clear_all(uint64_t *bitset, size_t size)
>> +{
>> +	rte_bitset_init(bitset, size);
>> +}
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice.
>> + *
>> + * Count all set bits.
>> + *
>> + * @param bitset
>> + *   A pointer to the array of words making up the bitset.
>> + * @param size
>> + *   The size of the bitset (in bits).
>> + * @return
>> + *   Returns the number of '1' bits in the bitset.
>> + */
>> +
>> +__rte_experimental
>> +static inline size_t
>> +rte_bitset_count_set(const uint64_t *bitset, size_t size)
>> +{
>> +	size_t i;
>> +	size_t total = 0;
>> +	uint64_t unused_mask;
>> +
>> +	/*
>> +	 * Unused bits in a rte_bitset are always '0', and thus are
>> +	 * not included in this count.
>> +	 */
>> +	for (i = 0; i < RTE_BITSET_NUM_WORDS(size) - 1; i++)
>> +		total += __builtin_popcountll(bitset[i]);
>> +
>> +	unused_mask = UINT64_MAX >> __RTE_BITSET_UNUSED(size);
>> +	total += __builtin_popcountll(bitset[i] & unused_mask);
>> +
>> +	return total;
>> +}
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice.
>> + *
>> + * Count all cleared bits.
>> + *
>> + * @param bitset
>> + *   A pointer to the array of words making up the bitset.
>> + * @param size
>> + *   The size of the bitset (in bits).
>> + * @return
>> + *   Returns the number of '0' bits in the bitset.
>> + */
>> +
>> +__rte_experimental
>> +static inline size_t
>> +rte_bitset_count_clear(const uint64_t *bitset, size_t size)
>> +{
>> +	return size - rte_bitset_count_set(bitset, size);
>> +}
>> +
>> +/**
>> + * @warning
>> + * @b EXPERIMENTAL: this API may change without prior notice.
>> + *
>> + * Test if a bit is set.
>> + *
>> + * @param bitset
>> + *   A pointer to the array of words making up the bitset.
>> + * @param bit_num
>> + *   Index of the bit to test. Index 0 is the least significant bit.
>> + * @return
>> + *   Returns true if the bit is '1', and false if the bit is '0'.
>> + */
>> +
>> +__rte_experimental
>> +static inline bool
>> +rte_bitset_test(const uint64_t *bitset, size_t bit_num)
>> +{
>> +	size_t word;
>> +	size_t offset;
>> +
>> +	word = __RTE_BITSET_WORD_IDX(bit_num);
>> +	offset = __RTE_BITSET_BIT_OFFSET(bit_num);
>> +
>> +	return (bitset[word] >> offset) & 1;
>> +}
>> +
>> +#define __RTE_BITSET_FIND_FLAG_FIND_CLEAR (1U << 0)
>> +#define __RTE_BITSET_FIND_FLAG_WRAP (1U << 1)
>> +
>> +__rte_experimental
>> +static inline ssize_t
>> +__rte_bitset_find_nowrap(const uint64_t *bitset, size_t __rte_unused size,
>> +			 size_t start_bit, size_t len, bool find_clear)
> 
>     ^^ seems like the intent here is for this to be internal (not part of
>        public api) i wonder do we have to mark it __rte_experimental?
> 
>        or better can we prevent visibility / consumption outside of the
>        inline functions in this header?
> 

It is supposed to be internal (although you can question the DPDK habit 
of using the reserved __ namespace for such symbols).

I don't know of any way to prevent its use. I'm not sure it's needed. 
Those who don't heed the "___" prefix warning, are on their own.

>> +{
>> +	size_t word_idx;
>> +	size_t offset;
>> +	size_t end_bit = start_bit + len;
>> +
>> +	RTE_ASSERT(end_bit <= size);
>> +
>> +	if (unlikely(len == 0))
>> +		return -1;
>> +
>> +	word_idx = __RTE_BITSET_WORD_IDX(start_bit);
>> +	offset = __RTE_BITSET_BIT_OFFSET(start_bit);
>> +
>> +	while (word_idx <= __RTE_BITSET_WORD_IDX(end_bit - 1)) {
>> +		uint64_t word;
>> +		int word_ffs;
>> +
>> +		word = bitset[word_idx];
>> +		if (find_clear)
>> +			word = ~word;
>> +
>> +		word >>= offset;
>> +
>> +		word_ffs = __builtin_ffsll(word);
>> +
>> +		if (word_ffs != 0) {
>> +			ssize_t ffs = start_bit + word_ffs - 1;
>> +
>> +			/*
>> +			 * Check if set bit were among the last,
>> +			 * unused bits, in the last word.
>> +			 */
>> +			if (unlikely(ffs >= (ssize_t)end_bit))
>> +				return -1;
>> +
>> +			return ffs;
>> +		}
>> +
>> +		start_bit += (RTE_BITSET_WORD_BITS - offset);
>> +		word_idx++;
>> +		offset = 0;
>> +	}
>> +
>> +	return -1;
>> +
>> +}
>> +


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 1/2] eal: add bitset type
  2023-03-02  6:31     ` Mattias Rönnblom
@ 2023-03-02 20:39       ` Tyler Retzlaff
  0 siblings, 0 replies; 31+ messages in thread
From: Tyler Retzlaff @ 2023-03-02 20:39 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, Erik Gabriel Carrillo, David Marchand, Maria Lingemark,
	Stefan Sundkvist

On Thu, Mar 02, 2023 at 06:31:40AM +0000, Mattias Rönnblom wrote:
> On 2023-02-28 19:46, Tyler Retzlaff wrote:
> > On Tue, Feb 28, 2023 at 10:39:15AM +0100, Mattias Rönnblom wrote:
> >> Introduce a set of functions and macros that operate on sets of bits,
> >> kept in arrays of 64-bit elements.
> >>
> >> RTE bitset is designed for bitsets which are larger than what fits in
> >> a single machine word (i.e., 64 bits). For very large bitsets, the
> >> <rte_bitmap.h> API may be a more appropriate choice.
> >>
> >> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> >> ---
> > 
> > ...
> > 
> >> diff --git a/lib/eal/include/rte_bitset.h b/lib/eal/include/rte_bitset.h
> >> new file mode 100644
> >> index 0000000000..e333e527e5
> >> --- /dev/null
> >> +++ b/lib/eal/include/rte_bitset.h
> >> @@ -0,0 +1,878 @@
> >> +/* SPDX-License-Identifier: BSD-3-Clause
> >> + * Copyright(c) 2023 Ericsson AB
> >> + */
> >> +
> >> +#ifndef _RTE_BITSET_H_
> >> +#define _RTE_BITSET_H_
> >> +
> >> +/**
> >> + * @file
> >> + * RTE Bitset
> >> + *
> >> + * This file provides functions and macros for querying and
> >> + * manipulating sets of bits kept in arrays of @c uint64_t-sized
> >> + * elements.
> >> + *
> >> + * The bits in a bitset are numbered from 0 to @c size - 1, with the
> >> + * lowest index being the least significant bit.
> >> + *
> >> + * The bitset array must be properly aligned.
> >> + *
> >> + * For optimal performance, the @c size parameter, required by
> >> + * many of the API's functions, should be a compile-time constant.
> >> + *
> >> + * For large bitsets, the rte_bitmap.h API may be more appropriate.
> >> + *
> >> + * @warning
> >> + * All functions modifying a bitset may overwrite any unused bits of
> >> + * the last word. Such unused bits are ignored by all functions reading
> >> + * bits.
> >> + *
> >> + */
> >> +
> >> +#include <limits.h>
> >> +#include <stdbool.h>
> >> +#include <stdint.h>
> >> +#include <sys/types.h>
> > 
> > windows has no sys/types.h if there is a shim being picked up somewhere
> > portable code shouldn't depend on sys/types.h
> > 
> 
> That include was a misdirected attempt to get the 'size_t' definition. I 
> will replaced it with <stddef.h>. Thanks.
> 
> >> +
> >> +#include <rte_branch_prediction.h>
> >> +#include <rte_common.h>
> >> +#include <rte_debug.h>
> >> +#include <rte_memcpy.h>
> >> +
> >> +#ifdef __cplusplus
> >> +extern "C" {
> >> +#endif
> >> +
> >> +/**
> >> + * The size (in bytes) of each element in the array used to represent
> >> + * a bitset.
> >> + */
> >> +#define RTE_BITSET_WORD_SIZE (sizeof(uint64_t))
> >> +
> >> +/**
> >> + * The size (in bits) of each element in the array used to represent
> >> + * a bitset.
> >> + */
> >> +#define RTE_BITSET_WORD_BITS (RTE_BITSET_WORD_SIZE * CHAR_BIT)
> >> +
> >> +/**
> >> + * Computes the number of words required to store @c size bits.
> >> + */
> >> +#define RTE_BITSET_NUM_WORDS(size)			       \
> >> +	((size + RTE_BITSET_WORD_BITS - 1) / RTE_BITSET_WORD_BITS)
> >> +
> >> +/**
> >> + * Computes the amount of memory (in bytes) required to fit a bitset
> >> + * holding @c size bits.
> >> + */
> >> +#define RTE_BITSET_SIZE(size)					\
> >> +	((size_t)(RTE_BITSET_NUM_WORDS(size) * RTE_BITSET_WORD_SIZE))
> >> +
> >> +#define __RTE_BITSET_WORD_IDX(bit_num) ((bit_num) / RTE_BITSET_WORD_BITS)
> >> +#define __RTE_BITSET_BIT_OFFSET(bit_num) ((bit_num) % RTE_BITSET_WORD_BITS)
> >> +#define __RTE_BITSET_UNUSED(size)				\
> >> +	((RTE_BITSET_NUM_WORDS(size) * RTE_BITSET_WORD_BITS) \
> >> +	 - (size))
> >> +
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice.
> >> + *
> >> + * Declare a bitset.
> >> + *
> >> + * Declare (e.g., as a struct field) or define (e.g., as a stack
> >> + * variable) a bitset of the specified size.
> >> + *
> >> + * @param size
> >> + *   The number of bits the bitset must be able to represent. Must be
> >> + *   a compile-time constant.
> >> + * @param name
> >> + *   The field or variable name of the resulting definition.
> >> + */
> >> +#define RTE_BITSET_DECLARE(name, size)		\
> >> +	uint64_t name[RTE_BITSET_NUM_WORDS(size)]
> >> +
> >> +/* XXX: should one include flags here and use to avoid a comparison? */
> >> +/* XXX: would this be better off as a function? */
> >> +
> >> +#define __RTE_BITSET_FOREACH_LEFT(var, size, start_bit, len)	\
> >> +	((len) - 1 - ((var) >= (start_bit) ? (var) - (start_bit) :	\
> >> +		  (size) - (start_bit) + (var)))
> >> +
> >> +#define __RTE_BITSET_FOREACH(var, bitset, size, start_bit, len, flags) \
> >> +	for ((var) = __rte_bitset_find(bitset, size, start_bit, len, \
> >> +				       flags);				\
> >> +	     (var) != -1;						\
> >> +	     (var) = __RTE_BITSET_FOREACH_LEFT(var, size, start_bit, \
> >> +					       len) > 0	?		\
> >> +		     __rte_bitset_find(bitset, size,		\
> >> +				       ((var) + 1) % (size),	\
> >> +				       __RTE_BITSET_FOREACH_LEFT(var,	\
> >> +								 size, \
> >> +								 start_bit, \
> >> +								 len),	\
> >> +				       flags) : -1)
> >> +
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice.
> >> + *
> >> + * Iterate over all bits set.
> >> + *
> >> + * This macro iterates over all bits set (i.e., all ones) in the
> >> + * bitset, in the forward direction (i.e., starting with the least
> >> + * significant '1').
> >> + *
> >> + * @param var
> >> + *   An iterator variable of type @c ssize_t. For each sucessive iteration,
> >> + *   this variable will hold the bit index of a set bit.
> >> + * @param bitset
> >> + *   A <tt>const uint64_t *</tt> pointer to the bitset array.
> >> + * @param size
> >> + *   The size of the bitset (in bits).
> >> + */
> >> +
> >> +#define RTE_BITSET_FOREACH_SET(var, bitset, size)			\
> >> +	__RTE_BITSET_FOREACH(var, bitset, size, 0, size, 0)
> >> +
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice.
> >> + *
> >> + * Iterate over all bits cleared.
> >> + *
> >> + * This macro iterates over all bits cleared in the bitset, in the
> >> + * forward direction (i.e., starting with the lowest-indexed set bit).
> >> + *
> >> + * @param var
> >> + *   An iterator variable of type @c ssize_t. For each successive iteration,
> >> + *   this variable will hold the bit index of a cleared bit.
> >> + * @param bitset
> >> + *   A <tt>const uint64_t *</tt> pointer to the bitset array.
> >> + * @param size
> >> + *   The size of the bitset (in bits).
> >> + */
> >> +
> >> +#define RTE_BITSET_FOREACH_CLEAR(var, bitset, size)			\
> >> +	__RTE_BITSET_FOREACH(var, bitset, size, 0, size,	\
> >> +			     __RTE_BITSET_FIND_FLAG_FIND_CLEAR)
> >> +
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice.
> >> + *
> >> + * Iterate over all bits set within a range.
> >> + *
> >> + * This macro iterates over all bits set (i.e., all ones) in the
> >> + * specified range, in the forward direction (i.e., starting with the
> >> + * least significant '1').
> >> + *
> >> + * @param var
> >> + *   An iterator variable of type @c ssize_t. For each sucessive iteration,
> >> + *   this variable will hold the bit index of a set bit.
> >> + * @param bitset
> >> + *   A <tt>const uint64_t *</tt> pointer to the bitset array.
> >> + * @param size
> >> + *   The size of the bitset (in bits).
> >> + * @param start_bit
> >> + *   The index of the first bit to check. Must be less than @c size.
> >> + * @param len
> >> + *   The length (in bits) of the range. @c start_bit + @c len must be less
> >> + *   than or equal to @c size.
> >> + */
> >> +
> >> +#define RTE_BITSET_FOREACH_SET_RANGE(var, bitset, size, start_bit, \
> >> +				     len)			       \
> >> +	__RTE_BITSET_FOREACH(var, bitset, size, start_bit, len, 0)
> >> +
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice.
> >> + *
> >> + * Iterate over all cleared bits within a range.
> >> + *
> >> + * This macro iterates over all bits cleared (i.e., all zeroes) in the
> >> + * specified range, in the forward direction (i.e., starting with the
> >> + * least significant '0').
> >> + *
> >> + * @param var
> >> + *   An iterator variable of type @c ssize_t. For each sucessive iteration,
> >> + *   this variable will hold the bit index of a set bit.
> >> + * @param bitset
> >> + *   A <tt>const uint64_t *</tt> pointer to the bitset array.
> >> + * @param size
> >> + *   The size of the bitset (in bits).
> >> + * @param start_bit
> >> + *   The index of the first bit to check. Must be less than @c size.
> >> + * @param len
> >> + *   The length (in bits) of the range. @c start_bit + @c len must be less
> >> + *   than or equal to @c size.
> >> + */
> >> +
> >> +#define RTE_BITSET_FOREACH_CLEAR_RANGE(var, bitset, size, start_bit,	\
> >> +				       len)				\
> >> +	__RTE_BITSET_FOREACH(var, bitset, size, start_bit, len,	\
> >> +			     __RTE_BITSET_FIND_FLAG_FIND_CLEAR)
> >> +
> >> +#define RTE_BITSET_FOREACH_SET_WRAP(var, bitset, size, start_bit,  \
> >> +				    len)			       \
> >> +	__RTE_BITSET_FOREACH(var, bitset, size, start_bit, len,    \
> >> +			     __RTE_BITSET_FIND_FLAG_WRAP)
> >> +
> >> +#define RTE_BITSET_FOREACH_CLEAR_WRAP(var, bitset, size, start_bit, \
> >> +				       len)				\
> >> +	__RTE_BITSET_FOREACH(var, bitset, size, start_bit, len,	\
> >> +			     __RTE_BITSET_FIND_FLAG_WRAP |		\
> >> +			     __RTE_BITSET_FIND_FLAG_FIND_CLEAR)
> >> +
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice.
> >> + *
> >> + * Initializes a bitset.
> >> + *
> >> + * All bits are cleared.
> >> + *
> >> + * @param bitset
> >> + *   A pointer to the array of bitset 64-bit words.
> >> + * @param size
> >> + *   The size of the bitset (in bits).
> >> + */
> >> +
> >> +__rte_experimental
> >> +static inline void
> >> +rte_bitset_init(uint64_t *bitset, size_t size)
> >> +{
> >> +	memset(bitset, 0, RTE_BITSET_SIZE(size));
> >> +}
> >> +
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice.
> >> + *
> >> + * Set a bit in the bitset.
> >> + *
> >> + * Bits are numbered from 0 to (size - 1) (inclusive).
> >> + *
> >> + * @param bitset
> >> + *   A pointer to the array words making up the bitset.
> >> + * @param bit_num
> >> + *   The index of the bit to be set.
> >> + */
> >> +
> >> +__rte_experimental
> >> +static inline void
> >> +rte_bitset_set(uint64_t *bitset, size_t bit_num)
> >> +{
> >> +	size_t word;
> >> +	size_t offset;
> >> +	uint64_t mask;
> >> +
> >> +	word = __RTE_BITSET_WORD_IDX(bit_num);
> >> +	offset = __RTE_BITSET_BIT_OFFSET(bit_num);
> >> +	mask = UINT64_C(1) << offset;
> >> +
> >> +	bitset[word] |= mask;
> >> +}
> >> +
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice.
> >> + *
> >> + * Clear a bit in the bitset.
> >> + *
> >> + * Bits are numbered 0 to (size - 1) (inclusive).
> >> + *
> >> + * @param bitset
> >> + *   A pointer to the array words making up the bitset.
> >> + * @param bit_num
> >> + *   The index of the bit to be cleared.
> >> + */
> >> +
> >> +__rte_experimental
> >> +static inline void
> >> +rte_bitset_clear(uint64_t *bitset, size_t bit_num)
> >> +{
> >> +	size_t word;
> >> +	size_t offset;
> >> +	uint64_t mask;
> >> +
> >> +	word = __RTE_BITSET_WORD_IDX(bit_num);
> >> +	offset = __RTE_BITSET_BIT_OFFSET(bit_num);
> >> +	mask = ~(UINT64_C(1) << offset);
> >> +
> >> +	bitset[word] &= mask;
> >> +}
> >> +
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice.
> >> + *
> >> + * Set all bits in the bitset.
> >> + *
> >> + * @param bitset
> >> + *   A pointer to the array of words making up the bitset.
> >> + * @param size
> >> + *   The size of the bitset (in bits).
> >> + */
> >> +
> >> +__rte_experimental
> >> +static inline void
> >> +rte_bitset_set_all(uint64_t *bitset, size_t size)
> >> +{
> >> +	memset(bitset, 0xFF, RTE_BITSET_SIZE(size));
> >> +}
> >> +
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice.
> >> + *
> >> + * Clear all bits in the bitset.
> >> + *
> >> + * @param bitset
> >> + *   A pointer to the array of words making up the bitset.
> >> + * @param size
> >> + *   The size of the bitset (in bits).
> >> + */
> >> +
> >> +__rte_experimental
> >> +static inline void
> >> +rte_bitset_clear_all(uint64_t *bitset, size_t size)
> >> +{
> >> +	rte_bitset_init(bitset, size);
> >> +}
> >> +
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice.
> >> + *
> >> + * Count all set bits.
> >> + *
> >> + * @param bitset
> >> + *   A pointer to the array of words making up the bitset.
> >> + * @param size
> >> + *   The size of the bitset (in bits).
> >> + * @return
> >> + *   Returns the number of '1' bits in the bitset.
> >> + */
> >> +
> >> +__rte_experimental
> >> +static inline size_t
> >> +rte_bitset_count_set(const uint64_t *bitset, size_t size)
> >> +{
> >> +	size_t i;
> >> +	size_t total = 0;
> >> +	uint64_t unused_mask;
> >> +
> >> +	/*
> >> +	 * Unused bits in a rte_bitset are always '0', and thus are
> >> +	 * not included in this count.
> >> +	 */
> >> +	for (i = 0; i < RTE_BITSET_NUM_WORDS(size) - 1; i++)
> >> +		total += __builtin_popcountll(bitset[i]);
> >> +
> >> +	unused_mask = UINT64_MAX >> __RTE_BITSET_UNUSED(size);
> >> +	total += __builtin_popcountll(bitset[i] & unused_mask);
> >> +
> >> +	return total;
> >> +}
> >> +
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice.
> >> + *
> >> + * Count all cleared bits.
> >> + *
> >> + * @param bitset
> >> + *   A pointer to the array of words making up the bitset.
> >> + * @param size
> >> + *   The size of the bitset (in bits).
> >> + * @return
> >> + *   Returns the number of '0' bits in the bitset.
> >> + */
> >> +
> >> +__rte_experimental
> >> +static inline size_t
> >> +rte_bitset_count_clear(const uint64_t *bitset, size_t size)
> >> +{
> >> +	return size - rte_bitset_count_set(bitset, size);
> >> +}
> >> +
> >> +/**
> >> + * @warning
> >> + * @b EXPERIMENTAL: this API may change without prior notice.
> >> + *
> >> + * Test if a bit is set.
> >> + *
> >> + * @param bitset
> >> + *   A pointer to the array of words making up the bitset.
> >> + * @param bit_num
> >> + *   Index of the bit to test. Index 0 is the least significant bit.
> >> + * @return
> >> + *   Returns true if the bit is '1', and false if the bit is '0'.
> >> + */
> >> +
> >> +__rte_experimental
> >> +static inline bool
> >> +rte_bitset_test(const uint64_t *bitset, size_t bit_num)
> >> +{
> >> +	size_t word;
> >> +	size_t offset;
> >> +
> >> +	word = __RTE_BITSET_WORD_IDX(bit_num);
> >> +	offset = __RTE_BITSET_BIT_OFFSET(bit_num);
> >> +
> >> +	return (bitset[word] >> offset) & 1;
> >> +}
> >> +
> >> +#define __RTE_BITSET_FIND_FLAG_FIND_CLEAR (1U << 0)
> >> +#define __RTE_BITSET_FIND_FLAG_WRAP (1U << 1)
> >> +
> >> +__rte_experimental
> >> +static inline ssize_t
> >> +__rte_bitset_find_nowrap(const uint64_t *bitset, size_t __rte_unused size,
> >> +			 size_t start_bit, size_t len, bool find_clear)
> > 
> >     ^^ seems like the intent here is for this to be internal (not part of
> >        public api) i wonder do we have to mark it __rte_experimental?
> > 
> >        or better can we prevent visibility / consumption outside of the
> >        inline functions in this header?
> > 
> 
> It is supposed to be internal (although you can question the DPDK habit 
> of using the reserved __ namespace for such symbols).

yes, i've made occasional comments on this. rightly or wrongly it seems
settled.

> 
> I don't know of any way to prevent its use. I'm not sure it's needed. 
> Those who don't heed the "___" prefix warning, are on their own.

fair enough, i was thinking we could tuck it inside some preprocessor
block and expand it within the scope of the header where used and then
undef.

but if consumers aren't smart enough to avoid __foo they probably
wouldn't hesitate to define FOO to get internal_foo anyway.

> 
> >> +{
> >> +	size_t word_idx;
> >> +	size_t offset;
> >> +	size_t end_bit = start_bit + len;
> >> +
> >> +	RTE_ASSERT(end_bit <= size);
> >> +
> >> +	if (unlikely(len == 0))
> >> +		return -1;
> >> +
> >> +	word_idx = __RTE_BITSET_WORD_IDX(start_bit);
> >> +	offset = __RTE_BITSET_BIT_OFFSET(start_bit);
> >> +
> >> +	while (word_idx <= __RTE_BITSET_WORD_IDX(end_bit - 1)) {
> >> +		uint64_t word;
> >> +		int word_ffs;
> >> +
> >> +		word = bitset[word_idx];
> >> +		if (find_clear)
> >> +			word = ~word;
> >> +
> >> +		word >>= offset;
> >> +
> >> +		word_ffs = __builtin_ffsll(word);
> >> +
> >> +		if (word_ffs != 0) {
> >> +			ssize_t ffs = start_bit + word_ffs - 1;
> >> +
> >> +			/*
> >> +			 * Check if set bit were among the last,
> >> +			 * unused bits, in the last word.
> >> +			 */
> >> +			if (unlikely(ffs >= (ssize_t)end_bit))
> >> +				return -1;
> >> +
> >> +			return ffs;
> >> +		}
> >> +
> >> +		start_bit += (RTE_BITSET_WORD_BITS - offset);
> >> +		word_idx++;
> >> +		offset = 0;
> >> +	}
> >> +
> >> +	return -1;
> >> +
> >> +}
> >> +
> 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 2/2] eal: add high-performance timer facility
  2023-02-28  9:39 ` [RFC 2/2] eal: add high-performance timer facility Mattias Rönnblom
@ 2023-03-05 17:25   ` Stephen Hemminger
  2023-03-09 15:20     ` Mattias Rönnblom
  0 siblings, 1 reply; 31+ messages in thread
From: Stephen Hemminger @ 2023-03-05 17:25 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, Erik Gabriel Carrillo, David Marchand, maria.lingemark,
	Stefan Sundkvist

On Tue, 28 Feb 2023 10:39:16 +0100
Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:

> The htimer library attempts at providing a timer facility with roughly
> the same functionality, but less overhead and better scalability than
> DPDK timer library.
> 
> The htimer library employs per-lcore hierachical timer wheels and a
> message-based synchronization/MT-safety scheme.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

I like this but:
  - need to have one timer infrastructure, having multiple will lead to confusion
    and overlap in user applications. I.e can they be mixed, what happens if X and Y...

  - best to keep original API available.

  - ok to drop the rte_alt_timer since it was always experimental.

  - would be good to have API using consistent known time (nanoseconds?) rather
    than TSC cycles.

  - there could be use cases for REALTIME as well as MONOTONIC types.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC 2/2] eal: add high-performance timer facility
  2023-03-05 17:25   ` Stephen Hemminger
@ 2023-03-09 15:20     ` Mattias Rönnblom
  0 siblings, 0 replies; 31+ messages in thread
From: Mattias Rönnblom @ 2023-03-09 15:20 UTC (permalink / raw)
  To: Stephen Hemminger, Mattias Rönnblom
  Cc: dev, Erik Gabriel Carrillo, David Marchand, maria.lingemark,
	Stefan Sundkvist

On 2023-03-05 18:25, Stephen Hemminger wrote:
> On Tue, 28 Feb 2023 10:39:16 +0100
> Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
> 
>> The htimer library attempts at providing a timer facility with roughly
>> the same functionality, but less overhead and better scalability than
>> DPDK timer library.
>>
>> The htimer library employs per-lcore hierachical timer wheels and a
>> message-based synchronization/MT-safety scheme.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> 
> I like this but:
>    - need to have one timer infrastructure, having multiple will lead to confusion
>      and overlap in user applications. I.e can they be mixed, what happens if X and Y...
> 
>    - best to keep original API available.
> 

My thoughts on this is that we first converge on the proper future API, 
and then you see how close it is to <rte_timer.h>, and if a shim layer 
or something of that sort is feasible.

>    - ok to drop the rte_alt_timer since it was always experimental.
> 
>    - would be good to have API using consistent known time (nanoseconds?) rather
>      than TSC cycles.
> 

I will update the RFC to use nanoseconds, at least as the default time unit.

>    - there could be use cases for REALTIME as well as MONOTONIC types.

The current HTW implementation requires monotonic time (i.e., time can't 
go backwards). Realtime clocks typically aren't monotonically increasing.

A naive implementation of backward time travel would be to just 
rescheduled all timers.

Would you use one HTW instance per clock?

This makes me think you want a small <rte_time.h> (!= <rte_timer.h>) 
time source API as well. It would, among other things, define the clock 
ids to use in the rte_htimer_mgr_add() calls. It would also be a good 
place in case you want to cache the result of rdtsc instructions (to 
have a more course-grained, but more efficient, clock source).

Maybe rte_get_timer_cycles()/hz() (as opposed to rte_get_tsc_cycles()) 
is a minimalistic attempt in this direction already?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC v2 0/2] Add high-performance timer facility
  2023-02-28  9:39 [RFC 0/2] Add high-performance timer facility Mattias Rönnblom
                   ` (2 preceding siblings ...)
  2023-02-28 16:01 ` [RFC 0/2] Add " Morten Brørup
@ 2023-03-15 17:03 ` Mattias Rönnblom
  2023-03-15 17:03   ` [RFC v2 1/2] eal: add bitset type Mattias Rönnblom
                     ` (2 more replies)
  3 siblings, 3 replies; 31+ messages in thread
From: Mattias Rönnblom @ 2023-03-15 17:03 UTC (permalink / raw)
  To: dev
  Cc: Erik Gabriel Carrillo, David Marchand, maria.lingemark,
	Stefan Sundkvist, Stephen Hemminger, Morten Brørup,
	Tyler Retzlaff, Mattias Rönnblom

This patchset is an attempt to introduce a high-performance, highly
scalable timer facility into DPDK.

More specifically, the goals for the htimer library are:

* Efficient handling of a handful up to hundreds of thousands of
  concurrent timers.
* Make adding and canceling timers low-overhead, constant-time
  operations.
* Provide a service functionally equivalent to that of
  <rte_timer.h>. API/ABI backward compatibility is secondary.

In the author's opinion, there are two main shortcomings with the
current DPDK timer library (i.e., rte_timer.[ch]).

One is the synchronization overhead, where heavy-weight full-barrier
type synchronization is used. rte_timer.c uses per-EAL/lcore skip
lists, but any thread may add or cancel (or otherwise access) timers
managed by another lcore (and thus resides in its timer skip list).

The other is an algorithmic shortcoming, with rte_timer.c's reliance
on a skip list, which is less efficient than certain alternatives.

This patchset implements a hierarchical timer wheel (HWT, in
rte_htw.c), as per the Varghese and Lauck paper "Hashed and
Hierarchical Timing Wheels: Data Structures for the Efficient
Implementation of a Timer Facility". A HWT is a data structure
purposely design for this task, and used by many operating system
kernel timer facilities.

To further improve the solution described by Varghese and Lauck, a
bitset is placed in front of each of the timer wheel in the HWT,
reducing overhead of rte_htimer_mgr_manage() (i.e., progressing time
and expiry processing).

Cycle-efficient scanning and manipulation of these bitsets are crucial
for the HWT's performance.

The htimer module keeps a per-lcore (or per-registered EAL thread) HWT
instance, much like rte_timer.c keeps a per-lcore skip list.

To avoid expensive synchronization overhead for thread-local timer
management, the HWTs are accessed only from the "owning" thread.  Any
interaction any other thread does with a particular lcore's timer
wheel goes over a set of DPDK rings. A side-effect of this design is
that all operations working toward a "remote" HWT must be
asynchronous.

The <rte_htimer.h> API is available only to EAL threads and registered
non-EAL threads.

The htimer API allows the application to supply the current time,
useful in case it already has retrieved this for other purposes,
saving the cost of a rdtsc instruction (or its equivalent).

Relative htimer does not retrieve a new time, but reuse the current
time (as known via/at-the-time of the manage-call), again to shave off
some cycles of overhead.

A semantic improvement compared to the <rte_timer.h> API is that the
htimer library can give a definite answer on the question if the timer
expiry callback was called, after a timer has been canceled.

The patchset includes a performance test case
'timer_htimer_htw_perf_autotest', which compares rte_timer, rte_htimer
and rte_htw timers in the same scenario.

'timer_htimer_htw_perf_autotest' suggests that rte_htimer is ~3-5x
faster than rte_timer for timer/timeout-heavy applications, in a
scenario where the timer always fires. For a scenario with a mix of
canceled and expired timers, the performance difference is greater.

In scenarios with few timeouts, rte_timer has lower overhead than
htimer, but both variants consume very little CPU time.

In certain scenarios, rte_timer does not suffer from
non-constant-time-add and cancel operations. On such is in case the
timer added is always last in the list, where htimer is only ~2-3x
faster.

The bitset implementation which the HWT implementation depends upon
seemed generic-enough and potentially useful outside the world of
HWTs, to justify being located in the EAL.

This patchset is very much an RFC, and the author is yet to form an
opinion on many important issues.

* If deemed a suitable replacement, should the htimer replace the
  current DPDK timer library in some particular (ABI-breaking)
  release, or should it live side-by-side with the then-legacy
  <rte_timer.h> API? A lot of things in and outside DPDK depend on
  <rte_timer.h>, so coexistence may be required to facilitate a smooth
  transition.

* Should the htimer and htw-related files be colocated with rte_timer.c
  in the timer library?

* Would it be useful for applications using asynchronous cancel to
  have the option of having the timer callback run not only in case of
  timer expiration, but also cancellation (on the target lcore)? The
  timer cb signature would need to include an additional parameter in
  that case.

* Should the rte_htimer be a nested struct, so the htw parts be separated
  from the htimer parts?

* <rte_htimer.h> is kept separate from <rte_htimer_mgr.h>, so that
  <rte_htw.h> may avoid a depedency to <rte_htimer_mgr.h>. Should it
  be so?

* rte_htimer struct is only supposed to be used by the application to
  give an indication of how much memory it needs to allocate, and is
  its member are not supposed to be directly accessed (w/ the possible
  exception of the owner_lcore_id field). Should there be a dummy
  struct, or a #define RTE_HTIMER_MEMSIZE or a rte_htimer_get_memsize()
  function instead, serving the same purpose? Better encapsulation,
  but more inconvenient for applications. Run-time dynamic sizing
  would force application-level dynamic allocations.

* Asynchronous cancellation is a little tricky to use for the
  application (primarily due to timer memory reclamation/race
  issues). Should this functionality be removed?

* Should rte_htimer_mgr_init() also retrieve the current time? If so,
  there should to be a variant which allows the user to specify the
  time (to match rte_htimer_mgr_manage_time()). One pitfall with the
  current proposed API is an application calling rte_htimer_mgr_init()
  and then immediately adding a timer with a relative timeout, in
  which case the current absolute time used is 0, which might be a
  surprise.

* Would the event timer adapter be best off using <rte_htw.h>
  directly, or <rte_htimer.h>? In the latter case, there needs to be a
  way to instantiate more HWTs (similar to the "alt" functions of
  <rte_timer.h>)?

* Should the PERIODICAL flag (and the complexity it brings) be
  removed? And leave the application with only single-shot timers, and
  the option to re-add them in the timer callback.

* Should the async result codes and the sync cancel error codes be merged
  into one set of result codes?

* Should the rte_htimer_mgr_async_add() have a flag which allow
  buffering add request messages until rte_htimer_mgr_process() is
  called? Or any manage function. Would reduce ring signaling overhead
  (i.e., burst enqueue operations instead of single-element
  enqueue). Could also be a rte_htimer_mgr_async_add_burst() function,
  solving the same "problem" a different way. (The signature of such
  a function would not be pretty.)

* Does the functionality provided by the rte_htimer_mgr_process()
  function match its the use cases? Should there me a more clear
  separation between expiry processing and asynchronous operation
  processing?

* Should the patchset be split into more commits? If so, how?

Thanks to Erik Carrillo for his assistance.

Mattias Rönnblom (2):
  eal: add bitset type
  eal: add high-performance timer facility

 app/test/meson.build                  |  12 +-
 app/test/test_bitset.c                | 645 +++++++++++++++++++
 app/test/test_htimer_mgr.c            | 674 ++++++++++++++++++++
 app/test/test_htimer_mgr_perf.c       | 322 ++++++++++
 app/test/test_htw.c                   | 478 ++++++++++++++
 app/test/test_htw_perf.c              | 181 ++++++
 app/test/test_timer_htimer_htw_perf.c | 693 ++++++++++++++++++++
 doc/api/doxy-api-index.md             |   5 +-
 doc/api/doxy-api.conf.in              |   1 +
 lib/eal/common/meson.build            |   1 +
 lib/eal/common/rte_bitset.c           |  29 +
 lib/eal/include/meson.build           |   1 +
 lib/eal/include/rte_bitset.h          | 879 ++++++++++++++++++++++++++
 lib/eal/version.map                   |   3 +
 lib/htimer/meson.build                |   7 +
 lib/htimer/rte_htimer.h               |  68 ++
 lib/htimer/rte_htimer_mgr.c           | 547 ++++++++++++++++
 lib/htimer/rte_htimer_mgr.h           | 516 +++++++++++++++
 lib/htimer/rte_htimer_msg.h           |  44 ++
 lib/htimer/rte_htimer_msg_ring.c      |  18 +
 lib/htimer/rte_htimer_msg_ring.h      |  55 ++
 lib/htimer/rte_htw.c                  | 445 +++++++++++++
 lib/htimer/rte_htw.h                  |  49 ++
 lib/htimer/version.map                |  17 +
 lib/meson.build                       |   1 +
 25 files changed, 5689 insertions(+), 2 deletions(-)
 create mode 100644 app/test/test_bitset.c
 create mode 100644 app/test/test_htimer_mgr.c
 create mode 100644 app/test/test_htimer_mgr_perf.c
 create mode 100644 app/test/test_htw.c
 create mode 100644 app/test/test_htw_perf.c
 create mode 100644 app/test/test_timer_htimer_htw_perf.c
 create mode 100644 lib/eal/common/rte_bitset.c
 create mode 100644 lib/eal/include/rte_bitset.h
 create mode 100644 lib/htimer/meson.build
 create mode 100644 lib/htimer/rte_htimer.h
 create mode 100644 lib/htimer/rte_htimer_mgr.c
 create mode 100644 lib/htimer/rte_htimer_mgr.h
 create mode 100644 lib/htimer/rte_htimer_msg.h
 create mode 100644 lib/htimer/rte_htimer_msg_ring.c
 create mode 100644 lib/htimer/rte_htimer_msg_ring.h
 create mode 100644 lib/htimer/rte_htw.c
 create mode 100644 lib/htimer/rte_htw.h
 create mode 100644 lib/htimer/version.map

-- 
2.34.1

^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC v2 1/2] eal: add bitset type
  2023-03-15 17:03 ` [RFC v2 " Mattias Rönnblom
@ 2023-03-15 17:03   ` Mattias Rönnblom
  2023-03-15 17:20     ` Stephen Hemminger
  2023-03-15 17:03   ` [RFC v2 2/2] eal: add high-performance timer facility Mattias Rönnblom
  2024-10-03 18:36   ` [RFC v2 0/2] Add " Stephen Hemminger
  2 siblings, 1 reply; 31+ messages in thread
From: Mattias Rönnblom @ 2023-03-15 17:03 UTC (permalink / raw)
  To: dev
  Cc: Erik Gabriel Carrillo, David Marchand, maria.lingemark,
	Stefan Sundkvist, Stephen Hemminger, Morten Brørup,
	Tyler Retzlaff, Mattias Rönnblom

Introduce a set of functions and macros that operate on sets of bits,
kept in arrays of 64-bit elements.

RTE bitset is designed for bitsets which are larger than what fits in
a single machine word (i.e., 64 bits). For very large bitsets, the
<rte_bitmap.h> API may be a more appropriate choice.

RFC v2:
 * Replaced <sys/types.h> with <stddef.h> include, to properly get
   size_t typedef.
 * Add <rte_compat.h> to get __rte_experimental in <rte_bitset.h>.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 app/test/meson.build         |   4 +-
 app/test/test_bitset.c       | 645 +++++++++++++++++++++++++
 lib/eal/common/meson.build   |   1 +
 lib/eal/common/rte_bitset.c  |  29 ++
 lib/eal/include/meson.build  |   1 +
 lib/eal/include/rte_bitset.h | 879 +++++++++++++++++++++++++++++++++++
 lib/eal/version.map          |   3 +
 7 files changed, 1561 insertions(+), 1 deletion(-)
 create mode 100644 app/test/test_bitset.c
 create mode 100644 lib/eal/common/rte_bitset.c
 create mode 100644 lib/eal/include/rte_bitset.h

diff --git a/app/test/meson.build b/app/test/meson.build
index f34d19e3c3..03811ff692 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -13,8 +13,9 @@ test_sources = files(
         'test_alarm.c',
         'test_atomic.c',
         'test_barrier.c',
-        'test_bitops.c',
         'test_bitmap.c',
+        'test_bitset.c',
+        'test_bitops.c',
         'test_bpf.c',
         'test_byteorder.c',
         'test_cksum.c',
@@ -164,6 +165,7 @@ fast_tests = [
         ['bpf_autotest', true, true],
         ['bpf_convert_autotest', true, true],
         ['bitops_autotest', true, true],
+        ['bitset_autotest', true, true],
         ['byteorder_autotest', true, true],
         ['cksum_autotest', true, true],
         ['cmdline_autotest', true, true],
diff --git a/app/test/test_bitset.c b/app/test/test_bitset.c
new file mode 100644
index 0000000000..8c75b38575
--- /dev/null
+++ b/app/test/test_bitset.c
@@ -0,0 +1,645 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#include <stdlib.h>
+#include <inttypes.h>
+
+#include <rte_random.h>
+
+#include <rte_bitset.h>
+
+#include "test.h"
+
+#define MAGIC UINT64_C(0xdeadbeefdeadbeef)
+
+static void
+rand_buf(void *buf, size_t n)
+{
+	size_t i;
+
+	for (i = 0; i < n; i++)
+		((char *)buf)[i] = (char)rte_rand();
+}
+
+static uint64_t *
+alloc_bitset(size_t size)
+{
+	uint64_t *p;
+
+	p = malloc(RTE_BITSET_SIZE(size) + 2 * sizeof(uint64_t));
+
+	if (p == NULL)
+		rte_panic("Unable to allocate memory\n");
+
+	rand_buf(&p[0], RTE_BITSET_SIZE(size));
+
+	p[0] = MAGIC;
+	p[RTE_BITSET_NUM_WORDS(size) + 1] = MAGIC;
+
+	return p + 1;
+}
+
+
+static int
+free_bitset(uint64_t *bitset, size_t size)
+{
+	uint64_t *p;
+
+	p = bitset - 1;
+
+	if (p[0] != MAGIC)
+		return TEST_FAILED;
+
+	if (p[RTE_BITSET_NUM_WORDS(size) + 1] != MAGIC)
+		return TEST_FAILED;
+
+	free(p);
+
+	return TEST_SUCCESS;
+}
+
+static bool
+rand_bool(void)
+{
+	return rte_rand_max(2);
+}
+
+static void
+rand_bool_ary(bool *ary, size_t len)
+{
+	size_t i;
+
+	for (i = 0; i < len; i++)
+		ary[i] = rand_bool();
+}
+
+static int
+test_test_set_size(size_t size)
+{
+	size_t i;
+	bool reference[size];
+	uint64_t *bitset;
+
+	rand_bool_ary(reference, size);
+
+	bitset = alloc_bitset(size);
+
+	if (bitset == NULL)
+		return TEST_FAILED;
+
+	rte_bitset_init(bitset, size);
+
+	for (i = 0; i < size; i++) {
+		if (reference[i])
+			rte_bitset_set(bitset, i);
+		else
+			rte_bitset_clear(bitset, i);
+	}
+
+	for (i = 0; i < size; i++)
+		if (reference[i] != rte_bitset_test(bitset, i))
+			return TEST_FAILED;
+
+	if (free_bitset(bitset, size) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+#define RAND_ITERATIONS (10000)
+#define RAND_SET_MAX_SIZE (1000)
+
+static int
+test_test_set(void)
+{
+	size_t i;
+
+	for (i = 0; i < RAND_ITERATIONS; i++) {
+		size_t size = 1 + rte_rand_max(RAND_SET_MAX_SIZE - 1);
+
+		if (test_test_set_size(size) != TEST_SUCCESS)
+			return TEST_FAILED;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static ssize_t
+find(const bool *ary, size_t num_bools, size_t start, size_t len, bool set)
+{
+	size_t i;
+
+	for (i = 0; i < len; i++) {
+		ssize_t idx = (start + i) % num_bools;
+
+		if (ary[idx] == set)
+			return idx;
+	}
+
+	return -1;
+}
+
+static ssize_t
+find_set(const bool *ary, size_t num_bools, size_t start, size_t len)
+{
+	return find(ary, num_bools, start, len, true);
+}
+
+static ssize_t
+find_clear(const bool *ary, size_t num_bools, size_t start, size_t len)
+{
+	return find(ary, num_bools, start, len, false);
+}
+
+#define FFS_ITERATIONS (100)
+
+static int
+test_find_size(size_t size, bool set)
+{
+	uint64_t *bitset;
+	bool reference[size];
+	size_t i;
+
+	bitset = alloc_bitset(size);
+
+	if (bitset == NULL)
+		return TEST_FAILED;
+
+	rte_bitset_init(bitset, size);
+
+	for (i = 0; i < size; i++) {
+		bool bit = rand_bool();
+		reference[i] = bit;
+
+		if (bit)
+			rte_bitset_set(bitset, i);
+		else /* redundant, still useful for testing */
+			rte_bitset_clear(bitset, i);
+	}
+
+	for (i = 0; i < FFS_ITERATIONS; i++) {
+		size_t start_bit = rte_rand_max(size);
+		size_t len = rte_rand_max(size + 1);
+		bool full_range = len == size && start_bit == 0;
+		bool wraps = start_bit + len > size;
+		ssize_t rc;
+
+		if (set) {
+			if (full_range && rand_bool())
+				rc = rte_bitset_find_first_set(bitset,
+							       size);
+			else if (wraps || rand_bool()) {
+				rc = rte_bitset_find_set_wrap(bitset, size,
+							      start_bit, len);
+
+			} else
+				rc = rte_bitset_find_set(bitset, size,
+							 start_bit, len);
+
+			if (rc != find_set(reference, size, start_bit,
+					   len))
+				return TEST_FAILED;
+		} else {
+			if (full_range && rand_bool())
+				rc = rte_bitset_find_first_clear(bitset,
+								 size);
+			else if (wraps || rand_bool())
+				rc = rte_bitset_find_clear_wrap(bitset,
+								size,
+								start_bit, len);
+			else
+				rc = rte_bitset_find_clear(bitset, size,
+							   start_bit, len);
+
+			if (rc != find_clear(reference, size, start_bit,
+					     len))
+				return TEST_FAILED;
+		}
+
+	}
+
+	if (free_bitset(bitset, size) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_find_set_size(size_t size)
+{
+	return test_find_size(size, true);
+}
+
+static int
+test_find_clear_size(size_t size)
+{
+	return test_find_size(size, false);
+}
+
+static int
+test_find(void)
+{
+	size_t i;
+
+	for (i = 0; i < RAND_ITERATIONS; i++) {
+		size_t size = 2 + rte_rand_max(RAND_SET_MAX_SIZE - 2);
+
+		if (test_find_set_size(size) != TEST_SUCCESS)
+			return TEST_FAILED;
+
+		if (test_find_clear_size(size) != TEST_SUCCESS)
+			return TEST_FAILED;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+record_match(ssize_t match_idx, size_t size, int *calls)
+{
+	if (match_idx < 0 || (size_t)match_idx >= size)
+		return TEST_FAILED;
+
+	calls[match_idx]++;
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_foreach_size(ssize_t size, bool may_wrap, bool set)
+{
+	bool reference[size];
+	int calls[size];
+	uint64_t *bitset;
+	ssize_t i;
+	ssize_t start_bit;
+	ssize_t len;
+	bool full_range;
+	size_t total_calls = 0;
+
+	rand_bool_ary(reference, size);
+
+	bitset = alloc_bitset(size);
+
+	if (bitset == NULL)
+		return TEST_FAILED;
+
+	memset(calls, 0, sizeof(calls));
+
+	start_bit = rte_rand_max(size);
+	len = may_wrap ? rte_rand_max(size + 1) :
+		rte_rand_max(size - start_bit + 1);
+
+	rte_bitset_init(bitset, size);
+
+	/* random data in the unused bits should not matter */
+	rand_buf(bitset, RTE_BITSET_SIZE(size));
+
+	for (i = start_bit; i < start_bit + len; i++) {
+		size_t idx = i % size;
+
+		if (reference[idx])
+			rte_bitset_set(bitset, idx);
+		else
+			rte_bitset_clear(bitset, idx);
+
+		if (rte_bitset_test(bitset, idx) != reference[idx])
+			return TEST_FAILED;
+	}
+
+	full_range = (len == size && start_bit == 0);
+
+	/* XXX: verify iteration order as well */
+	if (set) {
+		if (full_range && rand_bool()) {
+			RTE_BITSET_FOREACH_SET(i, bitset, size) {
+				if (record_match(i, size, calls) !=
+				    TEST_SUCCESS)
+					return TEST_FAILED;
+			}
+		} else if (may_wrap) {
+			RTE_BITSET_FOREACH_SET_WRAP(i, bitset, size,
+						    start_bit, len) {
+				if (record_match(i, size, calls) !=
+				    TEST_SUCCESS) {
+					printf("failed\n");
+					return TEST_FAILED;
+				}
+			}
+		} else {
+			RTE_BITSET_FOREACH_SET_RANGE(i, bitset, size,
+						     start_bit, len) {
+				if (record_match(i, size, calls) !=
+				    TEST_SUCCESS)
+					return TEST_FAILED;
+			}
+		}
+	} else {
+		if (full_range && rand_bool()) {
+			RTE_BITSET_FOREACH_CLEAR(i, bitset, size)
+				if (record_match(i, size, calls) !=
+				    TEST_SUCCESS)
+					return TEST_FAILED;
+		} else if (may_wrap) {
+			RTE_BITSET_FOREACH_CLEAR_WRAP(i, bitset, size,
+						      start_bit, len) {
+				if (record_match(i, size, calls) !=
+				    TEST_SUCCESS)
+					return TEST_FAILED;
+			}
+		} else {
+			RTE_BITSET_FOREACH_CLEAR_RANGE(i, bitset, size,
+						       start_bit, len)
+				if (record_match(i, size, calls) !=
+				    TEST_SUCCESS)
+					return TEST_FAILED;
+		}
+	}
+
+	for (i = 0; i < len; i++) {
+		size_t idx = (start_bit + i) % size;
+
+		if (reference[idx] == set && calls[idx] != 1) {
+			printf("bit %zd shouldn't have been found %d "
+			       "times\n", idx, calls[idx]);
+			return TEST_FAILED;
+		}
+
+		if (reference[idx] != set && calls[idx] != 0) {
+			puts("bar");
+			return TEST_FAILED;
+		}
+
+		total_calls += calls[idx];
+	}
+
+	if (full_range) {
+		size_t count;
+
+		count = set ? rte_bitset_count_set(bitset, size) :
+			rte_bitset_count_clear(bitset, size);
+
+		if (count != total_calls)
+			return TEST_FAILED;
+	}
+
+	if (free_bitset(bitset, size) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_foreach(void)
+{
+	size_t i;
+
+	for (i = 0; i < RAND_ITERATIONS; i++) {
+		size_t size = 1 + rte_rand_max(RAND_SET_MAX_SIZE - 1);
+
+		if (test_foreach_size(size, false, true) != TEST_SUCCESS)
+			return TEST_FAILED;
+
+		if (test_foreach_size(size, false, false) != TEST_SUCCESS)
+			return TEST_FAILED;
+
+		if (test_foreach_size(size, true, true) != TEST_SUCCESS)
+			return TEST_FAILED;
+
+		if (test_foreach_size(size, true, false) != TEST_SUCCESS)
+			return TEST_FAILED;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_count_size(size_t size)
+{
+	uint64_t *bitset;
+
+	bitset = alloc_bitset(size);
+
+	if (bitset == NULL)
+		return TEST_FAILED;
+
+	rte_bitset_init(bitset, size);
+
+	if (rte_bitset_count_set(bitset, size) != 0)
+		return TEST_FAILED;
+
+	if (rte_bitset_count_clear(bitset, size) != size)
+		return TEST_FAILED;
+
+	rte_bitset_set_all(bitset, size);
+
+	if (rte_bitset_count_set(bitset, size) != size)
+		return TEST_FAILED;
+
+	if (rte_bitset_count_clear(bitset, size) != 0)
+		return TEST_FAILED;
+
+	rte_bitset_clear_all(bitset, size);
+
+	if (rte_bitset_count_set(bitset, size) != 0)
+		return TEST_FAILED;
+
+	if (rte_bitset_count_clear(bitset, size) != size)
+		return TEST_FAILED;
+
+	rte_bitset_set(bitset, rte_rand_max(size));
+
+	if (rte_bitset_count_set(bitset, size) != 1)
+		return TEST_FAILED;
+
+	if (rte_bitset_count_clear(bitset, size) != (size - 1))
+		return TEST_FAILED;
+
+	rte_bitset_clear_all(bitset, size);
+	if (rte_bitset_count_set(bitset, size) != 0)
+		return TEST_FAILED;
+	if (rte_bitset_count_clear(bitset, size) != size)
+		return TEST_FAILED;
+
+	rte_bitset_set_all(bitset, size);
+	if (rte_bitset_count_set(bitset, size) != size)
+		return TEST_FAILED;
+	if (rte_bitset_count_clear(bitset, size) != 0)
+		return TEST_FAILED;
+
+	if (free_bitset(bitset, size) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_count(void)
+{
+	size_t i;
+
+	if (test_count_size(128) != TEST_SUCCESS)
+		return TEST_FAILED;
+	if (test_count_size(1) != TEST_SUCCESS)
+		return TEST_FAILED;
+	if (test_count_size(63) != TEST_SUCCESS)
+		return TEST_FAILED;
+	if (test_count_size(64) != TEST_SUCCESS)
+		return TEST_FAILED;
+	if (test_count_size(65) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	for (i = 0; i < RAND_ITERATIONS; i++) {
+		size_t size = 1 + rte_rand_max(RAND_SET_MAX_SIZE - 1);
+
+		if (test_count_size(size) != TEST_SUCCESS)
+			return TEST_FAILED;
+	}
+
+	return TEST_SUCCESS;
+}
+
+#define GEN_DECLARE(size)						\
+	{								\
+		RTE_BITSET_DECLARE(bitset, size);			\
+		size_t idx;						\
+									\
+		idx = rte_rand_max(size);				\
+		rte_bitset_init(bitset, size);				\
+									\
+		rte_bitset_set(bitset, idx);				\
+		if (!rte_bitset_test(bitset, idx))			\
+			return TEST_FAILED;				\
+		if (rte_bitset_count_set(bitset, size) != 1)		\
+			return TEST_FAILED;				\
+		return TEST_SUCCESS;					\
+	}
+
+static int
+test_define(void)
+{
+	GEN_DECLARE(1);
+	GEN_DECLARE(64);
+	GEN_DECLARE(65);
+	GEN_DECLARE(4097);
+}
+
+static int
+test_equal(void)
+{
+	const size_t size = 100;
+	RTE_BITSET_DECLARE(bitset_a, size);
+	RTE_BITSET_DECLARE(bitset_b, size);
+
+	rand_buf(bitset_a, RTE_BITSET_SIZE(size));
+	rand_buf(bitset_b, RTE_BITSET_SIZE(size));
+
+	rte_bitset_init(bitset_a, size);
+	rte_bitset_init(bitset_b, size);
+
+	rte_bitset_set(bitset_a, 9);
+	rte_bitset_set(bitset_b, 9);
+	rte_bitset_set(bitset_a, 90);
+	rte_bitset_set(bitset_b, 90);
+
+	if (!rte_bitset_equal(bitset_a, bitset_b, size))
+		return TEST_FAILED;
+
+	/* set unused bit, which should be ignored */
+	rte_bitset_set(&bitset_a[1], 60);
+
+	if (!rte_bitset_equal(bitset_a, bitset_b, size))
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_copy(void)
+{
+	const size_t size = 100;
+	RTE_BITSET_DECLARE(bitset_a, size);
+	RTE_BITSET_DECLARE(bitset_b, size);
+
+	rand_buf(bitset_a, RTE_BITSET_SIZE(size));
+	rand_buf(bitset_b, RTE_BITSET_SIZE(size));
+
+	if (rte_bitset_equal(bitset_a, bitset_b, size))
+		return TEST_FAILED;
+
+	rte_bitset_copy(bitset_a, bitset_b, size);
+
+	if (!rte_bitset_equal(bitset_a, bitset_b, size))
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_to_str(void)
+{
+	char buf[1024];
+	RTE_BITSET_DECLARE(bitset, 128);
+
+	rte_bitset_init(bitset, 128);
+	rte_bitset_set(bitset, 1);
+
+	if (rte_bitset_to_str(bitset, 2, buf, 3) != 3)
+		return TEST_FAILED;
+	if (strcmp(buf, "10") != 0)
+		return TEST_FAILED;
+
+	rte_bitset_set(bitset, 0);
+
+	if (rte_bitset_to_str(bitset, 1, buf, sizeof(buf)) != 2)
+		return TEST_FAILED;
+	if (strcmp(buf, "1") != 0)
+		return TEST_FAILED;
+
+	rte_bitset_init(bitset, 99);
+	rte_bitset_set(bitset, 98);
+
+	if (rte_bitset_to_str(bitset, 99, buf, sizeof(buf)) != 100)
+		return TEST_FAILED;
+
+	if (buf[0] != '1' || strchr(&buf[1], '1') != NULL)
+		return TEST_FAILED;
+
+	if (rte_bitset_to_str(bitset, 128, buf, 64) != -EINVAL)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_bitset(void)
+{
+	if (test_test_set() != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_find() != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_foreach() != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_count() != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_define() != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_equal() != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_copy() != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_to_str() != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+REGISTER_TEST_COMMAND(bitset_autotest, test_bitset);
diff --git a/lib/eal/common/meson.build b/lib/eal/common/meson.build
index 917758cc65..687ae51d87 100644
--- a/lib/eal/common/meson.build
+++ b/lib/eal/common/meson.build
@@ -32,6 +32,7 @@ sources += files(
         'eal_common_uuid.c',
         'malloc_elem.c',
         'malloc_heap.c',
+        'rte_bitset.c',
         'rte_malloc.c',
         'rte_random.c',
         'rte_reciprocal.c',
diff --git a/lib/eal/common/rte_bitset.c b/lib/eal/common/rte_bitset.c
new file mode 100644
index 0000000000..35e55a64db
--- /dev/null
+++ b/lib/eal/common/rte_bitset.c
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#include <errno.h>
+
+#include "rte_bitset.h"
+
+ssize_t
+rte_bitset_to_str(const uint64_t *bitset, size_t num_bits, char *buf,
+		  size_t capacity)
+{
+	size_t i;
+
+	if (capacity < (num_bits + 1))
+		return -EINVAL;
+
+	for (i = 0; i < num_bits; i++) {
+		bool value;
+
+		value = rte_bitset_test(bitset, num_bits - 1 - i);
+
+		buf[i] = value ? '1' : '0';
+	}
+
+	buf[num_bits] = '\0';
+
+	return num_bits + 1;
+}
diff --git a/lib/eal/include/meson.build b/lib/eal/include/meson.build
index b0db9b3b3a..fa3cb884e9 100644
--- a/lib/eal/include/meson.build
+++ b/lib/eal/include/meson.build
@@ -5,6 +5,7 @@ includes += include_directories('.')
 
 headers += files(
         'rte_alarm.h',
+        'rte_bitset.h',
         'rte_bitmap.h',
         'rte_bitops.h',
         'rte_branch_prediction.h',
diff --git a/lib/eal/include/rte_bitset.h b/lib/eal/include/rte_bitset.h
new file mode 100644
index 0000000000..4e59053b2d
--- /dev/null
+++ b/lib/eal/include/rte_bitset.h
@@ -0,0 +1,879 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#ifndef _RTE_BITSET_H_
+#define _RTE_BITSET_H_
+
+/**
+ * @file
+ * RTE Bitset
+ *
+ * This file provides functions and macros for querying and
+ * manipulating sets of bits kept in arrays of @c uint64_t-sized
+ * elements.
+ *
+ * The bits in a bitset are numbered from 0 to @c size - 1, with the
+ * lowest index being the least significant bit.
+ *
+ * The bitset array must be properly aligned.
+ *
+ * For optimal performance, the @c size parameter, required by
+ * many of the API's functions, should be a compile-time constant.
+ *
+ * For large bitsets, the rte_bitmap.h API may be more appropriate.
+ *
+ * @warning
+ * All functions modifying a bitset may overwrite any unused bits of
+ * the last word. Such unused bits are ignored by all functions reading
+ * bits.
+ *
+ */
+
+#include <limits.h>
+#include <stdbool.h>
+#include <stddef.h>
+#include <stdint.h>
+
+#include <rte_branch_prediction.h>
+#include <rte_common.h>
+#include <rte_compat.h>
+#include <rte_debug.h>
+#include <rte_memcpy.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+/**
+ * The size (in bytes) of each element in the array used to represent
+ * a bitset.
+ */
+#define RTE_BITSET_WORD_SIZE (sizeof(uint64_t))
+
+/**
+ * The size (in bits) of each element in the array used to represent
+ * a bitset.
+ */
+#define RTE_BITSET_WORD_BITS (RTE_BITSET_WORD_SIZE * CHAR_BIT)
+
+/**
+ * Computes the number of words required to store @c size bits.
+ */
+#define RTE_BITSET_NUM_WORDS(size)			       \
+	((size + RTE_BITSET_WORD_BITS - 1) / RTE_BITSET_WORD_BITS)
+
+/**
+ * Computes the amount of memory (in bytes) required to fit a bitset
+ * holding @c size bits.
+ */
+#define RTE_BITSET_SIZE(size)					\
+	((size_t)(RTE_BITSET_NUM_WORDS(size) * RTE_BITSET_WORD_SIZE))
+
+#define __RTE_BITSET_WORD_IDX(bit_num) ((bit_num) / RTE_BITSET_WORD_BITS)
+#define __RTE_BITSET_BIT_OFFSET(bit_num) ((bit_num) % RTE_BITSET_WORD_BITS)
+#define __RTE_BITSET_UNUSED(size)				\
+	((RTE_BITSET_NUM_WORDS(size) * RTE_BITSET_WORD_BITS) \
+	 - (size))
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Declare a bitset.
+ *
+ * Declare (e.g., as a struct field) or define (e.g., as a stack
+ * variable) a bitset of the specified size.
+ *
+ * @param size
+ *   The number of bits the bitset must be able to represent. Must be
+ *   a compile-time constant.
+ * @param name
+ *   The field or variable name of the resulting definition.
+ */
+#define RTE_BITSET_DECLARE(name, size)		\
+	uint64_t name[RTE_BITSET_NUM_WORDS(size)]
+
+/* XXX: should one include flags here and use to avoid a comparison? */
+/* XXX: would this be better off as a function? */
+
+#define __RTE_BITSET_FOREACH_LEFT(var, size, start_bit, len)	\
+	((len) - 1 - ((var) >= (start_bit) ? (var) - (start_bit) :	\
+		  (size) - (start_bit) + (var)))
+
+#define __RTE_BITSET_FOREACH(var, bitset, size, start_bit, len, flags) \
+	for ((var) = __rte_bitset_find(bitset, size, start_bit, len, \
+				       flags);				\
+	     (var) != -1;						\
+	     (var) = __RTE_BITSET_FOREACH_LEFT(var, size, start_bit, \
+					       len) > 0	?		\
+		     __rte_bitset_find(bitset, size,		\
+				       ((var) + 1) % (size),	\
+				       __RTE_BITSET_FOREACH_LEFT(var,	\
+								 size, \
+								 start_bit, \
+								 len),	\
+				       flags) : -1)
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Iterate over all bits set.
+ *
+ * This macro iterates over all bits set (i.e., all ones) in the
+ * bitset, in the forward direction (i.e., starting with the least
+ * significant '1').
+ *
+ * @param var
+ *   An iterator variable of type @c ssize_t. For each successive
+ *   iteration, this variable will hold the bit index of a set bit.
+ * @param bitset
+ *   A <tt>const uint64_t *</tt> pointer to the bitset array.
+ * @param size
+ *   The size of the bitset (in bits).
+ */
+
+#define RTE_BITSET_FOREACH_SET(var, bitset, size)			\
+	__RTE_BITSET_FOREACH(var, bitset, size, 0, size, 0)
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Iterate over all bits cleared.
+ *
+ * This macro iterates over all bits cleared in the bitset, in the
+ * forward direction (i.e., starting with the lowest-indexed set bit).
+ *
+ * @param var
+ *   An iterator variable of type @c ssize_t. For each successive iteration,
+ *   this variable will hold the bit index of a cleared bit.
+ * @param bitset
+ *   A <tt>const uint64_t *</tt> pointer to the bitset array.
+ * @param size
+ *   The size of the bitset (in bits).
+ */
+
+#define RTE_BITSET_FOREACH_CLEAR(var, bitset, size)			\
+	__RTE_BITSET_FOREACH(var, bitset, size, 0, size,	\
+			     __RTE_BITSET_FIND_FLAG_FIND_CLEAR)
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Iterate over all bits set within a range.
+ *
+ * This macro iterates over all bits set (i.e., all ones) in the
+ * specified range, in the forward direction (i.e., starting with the
+ * least significant '1').
+ *
+ * @param var
+ *   An iterator variable of type @c ssize_t. For each successive iteration,
+ *   this variable will hold the bit index of a set bit.
+ * @param bitset
+ *   A <tt>const uint64_t *</tt> pointer to the bitset array.
+ * @param size
+ *   The size of the bitset (in bits).
+ * @param start_bit
+ *   The index of the first bit to check. Must be less than @c size.
+ * @param len
+ *   The length (in bits) of the range. @c start_bit + @c len must be less
+ *   than or equal to @c size.
+ */
+
+#define RTE_BITSET_FOREACH_SET_RANGE(var, bitset, size, start_bit, \
+				     len)			       \
+	__RTE_BITSET_FOREACH(var, bitset, size, start_bit, len, 0)
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Iterate over all cleared bits within a range.
+ *
+ * This macro iterates over all bits cleared (i.e., all zeroes) in the
+ * specified range, in the forward direction (i.e., starting with the
+ * least significant '0').
+ *
+ * @param var
+ *   An iterator variable of type @c ssize_t. For each successive iteration,
+ *   this variable will hold the bit index of a set bit.
+ * @param bitset
+ *   A <tt>const uint64_t *</tt> pointer to the bitset array.
+ * @param size
+ *   The size of the bitset (in bits).
+ * @param start_bit
+ *   The index of the first bit to check. Must be less than @c size.
+ * @param len
+ *   The length (in bits) of the range. @c start_bit + @c len must be less
+ *   than or equal to @c size.
+ */
+
+#define RTE_BITSET_FOREACH_CLEAR_RANGE(var, bitset, size, start_bit,	\
+				       len)				\
+	__RTE_BITSET_FOREACH(var, bitset, size, start_bit, len,	\
+			     __RTE_BITSET_FIND_FLAG_FIND_CLEAR)
+
+#define RTE_BITSET_FOREACH_SET_WRAP(var, bitset, size, start_bit,  \
+				    len)			       \
+	__RTE_BITSET_FOREACH(var, bitset, size, start_bit, len,    \
+			     __RTE_BITSET_FIND_FLAG_WRAP)
+
+#define RTE_BITSET_FOREACH_CLEAR_WRAP(var, bitset, size, start_bit, \
+				       len)				\
+	__RTE_BITSET_FOREACH(var, bitset, size, start_bit, len,	\
+			     __RTE_BITSET_FIND_FLAG_WRAP |		\
+			     __RTE_BITSET_FIND_FLAG_FIND_CLEAR)
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Initializes a bitset.
+ *
+ * All bits are cleared.
+ *
+ * @param bitset
+ *   A pointer to the array of bitset 64-bit words.
+ * @param size
+ *   The size of the bitset (in bits).
+ */
+
+__rte_experimental
+static inline void
+rte_bitset_init(uint64_t *bitset, size_t size)
+{
+	memset(bitset, 0, RTE_BITSET_SIZE(size));
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Set a bit in the bitset.
+ *
+ * Bits are numbered from 0 to (size - 1) (inclusive).
+ *
+ * @param bitset
+ *   A pointer to the array words making up the bitset.
+ * @param bit_num
+ *   The index of the bit to be set.
+ */
+
+__rte_experimental
+static inline void
+rte_bitset_set(uint64_t *bitset, size_t bit_num)
+{
+	size_t word;
+	size_t offset;
+	uint64_t mask;
+
+	word = __RTE_BITSET_WORD_IDX(bit_num);
+	offset = __RTE_BITSET_BIT_OFFSET(bit_num);
+	mask = UINT64_C(1) << offset;
+
+	bitset[word] |= mask;
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Clear a bit in the bitset.
+ *
+ * Bits are numbered 0 to (size - 1) (inclusive).
+ *
+ * @param bitset
+ *   A pointer to the array words making up the bitset.
+ * @param bit_num
+ *   The index of the bit to be cleared.
+ */
+
+__rte_experimental
+static inline void
+rte_bitset_clear(uint64_t *bitset, size_t bit_num)
+{
+	size_t word;
+	size_t offset;
+	uint64_t mask;
+
+	word = __RTE_BITSET_WORD_IDX(bit_num);
+	offset = __RTE_BITSET_BIT_OFFSET(bit_num);
+	mask = ~(UINT64_C(1) << offset);
+
+	bitset[word] &= mask;
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Set all bits in the bitset.
+ *
+ * @param bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param size
+ *   The size of the bitset (in bits).
+ */
+
+__rte_experimental
+static inline void
+rte_bitset_set_all(uint64_t *bitset, size_t size)
+{
+	memset(bitset, 0xFF, RTE_BITSET_SIZE(size));
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Clear all bits in the bitset.
+ *
+ * @param bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param size
+ *   The size of the bitset (in bits).
+ */
+
+__rte_experimental
+static inline void
+rte_bitset_clear_all(uint64_t *bitset, size_t size)
+{
+	rte_bitset_init(bitset, size);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Count all set bits.
+ *
+ * @param bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param size
+ *   The size of the bitset (in bits).
+ * @return
+ *   Returns the number of '1' bits in the bitset.
+ */
+
+__rte_experimental
+static inline size_t
+rte_bitset_count_set(const uint64_t *bitset, size_t size)
+{
+	size_t i;
+	size_t total = 0;
+	uint64_t unused_mask;
+
+	/*
+	 * Unused bits in a rte_bitset are always '0', and thus are
+	 * not included in this count.
+	 */
+	for (i = 0; i < RTE_BITSET_NUM_WORDS(size) - 1; i++)
+		total += __builtin_popcountll(bitset[i]);
+
+	unused_mask = UINT64_MAX >> __RTE_BITSET_UNUSED(size);
+	total += __builtin_popcountll(bitset[i] & unused_mask);
+
+	return total;
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Count all cleared bits.
+ *
+ * @param bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param size
+ *   The size of the bitset (in bits).
+ * @return
+ *   Returns the number of '0' bits in the bitset.
+ */
+
+__rte_experimental
+static inline size_t
+rte_bitset_count_clear(const uint64_t *bitset, size_t size)
+{
+	return size - rte_bitset_count_set(bitset, size);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Test if a bit is set.
+ *
+ * @param bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param bit_num
+ *   Index of the bit to test. Index 0 is the least significant bit.
+ * @return
+ *   Returns true if the bit is '1', and false if the bit is '0'.
+ */
+
+__rte_experimental
+static inline bool
+rte_bitset_test(const uint64_t *bitset, size_t bit_num)
+{
+	size_t word;
+	size_t offset;
+
+	word = __RTE_BITSET_WORD_IDX(bit_num);
+	offset = __RTE_BITSET_BIT_OFFSET(bit_num);
+
+	return (bitset[word] >> offset) & 1;
+}
+
+#define __RTE_BITSET_FIND_FLAG_FIND_CLEAR (1U << 0)
+#define __RTE_BITSET_FIND_FLAG_WRAP (1U << 1)
+
+__rte_experimental
+static inline ssize_t
+__rte_bitset_find_nowrap(const uint64_t *bitset, size_t __rte_unused size,
+			 size_t start_bit, size_t len, bool find_clear)
+{
+	size_t word_idx;
+	size_t offset;
+	size_t end_bit = start_bit + len;
+
+	RTE_ASSERT(end_bit <= size);
+
+	if (unlikely(len == 0))
+		return -1;
+
+	word_idx = __RTE_BITSET_WORD_IDX(start_bit);
+	offset = __RTE_BITSET_BIT_OFFSET(start_bit);
+
+	while (word_idx <= __RTE_BITSET_WORD_IDX(end_bit - 1)) {
+		uint64_t word;
+		int word_ffs;
+
+		word = bitset[word_idx];
+		if (find_clear)
+			word = ~word;
+
+		word >>= offset;
+
+		word_ffs = __builtin_ffsll(word);
+
+		if (word_ffs != 0) {
+			ssize_t ffs = start_bit + word_ffs - 1;
+
+			/*
+			 * Check if set bit were among the last,
+			 * unused bits, in the last word.
+			 */
+			if (unlikely(ffs >= (ssize_t)end_bit))
+				return -1;
+
+			return ffs;
+		}
+
+		start_bit += (RTE_BITSET_WORD_BITS - offset);
+		word_idx++;
+		offset = 0;
+	}
+
+	return -1;
+
+}
+
+__rte_experimental
+static inline ssize_t
+__rte_bitset_find(const uint64_t *bitset, size_t size, size_t start_bit,
+		  size_t len, unsigned int flags)
+{
+	bool find_clear = flags & __RTE_BITSET_FIND_FLAG_FIND_CLEAR;
+	bool may_wrap = flags & __RTE_BITSET_FIND_FLAG_WRAP;
+	bool does_wrap = (start_bit + len) > size;
+	ssize_t rc;
+
+	RTE_ASSERT(len <= size);
+	if (!may_wrap)
+		RTE_ASSERT(!does_wrap);
+
+	if (may_wrap && does_wrap) {
+		size_t len0 = size - start_bit;
+		size_t len1 = len - len0;
+
+		rc = __rte_bitset_find_nowrap(bitset, size, start_bit, len0,
+					      find_clear);
+		if (rc < 0)
+			rc =  __rte_bitset_find_nowrap(bitset, size,
+						       0, len1, find_clear);
+	} else
+		rc = __rte_bitset_find_nowrap(bitset, size, start_bit,
+					      len, find_clear);
+
+	return rc;
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Find first bit set.
+ *
+ * Scans the bitset in the forward direction (i.e., starting at the
+ * least significant bit), and returns the index of the first '1'.
+ *
+ * @param bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param size
+ *   The size of the bitset (in bits).
+ * @return
+ *   Returns the index of the least significant '1', or -1 if all
+ *   bits are '0'.
+ */
+
+__rte_experimental
+static inline ssize_t
+rte_bitset_find_first_set(const uint64_t *bitset, size_t size)
+{
+	return __rte_bitset_find(bitset, size, 0, size, 0);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Find first bit set at offset.
+ *
+ * Scans the bitset in the forward direction (i.e., starting at the
+ * least significant bit), starting at an offset @c start_bit into the
+ * bitset, and returns the index of the first '1' encountered.
+ *
+ * @param bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param size
+ *   The size of the bitset (in bits).
+ * @param start_bit
+ *   The index of the first bit to check. Must be less than @c size.
+ * @param len
+ *   The number of bits to scan. @c start_bit + @c len must be less
+ *   than or equal to @c size.
+ * @return
+ *   Returns the index of the least significant '1', or -1 if all
+ *   bits are '0'.
+ */
+
+__rte_experimental
+static inline ssize_t
+rte_bitset_find_set(const uint64_t *bitset, size_t size,
+		    size_t start_bit, size_t len)
+{
+	return __rte_bitset_find(bitset, size, start_bit, len, 0);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Find first bit set at offset, with wrap-around.
+ *
+ * Scans the bitset in the forward direction (i.e., starting at the
+ * least significant bit), starting at an offset @c start_bit into the
+ * bitset. If no '1' is encountered before the end of the bitset, the search
+ * will continue at index 0.
+ *
+ * @param bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param size
+ *   The size of the bitset (in bits).
+ * @param start_bit
+ *   The index of the first bit to check. Must be less than @c size.
+ * @param len
+ *   The number of bits to scan. @c start_bit + @c len must be less
+ *   than or equal to @c size.
+ * @return
+ *   Returns the index of the least significant '1', or -1 if all
+ *   bits are '0'.
+ */
+
+__rte_experimental
+static inline ssize_t
+rte_bitset_find_set_wrap(const uint64_t *bitset, size_t size,
+			 size_t start_bit, size_t len)
+{
+	return __rte_bitset_find(bitset, size, start_bit, len,
+				 __RTE_BITSET_FIND_FLAG_WRAP);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Find first cleared bit.
+ *
+ * Scans the bitset in the forward direction (i.e., starting at the
+ * least significant bit), and returns the index of the first '0'.
+ *
+ * @param bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param size
+ *   The size of the bitset (in bits).
+ * @return
+ *   Returns the index of the least significant '0', or -1 if all
+ *   bits are '1'.
+ */
+
+__rte_experimental
+static inline ssize_t
+rte_bitset_find_first_clear(const uint64_t *bitset, size_t size)
+{
+	return __rte_bitset_find(bitset, size, 0, size,
+				 __RTE_BITSET_FIND_FLAG_FIND_CLEAR);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Find first cleared bit at offset.
+ *
+ * Scans the bitset in the forward direction (i.e., starting at the
+ * least significant bit), starting at an offset @c start_bit into the
+ * bitset, and returns the index of the first '0' encountered.
+ *
+ * @param bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param size
+ *   The size of the bitset (in bits).
+ * @param start_bit
+ *   The index of the first bit to check. Must be less than @c size.
+ * @param len
+ *   The number of bits to scan. @c start_bit + @c len must be less
+ *   than or equal to @c size.
+ * @return
+ *   Returns the index of the least significant '0', or -1 if all
+ *   bits are '1'.
+ */
+
+__rte_experimental
+static inline ssize_t
+rte_bitset_find_clear(const uint64_t *bitset, size_t size,
+			   size_t start_bit, size_t len)
+{
+	return __rte_bitset_find(bitset, size, start_bit, len,
+				 __RTE_BITSET_FIND_FLAG_FIND_CLEAR);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Find first cleared bit at offset, with wrap-around.
+ *
+ * Scans the bitset in the forward direction (i.e., starting at the
+ * least significant bit), starting at an offset @c start_bit into the
+ * bitset. If no '0' is encountered before the end of the bitset, the
+ * search will continue at index 0.
+ *
+ * @param bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param size
+ *   The size of the bitset (in bits).
+ * @param start_bit
+ *   The index of the first bit to check. Must be less than @c size.
+ * @param len
+ *   The number of bits to scan. @c start_bit + @c len must be less
+ *   than or equal to @c size.
+ * @return
+ *   Returns the index of the least significant '0', or -1 if all
+ *   bits are '1'.
+ */
+
+__rte_experimental
+static inline ssize_t
+rte_bitset_find_clear_wrap(const uint64_t *bitset, size_t size,
+			   size_t start_bit, size_t len)
+{
+	return __rte_bitset_find(bitset, size, start_bit, len,
+				 __RTE_BITSET_FIND_FLAG_FIND_CLEAR |
+				 __RTE_BITSET_FIND_FLAG_WRAP);
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Copy bitset.
+ *
+ * Copy the bits of the @c src_bitset to the @c dst_bitset.
+ *
+ * The bitsets may not overlap and must be of equal size.
+ *
+ * @param dst_bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param src_bitset
+ *   A pointer to the array of words making up the bitset.
+ * @param size
+ *   The size of the bitsets (in bits).
+ */
+
+__rte_experimental
+static inline void
+rte_bitset_copy(uint64_t *__rte_restrict dst_bitset,
+		const uint64_t *__rte_restrict src_bitset,
+		size_t size)
+{
+	rte_memcpy(dst_bitset, src_bitset, RTE_BITSET_SIZE(size));
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Bitwise or two bitsets.
+ *
+ * Perform a bitwise OR operation on all bits in the two equal-size
+ * bitsets @c dst_bitset and @c src_bitset, and store the results in
+ * @c dst_bitset.
+ *
+ * @param dst_bitset
+ *   A pointer to the destination bitset.
+ * @param src_bitset
+ *   A pointer to the source bitset.
+ * @param size
+ *   The size of the bitsets (in bits).
+ */
+
+__rte_experimental
+static inline void
+rte_bitset_or(uint64_t *dst_bitset, const uint64_t *src_bitset, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < RTE_BITSET_NUM_WORDS(size); i++)
+		dst_bitset[i] |= src_bitset[i];
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Bitwise and two bitsets.
+ *
+ * Perform a bitwise AND operation on all bits in the two equal-size
+ * bitsets @c dst_bitset and @c src_bitset, and store the results in
+ * @c dst_bitset.
+ *
+ * @param dst_bitset
+ *   A pointer to the destination bitset.
+ * @param src_bitset
+ *   A pointer to the source bitset.
+ * @param size
+ *   The size of the bitsets (in bits).
+ */
+
+__rte_experimental
+static inline void
+rte_bitset_and(uint64_t *dst_bitset, const uint64_t *src_bitset, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < RTE_BITSET_NUM_WORDS(size); i++)
+		dst_bitset[i] &= src_bitset[i];
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Bitwise xor two bitsets.
+ *
+ * Perform a bitwise XOR operation on all bits in the two equal-size
+ * bitsets @c dst_bitset and @c src_bitset, and store the results in
+ * @c dst_bitset.
+ *
+ * @param dst_bitset
+ *   A pointer to the destination bitset.
+ * @param src_bitset
+ *   A pointer to the source bitset.
+ * @param size
+ *   The size of the bitsets (in bits).
+ */
+
+__rte_experimental
+static inline void
+rte_bitset_xor(uint64_t *__rte_restrict dst_bitset,
+	       const uint64_t *__rte_restrict src_bitset, size_t size)
+{
+	size_t i;
+
+	for (i = 0; i < RTE_BITSET_NUM_WORDS(size); i++)
+		dst_bitset[i] ^= src_bitset[i];
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Compare two bitsets.
+ *
+ * Compare two bitsets for equality.
+ *
+ * @param bitset_a
+ *   A pointer to the destination bitset.
+ * @param bitset_b
+ *   A pointer to the source bitset.
+ * @param size
+ *   The size of the bitsets (in bits).
+ */
+
+__rte_experimental
+static inline bool
+rte_bitset_equal(const uint64_t *bitset_a, const uint64_t *bitset_b,
+		 size_t size)
+{
+	size_t i;
+	uint64_t last_a, last_b;
+
+	for (i = 0; i < RTE_BITSET_NUM_WORDS(size) - 1; i++)
+		if (bitset_a[i] != bitset_b[i])
+			return false;
+
+	last_a = bitset_a[i] << __RTE_BITSET_UNUSED(size);
+	last_b = bitset_b[i] << __RTE_BITSET_UNUSED(size);
+
+	return last_a == last_b;
+}
+
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice.
+ *
+ * Converts a bitset to a string.
+ *
+ * This function prints a string representation of the bitstring to
+ * the supplied buffer.
+ *
+ * Each bit is represented either by '0' or '1' in the output. The
+ * resulting string is NUL terminated.
+ *
+ * @param bitset
+ *   A pointer to the array of bitset 64-bit words.
+ * @param size
+ *   The number of bits the bitset represent.
+ * @param buf
+ *   A buffer to hold the output.
+ * @param capacity
+ *   The size of the buffer. Must be @c size + 1 or larger.
+ * @return
+ *   Returns the number of bytes written (i.e., @c size + 1), or -EINVAL
+ *   in case the buffer capacity was too small.
+ */
+
+__rte_experimental
+ssize_t
+rte_bitset_to_str(const uint64_t *bitset, size_t size, char *buf,
+		  size_t capacity);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_BITSET_H_ */
diff --git a/lib/eal/version.map b/lib/eal/version.map
index 6d6978f108..9136a71c73 100644
--- a/lib/eal/version.map
+++ b/lib/eal/version.map
@@ -430,6 +430,9 @@ EXPERIMENTAL {
 	rte_thread_create_control;
 	rte_thread_set_name;
 	__rte_eal_trace_generic_blob;
+
+	# added in X.Y
+	rte_bitset_to_str;
 };
 
 INTERNAL {
-- 
2.34.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* [RFC v2 2/2] eal: add high-performance timer facility
  2023-03-15 17:03 ` [RFC v2 " Mattias Rönnblom
  2023-03-15 17:03   ` [RFC v2 1/2] eal: add bitset type Mattias Rönnblom
@ 2023-03-15 17:03   ` Mattias Rönnblom
  2023-03-16  3:55     ` Tyler Retzlaff
                       ` (4 more replies)
  2024-10-03 18:36   ` [RFC v2 0/2] Add " Stephen Hemminger
  2 siblings, 5 replies; 31+ messages in thread
From: Mattias Rönnblom @ 2023-03-15 17:03 UTC (permalink / raw)
  To: dev
  Cc: Erik Gabriel Carrillo, David Marchand, maria.lingemark,
	Stefan Sundkvist, Stephen Hemminger, Morten Brørup,
	Tyler Retzlaff, Mattias Rönnblom

The htimer library attempts at providing a timer facility with roughly
the same functionality, but less overhead and better scalability than
DPDK timer library.

The htimer library employs per-lcore hierarchical timer wheels and a
message-based synchronization/MT-safety scheme.

RFC v2:
 * Fix spelling.
 * Fix signed/unsigned comparisons and discontinue the use of name-less
   function parameters, both of which may result in compiler warnings.
 * Undo the accidental removal of the bitset tests from the 'fast_tests'.
 * Add a number of missing include files, causing build failures
   (e.g., on AArch64 builds).
 * Add perf test attempting to compare rte_timer, rte_htimer and rte_htw.
 * Use nanoseconds (instead of TSC) as the default time unit.
 * add() and manage() has flags which allows the caller to specify the
   time unit (nanoseconds, TSC, or ticks) for the times provided.

Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
---
 app/test/meson.build                  |   8 +
 app/test/test_htimer_mgr.c            | 674 +++++++++++++++++++++++++
 app/test/test_htimer_mgr_perf.c       | 322 ++++++++++++
 app/test/test_htw.c                   | 478 ++++++++++++++++++
 app/test/test_htw_perf.c              | 181 +++++++
 app/test/test_timer_htimer_htw_perf.c | 693 ++++++++++++++++++++++++++
 doc/api/doxy-api-index.md             |   5 +-
 doc/api/doxy-api.conf.in              |   1 +
 lib/htimer/meson.build                |   7 +
 lib/htimer/rte_htimer.h               |  68 +++
 lib/htimer/rte_htimer_mgr.c           | 547 ++++++++++++++++++++
 lib/htimer/rte_htimer_mgr.h           | 516 +++++++++++++++++++
 lib/htimer/rte_htimer_msg.h           |  44 ++
 lib/htimer/rte_htimer_msg_ring.c      |  18 +
 lib/htimer/rte_htimer_msg_ring.h      |  55 ++
 lib/htimer/rte_htw.c                  | 445 +++++++++++++++++
 lib/htimer/rte_htw.h                  |  49 ++
 lib/htimer/version.map                |  17 +
 lib/meson.build                       |   1 +
 19 files changed, 4128 insertions(+), 1 deletion(-)
 create mode 100644 app/test/test_htimer_mgr.c
 create mode 100644 app/test/test_htimer_mgr_perf.c
 create mode 100644 app/test/test_htw.c
 create mode 100644 app/test/test_htw_perf.c
 create mode 100644 app/test/test_timer_htimer_htw_perf.c
 create mode 100644 lib/htimer/meson.build
 create mode 100644 lib/htimer/rte_htimer.h
 create mode 100644 lib/htimer/rte_htimer_mgr.c
 create mode 100644 lib/htimer/rte_htimer_mgr.h
 create mode 100644 lib/htimer/rte_htimer_msg.h
 create mode 100644 lib/htimer/rte_htimer_msg_ring.c
 create mode 100644 lib/htimer/rte_htimer_msg_ring.h
 create mode 100644 lib/htimer/rte_htw.c
 create mode 100644 lib/htimer/rte_htw.h
 create mode 100644 lib/htimer/version.map

diff --git a/app/test/meson.build b/app/test/meson.build
index 03811ff692..d0308ac09d 100644
--- a/app/test/meson.build
+++ b/app/test/meson.build
@@ -140,9 +140,14 @@ test_sources = files(
         'test_thash_perf.c',
         'test_threads.c',
         'test_timer.c',
+        'test_timer_htimer_htw_perf.c',
         'test_timer_perf.c',
         'test_timer_racecond.c',
         'test_timer_secondary.c',
+        'test_htimer_mgr.c',
+        'test_htimer_mgr_perf.c',
+        'test_htw.c',
+        'test_htw_perf.c',
         'test_ticketlock.c',
         'test_trace.c',
         'test_trace_register.c',
@@ -193,6 +198,7 @@ fast_tests = [
         ['fib6_autotest', true, true],
         ['func_reentrancy_autotest', false, true],
         ['hash_autotest', true, true],
+        ['htimer_mgr_autotest', true, true],
         ['interrupt_autotest', true, true],
         ['ipfrag_autotest', false, true],
         ['lcores_autotest', true, true],
@@ -265,6 +271,8 @@ perf_test_names = [
         'memcpy_perf_autotest',
         'hash_perf_autotest',
         'timer_perf_autotest',
+        'htimer_mgr_perf_autotest',
+        'htw_perf_autotest',
         'reciprocal_division',
         'reciprocal_division_perf',
         'lpm_perf_autotest',
diff --git a/app/test/test_htimer_mgr.c b/app/test/test_htimer_mgr.c
new file mode 100644
index 0000000000..9e46dec53e
--- /dev/null
+++ b/app/test/test_htimer_mgr.c
@@ -0,0 +1,674 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#include "test.h"
+
+#include <sys/queue.h>
+#include <stdlib.h>
+#include <inttypes.h>
+
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_htimer_mgr.h>
+#include <rte_launch.h>
+#include <rte_lcore.h>
+#include <rte_random.h>
+
+static int
+timer_lcore(void *arg)
+{
+	bool *stop = arg;
+
+	while (!__atomic_load_n(stop, __ATOMIC_RELAXED))
+		rte_htimer_mgr_manage();
+
+	return 0;
+}
+
+static void
+count_timer_cb(struct rte_htimer *timer __rte_unused, void *arg)
+{
+	unsigned int *count = arg;
+
+	__atomic_fetch_add(count, 1, __ATOMIC_RELAXED);
+}
+
+static void
+count_async_cb(struct rte_htimer *timer __rte_unused, int result,
+	       void *cb_arg)
+{
+	unsigned int *count = cb_arg;
+
+	if (result == RTE_HTIMER_MGR_ASYNC_RESULT_ADDED)
+		__atomic_fetch_add(count, 1, __ATOMIC_RELAXED);
+}
+
+static uint64_t
+s_to_tsc(double s)
+{
+	return s * rte_get_tsc_hz();
+}
+
+#define ASYNC_ADD_TEST_EXPIRATION_TIME (250*1000) /* ns */
+#define ASYNC_TEST_TICK (10*1000) /* ns */
+
+static int
+test_htimer_mgr_async_add(unsigned int num_timers_per_lcore)
+{
+	struct rte_htimer *timers;
+	unsigned int timer_idx;
+	unsigned int lcore_id;
+	bool stop = false;
+	unsigned int timeout_count = 0;
+	unsigned int async_count = 0;
+	unsigned int num_workers = 0;
+	uint64_t expiration_time;
+	unsigned int num_total_timers;
+
+	rte_htimer_mgr_init(ASYNC_TEST_TICK);
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		if (rte_eal_remote_launch(timer_lcore, &stop, lcore_id) != 0)
+			rte_panic("Unable to launch timer lcore\n");
+		num_workers++;
+	}
+
+	num_total_timers = num_workers * num_timers_per_lcore;
+
+	timers = malloc(num_total_timers * sizeof(struct rte_htimer));
+	timer_idx = 0;
+
+	if (timers == NULL)
+		rte_panic("Unable to allocate heap memory\n");
+
+	expiration_time = ASYNC_ADD_TEST_EXPIRATION_TIME;
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		unsigned int i;
+
+		for (i = 0; i < num_timers_per_lcore; i++) {
+			struct rte_htimer *timer = &timers[timer_idx++];
+
+			for (;;) {
+				int rc;
+
+				rc = rte_htimer_mgr_async_add(timer, lcore_id,
+						      expiration_time,
+						      RTE_HTIMER_FLAG_TIME_TSC,
+						      count_timer_cb,
+						      &timeout_count, 0,
+						      count_async_cb,
+						      &async_count);
+				if (unlikely(rc == -EBUSY))
+					rte_htimer_mgr_process();
+				else
+					break;
+			}
+		}
+	}
+
+	while (__atomic_load_n(&async_count, __ATOMIC_RELAXED) !=
+	       num_total_timers ||
+	       __atomic_load_n(&timeout_count, __ATOMIC_RELAXED) !=
+	       num_total_timers)
+		rte_htimer_mgr_manage();
+
+	__atomic_store_n(&stop, true, __ATOMIC_RELAXED);
+
+	rte_eal_mp_wait_lcore();
+
+	rte_htimer_mgr_deinit();
+
+	free(timers);
+
+	return TEST_SUCCESS;
+}
+
+struct async_recorder_state {
+	bool timer_cb_run;
+	bool async_add_cb_run;
+	bool async_cancel_cb_run;
+	bool failed;
+};
+
+static void
+record_async_add_cb(struct rte_htimer *timer __rte_unused,
+		    int result, void *cb_arg)
+{
+	struct async_recorder_state *state = cb_arg;
+
+	if (state->failed)
+		return;
+
+	if (state->async_add_cb_run ||
+	    result != RTE_HTIMER_MGR_ASYNC_RESULT_ADDED) {
+		puts("async add run already");
+		state->failed = true;
+	}
+
+	state->async_add_cb_run = true;
+}
+
+static void
+record_async_cancel_cb(struct rte_htimer *timer __rte_unused,
+		       int result, void *cb_arg)
+{
+	struct async_recorder_state *state = cb_arg;
+
+	if (state->failed)
+		return;
+
+	if (state->async_cancel_cb_run) {
+		state->failed = true;
+		return;
+	}
+
+	switch (result) {
+	case RTE_HTIMER_MGR_ASYNC_RESULT_EXPIRED:
+		if (!state->timer_cb_run)
+			state->failed = true;
+		break;
+	case RTE_HTIMER_MGR_ASYNC_RESULT_CANCELED:
+		if (state->timer_cb_run)
+			state->failed = true;
+		break;
+	case RTE_HTIMER_MGR_ASYNC_RESULT_ALREADY_CANCELED:
+		state->failed = true;
+	}
+
+	state->async_cancel_cb_run = true;
+}
+
+static int
+record_check_consistency(struct async_recorder_state *state)
+{
+	if (state->failed)
+		return -1;
+
+	return state->async_cancel_cb_run ? 1 : 0;
+}
+
+static int
+records_check_consistency(struct async_recorder_state *states,
+			  unsigned int num_states)
+{
+	unsigned int i;
+	int canceled = 0;
+
+	for (i = 0; i < num_states; i++) {
+		int rc;
+
+		rc = record_check_consistency(&states[i]);
+
+		if (rc < 0)
+			return -1;
+		canceled += rc;
+	}
+
+	return canceled;
+}
+
+static void
+log_timer_expiry_cb(struct rte_htimer *timer __rte_unused,
+		    void *arg)
+{
+	bool *timer_run = arg;
+
+	*timer_run = true;
+}
+
+
+#define ASYNC_ADD_CANCEL_TEST_EXPIRATION_TIME_MAX 10e-3 /* s */
+
+static int
+test_htimer_mgr_async_add_cancel(unsigned int num_timers_per_lcore)
+{
+	struct rte_htimer *timers;
+	struct async_recorder_state *recorder_states;
+	unsigned int timer_idx = 0;
+	unsigned int lcore_id;
+	uint64_t now;
+	unsigned int num_workers = 0;
+	bool stop = false;
+	uint64_t max_expiration_time =
+		s_to_tsc(ASYNC_ADD_CANCEL_TEST_EXPIRATION_TIME_MAX);
+	unsigned int num_total_timers;
+	int canceled = 0;
+
+	rte_htimer_mgr_init(ASYNC_TEST_TICK);
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		if (rte_eal_remote_launch(timer_lcore, &stop, lcore_id) != 0)
+			rte_panic("Unable to launch timer lcore\n");
+		num_workers++;
+	}
+
+	num_total_timers = num_workers * num_timers_per_lcore;
+
+	timers = malloc(num_total_timers * sizeof(struct rte_htimer));
+	recorder_states =
+		malloc(num_total_timers * sizeof(struct async_recorder_state));
+
+	if (timers == NULL || recorder_states == NULL)
+		rte_panic("Unable to allocate heap memory\n");
+
+	now = rte_get_tsc_cycles();
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		unsigned int i;
+
+		for (i = 0; i < num_timers_per_lcore; i++) {
+			struct rte_htimer *timer = &timers[timer_idx];
+			struct async_recorder_state *state =
+				&recorder_states[timer_idx];
+
+			timer_idx++;
+
+			*state = (struct async_recorder_state) {};
+
+			uint64_t expiration_time =
+				now + rte_rand_max(max_expiration_time);
+
+			for (;;) {
+				int rc;
+
+				rc = rte_htimer_mgr_async_add(timer, lcore_id,
+							 expiration_time,
+							 0,
+							 log_timer_expiry_cb,
+							 &state->timer_cb_run,
+							 0,
+							 record_async_add_cb,
+							 state);
+
+				if (unlikely(rc == -EBUSY))
+					rte_htimer_mgr_process();
+				else
+					break;
+			}
+		}
+	}
+
+	timer_idx = 0;
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		unsigned int i;
+
+		for (i = 0; i < num_timers_per_lcore; i++) {
+			struct rte_htimer *timer = &timers[timer_idx];
+			struct async_recorder_state *state =
+				&recorder_states[timer_idx];
+
+			timer_idx++;
+
+			/* cancel roughly half of the timers */
+			if (rte_rand_max(2) == 0)
+				continue;
+
+			for (;;) {
+				int rc;
+
+				rc = rte_htimer_mgr_async_cancel(timer,
+							record_async_cancel_cb,
+							state);
+
+				if (unlikely(rc == -EBUSY)) {
+					puts("busy");
+					rte_htimer_mgr_process();
+				} else
+					break;
+			}
+
+			canceled++;
+		}
+	}
+
+	for (;;) {
+		int cancel_completed;
+
+		cancel_completed = records_check_consistency(recorder_states,
+							     num_total_timers);
+
+		if (cancel_completed < 0) {
+			puts("Inconstinency found");
+			return TEST_FAILED;
+		}
+
+		if (cancel_completed == canceled)
+			break;
+
+		rte_htimer_mgr_process();
+	}
+
+	__atomic_store_n(&stop, true, __ATOMIC_RELAXED);
+
+	rte_eal_mp_wait_lcore();
+
+	rte_htimer_mgr_deinit();
+
+	free(timers);
+	free(recorder_states);
+
+	return TEST_SUCCESS;
+}
+
+/*
+ * This is a test case where one thread asynchronously adds two timers,
+ * with the same expiration time; one on the local lcore and one on a
+ * remote lcore. This creates a tricky situation for the timer
+ * manager, and for the application as well, if the htimer struct is
+ * dynamically allocated.
+ */
+
+struct test_timer {
+	uint32_t ref_cnt;
+	uint64_t expiration_time; /* in TSC, not tick */
+	uint32_t *timeout_count;
+	bool *failure_occurred;
+	struct rte_htimer htimer;
+};
+
+
+static struct test_timer *
+test_timer_create(uint64_t expiration_time, uint32_t *timeout_count,
+		  bool *failure_occurred)
+{
+	struct test_timer *timer;
+
+	timer = malloc(sizeof(struct test_timer));
+
+	if (timer == NULL)
+		rte_panic("Unable to allocate timer memory\n");
+
+	timer->ref_cnt = 1;
+	timer->expiration_time = expiration_time;
+	timer->timeout_count = timeout_count;
+	timer->failure_occurred = failure_occurred;
+
+	return timer;
+}
+
+static void
+test_timer_inc_ref_cnt(struct test_timer *timer)
+{
+	__atomic_add_fetch(&timer->ref_cnt, 1, __ATOMIC_RELEASE);
+}
+
+static void
+test_timer_dec_ref_cnt(struct test_timer *timer)
+{
+	if (timer != NULL) {
+		uint32_t cnt = __atomic_sub_fetch(&timer->ref_cnt, 1,
+						  __ATOMIC_RELEASE);
+		if (cnt == 0)
+			free(timer);
+	}
+}
+
+static void
+test_timer_cb(struct rte_htimer *timer, void *arg __rte_unused)
+{
+	struct test_timer *test_timer =
+		container_of(timer, struct test_timer, htimer);
+	uint64_t now = rte_get_tsc_cycles();
+
+	if (now < test_timer->expiration_time)
+		*(test_timer->failure_occurred) = true;
+
+	__atomic_fetch_add(test_timer->timeout_count, 1, __ATOMIC_RELAXED);
+
+	test_timer_dec_ref_cnt(test_timer);
+}
+
+static int
+worker_lcore(void *arg)
+{
+	bool *stop = arg;
+
+	while (!__atomic_load_n(stop, __ATOMIC_RELAXED))
+		rte_htimer_mgr_manage();
+
+	return 0;
+}
+
+struct cancel_timer {
+	bool cancel;
+	struct rte_htimer *target_timer;
+	uint32_t *cancel_count;
+	uint32_t *expired_count;
+	bool *failure_occurred;
+	struct rte_htimer htimer;
+};
+
+static struct cancel_timer *
+cancel_timer_create(bool cancel, struct rte_htimer *target_timer,
+		    uint32_t *cancel_count, uint32_t *expired_count,
+		    bool *failure_occurred)
+{
+	struct cancel_timer *timer;
+
+	timer = malloc(sizeof(struct cancel_timer));
+
+	if (timer == NULL)
+		rte_panic("Unable to allocate timer memory\n");
+
+	timer->cancel = cancel;
+	timer->target_timer = target_timer;
+	timer->cancel_count = cancel_count;
+	timer->expired_count = expired_count;
+	timer->failure_occurred = failure_occurred;
+
+	return timer;
+}
+
+static void
+async_cancel_cb(struct rte_htimer *timer, int result, void *cb_arg)
+{
+	struct test_timer *test_timer =
+		container_of(timer, struct test_timer, htimer);
+	struct cancel_timer *cancel_timer = cb_arg;
+	bool *failure_occurred = cancel_timer->failure_occurred;
+
+	if (!cancel_timer->cancel || cancel_timer->target_timer != timer)
+		*failure_occurred = true;
+
+	if (result == RTE_HTIMER_MGR_ASYNC_RESULT_CANCELED) {
+		uint32_t *cancel_count = cancel_timer->cancel_count;
+
+		/* decrease target lcore's ref count */
+		test_timer_dec_ref_cnt(test_timer);
+		(*cancel_count)++;
+	} else if (result == RTE_HTIMER_MGR_ASYNC_RESULT_EXPIRED) {
+		uint32_t *expired_count = cancel_timer->expired_count;
+
+		(*expired_count)++;
+	} else
+		*failure_occurred = true;
+
+	/* source lcore's ref count */
+	test_timer_dec_ref_cnt(test_timer);
+
+	free(cancel_timer);
+}
+
+static void
+cancel_timer_cb(struct rte_htimer *timer, void *arg __rte_unused)
+{
+	struct cancel_timer *cancel_timer =
+		container_of(timer, struct cancel_timer, htimer);
+
+	if (cancel_timer->cancel) {
+		int rc;
+
+		rc = rte_htimer_mgr_async_cancel(cancel_timer->target_timer,
+						 async_cancel_cb, cancel_timer);
+
+		if (rc == -EBUSY)
+			rte_htimer_mgr_add(timer, 0, 0, cancel_timer_cb,
+					   NULL, 0);
+	} else
+		free(cancel_timer);
+}
+
+#define REF_CNT_TEST_TICK 10 /* ns */
+#define REF_CNT_AVG_EXPIRATION_TIME (50 * 1000) /* ns */
+#define REF_CNT_MAX_EXPIRATION_TIME (2 * REF_CNT_AVG_EXPIRATION_TIME)
+#define REF_CNT_CANCEL_FUZZ(expiration_time) \
+	((uint64_t)((expiration_time) * (rte_drand()/10 + 0.95)))
+
+static int
+test_htimer_mgr_ref_cnt_timers(unsigned int num_timers_per_lcore)
+{
+	unsigned int lcore_id;
+	bool stop = false;
+	unsigned int num_workers = 0;
+	struct test_timer **timers;
+	struct cancel_timer **cancel_timers;
+	unsigned int num_timers;
+	uint32_t timeout_count = 0;
+	uint32_t cancel_count = 0;
+	uint32_t expired_count = 0;
+	bool failure_occurred = false;
+	unsigned int timer_idx;
+	unsigned int expected_cancel_attempts;
+	uint64_t deadline;
+	uint64_t now;
+
+	rte_htimer_mgr_init(REF_CNT_TEST_TICK);
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		if (rte_eal_remote_launch(worker_lcore, &stop, lcore_id) != 0)
+			rte_panic("Unable to launch timer lcore\n");
+		num_workers++;
+	}
+
+	/* give the workers a chance to get going */
+	rte_delay_us_block(10*1000);
+
+	num_timers = num_timers_per_lcore * num_workers;
+
+	timers = malloc(sizeof(struct test_timer *) * num_timers);
+	cancel_timers = malloc(sizeof(struct cancel_timer *) * num_timers);
+
+	if (timers == NULL || cancel_timers == NULL)
+		rte_panic("Unable to allocate memory\n");
+
+	timer_idx = 0;
+	expected_cancel_attempts = 0;
+
+	RTE_LCORE_FOREACH_WORKER(lcore_id) {
+		unsigned int i;
+
+		for (i = 0; i < num_timers_per_lcore; i++) {
+			uint64_t expiration_time;
+			struct test_timer *timer;
+			struct rte_htimer *htimer;
+			bool cancel;
+			struct cancel_timer *cancel_timer;
+			uint64_t cancel_expiration_time;
+
+			expiration_time =
+				REF_CNT_MAX_EXPIRATION_TIME * rte_drand();
+
+			timer = test_timer_create(expiration_time,
+						  &timeout_count,
+						  &failure_occurred);
+			htimer = &timer->htimer;
+
+			timers[timer_idx++] = timer;
+
+			/* for the target lcore's usage of this time */
+			test_timer_inc_ref_cnt(timer);
+
+			for (;;) {
+				int rc;
+
+				rc = rte_htimer_mgr_async_add(htimer, lcore_id,
+							      expiration_time,
+							      0, test_timer_cb,
+							      NULL, 0, NULL,
+							      NULL);
+				if (unlikely(rc == -EBUSY))
+					rte_htimer_mgr_process();
+				else
+					break;
+			}
+
+			cancel = rte_rand_max(2);
+
+			cancel_timer =
+				cancel_timer_create(cancel, &timer->htimer,
+						    &cancel_count,
+						    &expired_count,
+						    &failure_occurred);
+
+			cancel_expiration_time =
+				REF_CNT_CANCEL_FUZZ(expiration_time);
+
+			rte_htimer_mgr_add(&cancel_timer->htimer,
+					   cancel_expiration_time, 0,
+					   cancel_timer_cb, NULL, 0);
+
+			if (cancel)
+				expected_cancel_attempts++;
+		}
+	}
+
+	deadline = rte_get_tsc_cycles() + REF_CNT_MAX_EXPIRATION_TIME +
+		s_to_tsc(0.25);
+
+	do {
+		now = rte_get_tsc_cycles();
+
+		rte_htimer_mgr_manage_time(now, RTE_HTIMER_FLAG_TIME_TSC);
+
+	} while (now < deadline);
+
+	__atomic_store_n(&stop, true, __ATOMIC_RELAXED);
+
+	rte_eal_mp_wait_lcore();
+
+	if (failure_occurred)
+		return TEST_FAILED;
+
+	if ((cancel_count + expired_count) != expected_cancel_attempts)
+		return TEST_FAILED;
+
+	if (timeout_count != (num_timers - cancel_count))
+		return TEST_FAILED;
+
+	rte_htimer_mgr_deinit();
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_htimer_mgr(void)
+{
+	int rc;
+
+	rc = test_htimer_mgr_async_add(1);
+	if (rc != TEST_SUCCESS)
+		return rc;
+
+	rc = test_htimer_mgr_async_add(100000);
+	if (rc != TEST_SUCCESS)
+		return rc;
+
+	rc = test_htimer_mgr_async_add_cancel(100);
+	if (rc != TEST_SUCCESS)
+		return rc;
+
+	rc = test_htimer_mgr_ref_cnt_timers(10);
+	if (rc != TEST_SUCCESS)
+		return rc;
+
+	rc = test_htimer_mgr_ref_cnt_timers(10000);
+	if (rc != TEST_SUCCESS)
+		return rc;
+
+	return TEST_SUCCESS;
+}
+
+REGISTER_TEST_COMMAND(htimer_mgr_autotest, test_htimer_mgr);
diff --git a/app/test/test_htimer_mgr_perf.c b/app/test/test_htimer_mgr_perf.c
new file mode 100644
index 0000000000..cdc513228f
--- /dev/null
+++ b/app/test/test_htimer_mgr_perf.c
@@ -0,0 +1,322 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#include "test.h"
+
+#include <sys/queue.h>
+#include <stdlib.h>
+#include <inttypes.h>
+
+#include <rte_cycles.h>
+#include <rte_htimer_mgr.h>
+#include <rte_launch.h>
+#include <rte_lcore.h>
+#include <rte_malloc.h>
+#include <rte_random.h>
+
+static void
+nop_cb(struct rte_htimer *timer __rte_unused, void *cb_arg __rte_unused)
+{
+}
+
+static uint64_t
+add_rand_timers(struct rte_htimer *timers, uint64_t num,
+		uint64_t timeout_start, uint64_t max_timeout)
+{
+	uint64_t i;
+	uint64_t expiration_times[num];
+	uint64_t start_ts;
+	uint64_t end_ts;
+
+	for (i = 0; i < num; i++)
+		expiration_times[i] =
+			1 + timeout_start + rte_rand_max(max_timeout - 1);
+
+	start_ts = rte_get_tsc_cycles();
+
+	for (i = 0; i < num; i++)
+		rte_htimer_mgr_add(&timers[i], expiration_times[i], 0, nop_cb,
+				   NULL, RTE_HTIMER_FLAG_ABSOLUTE_TIME);
+
+	/* make sure the timers are actually scheduled in the wheel */
+	rte_htimer_mgr_process();
+
+	end_ts = rte_get_tsc_cycles();
+
+	return end_ts - start_ts;
+}
+
+#define TIME_STEP 16
+
+static void
+test_add_manage_perf(const char *scenario_name, uint64_t num_timers,
+		     uint64_t timespan)
+{
+	uint64_t manage_calls;
+	struct rte_htimer *timers;
+	uint64_t start;
+	uint64_t now;
+	uint64_t start_ts;
+	uint64_t end_ts;
+	uint64_t add_latency;
+	uint64_t manage_latency;
+
+	rte_htimer_mgr_init(1);
+
+	manage_calls = timespan / TIME_STEP;
+
+	printf("Scenario: %s\n", scenario_name);
+	printf("    Configuration:\n");
+	printf("        Timers: %"PRIu64"\n", num_timers);
+	printf("        Max timeout: %"PRIu64" ticks\n", timespan);
+	printf("        Average timeouts/manage call: %.3f\n",
+	       num_timers / (double)manage_calls);
+	printf("        Time advance per manage call: %d\n", TIME_STEP);
+
+	printf("    Results:\n");
+
+	timers = rte_malloc(NULL, sizeof(struct rte_htimer) * num_timers, 0);
+
+	if (timers == NULL)
+		rte_panic("Unable to allocate memory\n");
+
+	start = 1 + rte_rand_max(UINT64_MAX / 2);
+
+	rte_htimer_mgr_manage_time(start - 1, 0);
+
+	add_latency = add_rand_timers(timers, num_timers, start, timespan);
+
+	start_ts = rte_get_tsc_cycles();
+
+	for (now = start; now < (start + timespan); now += TIME_STEP)
+		rte_htimer_mgr_manage_time(now, 0);
+
+	end_ts = rte_get_tsc_cycles();
+
+	manage_latency = end_ts - start_ts;
+
+	printf("        %.0f TSC cycles / add op\n",
+	       (double)add_latency / num_timers);
+	printf("        %.0f TSC cycles / manage call\n",
+	       (double)manage_latency / manage_calls);
+	printf("        %.1f TSC cycles / tick\n",
+	       (double)manage_latency / timespan);
+
+	rte_htimer_mgr_deinit();
+
+	rte_free(timers);
+}
+
+static uint64_t
+s_to_tsc(double s)
+{
+	return s * rte_get_tsc_hz();
+}
+
+static double
+tsc_to_s(uint64_t tsc)
+{
+	return (double)tsc / (double)rte_get_tsc_hz();
+}
+
+#define ITERATIONS 500
+
+static int
+test_del_perf(uint64_t num_timers, uint64_t timespan)
+{
+	struct rte_htimer *timers;
+	uint64_t start;
+	uint64_t i, j;
+	uint64_t start_ts;
+	uint64_t end_ts;
+	uint64_t latency = 0;
+
+	rte_htimer_mgr_init(1);
+
+	timers =
+	    rte_malloc(NULL, sizeof(struct rte_htimer) * num_timers, 0);
+
+	if (timers == NULL)
+		rte_panic("Unable to allocate memory\n");
+
+	start = 1 + rte_rand_max(UINT64_MAX / 2);
+
+	for (i = 0; i < ITERATIONS; i++) {
+		rte_htimer_mgr_manage_time(start - 1, 0);
+
+		add_rand_timers(timers, num_timers, start, timespan);
+
+		/* A manage (or process) call is required to get all
+		 * timers scheduled, which may in turn make them a
+		 * little more expensive to remove.
+		 */
+		rte_htimer_mgr_manage_time(start, 0);
+
+		start_ts = rte_get_tsc_cycles();
+
+		for (j = 0; j < num_timers; j++)
+			if (rte_htimer_mgr_cancel(&timers[j]) < 0)
+				return TEST_FAILED;
+
+		end_ts = rte_get_tsc_cycles();
+
+		latency += (end_ts - start_ts);
+
+		start += (timespan + 1);
+	}
+
+	printf("Timer delete: %.0f TSC cycles / call\n",
+	       (double)latency / (double)ITERATIONS / (double)num_timers);
+
+	rte_htimer_mgr_deinit();
+
+	rte_free(timers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+target_lcore(void *arg)
+{
+	bool *stop = arg;
+
+	while (!__atomic_load_n(stop, __ATOMIC_RELAXED))
+		rte_htimer_mgr_manage();
+
+	return 0;
+}
+
+static void
+count_async_cb(struct rte_htimer *timer __rte_unused, int result,
+	       void *cb_arg)
+{
+	unsigned int *count = cb_arg;
+
+	if (result == RTE_HTIMER_MGR_ASYNC_RESULT_ADDED)
+		(*count)++;
+}
+
+#define ASYNC_ADD_TEST_TICK s_to_tsc(500e-9)
+/*
+ * The number of test timers must be kept less than size of the
+ * htimer-internal message ring for this test case to work.
+ */
+#define ASYNC_ADD_TEST_NUM_TIMERS 1000
+#define ASYNC_ADD_TEST_MIN_TIMEOUT (ASYNC_ADD_TEST_NUM_TIMERS * s_to_tsc(1e-6))
+#define ASYNC_ADD_TEST_MAX_TIMEOUT (2 * ASYNC_ADD_TEST_MIN_TIMEOUT)
+
+static void
+test_async_add_perf(void)
+{
+	uint64_t max_timeout = ASYNC_ADD_TEST_MAX_TIMEOUT;
+	uint64_t min_timeout = ASYNC_ADD_TEST_MIN_TIMEOUT;
+	unsigned int num_timers = ASYNC_ADD_TEST_NUM_TIMERS;
+	struct rte_htimer *timers;
+	bool *stop;
+	unsigned int lcore_id = rte_lcore_id();
+	unsigned int target_lcore_id =
+		rte_get_next_lcore(lcore_id, true, true);
+	uint64_t now;
+	uint64_t request_latency = 0;
+	uint64_t response_latency = 0;
+	unsigned int i;
+
+	rte_htimer_mgr_init(ASYNC_ADD_TEST_TICK);
+
+	timers = rte_malloc(NULL, sizeof(struct rte_htimer) * num_timers,
+			    RTE_CACHE_LINE_SIZE);
+	stop = rte_malloc(NULL, sizeof(bool), RTE_CACHE_LINE_SIZE);
+
+	if (timers == NULL || stop == NULL)
+		rte_panic("Unable to allocate memory\n");
+
+	*stop = false;
+
+	if (rte_eal_remote_launch(target_lcore, stop, target_lcore_id) != 0)
+		rte_panic("Unable to launch worker lcore\n");
+
+	/* wait for launch to complete */
+	rte_delay_us_block(100);
+
+	for (i = 0; i < ITERATIONS; i++) {
+		uint64_t expiration_times[num_timers];
+		unsigned int j;
+		uint64_t start_ts;
+		uint64_t end_ts;
+		unsigned int count = 0;
+
+		now = rte_get_tsc_cycles();
+
+		for (j = 0; j < num_timers; j++)
+			expiration_times[j] = now + min_timeout +
+				rte_rand_max(max_timeout - min_timeout);
+
+		start_ts = rte_get_tsc_cycles();
+
+		for (j = 0; j < num_timers; j++)
+			rte_htimer_mgr_async_add(&timers[j], target_lcore_id,
+					     expiration_times[j], 0,
+					     nop_cb, NULL,
+					     RTE_HTIMER_FLAG_ABSOLUTE_TIME,
+					     count_async_cb, &count);
+
+		end_ts = rte_get_tsc_cycles();
+
+		request_latency += (end_ts - start_ts);
+
+		/* wait long-enough for the target lcore to answered */
+		rte_delay_us_block(1 * num_timers);
+
+		start_ts = rte_get_tsc_cycles();
+
+		while (count != num_timers)
+			rte_htimer_mgr_process();
+
+		end_ts = rte_get_tsc_cycles();
+
+		response_latency += (end_ts - start_ts);
+
+		/* wait until all timeouts have fired */
+		rte_delay_us_block(tsc_to_s(max_timeout) * 1e6);
+	}
+
+	__atomic_store_n(stop, true, __ATOMIC_RELAXED);
+
+	rte_eal_mp_wait_lcore();
+
+	rte_free(timers);
+
+	rte_htimer_mgr_deinit();
+
+	printf("Timer async add:\n");
+	printf("    Configuration:\n");
+	printf("        Timers: %d\n", ASYNC_ADD_TEST_NUM_TIMERS);
+	printf("    Results:\n");
+	printf("        Source lcore cost: %.0f TSC cycles / add request\n",
+	       (double)request_latency / (double)ITERATIONS / num_timers);
+	printf("                           %.0f TSC cycles / add response\n",
+	       (double)response_latency / (double)ITERATIONS / num_timers);
+}
+
+static int
+test_htimer_mgr_perf(void)
+{
+	/* warm up */
+	rte_delay_us_block(10000);
+
+	test_add_manage_perf("Sparse", 100000, 10000000);
+
+	test_add_manage_perf("Dense", 100000, 200000);
+
+	test_add_manage_perf("Idle", 10, 100000);
+
+	if (test_del_perf(100000, 100000) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	test_async_add_perf();
+
+	return TEST_SUCCESS;
+}
+
+REGISTER_TEST_COMMAND(htimer_mgr_perf_autotest, test_htimer_mgr_perf);
diff --git a/app/test/test_htw.c b/app/test/test_htw.c
new file mode 100644
index 0000000000..3cddfaed7f
--- /dev/null
+++ b/app/test/test_htw.c
@@ -0,0 +1,478 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#include "test.h"
+
+#include <sys/queue.h>
+#include <stdlib.h>
+#include <inttypes.h>
+
+#include <rte_cycles.h>
+#include <rte_htw.h>
+#include <rte_random.h>
+
+struct recorder {
+	struct rte_htimer_list timeout_list;
+	uint64_t num_timeouts;
+};
+
+static void
+recorder_init(struct recorder *recorder)
+{
+	recorder->num_timeouts = 0;
+	LIST_INIT(&recorder->timeout_list);
+}
+
+static void
+recorder_cb(struct rte_htimer *timer, void *arg)
+{
+	struct recorder *recorder = arg;
+
+	recorder->num_timeouts++;
+
+	LIST_INSERT_HEAD(&recorder->timeout_list, timer, entry);
+}
+
+static int
+recorder_verify(struct recorder *recorder, uint64_t min_expiry,
+		uint64_t max_expiry)
+{
+	struct rte_htimer *timer;
+
+	LIST_FOREACH(timer, &recorder->timeout_list, entry) {
+		if (timer->expiration_time > max_expiry)
+			return TEST_FAILED;
+
+		if (timer->expiration_time < min_expiry)
+			return TEST_FAILED;
+	}
+
+	return TEST_SUCCESS;
+}
+
+static void
+add_rand_timers(struct rte_htw *htw, struct rte_htimer *timers,
+		uint64_t num, uint64_t timeout_start, uint64_t max_timeout,
+		rte_htimer_cb_t cb, void *cb_arg)
+{
+	uint64_t i;
+
+	for (i = 0; i < num; i++) {
+		struct rte_htimer *timer = &timers[i];
+		bool use_absolute = rte_rand() & 1;
+		unsigned int flags = 0;
+		uint64_t expiration_time;
+
+		expiration_time = timeout_start + rte_rand_max(max_timeout);
+
+		if (use_absolute)
+			flags |= RTE_HTIMER_FLAG_ABSOLUTE_TIME;
+		else {
+			uint64_t htw_current_time;
+
+			htw_current_time = rte_htw_current_time(htw);
+
+			if (expiration_time < htw_current_time)
+				expiration_time = 0;
+			else
+				expiration_time -= htw_current_time;
+		}
+
+		rte_htw_add(htw, timer, expiration_time, 0, cb, cb_arg, flags);
+	}
+}
+
+#define ADVANCE_TIME_MAX_STEP 16
+
+static int
+test_rand_timers(uint64_t in_flight_timers, uint64_t max_timeout,
+		 uint64_t runtime)
+{
+	struct recorder recorder;
+	struct rte_htimer *timers;
+	uint64_t fired = 0;
+	uint64_t start;
+	uint64_t now;
+	struct rte_htw *htw;
+	uint64_t added;
+
+	recorder_init(&recorder);
+
+	timers = malloc(sizeof(struct rte_htimer) * in_flight_timers);
+
+	if (timers == NULL)
+		rte_panic("Unable to allocate heap memory\n");
+
+	start = rte_rand_max(UINT64_MAX - max_timeout);
+
+	htw = rte_htw_create();
+
+	if (htw == NULL)
+		return TEST_FAILED;
+
+	added = in_flight_timers;
+	add_rand_timers(htw, timers, added, start + 1, max_timeout,
+			recorder_cb, &recorder);
+
+	for (now = start; now < (start + runtime); ) {
+		uint64_t advance;
+
+		advance = rte_rand_max(ADVANCE_TIME_MAX_STEP);
+
+		now += advance;
+
+		rte_htw_manage(htw, now);
+
+		if (recorder.num_timeouts > 0) {
+			struct rte_htimer *timer;
+
+			if (advance == 0)
+				return TEST_FAILED;
+
+			if (recorder_verify(&recorder, now - advance + 1, now)
+			    != TEST_SUCCESS)
+				return TEST_FAILED;
+
+			while ((timer = LIST_FIRST(&recorder.timeout_list))
+			       != NULL) {
+				LIST_REMOVE(timer, entry);
+
+				add_rand_timers(htw, timer, 1,
+						now + 1, max_timeout,
+						recorder_cb, &recorder);
+				added++;
+				fired++;
+			}
+
+			recorder.num_timeouts = 0;
+		}
+	}
+
+	/* finish the remaining timeouts */
+
+	rte_htw_manage(htw, now + max_timeout);
+
+	if (recorder_verify(&recorder, now, now + max_timeout) != TEST_SUCCESS)
+		return TEST_FAILED;
+	fired += recorder.num_timeouts;
+
+	if (fired != added)
+		return TEST_FAILED;
+
+	rte_htw_destroy(htw);
+
+	free(timers);
+
+	return TEST_SUCCESS;
+}
+
+struct counter_state {
+	int calls;
+	struct rte_htw *htw;
+	bool cancel;
+};
+
+static void
+count_timeouts_cb(struct rte_htimer *timer __rte_unused, void *arg)
+{
+	struct counter_state *state = arg;
+
+	state->calls++;
+
+	if (state->cancel)
+		rte_htw_cancel(state->htw, timer);
+}
+
+static int
+test_single_timeout_type(uint64_t now, uint64_t distance, bool use_absolute)
+{
+	struct rte_htw *htw;
+	struct counter_state cstate = {};
+	struct rte_htimer timer;
+	uint64_t expiration_time;
+	unsigned int flags = 0;
+
+	htw = rte_htw_create();
+
+	rte_htw_manage(htw, now);
+
+	if (use_absolute) {
+		expiration_time = now + distance;
+		flags |= RTE_HTIMER_FLAG_ABSOLUTE_TIME;
+	} else
+		expiration_time = distance;
+
+	rte_htw_add(htw, &timer, expiration_time, 0, count_timeouts_cb,
+		    &cstate, flags);
+
+	rte_htw_manage(htw, now);
+
+	if (cstate.calls != 0)
+		return TEST_FAILED;
+
+	rte_htw_manage(htw, now + distance - 1);
+
+	if (cstate.calls != 0)
+		return TEST_FAILED;
+
+	rte_htw_manage(htw, now + distance);
+
+
+	if (cstate.calls != 1)
+		return TEST_FAILED;
+
+	rte_htw_manage(htw, now + distance);
+
+	if (cstate.calls != 1)
+		return TEST_FAILED;
+
+	rte_htw_manage(htw, now + distance + 1);
+
+	if (cstate.calls != 1)
+		return TEST_FAILED;
+
+	rte_htw_destroy(htw);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_single_timeout(uint64_t now, uint64_t distance)
+{
+
+	int rc;
+
+	rc = test_single_timeout_type(now, distance, true);
+	if (rc < 0)
+		return rc;
+
+	rc = test_single_timeout_type(now, distance, false);
+	if (rc < 0)
+		return rc;
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_periodical_timer(uint64_t now, uint64_t start, uint64_t period)
+{
+	struct rte_htw *htw;
+	struct counter_state cstate;
+	struct rte_htimer timer;
+
+	htw = rte_htw_create();
+
+	cstate = (struct counter_state) {
+		.htw = htw
+	};
+
+	rte_htw_manage(htw, now);
+
+	rte_htw_add(htw, &timer, start, period, count_timeouts_cb,
+		    &cstate, RTE_HTIMER_FLAG_PERIODICAL);
+
+	rte_htw_manage(htw, now);
+
+	if (cstate.calls != 0)
+		return TEST_FAILED;
+
+	rte_htw_manage(htw, now + start - 1);
+
+	if (cstate.calls != 0)
+		return TEST_FAILED;
+
+	rte_htw_manage(htw, now + start);
+
+	if (cstate.calls != 1)
+		return TEST_FAILED;
+
+	rte_htw_manage(htw, now + start + 1);
+
+	if (cstate.calls != 1)
+		return TEST_FAILED;
+
+	rte_htw_manage(htw, now + start + period);
+
+	if (cstate.calls != 2)
+		return TEST_FAILED;
+
+	cstate.cancel = true;
+
+	rte_htw_manage(htw, now + start + 2 * period);
+
+	if (cstate.calls != 3)
+		return TEST_FAILED;
+
+	rte_htw_manage(htw, now + start + 3 * period);
+
+	if (cstate.calls != 3)
+		return TEST_FAILED;
+
+	rte_htw_destroy(htw);
+
+	return TEST_SUCCESS;
+}
+
+#define CANCEL_ITERATIONS 1000
+#define CANCEL_NUM_TIMERS 1000
+#define CANCEL_MAX_DISTANCE 10000
+
+static int
+test_cancel_timer(void)
+{
+	uint64_t now;
+	struct rte_htw *htw;
+	int i;
+	struct rte_htimer timers[CANCEL_NUM_TIMERS];
+	struct counter_state timeouts[CANCEL_NUM_TIMERS];
+
+	now = rte_rand_max(UINT64_MAX / 2);
+
+	htw = rte_htw_create();
+
+	for (i = 0; i < CANCEL_ITERATIONS; i++) {
+		int j;
+		int target;
+
+		for (j = 0; j < CANCEL_NUM_TIMERS; j++) {
+			struct rte_htimer *timer = &timers[j];
+			uint64_t expiration_time;
+
+			timeouts[j] = (struct counter_state) {};
+
+			expiration_time = now + 1 +
+				rte_rand_max(CANCEL_MAX_DISTANCE);
+
+			rte_htw_add(htw, timer, expiration_time, 0,
+				    count_timeouts_cb, &timeouts[j],
+				    RTE_HTIMER_FLAG_ABSOLUTE_TIME);
+		}
+
+		target = rte_rand_max(CANCEL_NUM_TIMERS);
+
+		rte_htw_cancel(htw, &timers[target]);
+
+		now += CANCEL_MAX_DISTANCE;
+
+		rte_htw_manage(htw, now);
+
+		for (j = 0; j < CANCEL_NUM_TIMERS; j++) {
+			if (j != target) {
+				if (timeouts[j].calls != 1)
+					return TEST_FAILED;
+			} else {
+				if (timeouts[j].calls > 0)
+					return TEST_FAILED;
+			}
+		}
+	}
+
+	rte_htw_destroy(htw);
+
+	return TEST_SUCCESS;
+}
+
+static void
+nop_cb(struct rte_htimer *timer __rte_unused, void *arg __rte_unused)
+{
+}
+
+#define NEXT_NUM_TIMERS 1000
+#define NEXT_MAX_DISTANCE 10000
+
+static int
+test_next_timeout(void)
+{
+	uint64_t now;
+	struct rte_htw *htw;
+	int i;
+	struct rte_htimer timers[NEXT_NUM_TIMERS];
+	uint64_t last_expiration;
+
+	now = rte_rand_max(NEXT_MAX_DISTANCE);
+
+	htw = rte_htw_create();
+
+	if (rte_htw_next_timeout(htw, UINT64_MAX) != UINT64_MAX)
+		return TEST_FAILED;
+	if (rte_htw_next_timeout(htw, now + 1) != (now + 1))
+		return TEST_FAILED;
+
+	rte_htw_manage(htw, now);
+
+	last_expiration = now + NEXT_MAX_DISTANCE * NEXT_NUM_TIMERS;
+
+	for (i = 0; i < NEXT_NUM_TIMERS; i++) {
+		struct rte_htimer *timer = &timers[i];
+		uint64_t expiration;
+		uint64_t upper_bound;
+
+		/* add timers, each new one closer than the last */
+
+		expiration = last_expiration - rte_rand_max(NEXT_MAX_DISTANCE);
+
+		rte_htw_add(htw, timer, expiration, 0, nop_cb, NULL,
+			    RTE_HTIMER_FLAG_ABSOLUTE_TIME);
+
+		if (rte_htw_next_timeout(htw, UINT64_MAX) != expiration)
+			return TEST_FAILED;
+
+		upper_bound = expiration + rte_rand_max(100000);
+
+		if (rte_htw_next_timeout(htw, upper_bound) != expiration)
+			return TEST_FAILED;
+
+		upper_bound = expiration - rte_rand_max(expiration);
+
+		if (rte_htw_next_timeout(htw, upper_bound) != upper_bound)
+			return TEST_FAILED;
+
+		last_expiration = expiration;
+	}
+
+	rte_htw_destroy(htw);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_htw(void)
+{
+	if (test_single_timeout(0, 10) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_single_timeout(0, 254) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_single_timeout(0, 255) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_single_timeout(255, 1) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_single_timeout(254, 2) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_periodical_timer(10000, 500, 2) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_periodical_timer(1234567, 12345, 100000) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_cancel_timer() != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_rand_timers(1000, 100000, 100000000) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_rand_timers(100000, 100000, 1000000) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_next_timeout() != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+REGISTER_TEST_COMMAND(htw_autotest, test_htw);
diff --git a/app/test/test_htw_perf.c b/app/test/test_htw_perf.c
new file mode 100644
index 0000000000..65901f0874
--- /dev/null
+++ b/app/test/test_htw_perf.c
@@ -0,0 +1,181 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#include "test.h"
+
+#include <sys/queue.h>
+#include <stdlib.h>
+#include <inttypes.h>
+
+#include <rte_cycles.h>
+#include <rte_htw.h>
+#include <rte_malloc.h>
+#include <rte_random.h>
+
+static void
+nop_cb(struct rte_htimer *timer __rte_unused, void *arg __rte_unused)
+{
+}
+
+static void
+add_rand_timers(struct rte_htw *htw, struct rte_htimer *timers,
+		uint64_t num, uint64_t timeout_start, uint64_t max_timeout)
+{
+	uint64_t i;
+	uint64_t expiration_times[num];
+	uint64_t start_ts;
+	uint64_t end_ts;
+
+	for (i = 0; i < num; i++)
+		expiration_times[i] = timeout_start + rte_rand_max(max_timeout);
+
+	start_ts = rte_get_tsc_cycles();
+
+	for (i = 0; i < num; i++) {
+		struct rte_htimer *timer = &timers[i];
+
+		rte_htw_add(htw, timer, expiration_times[i], 0, nop_cb, NULL,
+			    RTE_HTIMER_FLAG_ABSOLUTE_TIME);
+	}
+
+	/* actually install the timers */
+	rte_htw_process(htw);
+
+	end_ts = rte_get_tsc_cycles();
+
+	printf("        %.0f TSC cycles / add op\n",
+	       (double)(end_ts - start_ts) / num);
+}
+
+#define TIME_STEP 16
+
+static int
+test_add_manage_perf(const char *scenario_name, uint64_t num_timers,
+		     uint64_t timespan)
+{
+	uint64_t manage_calls;
+	struct rte_htimer *timers;
+	uint64_t start;
+	uint64_t now;
+	struct rte_htw *htw;
+	uint64_t start_ts;
+	uint64_t end_ts;
+	double latency;
+
+	manage_calls = timespan / TIME_STEP;
+
+	printf("Scenario: %s\n", scenario_name);
+	printf("    Configuration:\n");
+	printf("        Timers: %"PRIu64"\n", num_timers);
+	printf("        Max timeout: %"PRIu64" ticks\n", timespan);
+	printf("        Average timeouts/manage call: %.3f\n",
+	       num_timers / (double)manage_calls);
+	printf("        Time advance per manage call: %d\n", TIME_STEP);
+
+	printf("    Results:\n");
+
+	timers = rte_malloc(NULL, sizeof(struct rte_htimer) *
+			    num_timers, 0);
+
+	if (timers == NULL)
+		rte_panic("Unable to allocate memory\n");
+
+	htw = rte_htw_create();
+
+	if (htw == NULL)
+		return TEST_FAILED;
+
+	start = 1 + rte_rand_max(UINT64_MAX / 2);
+
+	rte_htw_manage(htw, start - 1);
+
+	add_rand_timers(htw, timers, num_timers, start, timespan);
+
+	start_ts = rte_get_tsc_cycles();
+
+	for (now = start; now < (start + timespan); now += TIME_STEP)
+		rte_htw_manage(htw, now);
+
+	end_ts = rte_get_tsc_cycles();
+
+	latency = end_ts - start_ts;
+
+	printf("        %.0f TSC cycles / manage call\n",
+	       latency / manage_calls);
+	printf("        %.1f TSC cycles / tick\n", latency / timespan);
+
+	rte_htw_destroy(htw);
+
+	rte_free(timers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_cancel_perf(uint64_t num_timers, uint64_t timespan)
+{
+	struct rte_htimer *timers;
+	uint64_t start;
+	struct rte_htw *htw;
+	uint64_t i;
+	uint64_t start_ts;
+	uint64_t end_ts;
+	double latency;
+
+	timers = rte_malloc(NULL, sizeof(struct rte_htimer) * num_timers, 0);
+
+	if (timers == NULL)
+		rte_panic("Unable to allocate memory\n");
+
+	htw = rte_htw_create();
+
+	if (htw == NULL)
+		return TEST_FAILED;
+
+	start = 1 + rte_rand_max(UINT64_MAX / 2);
+
+	rte_htw_manage(htw, start - 1);
+
+	add_rand_timers(htw, timers, num_timers, start, timespan);
+
+	start_ts = rte_get_tsc_cycles();
+
+	for (i = 0; i < num_timers; i++)
+		rte_htw_cancel(htw, &timers[i]);
+
+	end_ts = rte_get_tsc_cycles();
+
+	latency = end_ts - start_ts;
+
+	printf("Timer delete: %.0f TSC cycles / call\n",
+	       latency / num_timers);
+
+	rte_htw_destroy(htw);
+
+	rte_free(timers);
+
+	return TEST_SUCCESS;
+}
+
+static int
+test_htw_perf(void)
+{
+	rte_delay_us_block(100);
+
+	if (test_add_manage_perf("Sparse", 100000, 10000000) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_add_manage_perf("Dense", 100000, 200000) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_add_manage_perf("Idle", 10, 100000) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	if (test_cancel_perf(100000, 100000) != TEST_SUCCESS)
+		return TEST_FAILED;
+
+	return TEST_SUCCESS;
+}
+
+REGISTER_TEST_COMMAND(htw_perf_autotest, test_htw_perf);
diff --git a/app/test/test_timer_htimer_htw_perf.c b/app/test/test_timer_htimer_htw_perf.c
new file mode 100644
index 0000000000..e51fc7282f
--- /dev/null
+++ b/app/test/test_timer_htimer_htw_perf.c
@@ -0,0 +1,693 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#include "test.h"
+
+#include <inttypes.h>
+#include <stdlib.h>
+#include <string.h>
+
+#include <rte_cycles.h>
+#include <rte_htimer_mgr.h>
+#include <rte_htw.h>
+#include <rte_lcore.h>
+#include <rte_malloc.h>
+#include <rte_random.h>
+#include <rte_timer.h>
+
+static uint64_t
+s_to_tsc(double s)
+{
+	return s * rte_get_tsc_hz();
+}
+
+static double
+tsc_to_s(uint64_t tsc)
+{
+	return (double)tsc / (double)rte_get_tsc_hz();
+}
+
+struct timer_conf {
+	uint64_t start;
+	uint64_t interval;
+};
+
+static void
+get_timer_confs(double aggregate_expiration_rate,
+		struct timer_conf *timer_confs,
+		size_t num_timers)
+{
+	double avg_expiration_rate;
+	size_t i;
+
+	avg_expiration_rate = aggregate_expiration_rate / num_timers;
+
+	for (i = 0; i < num_timers; i++) {
+		struct timer_conf *conf = &timer_confs[i];
+		double expiration_rate;
+
+		expiration_rate = avg_expiration_rate * (rte_drand() + 0.5);
+
+		conf->interval = rte_get_tsc_hz() / expiration_rate;
+		conf->start = rte_rand_max(conf->interval);
+	}
+}
+
+struct timer_lib_ops {
+	const char *name;
+
+	void *(*create)(const struct timer_conf *timer_confs,
+			size_t num_timers, bool cancel, uint64_t *fired);
+	void (*manage_time)(void *data, uint64_t current_time);
+	void (*manage)(void *data);
+	void (*destroy)(void *data);
+};
+
+static void *
+nop_create(const struct timer_conf *timer_confs __rte_unused,
+	   size_t num_timers __rte_unused, bool cancel __rte_unused,
+	   uint64_t *fired __rte_unused)
+{
+	return NULL;
+}
+
+static __rte_noinline void
+nop_manage(void *data __rte_unused)
+{
+}
+
+static __rte_noinline void
+nop_manage_time(void *data __rte_unused, uint64_t current_time __rte_unused)
+{
+}
+
+static void
+nop_destroy(void *data __rte_unused)
+{
+}
+
+static struct timer_lib_ops nop_ops = {
+	.name = "nop",
+	.create = nop_create,
+	.manage = nop_manage,
+	.manage_time = nop_manage_time,
+	.destroy = nop_destroy
+};
+
+struct ctimer {
+	uint64_t interval;
+	struct rte_timer timer;
+	uint64_t cancel_offset;
+	struct rte_timer canceled_timer;
+};
+
+static void
+crash_cb(struct rte_timer *timer __rte_unused, void *cb_arg __rte_unused)
+{
+	abort();
+}
+
+#define CANCELED_OFFSET (0.5) /* s */
+
+static void
+test_cb(struct rte_timer *timer, void *cb_arg)
+{
+	struct ctimer *ctimer =
+		container_of(timer, struct ctimer, timer);
+	uint64_t *fired = cb_arg;
+
+	rte_timer_reset(timer, ctimer->interval, SINGLE,
+			rte_lcore_id(), test_cb, cb_arg);
+
+	if (ctimer->cancel_offset > 0)
+		rte_timer_reset(&ctimer->canceled_timer,
+				ctimer->interval + ctimer->cancel_offset,
+				SINGLE, rte_lcore_id(), crash_cb, NULL);
+
+	(*fired)++;
+}
+
+static void *
+timer_create1(const struct timer_conf *timer_confs, size_t num_timers,
+	      bool cancel, uint64_t *fired)
+{
+	struct ctimer *ctimers;
+	unsigned int i;
+
+	ctimers = rte_malloc(NULL, sizeof(struct ctimer) * num_timers, 0);
+
+	if (num_timers > 0 && ctimers == NULL)
+		rte_panic("Unable to allocate memory\n");
+
+	rte_timer_subsystem_init();
+
+	for (i = 0; i < num_timers; i++) {
+		const struct timer_conf *timer_conf = &timer_confs[i];
+		struct ctimer *ctimer = &ctimers[i];
+		struct rte_timer *timer = &ctimer->timer;
+
+		rte_timer_init(timer);
+
+		ctimer->interval = timer_conf->interval;
+
+		rte_timer_reset(timer, timer_conf->start, SINGLE,
+				rte_lcore_id(),	test_cb, fired);
+
+		if (cancel) {
+			ctimer->cancel_offset = s_to_tsc(CANCELED_OFFSET);
+
+			rte_timer_reset(&ctimer->canceled_timer,
+				    timer_conf->start + ctimer->cancel_offset,
+				    SINGLE, rte_lcore_id(),
+				    crash_cb, NULL);
+		} else
+			ctimer->cancel_offset = 0;
+	}
+
+	return ctimers;
+}
+
+static void
+timer_manage(void *data __rte_unused)
+{
+	rte_timer_manage();
+}
+
+static void
+timer_manage_time(void *data __rte_unused, uint64_t current_time __rte_unused)
+{
+	rte_timer_manage();
+}
+
+static void
+timer_destroy(void *data)
+{
+	rte_free(data);
+
+	rte_timer_subsystem_finalize();
+}
+
+static struct timer_lib_ops timer_ops = {
+	.name = "timer",
+	.create = timer_create1,
+	.manage = timer_manage,
+	.manage_time = timer_manage_time,
+	.destroy = timer_destroy
+};
+
+struct chtimer {
+	uint64_t interval;
+	struct rte_htimer htimer;
+	uint64_t cancel_offset;
+	struct rte_htimer canceled_htimer;
+};
+
+static void
+hcrash_cb(struct rte_htimer *timer __rte_unused, void *cb_arg __rte_unused)
+{
+	abort();
+}
+
+static void
+htest_cb(struct rte_htimer *timer, void *cb_arg)
+{
+	struct chtimer *chtimer =
+		container_of(timer, struct chtimer, htimer);
+	uint64_t *fired = cb_arg;
+
+	rte_htimer_mgr_add(timer, chtimer->interval, 0, htest_cb, cb_arg,
+			   RTE_HTIMER_FLAG_TIME_TSC);
+
+	if (chtimer->cancel_offset > 0) {
+		struct rte_htimer *canceled_htimer =
+			&chtimer->canceled_htimer;
+		uint64_t cancel_expiration_time = chtimer->interval +
+			chtimer->cancel_offset;
+
+		rte_htimer_mgr_cancel(canceled_htimer);
+
+		rte_htimer_mgr_add(canceled_htimer, cancel_expiration_time, 0,
+				   hcrash_cb, NULL, RTE_HTIMER_FLAG_TIME_TSC);
+	}
+
+	(*fired)++;
+}
+
+#define TICK_LENGTH (1e-6)
+
+static void *
+htimer_create(const struct timer_conf *timer_confs, size_t num_timers,
+	      bool cancel, uint64_t *fired)
+{
+	struct chtimer *chtimers;
+	unsigned int i;
+
+	chtimers = rte_malloc(NULL, sizeof(struct chtimer) * num_timers, 0);
+
+	if (num_timers > 0 && chtimers == NULL)
+		rte_panic("Unable to allocate memory\n");
+
+	rte_htimer_mgr_init(TICK_LENGTH * NS_PER_S);
+
+	rte_htimer_mgr_manage();
+
+	for (i = 0; i < num_timers; i++) {
+		const struct timer_conf *timer_conf = &timer_confs[i];
+		struct chtimer *chtimer = &chtimers[i];
+
+		chtimer->interval = timer_conf->interval;
+
+		rte_htimer_mgr_add(&chtimer->htimer, timer_conf->start, 0,
+				   htest_cb, fired, RTE_HTIMER_FLAG_TIME_TSC);
+
+		if (cancel) {
+			uint64_t cancel_start;
+
+			chtimer->cancel_offset = s_to_tsc(CANCELED_OFFSET);
+
+			cancel_start =
+				timer_conf->start + chtimer->cancel_offset;
+
+			rte_htimer_mgr_add(&chtimer->canceled_htimer,
+					   cancel_start, 0,
+					   hcrash_cb, NULL,
+					   RTE_HTIMER_FLAG_TIME_TSC);
+		} else
+			chtimer->cancel_offset = 0;
+	}
+
+	rte_htimer_mgr_process();
+
+	return chtimers;
+}
+
+static void
+htimer_manage(void *data __rte_unused)
+{
+	rte_htimer_mgr_manage();
+}
+
+static void
+htimer_manage_time(void *data __rte_unused, uint64_t current_time)
+{
+	rte_htimer_mgr_manage_time(current_time, RTE_HTIMER_FLAG_TIME_TSC);
+}
+
+static void
+htimer_destroy(void *data)
+{
+	rte_free(data);
+
+	rte_htimer_mgr_deinit();
+}
+
+static struct timer_lib_ops htimer_ops = {
+	.name = "htimer",
+	.create = htimer_create,
+	.manage = htimer_manage,
+	.manage_time = htimer_manage_time,
+	.destroy = htimer_destroy,
+};
+
+struct htw {
+	struct rte_htw *htw;
+	struct chtimer *chtimers;
+	uint64_t tsc_per_tick;
+	uint64_t *fired;
+};
+
+static void
+htw_manage_time(void *timer_data, uint64_t current_time)
+{
+	struct htw *htw = timer_data;
+	uint64_t tick;
+
+	tick = current_time / htw->tsc_per_tick;
+
+	rte_htw_manage(htw->htw, tick);
+}
+
+static void
+htw_manage(void *timer_data)
+{
+	uint64_t now;
+
+	now = rte_get_tsc_cycles();
+
+	htw_manage_time(timer_data, now);
+}
+
+static void
+htwcrash_cb(struct rte_htimer *timer __rte_unused, void *cb_arg __rte_unused)
+{
+	abort();
+}
+
+static void
+htwtest_cb(struct rte_htimer *timer, void *cb_arg)
+{
+	struct chtimer *chtimer =
+		container_of(timer, struct chtimer, htimer);
+	struct htw *htw = cb_arg;
+
+	rte_htw_add(htw->htw, timer, chtimer->interval, 0, htwtest_cb,
+		    cb_arg, 0);
+
+	if (chtimer->cancel_offset > 0) {
+		struct rte_htimer *canceled_htimer =
+			&chtimer->canceled_htimer;
+		uint64_t cancel_expiration_time = chtimer->interval +
+			chtimer->cancel_offset;
+
+		rte_htw_cancel(htw->htw, canceled_htimer);
+
+		rte_htw_add(htw->htw, canceled_htimer,
+			    cancel_expiration_time, 0,
+			    htwcrash_cb, cb_arg, 0);
+	}
+
+	(*htw->fired)++;
+}
+
+static void *
+htw_create(const struct timer_conf *timer_confs, size_t num_timers,
+	   bool cancel, uint64_t *fired)
+{
+	unsigned int i;
+	struct htw *htw;
+
+	htw = rte_malloc(NULL, sizeof(struct htw), 0);
+	if (htw == NULL)
+		rte_panic("Unable to allocate memory\n");
+
+	htw->htw = rte_htw_create();
+	if (htw == NULL)
+		rte_panic("Unable to create HTW\n");
+
+	htw->chtimers =
+		rte_malloc(NULL, sizeof(struct chtimer) * num_timers, 0);
+	if (num_timers > 0 && htw->chtimers == NULL)
+		rte_panic("Unable to allocate memory\n");
+
+	htw->tsc_per_tick = s_to_tsc(TICK_LENGTH);
+
+	htw->fired = fired;
+
+	htw_manage(htw);
+
+	for (i = 0; i < num_timers; i++) {
+		const struct timer_conf *timer_conf = &timer_confs[i];
+		struct chtimer *chtimer = &htw->chtimers[i];
+		uint64_t start;
+
+		chtimer->interval = timer_conf->interval / htw->tsc_per_tick;
+
+		start = timer_conf->start / htw->tsc_per_tick;
+
+		rte_htw_add(htw->htw, &chtimer->htimer,
+			    start, 0, htwtest_cb, htw, 0);
+
+		if (cancel) {
+			uint64_t cancel_start;
+
+			chtimer->cancel_offset =
+				s_to_tsc(CANCELED_OFFSET) / htw->tsc_per_tick;
+
+			cancel_start = start + chtimer->cancel_offset;
+
+			rte_htw_add(htw->htw, &chtimer->canceled_htimer,
+				    cancel_start, 0, htwcrash_cb, NULL, 0);
+		} else
+			chtimer->cancel_offset = 0;
+	}
+
+	rte_htw_process(htw->htw);
+
+	return htw;
+}
+
+static void
+htw_destroy(void *data)
+{
+	struct htw *htw = data;
+
+	rte_htw_destroy(htw->htw);
+
+	rte_free(htw->chtimers);
+
+	rte_free(htw);
+}
+
+static struct timer_lib_ops htw_ops = {
+	.name = "htw",
+	.create = htw_create,
+	.manage = htw_manage,
+	.manage_time = htw_manage_time,
+	.destroy = htw_destroy,
+};
+
+static const struct timer_lib_ops *lib_ops[] = {
+	&timer_ops, &htimer_ops, &htw_ops
+};
+
+#define DUMMY_TASK_SIZE (2500)
+
+static __rte_noinline uint64_t
+do_dummy_task(void)
+{
+	uint64_t result = 0;
+	unsigned int i;
+
+	for (i = 0; i < DUMMY_TASK_SIZE; i++)
+		result += rte_rand();
+
+	return result;
+}
+
+struct work_log {
+	uint64_t tasks_completed;
+	uint64_t runtime;
+};
+
+#define TARGET_RUNTIME (4.0) /* s */
+
+struct run_result {
+	uint64_t tasks_completed;
+	uint64_t timer_fired;
+	uint64_t latency;
+};
+
+static void
+run_with_lib(const struct timer_lib_ops *timer_ops,
+	     const struct timer_conf *timer_confs, size_t num_timers,
+	     bool cancel, struct run_result *result)
+{
+	void *timer_data;
+	uint64_t deadline;
+	uint64_t start;
+	uint64_t now;
+	volatile uint64_t sum = 0;
+
+	result->tasks_completed = 0;
+	result->timer_fired = 0;
+
+	timer_data = timer_ops->create(timer_confs, num_timers, cancel,
+				       &result->timer_fired);
+
+	start = rte_get_tsc_cycles();
+
+	deadline = start + s_to_tsc(TARGET_RUNTIME);
+
+	do {
+		sum += do_dummy_task();
+
+		result->tasks_completed++;
+
+		now = rte_get_tsc_cycles();
+
+		timer_ops->manage_time(timer_data, now);
+	} while (now < deadline);
+
+	RTE_VERIFY(sum != 0);
+
+	result->latency = rte_get_tsc_cycles() - start;
+
+	timer_ops->destroy(timer_data);
+}
+
+static void
+benchmark_timer_libs(double aggregate_expiration_rate, uint64_t num_timers,
+		     bool cancel)
+{
+	struct timer_conf timer_confs[num_timers];
+	struct run_result nop_result;
+	double nop_per_task_latency;
+	struct run_result lib_results[RTE_DIM(lib_ops)];
+	uint64_t lib_overhead[RTE_DIM(lib_ops)];
+
+	unsigned int i;
+
+	printf("Configuration:\n");
+	printf("    Aggregate timer expiration rate: %.3e Hz\n",
+	       aggregate_expiration_rate);
+	if (cancel)
+		printf("    Aggregate timer cancellation rate: %.3e Hz\n",
+		       aggregate_expiration_rate);
+	printf("    Concurrent timers: %zd\n", num_timers);
+	printf("    Tick length: %.1e s\n", TICK_LENGTH);
+
+	rte_srand(4711);
+
+	get_timer_confs(aggregate_expiration_rate, timer_confs, num_timers);
+
+	run_with_lib(&nop_ops, NULL, 0, false, &nop_result);
+	nop_per_task_latency =
+		(double)nop_result.latency / nop_result.tasks_completed;
+
+	for (i = 0; i < RTE_DIM(lib_ops); i++) {
+		struct run_result *lib_result = &lib_results[i];
+		double per_task_latency;
+
+		run_with_lib(lib_ops[i], timer_confs, num_timers, cancel,
+			     lib_result);
+
+		per_task_latency = (double)lib_result->latency /
+			lib_result->tasks_completed;
+
+		if (per_task_latency > nop_per_task_latency)
+			lib_overhead[i] =
+				(per_task_latency - nop_per_task_latency) *
+				lib_result->tasks_completed;
+		else
+			lib_overhead[i] = 0;
+	}
+
+	printf("Results:\n");
+
+	printf("    Work between manage calls: %.0f TSC cycles\n",
+	       (double)nop_result.latency / nop_result.tasks_completed);
+
+	printf("\n");
+	printf("%-24s", "");
+	for (i = 0; i < RTE_DIM(lib_ops); i++)
+		printf("%12s", lib_ops[i]->name);
+	printf("\n");
+
+	printf("%-24s", "    Runtime [s]");
+	for (i = 0; i < RTE_DIM(lib_ops); i++)
+		printf("%12.3e", tsc_to_s(lib_results[i].latency));
+	printf("\n");
+
+	printf("%-24s", "    Expiration rate [Hz]");
+	for (i = 0; i < RTE_DIM(lib_ops); i++)
+		printf("%12.3e", lib_results[i].timer_fired /
+		       tsc_to_s(lib_results[i].latency));
+	printf("\n");
+
+	printf("%-24s", "    Overhead [%%]");
+	for (i = 0; i < RTE_DIM(lib_ops); i++)
+		printf("%12.3f", 100 * (double)lib_overhead[i] /
+		       (double)lib_results[i].latency);
+	printf("\n");
+
+	printf("%-24s", "    Per expiration [TSC]");
+	for (i = 0; i < RTE_DIM(lib_ops); i++)
+		printf("%12"PRIu64, lib_overhead[i] /
+		       lib_results[i].timer_fired);
+	printf("\n");
+
+	printf("%-24s", "    Per manage() [TSC]");
+	for (i = 0; i < RTE_DIM(lib_ops); i++)
+		printf("%12"PRIu64, lib_overhead[i] /
+		       lib_results[i].tasks_completed);
+	printf("\n");
+}
+
+static void
+benchmark_timer_libs_mode(double aggregate_expiration_rate, bool cancel)
+{
+	benchmark_timer_libs(aggregate_expiration_rate, 100, cancel);
+	benchmark_timer_libs(aggregate_expiration_rate, 100000, cancel);
+}
+
+static void
+benchmark_timer_libs_rate(double aggregate_expiration_rate)
+{
+	benchmark_timer_libs_mode(aggregate_expiration_rate, false);
+	benchmark_timer_libs_mode(aggregate_expiration_rate, true);
+}
+
+#define MANAGE_ITERATIONS (10000000)
+
+static uint64_t
+run_manage(const struct timer_lib_ops *timer_ops, bool user_provided_time)
+{
+	uint64_t start;
+	uint64_t latency;
+	void *timer_data;
+
+	timer_data = timer_ops->create(NULL, 0, NULL, false);
+
+	start = rte_get_tsc_cycles();
+
+	unsigned int i;
+	for (i = 0; i < MANAGE_ITERATIONS; i++)
+		if (user_provided_time && timer_ops->manage_time != NULL) {
+			uint64_t now;
+
+			now = rte_get_tsc_cycles();
+
+			timer_ops->manage_time(timer_data, now);
+		} else
+			timer_ops->manage(timer_data);
+
+	latency = rte_get_tsc_cycles() - start;
+
+	timer_ops->destroy(timer_data);
+
+	return latency / MANAGE_ITERATIONS;
+}
+
+static void
+benchmark_timer_libs_timeless_manage(bool user_provided_time)
+{
+	unsigned int i;
+	uint64_t nop_latency;
+
+	nop_latency = run_manage(&nop_ops, user_provided_time);
+
+	printf("Zero-timers manage() overhead%s:\n", user_provided_time ?
+	       " (w/ user-provided time)" : "");
+
+	for (i = 0; i < RTE_DIM(lib_ops); i++) {
+		const struct timer_lib_ops *ops = lib_ops[i];
+		uint64_t latency;
+
+		latency = run_manage(ops, user_provided_time);
+
+		if (latency > nop_latency)
+			latency -= nop_latency;
+		else
+			latency = 0;
+
+		printf("    %s: %"PRIu64" TSC cycles\n", ops->name, latency);
+	}
+}
+
+static int
+test_timer_htimer_htw_perf(void)
+{
+	/* warm up */
+	rte_delay_us_block(10000);
+
+	benchmark_timer_libs_rate(1e6);
+
+	benchmark_timer_libs_timeless_manage(false);
+	benchmark_timer_libs_timeless_manage(true);
+
+	return TEST_SUCCESS;
+}
+
+REGISTER_TEST_COMMAND(timer_htimer_htw_perf_autotest,
+		      test_timer_htimer_htw_perf);
diff --git a/doc/api/doxy-api-index.md b/doc/api/doxy-api-index.md
index 2deec7ea19..5ea1dfa262 100644
--- a/doc/api/doxy-api-index.md
+++ b/doc/api/doxy-api-index.md
@@ -67,6 +67,8 @@ The public API headers are grouped by topics:
 - **timers**:
   [cycles](@ref rte_cycles.h),
   [timer](@ref rte_timer.h),
+  [htimer_mgr](@ref rte_htimer_mgr.h),
+  [htimer](@ref rte_htimer.h),
   [alarm](@ref rte_alarm.h)
 
 - **locks**:
@@ -163,7 +165,8 @@ The public API headers are grouped by topics:
   [ring](@ref rte_ring.h),
   [stack](@ref rte_stack.h),
   [tailq](@ref rte_tailq.h),
-  [bitmap](@ref rte_bitmap.h)
+  [bitmap](@ref rte_bitmap.h),
+  [bitset](@ref rte_bitset.h)
 
 - **packet framework**:
   * [port](@ref rte_port.h):
diff --git a/doc/api/doxy-api.conf.in b/doc/api/doxy-api.conf.in
index e859426099..c0cd64db34 100644
--- a/doc/api/doxy-api.conf.in
+++ b/doc/api/doxy-api.conf.in
@@ -45,6 +45,7 @@ INPUT                   = @TOPDIR@/doc/api/doxy-api-index.md \
                           @TOPDIR@/lib/gro \
                           @TOPDIR@/lib/gso \
                           @TOPDIR@/lib/hash \
+                          @TOPDIR@/lib/htimer \
                           @TOPDIR@/lib/ip_frag \
                           @TOPDIR@/lib/ipsec \
                           @TOPDIR@/lib/jobstats \
diff --git a/lib/htimer/meson.build b/lib/htimer/meson.build
new file mode 100644
index 0000000000..2dd5d6a24b
--- /dev/null
+++ b/lib/htimer/meson.build
@@ -0,0 +1,7 @@
+# SPDX-License-Identifier: BSD-3-Clause
+# Copyright(c) 2023 Ericsson AB
+
+sources = files('rte_htw.c', 'rte_htimer_msg_ring.c', 'rte_htimer_mgr.c')
+headers = files('rte_htimer_mgr.h', 'rte_htimer.h')
+
+deps += ['ring']
diff --git a/lib/htimer/rte_htimer.h b/lib/htimer/rte_htimer.h
new file mode 100644
index 0000000000..6ac86292b5
--- /dev/null
+++ b/lib/htimer/rte_htimer.h
@@ -0,0 +1,68 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#ifndef _RTE_HTIMER_H_
+#define _RTE_HTIMER_H_
+
+#include <stdbool.h>
+#include <stdint.h>
+#include <sys/queue.h>
+
+#include <rte_bitops.h>
+
+struct rte_htimer;
+
+typedef void (*rte_htimer_cb_t)(struct rte_htimer *, void *);
+
+struct rte_htimer {
+	/**
+	 * Absolute timer expiration time (in ticks).
+	 */
+	uint64_t expiration_time;
+	/**
+	 * Time between expirations (in ticks). Zero for one-shot timers.
+	 */
+	uint64_t period;
+	/**
+	 * Owning lcore. May safely be read from any thread.
+	 */
+	uint32_t owner_lcore_id;
+	/**
+	 * The current state of the timer.
+	 */
+	uint32_t state:4;
+	/**
+	 * Flags set on this timer.
+	 */
+	uint32_t flags:28;
+	/**
+	 * User-specified callback function pointer.
+	 */
+	rte_htimer_cb_t cb;
+	/**
+	 * Argument for user callback.
+	 */
+	void *cb_arg;
+	/**
+	 * Pointers used to add timer to various internal lists.
+	 */
+	LIST_ENTRY(rte_htimer) entry;
+};
+
+#define RTE_HTIMER_FLAG_ABSOLUTE_TIME RTE_BIT32(0)
+#define RTE_HTIMER_FLAG_PERIODICAL RTE_BIT32(1)
+#define RTE_HTIMER_FLAG_TIME_TICK RTE_BIT32(2)
+#define RTE_HTIMER_FLAG_TIME_TSC RTE_BIT32(3)
+
+#define RTE_HTIMER_STATE_PENDING 1
+#define RTE_HTIMER_STATE_EXPIRED 2
+#define RTE_HTIMER_STATE_CANCELED 3
+
+LIST_HEAD(rte_htimer_list, rte_htimer);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_HTIMER_H_ */
diff --git a/lib/htimer/rte_htimer_mgr.c b/lib/htimer/rte_htimer_mgr.c
new file mode 100644
index 0000000000..efdfcf0985
--- /dev/null
+++ b/lib/htimer/rte_htimer_mgr.c
@@ -0,0 +1,547 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#include <inttypes.h>
+#include <math.h>
+#include <stdbool.h>
+#include <sys/queue.h>
+#include <unistd.h>
+
+#include <rte_branch_prediction.h>
+#include <rte_common.h>
+#include <rte_cycles.h>
+#include <rte_errno.h>
+#include <rte_htw.h>
+#include <rte_prefetch.h>
+#include <rte_ring_elem.h>
+
+#include "rte_htimer_mgr.h"
+#include "rte_htimer_msg.h"
+#include "rte_htimer_msg_ring.h"
+
+#define MAX_MSG_BATCH_SIZE 16
+
+struct htimer_mgr {
+	struct rte_htimer_msg_ring *msg_ring;
+	struct rte_htw *htw;
+
+	unsigned int async_msgs_idx __rte_cache_aligned;
+	unsigned int num_async_msgs;
+	struct rte_htimer_msg async_msgs[MAX_MSG_BATCH_SIZE];
+} __rte_cache_aligned;
+
+static uint64_t ns_per_tick;
+static double tsc_per_tick;
+
+static struct htimer_mgr mgrs[RTE_MAX_LCORE + 1];
+
+#define MAX_ASYNC_TRANSACTIONS 1024
+#define MSG_RING_SIZE MAX_ASYNC_TRANSACTIONS
+
+static inline uint64_t
+tsc_to_tick(uint64_t tsc)
+{
+	return tsc / tsc_per_tick;
+}
+
+static inline uint64_t
+tsc_to_tick_round_up(uint64_t tsc)
+{
+	uint64_t tick;
+
+	tick = (tsc + tsc_per_tick / 2) / tsc_per_tick;
+
+	return tick;
+}
+
+static inline uint64_t
+ns_to_tick(uint64_t ns)
+{
+	return ns / ns_per_tick;
+}
+
+static inline uint64_t
+ns_to_tick_round_up(uint64_t ns)
+{
+	uint64_t tick;
+
+	tick = ceil(ns / ns_per_tick);
+
+	return tick;
+}
+
+static inline uint64_t
+tick_to_ns(uint64_t tick)
+{
+	return tick * ns_per_tick;
+}
+
+static struct htimer_mgr *
+mgr_get(unsigned int lcore_id)
+{
+	return &mgrs[lcore_id];
+}
+
+static int
+mgr_init(unsigned int lcore_id)
+{
+	char ring_name[RTE_RING_NAMESIZE];
+	unsigned int socket_id;
+	struct htimer_mgr *mgr = &mgrs[lcore_id];
+
+	socket_id = rte_lcore_to_socket_id(lcore_id);
+
+	snprintf(ring_name, sizeof(ring_name), "htimer_%d", lcore_id);
+
+	mgr->msg_ring =
+		rte_htimer_msg_ring_create(ring_name, MSG_RING_SIZE, socket_id,
+					   RING_F_SC_DEQ);
+
+	if (mgr->msg_ring == NULL)
+		goto err;
+
+	mgr->htw = rte_htw_create();
+
+	if (mgr->htw == NULL)
+		goto err_free_ring;
+
+	mgr->async_msgs_idx = 0;
+	mgr->num_async_msgs = 0;
+
+	return 0;
+
+err_free_ring:
+	rte_htimer_msg_ring_free(mgr->msg_ring);
+err:
+	return -ENOMEM;
+}
+
+static void
+mgr_deinit(unsigned int lcore_id)
+{
+	struct htimer_mgr *mgr = &mgrs[lcore_id];
+
+	rte_htw_destroy(mgr->htw);
+
+	rte_htimer_msg_ring_free(mgr->msg_ring);
+}
+
+static volatile bool initialized;
+
+static void
+assure_initialized(void)
+{
+	RTE_ASSERT(initialized);
+}
+
+int
+rte_htimer_mgr_init(uint64_t _ns_per_tick)
+{
+	unsigned int lcore_id;
+
+	RTE_VERIFY(!initialized);
+
+	ns_per_tick = _ns_per_tick;
+
+	tsc_per_tick = (ns_per_tick / 1e9) * rte_get_tsc_hz();
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
+		int rc;
+
+		rc = mgr_init(lcore_id);
+
+		if (rc < 0) {
+			unsigned int deinit_lcore_id;
+
+			for (deinit_lcore_id = 0; deinit_lcore_id < lcore_id;
+			     deinit_lcore_id++)
+				mgr_deinit(deinit_lcore_id);
+
+			return rc;
+		}
+	}
+
+	initialized = true;
+
+	return 0;
+}
+
+void
+rte_htimer_mgr_deinit(void)
+{
+	unsigned int lcore_id;
+
+	assure_initialized();
+
+	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++)
+		mgr_deinit(lcore_id);
+
+	initialized = false;
+}
+
+static void
+assure_valid_time_conversion_flags(uint32_t flags __rte_unused)
+{
+	RTE_ASSERT(!((flags & RTE_HTIMER_FLAG_TIME_TSC) &&
+		     (flags & RTE_HTIMER_FLAG_TIME_TICK)));
+}
+
+static void
+assure_valid_add_flags(uint32_t flags)
+{
+	assure_valid_time_conversion_flags(flags);
+
+	RTE_ASSERT(!(flags & ~(RTE_HTIMER_FLAG_PERIODICAL |
+			       RTE_HTIMER_FLAG_ABSOLUTE_TIME |
+			       RTE_HTIMER_FLAG_TIME_TSC |
+			       RTE_HTIMER_FLAG_TIME_TICK)));
+}
+
+static uint64_t
+convert_time(uint64_t t, uint32_t flags)
+{
+	if (flags & RTE_HTIMER_FLAG_TIME_TSC)
+		return tsc_to_tick(t);
+	else if (flags & RTE_HTIMER_FLAG_TIME_TICK)
+		return t;
+	else
+		return ns_to_tick(t);
+}
+
+void
+rte_htimer_mgr_add(struct rte_htimer *timer, uint64_t expiration_time,
+		   uint64_t period, rte_htimer_cb_t timer_cb,
+		   void *timer_cb_arg, uint32_t flags)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct htimer_mgr *mgr = mgr_get(lcore_id);
+	uint64_t expiration_time_tick;
+	uint64_t period_tick;
+
+	assure_initialized();
+
+	assure_valid_add_flags(flags);
+
+	expiration_time_tick = convert_time(expiration_time, flags);
+
+	period_tick = convert_time(period, flags);
+
+	rte_htw_add(mgr->htw, timer, expiration_time_tick, period_tick,
+		    timer_cb, timer_cb_arg, flags);
+
+	timer->owner_lcore_id = lcore_id;
+}
+
+int
+rte_htimer_mgr_cancel(struct rte_htimer *timer)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct htimer_mgr *mgr = mgr_get(lcore_id);
+
+	assure_initialized();
+
+	RTE_ASSERT(timer->owner_lcore_id == lcore_id);
+
+	switch (timer->state) {
+	case RTE_HTIMER_STATE_PENDING:
+		rte_htw_cancel(mgr->htw, timer);
+		return 0;
+	case RTE_HTIMER_STATE_EXPIRED:
+		return -ETIME;
+	default:
+		RTE_ASSERT(timer->state == RTE_HTIMER_STATE_CANCELED);
+		return -ENOENT;
+	}
+}
+
+static int
+send_msg(unsigned int receiver_lcore_id, enum rte_htimer_msg_type msg_type,
+	 struct rte_htimer *timer, rte_htimer_mgr_async_op_cb_t async_cb,
+	 void *async_cb_arg, const struct rte_htimer_msg_request *request,
+	 const struct rte_htimer_msg_response *response)
+{
+	struct htimer_mgr *receiver_mgr;
+	struct rte_htimer_msg_ring *receiver_ring;
+	struct rte_htimer_msg msg = (struct rte_htimer_msg) {
+		.msg_type = msg_type,
+		.timer = timer,
+		.async_cb = async_cb,
+		.async_cb_arg = async_cb_arg
+	};
+	int rc;
+
+	if (request != NULL)
+		msg.request = *request;
+	else
+		msg.response = *response;
+
+	receiver_mgr = mgr_get(receiver_lcore_id);
+
+	receiver_ring = receiver_mgr->msg_ring;
+
+	rc = rte_htimer_msg_ring_enqueue(receiver_ring, &msg);
+
+	return rc;
+}
+
+static int
+send_request(unsigned int receiver_lcore_id, enum rte_htimer_msg_type msg_type,
+	     struct rte_htimer *timer,
+	     rte_htimer_mgr_async_op_cb_t async_cb, void *async_cb_arg)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct rte_htimer_msg_request request = {
+		.source_lcore_id = lcore_id
+	};
+
+	return send_msg(receiver_lcore_id, msg_type, timer, async_cb,
+			async_cb_arg, &request, NULL);
+}
+
+static int
+send_response(unsigned int receiver_lcore_id, enum rte_htimer_msg_type msg_type,
+	      struct rte_htimer *timer,
+	      rte_htimer_mgr_async_op_cb_t async_cb, void *async_cb_arg,
+	      int result)
+{
+	struct rte_htimer_msg_response response = {
+		.result = result
+	};
+
+	return send_msg(receiver_lcore_id, msg_type, timer, async_cb,
+			async_cb_arg, NULL, &response);
+}
+
+int
+rte_htimer_mgr_async_add(struct rte_htimer *timer,
+			 unsigned int target_lcore_id,
+			 uint64_t expiration_time, uint64_t period,
+			 rte_htimer_cb_t timer_cb, void *timer_cb_arg,
+			 uint32_t flags,
+			 rte_htimer_mgr_async_op_cb_t async_cb,
+			 void *async_cb_arg)
+{
+	*timer = (struct rte_htimer) {
+		.expiration_time = expiration_time,
+		.period = period,
+		.owner_lcore_id = target_lcore_id,
+		.flags = flags,
+		.cb = timer_cb,
+		.cb_arg = timer_cb_arg
+	};
+
+	assure_initialized();
+
+	if (send_request(target_lcore_id, rte_htimer_msg_type_add_request,
+			 timer, async_cb, async_cb_arg) < 0)
+		return -EBUSY;
+
+	return 0;
+}
+
+int
+rte_htimer_mgr_async_cancel(struct rte_htimer *timer,
+			    rte_htimer_mgr_async_op_cb_t async_cb,
+			    void *async_cb_arg)
+{
+	if (send_request(timer->owner_lcore_id,
+			 rte_htimer_msg_type_cancel_request,
+			 timer, async_cb, async_cb_arg) < 0)
+		return -EBUSY;
+
+	return 0;
+}
+
+static int
+process_add_request(struct rte_htimer_msg *request)
+{
+	struct rte_htimer *timer = request->timer;
+
+	if (request->async_cb != NULL &&
+	    send_response(request->request.source_lcore_id,
+			  rte_htimer_msg_type_add_response, timer,
+			  request->async_cb, request->async_cb_arg,
+			  RTE_HTIMER_MGR_ASYNC_RESULT_ADDED) < 0)
+		return -EBUSY;
+
+	rte_htimer_mgr_add(timer, timer->expiration_time, timer->period,
+			   timer->cb, timer->cb_arg, timer->flags);
+
+	return 0;
+}
+
+static int
+process_cancel_request(struct rte_htimer_msg *request)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct htimer_mgr *mgr = mgr_get(lcore_id);
+	struct rte_htimer *timer = request->timer;
+	int result;
+
+	switch (timer->state) {
+	case RTE_HTIMER_STATE_PENDING:
+		result = RTE_HTIMER_MGR_ASYNC_RESULT_CANCELED;
+		break;
+	case RTE_HTIMER_STATE_CANCELED:
+		result = RTE_HTIMER_MGR_ASYNC_RESULT_ALREADY_CANCELED;
+		break;
+	case RTE_HTIMER_STATE_EXPIRED:
+		result = RTE_HTIMER_MGR_ASYNC_RESULT_EXPIRED;
+		break;
+	default:
+		RTE_ASSERT(0);
+		result = -1;
+	}
+
+	if (request->async_cb != NULL &&
+	    send_response(request->request.source_lcore_id,
+			  rte_htimer_msg_type_cancel_response, timer,
+			  request->async_cb, request->async_cb_arg,
+			  result) < 0)
+		return -EBUSY;
+
+	if (timer->state == RTE_HTIMER_STATE_PENDING)
+		rte_htw_cancel(mgr->htw, timer);
+
+	return 0;
+}
+
+static int
+process_response(struct rte_htimer_msg *msg)
+{
+	struct rte_htimer_msg_response *response = &msg->response;
+
+	if (msg->async_cb != NULL)
+		msg->async_cb(msg->timer, response->result, msg->async_cb_arg);
+
+	return 0;
+}
+
+static int
+process_msg(struct rte_htimer_msg *msg)
+{
+	switch (msg->msg_type) {
+	case rte_htimer_msg_type_add_request:
+		return process_add_request(msg);
+	case rte_htimer_msg_type_cancel_request:
+		return process_cancel_request(msg);
+	case rte_htimer_msg_type_add_response:
+	case rte_htimer_msg_type_cancel_response:
+		return process_response(msg);
+	default:
+		RTE_ASSERT(0);
+		return -EBUSY;
+	}
+}
+
+static void
+dequeue_async_msgs(struct htimer_mgr *mgr)
+{
+	unsigned int i;
+
+	if (likely(rte_htimer_msg_ring_empty(mgr->msg_ring)))
+		return;
+
+	if (unlikely(mgr->num_async_msgs > 0))
+		return;
+
+	mgr->async_msgs_idx = 0;
+
+	mgr->num_async_msgs =
+		rte_htimer_msg_ring_dequeue_burst(mgr->msg_ring,
+						  mgr->async_msgs,
+						  MAX_MSG_BATCH_SIZE);
+
+	for (i = 0; i < mgr->num_async_msgs; i++)
+		rte_prefetch1(mgr->async_msgs[i].timer);
+}
+
+static void
+process_async(struct htimer_mgr *mgr)
+{
+	for (;;) {
+		struct rte_htimer_msg *msg;
+
+		dequeue_async_msgs(mgr);
+
+		if (mgr->num_async_msgs == 0)
+			break;
+
+		msg = &mgr->async_msgs[mgr->async_msgs_idx];
+
+		if (process_msg(msg) < 0)
+			break;
+
+		mgr->num_async_msgs--;
+		mgr->async_msgs_idx++;
+	}
+}
+
+static __rte_always_inline void
+htimer_mgr_manage_time(uint64_t current_time, uint32_t flags)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct htimer_mgr *mgr = mgr_get(lcore_id);
+	uint64_t current_tick;
+
+	assure_initialized();
+
+	assure_valid_time_conversion_flags(flags);
+
+	process_async(mgr);
+
+	current_tick = convert_time(current_time, flags);
+
+	rte_htw_manage(mgr->htw, current_tick);
+}
+
+void
+rte_htimer_mgr_manage_time(uint64_t current_time, uint32_t flags)
+{
+	htimer_mgr_manage_time(current_time, flags);
+}
+
+void
+rte_htimer_mgr_manage(void)
+{
+	uint64_t current_time;
+
+	current_time = rte_get_tsc_cycles();
+
+	htimer_mgr_manage_time(current_time, RTE_HTIMER_FLAG_TIME_TSC);
+}
+
+void
+rte_htimer_mgr_process(void)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct htimer_mgr *mgr = mgr_get(lcore_id);
+
+	process_async(mgr);
+	assure_initialized();
+
+	rte_htw_process(mgr->htw);
+}
+
+uint64_t
+rte_htimer_mgr_current_time(void)
+{
+	uint64_t current_tick;
+
+	current_tick = rte_htimer_mgr_current_tick();
+
+	return tick_to_ns(current_tick);
+}
+
+uint64_t
+rte_htimer_mgr_current_tick(void)
+{
+	unsigned int lcore_id = rte_lcore_id();
+	struct htimer_mgr *mgr = mgr_get(lcore_id);
+	uint64_t current_tick;
+
+	current_tick = rte_htw_current_time(mgr->htw);
+
+	return current_tick;
+}
diff --git a/lib/htimer/rte_htimer_mgr.h b/lib/htimer/rte_htimer_mgr.h
new file mode 100644
index 0000000000..173a95f9c0
--- /dev/null
+++ b/lib/htimer/rte_htimer_mgr.h
@@ -0,0 +1,516 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#ifndef _RTE_HTIMER_MGR_H_
+#define _RTE_HTIMER_MGR_H_
+
+/**
+ * @file
+ *
+ * RTE High-performance Timer Manager
+ *
+ * The high-performance timer manager (htimer_mgr) API provides access
+ * to a low-overhead, scalable timer service.
+ *
+ * The functionality offered similar to that of <rte_timer.h>, but the
+ * internals differs significantly, and there are slight differences
+ * in the programming interface as well.
+ *
+ * Core timer management is implemented by means of a hierarchical
+ * timer wheel (HTW), as per the Varghese and Lauck paper <em>Hashed
+ * and Hierarchical Timing Wheels: Data Structures for the Efficient
+ * Implementation of a Timer Facility</em>.
+ *
+ * Varghese et al's approach is further enhanced by the placement of a
+ * bitset in front of each wheel's slots. Each slot has a
+ * corresponding bit in the bitset. If a bit is clear, there are no
+ * pending timers scheduled for that slot. A set bit means there
+ * potentially are timers scheduled for that slot. This scheme reduces
+ * the overhead of the rte_htimer_mgr_manage() function, where slots
+ * of one or more of the wheels of the thread's HTW are scanned if
+ * time has progressed since last call. This improves performance is
+ * all cases, except for very densely populated timer wheels.
+ *
+ * One such HTW is instantiated for each lcore (EAL thread), and
+ * instances are also available for registered non-EAL threads.
+ *
+ * The <rte_htimer_mgr.h> API may not be called from unregistered
+ * non-EAL threads.
+ *
+ * The per-lcore-id HTW instance is private to that thread.
+ *
+ * The htimer API supports scheduling timers to a different thread
+ * (and thus, a different HTW) than the caller's. It is also possible
+ * to cancel timers managed by a "remote" timer wheel.
+ *
+ * All interaction (i.e., adding timers to or removing timers from) a
+ * remote HTW is done by sending a request, in the form of message on
+ * a DPDK ring, to that instance. Such requests are processed and, if
+ * required, acknowledged when the remote (target) thread calls
+ * rte_htimer_mgr_manage(), rte_htimer_mgr_manage_time() or
+ * rte_htimer_mgr_process().
+ *
+ * This message-based interaction avoid comparatively heavy-weight
+ * synchronization primitives such as spinlocks. Only release-acquire
+ * type synchronization on the rings are needed.
+ *
+ * Timer memory management is the responsibility of the
+ * application. After library-level initialization has completed, no
+ * more dynamic memory is allocated by the htimer library. When
+ * installing timers on remote lcores, care must be taken by the
+ * application to avoid race conditions, in particular use-after-free
+ * (or use-after-recycle) issues of the rte_timer structure. A timer
+ * struct may only be deallocated and/or recycled if the application
+ * can guarantee that there are no cancel requests in flight.
+ *
+ * The htimer library is able to give a definitive answer to the
+ * question if a remote timer's had expired or not, at the time of
+ * cancellation.
+ *
+ * The htimer library uses TSC as the default time source. A different
+ * time source may be used, in which case the application must
+ * explicitly provide the time using rte_htimer_mgr_manage_time().
+ * This function may also be used even if TSC is the time source, in
+ * cases where the application for some other purpose already is in
+ * possession of the current TSC time, to avoid the overhead of
+ * htimer's `rdtsc` instruction (or its equivalent on non-x86 ISAs).
+ *
+ * The htimer supports periodic and single-shot timers.
+ *
+ * The timer tick defines a quantum of time in the htimer library. The
+ * length of a tick (quantified in nanoseconds) is left to the
+ * application to specify. The core HTW implementation allows for all
+ * 64 bits to be used.
+ *
+ * Very fine-grained ticks increase the HTW overhead (since more slots
+ * needs to be scanned). Long ticks will only allow for very
+ * course-grained timers, and in timer-heavy application may cause
+ * load spikes when time advances into a new tick.
+ *
+ * Seemingly reasonable timer tick length range in between 100 ns and
+ * 100 us (or maybe up to as high as 1 ms), depending on the
+ * application.
+ */
+
+#include <stdint.h>
+
+#include <rte_common.h>
+#include <rte_compat.h>
+#include <rte_htimer.h>
+
+/**
+ * The timer has been added to the timer manager on the target lcore.
+ */
+#define RTE_HTIMER_MGR_ASYNC_RESULT_ADDED 1
+
+/**
+ * The timer cancellation request has completed, before the timer expired
+ * on the target lcore.
+ */
+#define RTE_HTIMER_MGR_ASYNC_RESULT_CANCELED 2
+
+/**
+ * The timer cancellation request was denied, since the timer was
+ * already marked as canceled.
+ */
+#define RTE_HTIMER_MGR_ASYNC_RESULT_ALREADY_CANCELED 3
+
+/**
+ * At the time of the cancellation request process on the target
+ * lcore, the timer had already expired.
+ */
+#define RTE_HTIMER_MGR_ASYNC_RESULT_EXPIRED 4
+
+typedef void (*rte_htimer_mgr_async_op_cb_t)(struct rte_htimer *timer,
+					     int result, void *cb_arg);
+
+/**
+ * Initialize the htimer library.
+ *
+ * Instantiates per-lcore (or per-registered non-EAL thread) timer
+ * wheels and other htimer library data structures, for all current
+ * and future threads.
+ *
+ * This function must be called prior to any other <rte_htimer.h> API
+ * call.
+ *
+ * This function may not be called if the htimer library is already
+ * initialized, but may be called multiple times, provided the library
+ * is deinitialized in between rte_htimer_mgr_init() calls.
+ *
+ * For applications not using TSC as the time source, the \c ns_per_tick
+ * parameter will denote the number of such application time-source-units
+ * per tick.
+ *
+ * This function is not multi-thread safe.
+ *
+ * @param ns_per_tick
+ *   The length (in nanoseconds) of a timer wheel tick.
+ *
+ * @return
+ *   - 0: Success
+ *   - -ENOMEM: Unable to allocate memory needed to initialize timer
+ *      subsystem
+ *
+ * @see rte_htimer_mgr_deinit()
+ * @see rte_get_tsc_hz()
+ */
+
+__rte_experimental
+int
+rte_htimer_mgr_init(uint64_t ns_per_tick);
+
+/**
+ * Deinitialize the htimer library.
+ *
+ * This function deallocates all dynamic memory used by the library,
+ * including HTW instances used by other threads than the caller.
+ *
+ * After this call has been made, no <rte_htimer.h> API call may be
+ * made, except rte_htimer_mgr_init().
+ *
+ * This function may not be called if the htimer library has never be
+ * initialized, or has been be deinitialized but not yet initialized
+ * again.
+ *
+ * This function is not multi-thread safe. In particular, no thread
+ * may call any <rte_htimer.h> functions (e.g., rte_htimer_mgr_manage())
+ * during (or after) the htimer library is deinitialized, except if it
+ * is initialized again.
+ *
+ * @see rte_htimer_mgr_init()
+ */
+
+__rte_experimental
+void
+rte_htimer_mgr_deinit(void);
+
+/**
+ * Adds a timer to the calling thread's timer wheel.
+ *
+ * This function schedules a timer on the calling thread's HTW.
+ *
+ * The \c timer_cb callback is called at a point when this thread
+ * calls rte_htimer_mgr_process(), rte_htimer_mgr_manage(), or
+ * rte_htimer_mgr_manage_time() and the expiration time has passed the
+ * current time (either as retrieved by rte_htimer_mgr_manage() or
+ * specified by the application in rte_htimer_mgr_manage_time().
+ *
+ * The HTW trackes times in units of \c ticks, which are likely more
+ * coarse-grained than nanosecond and TSC resolution.
+ *
+ * By default, the \c expiration_time is interpreted as the number of
+ * the nanoseconds into the future the timer should expired, relative
+ * to the last known current time, rounded up to the nearest tick.
+ * Thus, a timer with a certain expiration time maybe not expire even
+ * though this time was supplied in rte_timer_manage_time(). The
+ * maximum error is the length of one tick (plus any delays caused by
+ * infrequent manage calls).
+ *
+ * If the \c RTE_HTIMER_FLAG_ABSOLUTE_TIME is set in \c flags, the
+ * expiration time is relative to time zero.
+ *
+ * If the \c RTE_HTIMER_FLAG_PERIODICAL flag is set, the timer is
+ * peridoical, and will expire first at the time specified by
+ * the \c expiration_time, and then with an interval as specified
+ * by the \c period parameter.
+ *
+ * An added timer may be canceled using rte_htimer_mgr_cancel() or
+ * rte_htimer_mgr_async_cancel().
+ *
+ * rte_htimer_mgr_add() is multi-thread safe, and may only be called
+ * from an EAL thread or a registered non-EAL thread.
+ *
+ * @param timer
+ *   The chunk of memory used for managing this timer. This memory
+ *   must not be read or written (or free'd) by the application until
+ *   this timer has expired, or any cancellation attempts have
+ *   completed.
+ * @param expiration_time
+ *   The expiration time (in nanoseconds by default). For periodical
+ *   timers, this time represent the first expiration time.
+ * @param period
+ *   The time in between periodic timer expirations (in nanoseconds by
+ *   default).  Must be set to zero unless the
+ *   \c RTE_HTIMER_FLAG_PERIODICAL flag is set, in case it must be a
+ *   positive integer.
+ * @param timer_cb
+ *   The timer callback to be called upon timer expiration.
+ * @param timer_cb_arg
+ *   A pointer which will be supplied back to the application in the
+ *   timer callback call.
+ * @param flags
+ *   A bitmask which may contain these flags:
+ *     * \c RTE_HTIMER_FLAG_PERIODICAL
+ *     * \c RTE_HTIMER_FLAG_ABSOLUTE_TIME
+ *     * Either \c RTE_HTIMER_FLAG_TIME_TICK or \c RTE_HTIMER_FLAG_TIME_TSC
+ */
+
+__rte_experimental
+void
+rte_htimer_mgr_add(struct rte_htimer *timer, uint64_t expiration_time,
+		   uint64_t period, rte_htimer_cb_t timer_cb,
+		   void *timer_cb_arg, uint32_t flags);
+
+/**
+ * Cancel a timer scheduled in the calling thread's timer wheel.
+ *
+ * This function cancel a timer scheduled on the calling thread's HTW.
+ *
+ * rte_htimer_mgr_cancel() may be called on a timer which has already
+ * (synchronously or asynchronously) been canceled, or may have expired.
+ * However, the \c rte_htimer struct pointed to by \c timer may not
+ * have been freed or recycled since.
+ *
+ * rte_htimer_mgr_cancel() may not be called for a timer that was
+ * never (or, not yet) added.
+ *
+ * A timer added using rte_htimer_mgr_async_add() may be not be
+ * canceled using this function until after the add operation has
+ * completed (i.e, the completion callback has been run).
+ *
+ * rte_htimer_mgr_cancel() is multi-thread safe, and may only be
+ * called from an EAL thread or a registered non-EAL thread.
+ *
+ * @param timer
+ *   The timer to be canceled.
+ * @return
+ *   - 0: Success
+ *   - -ETIME: Timer has expired, and thus could not be canceled.
+ *   - -ENOENT: Timer was already canceled.
+ */
+
+__rte_experimental
+int
+rte_htimer_mgr_cancel(struct rte_htimer *timer);
+
+/**
+ * Asynchronuosly add a timer to the specified lcore's timer wheel.
+ *
+ * This function is the equivalent of rte_htimer_mgr_add(), but allows
+ * the calling ("source") thread to scheduled a timer in a HTW other
+ * than it's own. The operation is asynchronous.
+ *
+ * The timer works the same as a timer added locally. Thus, the \c
+ * timer_cb callback is called by the target thread, and it may be
+ * canceled using rte_htimer_mgr_cancel().
+ *
+ * The source thread may be the same as the target thread.
+ *
+ * Only EAL threads or registered non-EAL thread may be targeted.
+ *
+ * A successful rte_htimer_mgr_async_add() call guarantees that the
+ * timer will be scheduled on the target lcore at some future time,
+ * provided the target thread calls either rte_htimer_mgr_process(),
+ * rte_htimer_mgr_manage(), and/or rte_htimer_mgr_manage_time().
+ *
+ * The \c async_cb callback is called on the source thread as a part
+ * of its rte_htimer_mgr_process(), rte_htimer_mgr_manage(), or
+ * rte_htimer_mgr_manage_time() call, when the asynchronous add
+ * operation has completed (i.e., the timer is scheduled in the target
+ * HTW).
+ *
+ * \c async_cb may be NULL, in which case no notification is given.
+ *
+ * An asynchronously added timer may be asynchronously canceled (i.e.,
+ * using rte_htimer_mgr_async_cancel()) at any point, by any thread,
+ * after the rte_htimer_mgr_async_add() call. A asynchronously added
+ * timer may be not be canceled using rte_htimer_mgr_cancel() until
+ * after the completion callback has been executed.
+ *
+ * rte_htimer_mgr_async_add() is multi-thread safe, and may only be called
+ * from an EAL thread or a registered non-EAL thread.
+ *
+ * @param timer
+ *   The chunk of memory used for managing this timer. This memory
+ *   must not be read or written (or free'd) by the application until
+ *   this timer has expired, or any cancellation attempts have
+ *   completed.
+ * @param target_lcore_id
+ *   The lcore id of the thread which HTW will be manage this timer.
+ * @param expiration_time
+ *   The expiration time (measured in nanoseconds). For periodical
+ *   timers, this time represent the first expiration time.
+ * @param period
+ *   The time in between periodic timer expirations (measured in
+ *   nanoseconds).  Must be set to zero unless the
+ *   RTE_HTIMER_FLAG_PERIODICAL flag is set, in case it must be a
+ *   positive integer.
+ * @param timer_cb
+ *   The timer callback to be called upon timer expiration.
+ * @param timer_cb_arg
+ *   A pointer which will be supplied back to the application in the
+ *   timer callback call.
+ * @param async_cb
+ *   The asynchronous operationg callback to be called when the
+ *   add operation is completed.
+ * @param async_cb_arg
+ *   A pointer which will be supplied back to the application in the
+ *   \c async_cb callback call.
+ * @param flags
+ *   RTE_HTIMER_FLAG_ABSOLUTE_TIME and/or RTE_HTIMER_FLAG_PERIODICAL.
+ * @return
+ *   - 0: Success
+ *   - -EBUSY: The maximum number of concurrently queued asynchronous
+ *      operations has been reached.
+ */
+
+__rte_experimental
+int
+rte_htimer_mgr_async_add(struct rte_htimer *timer,
+			 unsigned int target_lcore_id,
+			 uint64_t expiration_time, uint64_t period,
+			 rte_htimer_cb_t timer_cb, void *timer_cb_arg,
+			 uint32_t flags,
+			 rte_htimer_mgr_async_op_cb_t async_cb,
+			 void *async_cb_arg);
+
+/**
+ * Asynchronuosly cancel a timer in any thread's timer wheel.
+ *
+ * This function is the equivalent of rte_htimer_mgr_cancel(), but
+ * allows the calling ("source") thread to also cancel a timer in a
+ * HTW other than it's own. The operation is asynchronous.
+ *
+ * A thread may asynchronously cancel a timer scheduled on its own
+ * HTW.
+ *
+ * The \c async_cb callback is called on the source thread as a part
+ * of its rte_htimer_mgr_process(), rte_htimer_mgr_manage(), or
+ * rte_htimer_mgr_manage_time() call, when the asynchronous add
+ * operation has completed (i.e., the timer is scheduled in the target
+ * HTW).
+ *
+ * \c async_cb may be NULL, in which case no notification is given.
+ *
+ * A timer may be asynchronously canceled at any point, by any thread,
+ * after it has been either synchronously or asynchronously added.
+ *
+ * rte_htimer_mgr_async_cancel() is multi-thread safe, and may only be
+ * called from an EAL thread or a registered non-EAL thread.
+ *
+ * @param timer
+ *   The memory used for managing this timer. This memory must not be
+ *   read or written (or free'd) by the application until this timer
+ *   has expired, or any cancellation attempts have completed.
+ * @param async_cb
+ *   The asynchronous operationg callback to be called when the
+ *   add operation is completed.
+ * @param async_cb_arg
+ *   A pointer which will be supplied back to the application in the
+ *   \c async_cb callback call.
+ * @return
+ *   - 0: Success
+ *   - -EBUSY: The maximum number of concurrently queued asynchronous
+ *      operations has been reached.
+ */
+
+__rte_experimental
+int
+rte_htimer_mgr_async_cancel(struct rte_htimer *timer,
+			    rte_htimer_mgr_async_op_cb_t async_cb,
+			    void *async_cb_arg);
+
+/**
+ * Update HTW time and perform timer expiry and asynchronous operation
+ * processing.
+ *
+ * This function is the equivalent of retrieving the current TSC time,
+ * and calling rte_htimer_mgr_manage_time().
+ *
+ * rte_htimer_mgr_manage() is multi-thread safe, and may only be
+ * called from an EAL thread or a registered non-EAL thread.
+ */
+
+__rte_experimental
+void
+rte_htimer_mgr_manage(void);
+
+/**
+ * Progress HTW time, and perform timer expiry and asynchronous
+ * operation processing in the process.
+ *
+ * This function progress the calling thread's HTW up to the point
+ * specified by \c current_time, calling the callbacks of any expired
+ * timers.
+ *
+ * The time source must be a monotonic clock, and thus each new \c
+ * current_time must be equal or greater than the time supplied in the
+ * previous call.
+ *
+ * The timer precision for timers scheduled on a particular thread's
+ * HTW depends on that threads call frequency to this function.
+ *
+ * rte_htimer_mgr_manage_time() also performs asynchronous operation
+ * processing. See rte_htimer_mgr_process() for details.
+ *
+ * rte_htimer_mgr_manage_time() is multi-thread safe, and may only be
+ * called from an EAL thread or a registered non-EAL thread.
+ *
+ * @param current_time
+ *   The current time (in nanoseconds, by default).
+ * @param flags
+ *   Either \c RTE_HTIMER_FLAG_TIME_TICK or \c RTE_HTIMER_FLAG_TIME_TSC.
+ */
+
+__rte_experimental
+void
+rte_htimer_mgr_manage_time(uint64_t current_time, uint32_t flags);
+
+/**
+ * Perform asynchronous operation processing.
+ *
+ * rte_htimer_mgr_process() serves pending asynchronous add or cancel
+ * requests, and produces the necessary responses. The timer callbacks
+ * of already-expired timers added are called.
+ *
+ * This function also processes asynchronous operation response
+ * messages received, and calls the asynchronous callbacks, if such
+ * was provided by the application.
+ *
+ * rte_htimer_mgr_process() is multi-thread safe, and may only be
+ * called from an EAL thread or a registered non-EAL thread.
+ */
+
+__rte_experimental
+void
+rte_htimer_mgr_process(void);
+
+/**
+ * Return the current local HTW time in nanoseconds.
+ *
+ * This function returns the most recent time provided by this thread,
+ * either via rte_htimer_mgr_manage_time(), or as sampled by
+ * rte_htimer_mgr_manage().
+ *
+ * The initial time, prior to any manage-calls, is 0.
+ *
+ * rte_htimer_mgr_current_time() is multi-thread safe, and may only be
+ * called from an EAL thread or a registered non-EAL thread.
+ */
+
+__rte_experimental
+uint64_t
+rte_htimer_mgr_current_time(void);
+
+/**
+ * Return the current local HTW time in ticks.
+ *
+ * This function returns the current time of the calling thread's HTW. The
+ * tick is the current time provided by the application (via
+ * rte_htimer_mgr_manage_time()), or as retrieved (using
+ * rte_timer_get_tsc_cycles() in rte_htimer_mgr_manage()), divided by the
+ * tick length (as provided in rte_htimer_mgr_init()).
+ *
+ * The initial time, prior to any manage-calls, is 0.
+ *
+ * rte_htimer_mgr_current_tick() is multi-thread safe, and may only be
+ * called from an EAL thread or a registered non-EAL thread.
+ */
+
+__rte_experimental
+uint64_t
+rte_htimer_mgr_current_tick(void);
+
+#endif
diff --git a/lib/htimer/rte_htimer_msg.h b/lib/htimer/rte_htimer_msg.h
new file mode 100644
index 0000000000..ceb106e263
--- /dev/null
+++ b/lib/htimer/rte_htimer_msg.h
@@ -0,0 +1,44 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#ifndef _RTE_HTIMER_MSG_
+#define _RTE_HTIMER_MSG_
+
+#include <rte_htimer.h>
+
+typedef void (*rte_htimer_msg_async_op_cb_t)(struct rte_htimer *timer,
+					     int result, void *cb_arg);
+
+typedef rte_htimer_msg_async_op_cb_t async_cb;
+
+enum rte_htimer_msg_type {
+	rte_htimer_msg_type_add_request,
+	rte_htimer_msg_type_add_response,
+	rte_htimer_msg_type_cancel_request,
+	rte_htimer_msg_type_cancel_response
+};
+
+struct rte_htimer_msg_request {
+	unsigned int source_lcore_id;
+};
+
+struct rte_htimer_msg_response {
+	int result;
+};
+
+struct rte_htimer_msg {
+	enum rte_htimer_msg_type msg_type;
+
+	struct rte_htimer *timer;
+
+	rte_htimer_msg_async_op_cb_t async_cb;
+	void *async_cb_arg;
+
+	union {
+		struct rte_htimer_msg_request request;
+		struct rte_htimer_msg_response response;
+	};
+};
+
+#endif
diff --git a/lib/htimer/rte_htimer_msg_ring.c b/lib/htimer/rte_htimer_msg_ring.c
new file mode 100644
index 0000000000..4019b7819a
--- /dev/null
+++ b/lib/htimer/rte_htimer_msg_ring.c
@@ -0,0 +1,18 @@
+#include "rte_htimer_msg_ring.h"
+
+struct rte_htimer_msg_ring *
+rte_htimer_msg_ring_create(const char *name, unsigned int count, int socket_id,
+			   unsigned int flags)
+{
+	struct rte_ring *ring =
+		rte_ring_create_elem(name, sizeof(struct rte_htimer_msg),
+				     count, socket_id, flags);
+
+	return (struct rte_htimer_msg_ring *)ring;
+}
+
+void
+rte_htimer_msg_ring_free(struct rte_htimer_msg_ring *msg_ring)
+{
+	rte_ring_free((struct rte_ring *)msg_ring);
+}
diff --git a/lib/htimer/rte_htimer_msg_ring.h b/lib/htimer/rte_htimer_msg_ring.h
new file mode 100644
index 0000000000..48fcc99189
--- /dev/null
+++ b/lib/htimer/rte_htimer_msg_ring.h
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#ifndef _RTE_HTIMER_MSG_RING_
+#define _RTE_HTIMER_MSG_RING_
+
+#include <rte_ring.h>
+
+#include "rte_htimer_msg.h"
+
+struct rte_htimer_msg_ring {
+	struct rte_ring ring;
+};
+
+struct rte_htimer_msg_ring *
+rte_htimer_msg_ring_create(const char *name, unsigned int count, int socket_id,
+			   unsigned int flags);
+
+void
+rte_htimer_msg_ring_free(struct rte_htimer_msg_ring *msg_ring);
+
+static inline int
+rte_htimer_msg_ring_empty(struct rte_htimer_msg_ring *msg_ring)
+{
+	return rte_ring_empty(&msg_ring->ring);
+}
+
+static inline unsigned int
+rte_htimer_msg_ring_dequeue_burst(struct rte_htimer_msg_ring *msg_ring,
+				  struct rte_htimer_msg *msgs,
+				  unsigned int n)
+{
+	unsigned int dequeued;
+
+	dequeued = rte_ring_dequeue_burst_elem(&msg_ring->ring, msgs,
+					       sizeof(struct rte_htimer_msg),
+					       n, NULL);
+
+	return dequeued;
+}
+
+static inline unsigned int
+rte_htimer_msg_ring_enqueue(struct rte_htimer_msg_ring *msg_ring,
+			    struct rte_htimer_msg *msg)
+{
+	int rc;
+
+	rc = rte_ring_enqueue_elem(&msg_ring->ring, msg,
+				   sizeof(struct rte_htimer_msg));
+
+	return rc;
+}
+
+#endif
diff --git a/lib/htimer/rte_htw.c b/lib/htimer/rte_htw.c
new file mode 100644
index 0000000000..67fcb8c623
--- /dev/null
+++ b/lib/htimer/rte_htw.c
@@ -0,0 +1,445 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+/*
+ * This is an implementation of a hierarchical timer wheel based on
+ * Hashed and Hierarchical Timing Wheels: Data Structures
+ * for the Efficient Implementation of a Timer Facility by Varghese and
+ * Lauck.
+ *
+ * To improve efficiency when the slots are sparsely populate (i.e.,
+ * many ticks do not have any timers), each slot is represented by a
+ * bit in a separately-managed, per-wheel, bitset. This allows for
+ * very efficient scanning. The cost of managing this bitset is small.
+ */
+
+#include <rte_bitset.h>
+#include <rte_branch_prediction.h>
+#include <rte_debug.h>
+#include <rte_errno.h>
+#include <rte_malloc.h>
+
+#include "rte_htw.h"
+
+#define TICK_BITS 64
+
+#define WHEEL_BITS 8
+#define WHEEL_SLOTS (1U << WHEEL_BITS)
+#define WHEEL_LEVELS (TICK_BITS / WHEEL_BITS)
+
+struct wheel {
+	uint64_t wheel_time;
+	RTE_BITSET_DECLARE(used_slots, WHEEL_SLOTS);
+	struct rte_htimer_list slots[WHEEL_SLOTS];
+};
+
+struct rte_htw {
+	uint64_t current_time;
+
+	struct wheel wheels[WHEEL_LEVELS];
+
+	struct rte_htimer_list added;
+	struct rte_htimer_list expiring;
+
+	struct rte_htimer *running_timer;
+};
+
+static uint64_t
+time_to_wheel_time(uint64_t t, uint16_t level)
+{
+	return t >> (level * WHEEL_BITS);
+}
+
+static uint64_t
+wheel_time_to_time(uint64_t wheel_time, uint16_t level)
+{
+	return wheel_time << (level * WHEEL_BITS);
+}
+
+static void
+wheel_init(struct wheel *wheel)
+{
+	uint16_t i;
+
+	wheel->wheel_time = 0;
+
+	rte_bitset_init(wheel->used_slots, WHEEL_SLOTS);
+
+	for (i = 0; i < WHEEL_SLOTS; i++)
+		LIST_INIT(&wheel->slots[i]);
+}
+
+static uint64_t
+list_next_timeout(struct rte_htimer_list *timers)
+{
+	struct rte_htimer *timer;
+	uint64_t candidate = UINT64_MAX;
+
+	LIST_FOREACH(timer, timers, entry)
+		candidate = RTE_MIN(timer->expiration_time, candidate);
+
+	return candidate;
+}
+
+static uint16_t
+wheel_time_to_slot(uint64_t wheel_time)
+{
+	return wheel_time % WHEEL_SLOTS;
+}
+
+static uint64_t
+wheel_current_slot_time(struct wheel *wheel, uint16_t level)
+{
+	return wheel->wheel_time << (level * WHEEL_BITS);
+}
+
+static uint64_t
+wheel_next_timeout(struct wheel *wheel, uint16_t level, uint64_t upper_bound)
+{
+	uint16_t start_slot;
+	ssize_t slot;
+
+	start_slot = wheel_current_slot_time(wheel, level);
+
+	if (wheel_time_to_time(wheel->wheel_time, level) >= upper_bound)
+		return upper_bound;
+
+	RTE_BITSET_FOREACH_SET_WRAP(slot, wheel->used_slots, WHEEL_SLOTS,
+				    (ssize_t)start_slot, WHEEL_SLOTS) {
+		struct rte_htimer_list *timers = &wheel->slots[slot];
+		uint64_t next_timeout;
+
+		next_timeout = list_next_timeout(timers);
+
+		if (next_timeout != UINT64_MAX)
+			return next_timeout;
+	}
+
+	return UINT64_MAX;
+}
+
+static uint16_t
+get_slot(uint64_t t, uint16_t level)
+{
+	uint64_t wheel_time;
+	uint16_t slot;
+
+	wheel_time = time_to_wheel_time(t, level);
+	slot = wheel_time_to_slot(wheel_time);
+
+	return slot;
+}
+
+struct rte_htw *
+rte_htw_create(void)
+{
+	struct rte_htw *htw;
+	uint16_t level;
+
+	RTE_BUILD_BUG_ON((TICK_BITS % WHEEL_BITS) != 0);
+	RTE_BUILD_BUG_ON(sizeof(uint16_t) * CHAR_BIT <= WHEEL_BITS);
+
+	htw = rte_malloc(NULL, sizeof(struct rte_htw), RTE_CACHE_LINE_SIZE);
+
+	if (htw == NULL) {
+		rte_errno = ENOMEM;
+		return NULL;
+	}
+
+	htw->current_time = 0;
+
+	LIST_INIT(&htw->added);
+	LIST_INIT(&htw->expiring);
+
+	for (level = 0; level < WHEEL_LEVELS; level++)
+		wheel_init(&htw->wheels[level]);
+
+	return htw;
+}
+
+void
+rte_htw_destroy(struct rte_htw *htw)
+{
+	rte_free(htw);
+}
+
+static uint16_t
+get_level(uint64_t remaining_time)
+{
+	int last_set = 64 - __builtin_clzll(remaining_time);
+
+	return (last_set - 1) / WHEEL_BITS;
+}
+
+static void
+mark_added(struct rte_htw *htw, struct rte_htimer *timer)
+{
+	timer->state = RTE_HTIMER_STATE_PENDING;
+	LIST_INSERT_HEAD(&htw->added, timer, entry);
+}
+
+static void
+assure_valid_add_params(uint64_t period __rte_unused,
+			uint32_t flags __rte_unused)
+{
+	RTE_ASSERT(!(flags & ~(RTE_HTIMER_FLAG_PERIODICAL |
+			       RTE_HTIMER_FLAG_ABSOLUTE_TIME)));
+	RTE_ASSERT(flags & RTE_HTIMER_FLAG_PERIODICAL ?
+		   period > 0 : period == 0);
+}
+
+void
+rte_htw_add(struct rte_htw *htw, struct rte_htimer *timer,
+	    uint64_t expiration_time, uint64_t period,
+	    rte_htimer_cb_t timer_cb, void *timer_cb_arg, uint32_t flags)
+{
+	assure_valid_add_params(period, flags);
+
+	if (flags & RTE_HTIMER_FLAG_ABSOLUTE_TIME)
+		timer->expiration_time = expiration_time;
+	else
+		timer->expiration_time = htw->current_time + expiration_time;
+
+	timer->period = period;
+	timer->flags = flags;
+	timer->cb = timer_cb;
+	timer->cb_arg = timer_cb_arg;
+
+	mark_added(htw, timer);
+}
+
+void
+rte_htw_cancel(struct rte_htw *htw, struct rte_htimer *timer)
+{
+	/*
+	 * One could consider clearing the relevant used_slots bit in
+	 * case this was the last entry in the wheel's slot
+	 * list. However, from a correctness point of view, a "false
+	 * positive" is not an issue. From a performance perspective,
+	 * checking the list head and clearing the bit is likely more
+	 * expensive than just deferring a minor cost to a future
+	 * rte_htw_manage() call.
+	 */
+
+	RTE_ASSERT(timer->state == RTE_HTIMER_STATE_PENDING ||
+		   timer->state == RTE_HTIMER_STATE_EXPIRED);
+
+	if (likely(timer->state == RTE_HTIMER_STATE_PENDING)) {
+		LIST_REMOVE(timer, entry);
+		timer->state = RTE_HTIMER_STATE_CANCELED;
+	} else if (timer == htw->running_timer) {
+		/* periodical timer being canceled by its own callback */
+		RTE_ASSERT(timer->flags & RTE_HTIMER_FLAG_PERIODICAL);
+
+		timer->state = RTE_HTIMER_STATE_CANCELED;
+
+		/* signals running timer canceled */
+		htw->running_timer = NULL;
+	}
+}
+
+static void
+mark_expiring(struct rte_htw *htw, struct rte_htimer *timer)
+{
+	LIST_INSERT_HEAD(&htw->expiring, timer, entry);
+}
+
+static void
+schedule_timer(struct rte_htw *htw, struct rte_htimer *timer)
+{
+	uint64_t remaining_time;
+	uint16_t level;
+	struct wheel *wheel;
+	uint16_t slot;
+	struct rte_htimer_list *slot_timers;
+
+	remaining_time = timer->expiration_time - htw->current_time;
+
+	level = get_level(remaining_time);
+
+	wheel = &htw->wheels[level];
+
+	slot = get_slot(timer->expiration_time, level);
+
+	slot_timers = &htw->wheels[level].slots[slot];
+
+	LIST_INSERT_HEAD(slot_timers, timer, entry);
+
+	rte_bitset_set(wheel->used_slots, slot);
+}
+
+static void
+process_added(struct rte_htw *htw)
+{
+	struct rte_htimer *timer;
+
+	while ((timer = LIST_FIRST(&htw->added)) != NULL) {
+		LIST_REMOVE(timer, entry);
+
+		if (timer->expiration_time > htw->current_time)
+			schedule_timer(htw, timer);
+		else
+			mark_expiring(htw, timer);
+	}
+}
+
+static void
+process_expiring(struct rte_htw *htw)
+{
+	struct rte_htimer *timer;
+
+	while ((timer = LIST_FIRST(&htw->expiring)) != NULL) {
+		bool is_periodical;
+		bool running_timer_canceled;
+
+		/*
+		 * The timer struct may cannot be safely accessed
+		 * after the callback has been called (except for
+		 * non-canceled periodical timers), since the callback
+		 * may have free'd (or reused) the memory.
+		 */
+
+		LIST_REMOVE(timer, entry);
+
+		is_periodical = timer->flags & RTE_HTIMER_FLAG_PERIODICAL;
+
+		timer->state = RTE_HTIMER_STATE_EXPIRED;
+
+		htw->running_timer = timer;
+
+		timer->cb(timer, timer->cb_arg);
+
+		running_timer_canceled = htw->running_timer == NULL;
+
+		htw->running_timer = NULL;
+
+		if (is_periodical && !running_timer_canceled) {
+			timer->expiration_time += timer->period;
+			mark_added(htw, timer);
+		}
+	}
+}
+
+uint64_t
+rte_htw_current_time(struct rte_htw *htw)
+{
+	return htw->current_time;
+}
+
+uint64_t
+rte_htw_next_timeout(struct rte_htw *htw, uint64_t upper_bound)
+{
+	uint16_t level;
+
+	/* scheduling timeouts will sort them in temporal order */
+	process_added(htw);
+
+	if (!LIST_EMPTY(&htw->expiring))
+		return 0;
+
+	for (level = 0; level < WHEEL_LEVELS; level++) {
+		uint64_t wheel_timeout;
+
+		wheel_timeout = wheel_next_timeout(&htw->wheels[level],
+						   level, upper_bound);
+		if (wheel_timeout != UINT64_MAX)
+			return RTE_MIN(wheel_timeout, upper_bound);
+	}
+
+	return upper_bound;
+}
+
+static __rte_always_inline void
+process_slot(struct rte_htw *htw, uint16_t level, struct wheel *wheel,
+	     uint16_t slot)
+{
+	struct rte_htimer_list *slot_timers;
+	struct rte_htimer *timer;
+
+	slot_timers = &wheel->slots[slot];
+
+	rte_bitset_clear(wheel->used_slots, slot);
+
+	while ((timer = LIST_FIRST(slot_timers)) != NULL) {
+		LIST_REMOVE(timer, entry);
+
+		if (level == 0 || timer->expiration_time <= htw->current_time)
+			mark_expiring(htw, timer);
+		else
+			schedule_timer(htw, timer);
+	}
+}
+
+static __rte_always_inline void
+process_slots(struct rte_htw *htw, uint16_t level, struct wheel *wheel,
+	      uint16_t start_slot, uint16_t num_slots)
+{
+	ssize_t slot;
+
+	RTE_BITSET_FOREACH_SET_WRAP(slot, wheel->used_slots, WHEEL_SLOTS,
+				    (ssize_t)start_slot, num_slots)
+		process_slot(htw, level, wheel, slot);
+}
+
+static void
+advance(struct rte_htw *htw)
+{
+	uint16_t level;
+
+	for (level = 0; level < WHEEL_LEVELS; level++) {
+		struct wheel *wheel = &htw->wheels[level];
+		uint64_t new_wheel_time;
+		uint16_t start_slot;
+		uint16_t num_slots;
+
+		new_wheel_time = time_to_wheel_time(htw->current_time, level);
+
+		if (new_wheel_time == wheel->wheel_time)
+			break;
+
+		start_slot = wheel_time_to_slot(wheel->wheel_time + 1);
+		num_slots = RTE_MIN(new_wheel_time - wheel->wheel_time,
+				    WHEEL_SLOTS);
+
+		wheel->wheel_time = new_wheel_time;
+
+		process_slots(htw, level, wheel, start_slot, num_slots);
+	}
+}
+
+void
+rte_htw_manage(struct rte_htw *htw, uint64_t new_time)
+{
+	RTE_VERIFY(new_time >= htw->current_time);
+
+	/*
+	 * Scheduling added timers, core timer wheeling processing and
+	 * expiry callback execution is kept as separate stages, to
+	 * avoid having the core wheel traversal code to deal with a
+	 * situation where a timeout callbacks re-adding the timer.
+	 * This split also results in seemingly reasonable semantics
+	 * in regards to the execution of the callbacks of
+	 * already-expired timeouts (e.g., with time 0) being added in
+	 * a timeout callback. Instead of creating an end-less loop,
+	 * with rte_htw_manage() never returning, it defers the
+	 * execution of the timer until the next rte_htw_manage()
+	 * call.
+	 */
+
+	process_added(htw);
+
+	if (new_time > htw->current_time) {
+		htw->current_time = new_time;
+		advance(htw);
+	}
+
+	process_expiring(htw);
+}
+
+void
+rte_htw_process(struct rte_htw *htw)
+{
+	process_added(htw);
+	process_expiring(htw);
+}
diff --git a/lib/htimer/rte_htw.h b/lib/htimer/rte_htw.h
new file mode 100644
index 0000000000..c93358bb13
--- /dev/null
+++ b/lib/htimer/rte_htw.h
@@ -0,0 +1,49 @@
+/* SPDX-License-Identifier: BSD-3-Clause
+ * Copyright(c) 2023 Ericsson AB
+ */
+
+#ifndef _RTE_HTW_H_
+#define _RTE_HTW_H_
+
+#include <stdint.h>
+#include <sys/queue.h>
+
+#include <rte_htimer.h>
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+struct rte_htw;
+
+struct rte_htw *
+rte_htw_create(void);
+
+void
+rte_htw_destroy(struct rte_htw *htw);
+
+void
+rte_htw_add(struct rte_htw *htw, struct rte_htimer *timer,
+	    uint64_t expiration_time, uint64_t period,
+	    rte_htimer_cb_t cb, void *cb_arg, uint32_t flags);
+
+void
+rte_htw_cancel(struct rte_htw *htw, struct rte_htimer *timer);
+
+uint64_t
+rte_htw_current_time(struct rte_htw *htw);
+
+uint64_t
+rte_htw_next_timeout(struct rte_htw *htw, uint64_t upper_bound);
+
+void
+rte_htw_manage(struct rte_htw *htw, uint64_t new_time);
+
+void
+rte_htw_process(struct rte_htw *htw);
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* _RTE_HTW_H_ */
diff --git a/lib/htimer/version.map b/lib/htimer/version.map
new file mode 100644
index 0000000000..0e71dc7d57
--- /dev/null
+++ b/lib/htimer/version.map
@@ -0,0 +1,17 @@
+EXPERIMENTAL {
+	global:
+
+	rte_htimer_mgr_init;
+	rte_htimer_mgr_deinit;
+	rte_htimer_mgr_add;
+	rte_htimer_mgr_cancel;
+	rte_htimer_mgr_async_add;
+	rte_htimer_mgr_async_cancel;
+	rte_htimer_mgr_manage;
+	rte_htimer_mgr_manage_time;
+	rte_htimer_mgr_process;
+	rte_htimer_mgr_current_time;
+	rte_htimer_mgr_current_tick;
+
+        local: *;
+};
diff --git a/lib/meson.build b/lib/meson.build
index 2bc0932ad5..c7c0e42ae8 100644
--- a/lib/meson.build
+++ b/lib/meson.build
@@ -37,6 +37,7 @@ libraries = [
         'gpudev',
         'gro',
         'gso',
+        'htimer',
         'ip_frag',
         'jobstats',
         'kni',
-- 
2.34.1


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v2 1/2] eal: add bitset type
  2023-03-15 17:03   ` [RFC v2 1/2] eal: add bitset type Mattias Rönnblom
@ 2023-03-15 17:20     ` Stephen Hemminger
  2023-03-15 18:27       ` Mattias Rönnblom
  0 siblings, 1 reply; 31+ messages in thread
From: Stephen Hemminger @ 2023-03-15 17:20 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, Erik Gabriel Carrillo, David Marchand, maria.lingemark,
	Stefan Sundkvist, Morten Brørup, Tyler Retzlaff

On Wed, 15 Mar 2023 18:03:41 +0100
Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:

> Introduce a set of functions and macros that operate on sets of bits,
> kept in arrays of 64-bit elements.
> 
> RTE bitset is designed for bitsets which are larger than what fits in
> a single machine word (i.e., 64 bits). For very large bitsets, the
> <rte_bitmap.h> API may be a more appropriate choice.
> 
> RFC v2:
>  * Replaced <sys/types.h> with <stddef.h> include, to properly get
>    size_t typedef.
>  * Add <rte_compat.h> to get __rte_experimental in <rte_bitset.h>.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

It would be good to have atomic version of these routines.
Other libraries could use that.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v2 1/2] eal: add bitset type
  2023-03-15 17:20     ` Stephen Hemminger
@ 2023-03-15 18:27       ` Mattias Rönnblom
  0 siblings, 0 replies; 31+ messages in thread
From: Mattias Rönnblom @ 2023-03-15 18:27 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, Erik Gabriel Carrillo, David Marchand, Maria Lingemark,
	Stefan Sundkvist, Morten Brørup, Tyler Retzlaff, Joyce Kong

On 2023-03-15 18:20, Stephen Hemminger wrote:
> On Wed, 15 Mar 2023 18:03:41 +0100
> Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
> 
>> Introduce a set of functions and macros that operate on sets of bits,
>> kept in arrays of 64-bit elements.
>>
>> RTE bitset is designed for bitsets which are larger than what fits in
>> a single machine word (i.e., 64 bits). For very large bitsets, the
>> <rte_bitmap.h> API may be a more appropriate choice.
>>
>> RFC v2:
>>   * Replaced <sys/types.h> with <stddef.h> include, to properly get
>>     size_t typedef.
>>   * Add <rte_compat.h> to get __rte_experimental in <rte_bitset.h>.
>>
>> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> 
> It would be good to have atomic version of these routines.
> Other libraries could use that.

Atomic with the option to specify memory model, like in C11? Or just 
atomic, with no implications in regards to memory ordering (i.e., 
__ATOMIC_RELAXED).

It seems to me that the implementation of such an API would require 
inline assembler (e.g., lock bts on x86) to do properly (no loops, no 
locks).

Related: reading <rte_bitops.h> left me a little confused. The 
documentation and naming suggest the bit access operations are 
"relaxed", which I'm sure is true, but may leave the user believing the 
bit operations are atomic - which they aren't. Why would you otherwise 
mentioned the operations are relaxed?

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v2 2/2] eal: add high-performance timer facility
  2023-03-15 17:03   ` [RFC v2 2/2] eal: add high-performance timer facility Mattias Rönnblom
@ 2023-03-16  3:55     ` Tyler Retzlaff
  2023-03-17  1:58     ` Stephen Hemminger
                       ` (3 subsequent siblings)
  4 siblings, 0 replies; 31+ messages in thread
From: Tyler Retzlaff @ 2023-03-16  3:55 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, Erik Gabriel Carrillo, David Marchand, maria.lingemark,
	Stefan Sundkvist, Stephen Hemminger, Morten Brørup

On Wed, Mar 15, 2023 at 06:03:42PM +0100, Mattias Rönnblom wrote:
> The htimer library attempts at providing a timer facility with roughly
> the same functionality, but less overhead and better scalability than
> DPDK timer library.
> 
> The htimer library employs per-lcore hierarchical timer wheels and a
> message-based synchronization/MT-safety scheme.
> 
> RFC v2:
>  * Fix spelling.
>  * Fix signed/unsigned comparisons and discontinue the use of name-less
>    function parameters, both of which may result in compiler warnings.
>  * Undo the accidental removal of the bitset tests from the 'fast_tests'.
>  * Add a number of missing include files, causing build failures
>    (e.g., on AArch64 builds).
>  * Add perf test attempting to compare rte_timer, rte_htimer and rte_htw.
>  * Use nanoseconds (instead of TSC) as the default time unit.
>  * add() and manage() has flags which allows the caller to specify the
>    time unit (nanoseconds, TSC, or ticks) for the times provided.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---
>  app/test/meson.build                  |   8 +
>  app/test/test_htimer_mgr.c            | 674 +++++++++++++++++++++++++
>  app/test/test_htimer_mgr_perf.c       | 322 ++++++++++++
>  app/test/test_htw.c                   | 478 ++++++++++++++++++
>  app/test/test_htw_perf.c              | 181 +++++++
>  app/test/test_timer_htimer_htw_perf.c | 693 ++++++++++++++++++++++++++
>  doc/api/doxy-api-index.md             |   5 +-
>  doc/api/doxy-api.conf.in              |   1 +
>  lib/htimer/meson.build                |   7 +
>  lib/htimer/rte_htimer.h               |  68 +++
>  lib/htimer/rte_htimer_mgr.c           | 547 ++++++++++++++++++++
>  lib/htimer/rte_htimer_mgr.h           | 516 +++++++++++++++++++
>  lib/htimer/rte_htimer_msg.h           |  44 ++
>  lib/htimer/rte_htimer_msg_ring.c      |  18 +
>  lib/htimer/rte_htimer_msg_ring.h      |  55 ++
>  lib/htimer/rte_htw.c                  | 445 +++++++++++++++++
>  lib/htimer/rte_htw.h                  |  49 ++
>  lib/htimer/version.map                |  17 +
>  lib/meson.build                       |   1 +
>  19 files changed, 4128 insertions(+), 1 deletion(-)
>  create mode 100644 app/test/test_htimer_mgr.c
>  create mode 100644 app/test/test_htimer_mgr_perf.c
>  create mode 100644 app/test/test_htw.c
>  create mode 100644 app/test/test_htw_perf.c
>  create mode 100644 app/test/test_timer_htimer_htw_perf.c
>  create mode 100644 lib/htimer/meson.build
>  create mode 100644 lib/htimer/rte_htimer.h
>  create mode 100644 lib/htimer/rte_htimer_mgr.c
>  create mode 100644 lib/htimer/rte_htimer_mgr.h
>  create mode 100644 lib/htimer/rte_htimer_msg.h
>  create mode 100644 lib/htimer/rte_htimer_msg_ring.c
>  create mode 100644 lib/htimer/rte_htimer_msg_ring.h
>  create mode 100644 lib/htimer/rte_htw.c
>  create mode 100644 lib/htimer/rte_htw.h
>  create mode 100644 lib/htimer/version.map
> 
> diff --git a/app/test/meson.build b/app/test/meson.build
> index 03811ff692..d0308ac09d 100644
> --- a/app/test/meson.build
> +++ b/app/test/meson.build
> @@ -140,9 +140,14 @@ test_sources = files(
>          'test_thash_perf.c',
>          'test_threads.c',
>          'test_timer.c',
> +        'test_timer_htimer_htw_perf.c',
>          'test_timer_perf.c',
>          'test_timer_racecond.c',
>          'test_timer_secondary.c',
> +        'test_htimer_mgr.c',
> +        'test_htimer_mgr_perf.c',
> +        'test_htw.c',
> +        'test_htw_perf.c',
>          'test_ticketlock.c',
>          'test_trace.c',
>          'test_trace_register.c',
> @@ -193,6 +198,7 @@ fast_tests = [
>          ['fib6_autotest', true, true],
>          ['func_reentrancy_autotest', false, true],
>          ['hash_autotest', true, true],
> +        ['htimer_mgr_autotest', true, true],
>          ['interrupt_autotest', true, true],
>          ['ipfrag_autotest', false, true],
>          ['lcores_autotest', true, true],
> @@ -265,6 +271,8 @@ perf_test_names = [
>          'memcpy_perf_autotest',
>          'hash_perf_autotest',
>          'timer_perf_autotest',
> +        'htimer_mgr_perf_autotest',
> +        'htw_perf_autotest',
>          'reciprocal_division',
>          'reciprocal_division_perf',
>          'lpm_perf_autotest',
> diff --git a/app/test/test_htimer_mgr.c b/app/test/test_htimer_mgr.c
> new file mode 100644
> index 0000000000..9e46dec53e
> --- /dev/null
> +++ b/app/test/test_htimer_mgr.c
> @@ -0,0 +1,674 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2023 Ericsson AB
> + */
> +
> +#include "test.h"
> +
> +#include <sys/queue.h>
> +#include <stdlib.h>
> +#include <inttypes.h>
> +
> +#include <rte_common.h>
> +#include <rte_cycles.h>
> +#include <rte_htimer_mgr.h>
> +#include <rte_launch.h>
> +#include <rte_lcore.h>
> +#include <rte_random.h>
> +
> +static int
> +timer_lcore(void *arg)
> +{
> +	bool *stop = arg;
> +
> +	while (!__atomic_load_n(stop, __ATOMIC_RELAXED))
> +		rte_htimer_mgr_manage();
> +
> +	return 0;
> +}
> +
> +static void
> +count_timer_cb(struct rte_htimer *timer __rte_unused, void *arg)
> +{
> +	unsigned int *count = arg;
> +
> +	__atomic_fetch_add(count, 1, __ATOMIC_RELAXED);
> +}
> +
> +static void
> +count_async_cb(struct rte_htimer *timer __rte_unused, int result,
> +	       void *cb_arg)
> +{
> +	unsigned int *count = cb_arg;
> +
> +	if (result == RTE_HTIMER_MGR_ASYNC_RESULT_ADDED)
> +		__atomic_fetch_add(count, 1, __ATOMIC_RELAXED);
> +}
> +
> +static uint64_t
> +s_to_tsc(double s)
> +{
> +	return s * rte_get_tsc_hz();
> +}
> +
> +#define ASYNC_ADD_TEST_EXPIRATION_TIME (250*1000) /* ns */
> +#define ASYNC_TEST_TICK (10*1000) /* ns */
> +
> +static int
> +test_htimer_mgr_async_add(unsigned int num_timers_per_lcore)
> +{
> +	struct rte_htimer *timers;
> +	unsigned int timer_idx;
> +	unsigned int lcore_id;
> +	bool stop = false;
> +	unsigned int timeout_count = 0;
> +	unsigned int async_count = 0;
> +	unsigned int num_workers = 0;
> +	uint64_t expiration_time;
> +	unsigned int num_total_timers;
> +
> +	rte_htimer_mgr_init(ASYNC_TEST_TICK);
> +
> +	RTE_LCORE_FOREACH_WORKER(lcore_id) {
> +		if (rte_eal_remote_launch(timer_lcore, &stop, lcore_id) != 0)
> +			rte_panic("Unable to launch timer lcore\n");
> +		num_workers++;
> +	}
> +
> +	num_total_timers = num_workers * num_timers_per_lcore;
> +
> +	timers = malloc(num_total_timers * sizeof(struct rte_htimer));
> +	timer_idx = 0;
> +
> +	if (timers == NULL)
> +		rte_panic("Unable to allocate heap memory\n");
> +
> +	expiration_time = ASYNC_ADD_TEST_EXPIRATION_TIME;
> +
> +	RTE_LCORE_FOREACH_WORKER(lcore_id) {
> +		unsigned int i;
> +
> +		for (i = 0; i < num_timers_per_lcore; i++) {
> +			struct rte_htimer *timer = &timers[timer_idx++];
> +
> +			for (;;) {
> +				int rc;
> +
> +				rc = rte_htimer_mgr_async_add(timer, lcore_id,
> +						      expiration_time,
> +						      RTE_HTIMER_FLAG_TIME_TSC,
> +						      count_timer_cb,
> +						      &timeout_count, 0,
> +						      count_async_cb,
> +						      &async_count);
> +				if (unlikely(rc == -EBUSY))
> +					rte_htimer_mgr_process();
> +				else
> +					break;
> +			}
> +		}
> +	}
> +
> +	while (__atomic_load_n(&async_count, __ATOMIC_RELAXED) !=
> +	       num_total_timers ||
> +	       __atomic_load_n(&timeout_count, __ATOMIC_RELAXED) !=
> +	       num_total_timers)
> +		rte_htimer_mgr_manage();
> +
> +	__atomic_store_n(&stop, true, __ATOMIC_RELAXED);
> +
> +	rte_eal_mp_wait_lcore();
> +
> +	rte_htimer_mgr_deinit();
> +
> +	free(timers);
> +
> +	return TEST_SUCCESS;
> +}
> +
> +struct async_recorder_state {
> +	bool timer_cb_run;
> +	bool async_add_cb_run;
> +	bool async_cancel_cb_run;
> +	bool failed;
> +};
> +
> +static void
> +record_async_add_cb(struct rte_htimer *timer __rte_unused,
> +		    int result, void *cb_arg)
> +{
> +	struct async_recorder_state *state = cb_arg;
> +
> +	if (state->failed)
> +		return;
> +
> +	if (state->async_add_cb_run ||
> +	    result != RTE_HTIMER_MGR_ASYNC_RESULT_ADDED) {
> +		puts("async add run already");
> +		state->failed = true;
> +	}
> +
> +	state->async_add_cb_run = true;
> +}
> +
> +static void
> +record_async_cancel_cb(struct rte_htimer *timer __rte_unused,
> +		       int result, void *cb_arg)
> +{
> +	struct async_recorder_state *state = cb_arg;
> +
> +	if (state->failed)
> +		return;
> +
> +	if (state->async_cancel_cb_run) {
> +		state->failed = true;
> +		return;
> +	}
> +
> +	switch (result) {
> +	case RTE_HTIMER_MGR_ASYNC_RESULT_EXPIRED:
> +		if (!state->timer_cb_run)
> +			state->failed = true;
> +		break;
> +	case RTE_HTIMER_MGR_ASYNC_RESULT_CANCELED:
> +		if (state->timer_cb_run)
> +			state->failed = true;
> +		break;
> +	case RTE_HTIMER_MGR_ASYNC_RESULT_ALREADY_CANCELED:
> +		state->failed = true;
> +	}
> +
> +	state->async_cancel_cb_run = true;
> +}
> +
> +static int
> +record_check_consistency(struct async_recorder_state *state)
> +{
> +	if (state->failed)
> +		return -1;
> +
> +	return state->async_cancel_cb_run ? 1 : 0;
> +}
> +
> +static int
> +records_check_consistency(struct async_recorder_state *states,
> +			  unsigned int num_states)
> +{
> +	unsigned int i;
> +	int canceled = 0;
> +
> +	for (i = 0; i < num_states; i++) {
> +		int rc;
> +
> +		rc = record_check_consistency(&states[i]);
> +
> +		if (rc < 0)
> +			return -1;
> +		canceled += rc;
> +	}
> +
> +	return canceled;
> +}
> +
> +static void
> +log_timer_expiry_cb(struct rte_htimer *timer __rte_unused,
> +		    void *arg)
> +{
> +	bool *timer_run = arg;
> +
> +	*timer_run = true;
> +}
> +
> +
> +#define ASYNC_ADD_CANCEL_TEST_EXPIRATION_TIME_MAX 10e-3 /* s */
> +
> +static int
> +test_htimer_mgr_async_add_cancel(unsigned int num_timers_per_lcore)
> +{
> +	struct rte_htimer *timers;
> +	struct async_recorder_state *recorder_states;
> +	unsigned int timer_idx = 0;
> +	unsigned int lcore_id;
> +	uint64_t now;
> +	unsigned int num_workers = 0;
> +	bool stop = false;
> +	uint64_t max_expiration_time =
> +		s_to_tsc(ASYNC_ADD_CANCEL_TEST_EXPIRATION_TIME_MAX);
> +	unsigned int num_total_timers;
> +	int canceled = 0;
> +
> +	rte_htimer_mgr_init(ASYNC_TEST_TICK);
> +
> +	RTE_LCORE_FOREACH_WORKER(lcore_id) {
> +		if (rte_eal_remote_launch(timer_lcore, &stop, lcore_id) != 0)
> +			rte_panic("Unable to launch timer lcore\n");
> +		num_workers++;
> +	}
> +
> +	num_total_timers = num_workers * num_timers_per_lcore;
> +
> +	timers = malloc(num_total_timers * sizeof(struct rte_htimer));
> +	recorder_states =
> +		malloc(num_total_timers * sizeof(struct async_recorder_state));
> +
> +	if (timers == NULL || recorder_states == NULL)
> +		rte_panic("Unable to allocate heap memory\n");
> +
> +	now = rte_get_tsc_cycles();
> +
> +	RTE_LCORE_FOREACH_WORKER(lcore_id) {
> +		unsigned int i;
> +
> +		for (i = 0; i < num_timers_per_lcore; i++) {
> +			struct rte_htimer *timer = &timers[timer_idx];
> +			struct async_recorder_state *state =
> +				&recorder_states[timer_idx];
> +
> +			timer_idx++;
> +
> +			*state = (struct async_recorder_state) {};
> +
> +			uint64_t expiration_time =
> +				now + rte_rand_max(max_expiration_time);
> +
> +			for (;;) {
> +				int rc;
> +
> +				rc = rte_htimer_mgr_async_add(timer, lcore_id,
> +							 expiration_time,
> +							 0,
> +							 log_timer_expiry_cb,
> +							 &state->timer_cb_run,
> +							 0,
> +							 record_async_add_cb,
> +							 state);
> +
> +				if (unlikely(rc == -EBUSY))
> +					rte_htimer_mgr_process();
> +				else
> +					break;
> +			}
> +		}
> +	}
> +
> +	timer_idx = 0;
> +
> +	RTE_LCORE_FOREACH_WORKER(lcore_id) {
> +		unsigned int i;
> +
> +		for (i = 0; i < num_timers_per_lcore; i++) {
> +			struct rte_htimer *timer = &timers[timer_idx];
> +			struct async_recorder_state *state =
> +				&recorder_states[timer_idx];
> +
> +			timer_idx++;
> +
> +			/* cancel roughly half of the timers */
> +			if (rte_rand_max(2) == 0)
> +				continue;
> +
> +			for (;;) {
> +				int rc;
> +
> +				rc = rte_htimer_mgr_async_cancel(timer,
> +							record_async_cancel_cb,
> +							state);
> +
> +				if (unlikely(rc == -EBUSY)) {
> +					puts("busy");
> +					rte_htimer_mgr_process();
> +				} else
> +					break;
> +			}
> +
> +			canceled++;
> +		}
> +	}
> +
> +	for (;;) {
> +		int cancel_completed;
> +
> +		cancel_completed = records_check_consistency(recorder_states,
> +							     num_total_timers);
> +
> +		if (cancel_completed < 0) {
> +			puts("Inconstinency found");
> +			return TEST_FAILED;
> +		}
> +
> +		if (cancel_completed == canceled)
> +			break;
> +
> +		rte_htimer_mgr_process();
> +	}
> +
> +	__atomic_store_n(&stop, true, __ATOMIC_RELAXED);
> +
> +	rte_eal_mp_wait_lcore();
> +
> +	rte_htimer_mgr_deinit();
> +
> +	free(timers);
> +	free(recorder_states);
> +
> +	return TEST_SUCCESS;
> +}
> +
> +/*
> + * This is a test case where one thread asynchronously adds two timers,
> + * with the same expiration time; one on the local lcore and one on a
> + * remote lcore. This creates a tricky situation for the timer
> + * manager, and for the application as well, if the htimer struct is
> + * dynamically allocated.
> + */
> +
> +struct test_timer {
> +	uint32_t ref_cnt;
> +	uint64_t expiration_time; /* in TSC, not tick */
> +	uint32_t *timeout_count;
> +	bool *failure_occurred;
> +	struct rte_htimer htimer;
> +};
> +
> +
> +static struct test_timer *
> +test_timer_create(uint64_t expiration_time, uint32_t *timeout_count,
> +		  bool *failure_occurred)
> +{
> +	struct test_timer *timer;
> +
> +	timer = malloc(sizeof(struct test_timer));
> +
> +	if (timer == NULL)
> +		rte_panic("Unable to allocate timer memory\n");
> +
> +	timer->ref_cnt = 1;
> +	timer->expiration_time = expiration_time;
> +	timer->timeout_count = timeout_count;
> +	timer->failure_occurred = failure_occurred;
> +
> +	return timer;
> +}
> +
> +static void
> +test_timer_inc_ref_cnt(struct test_timer *timer)
> +{
> +	__atomic_add_fetch(&timer->ref_cnt, 1, __ATOMIC_RELEASE);

__atomic_fetch_add instead please

there's future work to align with C11 atomics using the previous
__atomic_fetch_<op> is preferred because it just becomes
s/__atomic/atomic/ (well mostly...)


> +}
> +
> +static void
> +test_timer_dec_ref_cnt(struct test_timer *timer)
> +{
> +	if (timer != NULL) {
> +		uint32_t cnt = __atomic_sub_fetch(&timer->ref_cnt, 1,
> +						  __ATOMIC_RELEASE);

same here

i'll try to get a patch up for checkpatches warning soon.

thanks!

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v2 2/2] eal: add high-performance timer facility
  2023-03-15 17:03   ` [RFC v2 2/2] eal: add high-performance timer facility Mattias Rönnblom
  2023-03-16  3:55     ` Tyler Retzlaff
@ 2023-03-17  1:58     ` Stephen Hemminger
  2023-03-22 12:18     ` Morten Brørup
                       ` (2 subsequent siblings)
  4 siblings, 0 replies; 31+ messages in thread
From: Stephen Hemminger @ 2023-03-17  1:58 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, Erik Gabriel Carrillo, David Marchand, maria.lingemark,
	Stefan Sundkvist, Morten Brørup, Tyler Retzlaff

On Wed, 15 Mar 2023 18:03:42 +0100
Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:

> The htimer library attempts at providing a timer facility with roughly
> the same functionality, but less overhead and better scalability than
> DPDK timer library.
> 
> The htimer library employs per-lcore hierarchical timer wheels and a
> message-based synchronization/MT-safety scheme.
> 
> RFC v2:
>  * Fix spelling.
>  * Fix signed/unsigned comparisons and discontinue the use of name-less
>    function parameters, both of which may result in compiler warnings.
>  * Undo the accidental removal of the bitset tests from the 'fast_tests'.
>  * Add a number of missing include files, causing build failures
>    (e.g., on AArch64 builds).
>  * Add perf test attempting to compare rte_timer, rte_htimer and rte_htw.
>  * Use nanoseconds (instead of TSC) as the default time unit.
>  * add() and manage() has flags which allows the caller to specify the
>    time unit (nanoseconds, TSC, or ticks) for the times provided.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>

Initial reactions.

Good:
  - timer API does need work
  - the units and API's model seems good, would have to look at real applications
  - tests look good as well.

Bad:
  - why do we need a new timer infrastructure. Could this not be done
    by extending and embracing the existing rte_timer() API's.
  - having fast rte_timer() would make existing app's faster.
  

PS:
  - ok to drop all the rte_alt_timer stuff, don't think any application depends on it.
    my survey of github projects, only one usage (OpenDataplane).



  - 

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [RFC v2 2/2] eal: add high-performance timer facility
  2023-03-15 17:03   ` [RFC v2 2/2] eal: add high-performance timer facility Mattias Rönnblom
  2023-03-16  3:55     ` Tyler Retzlaff
  2023-03-17  1:58     ` Stephen Hemminger
@ 2023-03-22 12:18     ` Morten Brørup
  2023-04-03 12:04       ` Mattias Rönnblom
  2023-03-24 16:00     ` Morten Brørup
  2023-07-06 22:41     ` Stephen Hemminger
  4 siblings, 1 reply; 31+ messages in thread
From: Morten Brørup @ 2023-03-22 12:18 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: Erik Gabriel Carrillo, David Marchand, maria.lingemark,
	Stefan Sundkvist, Stephen Hemminger, Tyler Retzlaff

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Wednesday, 15 March 2023 18.04

> +++ b/lib/htimer/rte_htimer.h
> @@ -0,0 +1,68 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2023 Ericsson AB
> + */
> +
> +#ifndef _RTE_HTIMER_H_
> +#define _RTE_HTIMER_H_
> +
> +#include <stdbool.h>
> +#include <stdint.h>
> +#include <sys/queue.h>
> +
> +#include <rte_bitops.h>
> +
> +struct rte_htimer;
> +
> +typedef void (*rte_htimer_cb_t)(struct rte_htimer *, void *);
> +
> +struct rte_htimer {
> +	/**
> +	 * Absolute timer expiration time (in ticks).
> +	 */
> +	uint64_t expiration_time;
> +	/**
> +	 * Time between expirations (in ticks). Zero for one-shot timers.
> +	 */
> +	uint64_t period;
> +	/**
> +	 * Owning lcore. May safely be read from any thread.
> +	 */
> +	uint32_t owner_lcore_id;
> +	/**
> +	 * The current state of the timer.
> +	 */
> +	uint32_t state:4;
> +	/**
> +	 * Flags set on this timer.
> +	 */
> +	uint32_t flags:28;
> +	/**
> +	 * User-specified callback function pointer.
> +	 */
> +	rte_htimer_cb_t cb;
> +	/**
> +	 * Argument for user callback.
> +	 */
> +	void *cb_arg;
> +	/**
> +	 * Pointers used to add timer to various internal lists.
> +	 */
> +	LIST_ENTRY(rte_htimer) entry;
> +};

If the rte_htimer structure is supposed to be used in some other data structure, e.g. in a TCP/IP flow structure, it seems unnecessarily bloated.

Generally, if there is no significant performance benefit to the "period" feature, please remove it.

Let's say that this library is used for handling the timers of flows in an IP stack, then the vast majority of timers will be timers related to flows. I would prefer if this high-performance timer library is optimized for such high-volume use cases, rather than offering generic features for low-volume use cases.

And if one HTW instance is used for a single purpose (e.g. the IP stack state machine), both "cb" and "cb_arg" can be removed: The application can derive the pointer to the flow by the using container_of() with the pointer to the rte_htimer, and the cb_arg will effectively be a shadow variable of the flow's state anyway (if not just a pointer to the flow).

Here's an idea, which will offer both: For the high-volume single-purpose use cases you could provide a struct rte_htimer_core without the generic fields, and for the generic use cases, you could provide a struct rte_htimer containing a struct rte_htimer_core and the additional fields for generic use.

> +
> +#define RTE_HTIMER_FLAG_ABSOLUTE_TIME RTE_BIT32(0)
> +#define RTE_HTIMER_FLAG_PERIODICAL RTE_BIT32(1)
> +#define RTE_HTIMER_FLAG_TIME_TICK RTE_BIT32(2)
> +#define RTE_HTIMER_FLAG_TIME_TSC RTE_BIT32(3)
> +
> +#define RTE_HTIMER_STATE_PENDING 1
> +#define RTE_HTIMER_STATE_EXPIRED 2
> +#define RTE_HTIMER_STATE_CANCELED 3
> +
> +LIST_HEAD(rte_htimer_list, rte_htimer);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_HTIMER_H_ */

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [RFC v2 2/2] eal: add high-performance timer facility
  2023-03-15 17:03   ` [RFC v2 2/2] eal: add high-performance timer facility Mattias Rönnblom
                       ` (2 preceding siblings ...)
  2023-03-22 12:18     ` Morten Brørup
@ 2023-03-24 16:00     ` Morten Brørup
  2023-07-06 22:41     ` Stephen Hemminger
  4 siblings, 0 replies; 31+ messages in thread
From: Morten Brørup @ 2023-03-24 16:00 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: Erik Gabriel Carrillo, David Marchand, maria.lingemark,
	Stefan Sundkvist, Stephen Hemminger, Tyler Retzlaff

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Wednesday, 15 March 2023 18.04
> 
> The htimer library attempts at providing a timer facility with roughly
> the same functionality, but less overhead and better scalability than
> DPDK timer library.
> 
> The htimer library employs per-lcore hierarchical timer wheels and a
> message-based synchronization/MT-safety scheme.
> 
> RFC v2:
>  * Fix spelling.
>  * Fix signed/unsigned comparisons and discontinue the use of name-less
>    function parameters, both of which may result in compiler warnings.
>  * Undo the accidental removal of the bitset tests from the 'fast_tests'.
>  * Add a number of missing include files, causing build failures
>    (e.g., on AArch64 builds).
>  * Add perf test attempting to compare rte_timer, rte_htimer and rte_htw.
>  * Use nanoseconds (instead of TSC) as the default time unit.
>  * add() and manage() has flags which allows the caller to specify the
>    time unit (nanoseconds, TSC, or ticks) for the times provided.
> 
> Signed-off-by: Mattias Rönnblom <mattias.ronnblom@ericsson.com>
> ---

Two more series of comments, see inline below:
1. Arguing for using "tick" as the default unit of time.
2. Some bugs in the time conversion functions.

[...]

> diff --git a/lib/htimer/meson.build b/lib/htimer/meson.build
> new file mode 100644
> index 0000000000..2dd5d6a24b
> --- /dev/null
> +++ b/lib/htimer/meson.build
> @@ -0,0 +1,7 @@
> +# SPDX-License-Identifier: BSD-3-Clause
> +# Copyright(c) 2023 Ericsson AB
> +
> +sources = files('rte_htw.c', 'rte_htimer_msg_ring.c', 'rte_htimer_mgr.c')
> +headers = files('rte_htimer_mgr.h', 'rte_htimer.h')
> +
> +deps += ['ring']
> diff --git a/lib/htimer/rte_htimer.h b/lib/htimer/rte_htimer.h
> new file mode 100644
> index 0000000000..6ac86292b5
> --- /dev/null
> +++ b/lib/htimer/rte_htimer.h
> @@ -0,0 +1,68 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2023 Ericsson AB
> + */
> +
> +#ifndef _RTE_HTIMER_H_
> +#define _RTE_HTIMER_H_
> +
> +#include <stdbool.h>
> +#include <stdint.h>
> +#include <sys/queue.h>
> +
> +#include <rte_bitops.h>
> +
> +struct rte_htimer;
> +
> +typedef void (*rte_htimer_cb_t)(struct rte_htimer *, void *);
> +
> +struct rte_htimer {
> +	/**
> +	 * Absolute timer expiration time (in ticks).
> +	 */
> +	uint64_t expiration_time;
> +	/**
> +	 * Time between expirations (in ticks). Zero for one-shot timers.
> +	 */
> +	uint64_t period;
> +	/**
> +	 * Owning lcore. May safely be read from any thread.
> +	 */
> +	uint32_t owner_lcore_id;
> +	/**
> +	 * The current state of the timer.
> +	 */
> +	uint32_t state:4;
> +	/**
> +	 * Flags set on this timer.
> +	 */
> +	uint32_t flags:28;
> +	/**
> +	 * User-specified callback function pointer.
> +	 */
> +	rte_htimer_cb_t cb;
> +	/**
> +	 * Argument for user callback.
> +	 */
> +	void *cb_arg;
> +	/**
> +	 * Pointers used to add timer to various internal lists.
> +	 */
> +	LIST_ENTRY(rte_htimer) entry;
> +};
> +
> +#define RTE_HTIMER_FLAG_ABSOLUTE_TIME RTE_BIT32(0)
> +#define RTE_HTIMER_FLAG_PERIODICAL RTE_BIT32(1)
> +#define RTE_HTIMER_FLAG_TIME_TICK RTE_BIT32(2)
> +#define RTE_HTIMER_FLAG_TIME_TSC RTE_BIT32(3)

After further consideration, and taking the time conversion functions into account, I think the default unit of time should be "tick", not nanoseconds. It seems more natural, and might offer more flexibility in the future.

So instead of:

+#define RTE_HTIMER_FLAG_TIME_TICK RTE_BIT32(2)
+#define RTE_HTIMER_FLAG_TIME_TSC RTE_BIT32(3)

then:

+#define RTE_HTIMER_FLAG_TIME_TSC RTE_BIT32(2)
+#define RTE_HTIMER_FLAG_TIME_NS RTE_BIT32(3)

and perhaps in the future:

+#define RTE_HTIMER_FLAG_TIME_US RTE_BIT32(4)

> +
> +#define RTE_HTIMER_STATE_PENDING 1
> +#define RTE_HTIMER_STATE_EXPIRED 2
> +#define RTE_HTIMER_STATE_CANCELED 3
> +
> +LIST_HEAD(rte_htimer_list, rte_htimer);
> +
> +#ifdef __cplusplus
> +}
> +#endif
> +
> +#endif /* _RTE_HTIMER_H_ */
> diff --git a/lib/htimer/rte_htimer_mgr.c b/lib/htimer/rte_htimer_mgr.c
> new file mode 100644
> index 0000000000..efdfcf0985
> --- /dev/null
> +++ b/lib/htimer/rte_htimer_mgr.c
> @@ -0,0 +1,547 @@
> +/* SPDX-License-Identifier: BSD-3-Clause
> + * Copyright(c) 2023 Ericsson AB
> + */
> +
> +#include <inttypes.h>
> +#include <math.h>
> +#include <stdbool.h>
> +#include <sys/queue.h>
> +#include <unistd.h>
> +
> +#include <rte_branch_prediction.h>
> +#include <rte_common.h>
> +#include <rte_cycles.h>
> +#include <rte_errno.h>
> +#include <rte_htw.h>
> +#include <rte_prefetch.h>
> +#include <rte_ring_elem.h>
> +
> +#include "rte_htimer_mgr.h"
> +#include "rte_htimer_msg.h"
> +#include "rte_htimer_msg_ring.h"
> +
> +#define MAX_MSG_BATCH_SIZE 16
> +
> +struct htimer_mgr {
> +	struct rte_htimer_msg_ring *msg_ring;
> +	struct rte_htw *htw;
> +
> +	unsigned int async_msgs_idx __rte_cache_aligned;
> +	unsigned int num_async_msgs;
> +	struct rte_htimer_msg async_msgs[MAX_MSG_BATCH_SIZE];
> +} __rte_cache_aligned;
> +
> +static uint64_t ns_per_tick;
> +static double tsc_per_tick;
> +
> +static struct htimer_mgr mgrs[RTE_MAX_LCORE + 1];
> +
> +#define MAX_ASYNC_TRANSACTIONS 1024
> +#define MSG_RING_SIZE MAX_ASYNC_TRANSACTIONS
> +
> +static inline uint64_t
> +tsc_to_tick(uint64_t tsc)
> +{
> +	return tsc / tsc_per_tick;
> +}
> +
> +static inline uint64_t
> +tsc_to_tick_round_up(uint64_t tsc)
> +{
> +	uint64_t tick;
> +
> +	tick = (tsc + tsc_per_tick / 2) / tsc_per_tick;

This does not round up, it rounds off.

E.g. tsc_per_tick=10.0, tsc=1 becomes (1 + 5.0) / 10.0 = 0.6, which becomes 0 (when converted to integer).

E.g. tsc_per_tick=10.0, tsc=5 becomes (5 + 5.0) / 10.0 = 1.0, which becomes 1.

> +
> +	return tick;
> +}
> +
> +static inline uint64_t
> +ns_to_tick(uint64_t ns)
> +{
> +	return ns / ns_per_tick;
> +}
> +
> +static inline uint64_t
> +ns_to_tick_round_up(uint64_t ns)
> +{
> +	uint64_t tick;
> +
> +	tick = ceil(ns / ns_per_tick);

ns_per_tick is integer, not floating point, so the division is performed as integer division, and ceil() has no effect; i.e. the above is the same as:

tick = ns / ns_per_tick;

Which also means that it does not round up.

> +
> +	return tick;
> +}
> +
> +static inline uint64_t
> +tick_to_ns(uint64_t tick)
> +{
> +	return tick * ns_per_tick;
> +}
> +
> +static struct htimer_mgr *
> +mgr_get(unsigned int lcore_id)
> +{
> +	return &mgrs[lcore_id];
> +}
> +
> +static int
> +mgr_init(unsigned int lcore_id)
> +{
> +	char ring_name[RTE_RING_NAMESIZE];
> +	unsigned int socket_id;
> +	struct htimer_mgr *mgr = &mgrs[lcore_id];
> +
> +	socket_id = rte_lcore_to_socket_id(lcore_id);
> +
> +	snprintf(ring_name, sizeof(ring_name), "htimer_%d", lcore_id);
> +
> +	mgr->msg_ring =
> +		rte_htimer_msg_ring_create(ring_name, MSG_RING_SIZE, socket_id,
> +					   RING_F_SC_DEQ);
> +
> +	if (mgr->msg_ring == NULL)
> +		goto err;
> +
> +	mgr->htw = rte_htw_create();
> +
> +	if (mgr->htw == NULL)
> +		goto err_free_ring;
> +
> +	mgr->async_msgs_idx = 0;
> +	mgr->num_async_msgs = 0;
> +
> +	return 0;
> +
> +err_free_ring:
> +	rte_htimer_msg_ring_free(mgr->msg_ring);
> +err:
> +	return -ENOMEM;
> +}
> +
> +static void
> +mgr_deinit(unsigned int lcore_id)
> +{
> +	struct htimer_mgr *mgr = &mgrs[lcore_id];
> +
> +	rte_htw_destroy(mgr->htw);
> +
> +	rte_htimer_msg_ring_free(mgr->msg_ring);
> +}
> +
> +static volatile bool initialized;
> +
> +static void
> +assure_initialized(void)
> +{
> +	RTE_ASSERT(initialized);
> +}
> +
> +int
> +rte_htimer_mgr_init(uint64_t _ns_per_tick)
> +{
> +	unsigned int lcore_id;
> +
> +	RTE_VERIFY(!initialized);
> +
> +	ns_per_tick = _ns_per_tick;
> +
> +	tsc_per_tick = (ns_per_tick / 1e9) * rte_get_tsc_hz();
> +
> +	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++) {
> +		int rc;
> +
> +		rc = mgr_init(lcore_id);
> +
> +		if (rc < 0) {
> +			unsigned int deinit_lcore_id;
> +
> +			for (deinit_lcore_id = 0; deinit_lcore_id < lcore_id;
> +			     deinit_lcore_id++)
> +				mgr_deinit(deinit_lcore_id);
> +
> +			return rc;
> +		}
> +	}
> +
> +	initialized = true;
> +
> +	return 0;
> +}
> +
> +void
> +rte_htimer_mgr_deinit(void)
> +{
> +	unsigned int lcore_id;
> +
> +	assure_initialized();
> +
> +	for (lcore_id = 0; lcore_id < RTE_MAX_LCORE; lcore_id++)
> +		mgr_deinit(lcore_id);
> +
> +	initialized = false;
> +}
> +
> +static void
> +assure_valid_time_conversion_flags(uint32_t flags __rte_unused)
> +{
> +	RTE_ASSERT(!((flags & RTE_HTIMER_FLAG_TIME_TSC) &&
> +		     (flags & RTE_HTIMER_FLAG_TIME_TICK)));

With my above suggestion of using tick as default time unit, this would be changed to:

+	RTE_ASSERT(!((flags & RTE_HTIMER_FLAG_TIME_TSC) &&
+		     (flags & RTE_HTIMER_FLAG_TIME_NS) &&
+		     (flags & RTE_HTIMER_FLAG_TIME_US)));

> +}
> +
> +static void
> +assure_valid_add_flags(uint32_t flags)
> +{
> +	assure_valid_time_conversion_flags(flags);
> +
> +	RTE_ASSERT(!(flags & ~(RTE_HTIMER_FLAG_PERIODICAL |
> +			       RTE_HTIMER_FLAG_ABSOLUTE_TIME |
> +			       RTE_HTIMER_FLAG_TIME_TSC |
> +			       RTE_HTIMER_FLAG_TIME_TICK)));

With my above suggestion of using tick as default time unit, this would be changed to:

+	RTE_ASSERT(!(flags & ~(RTE_HTIMER_FLAG_PERIODICAL |
+			       RTE_HTIMER_FLAG_ABSOLUTE_TIME |
+			       RTE_HTIMER_FLAG_TIME_TSC |
+			       RTE_HTIMER_FLAG_TIME_NS |
+			       RTE_HTIMER_FLAG_TIME_US)));

> +}
> +
> +static uint64_t
> +convert_time(uint64_t t, uint32_t flags)
> +{
> +	if (flags & RTE_HTIMER_FLAG_TIME_TSC)
> +		return tsc_to_tick(t);
> +	else if (flags & RTE_HTIMER_FLAG_TIME_TICK)
> +		return t;
> +	else
> +		return ns_to_tick(t);

With my above suggestion of using tick as default time unit, this would be changed to:

+	if (flags & RTE_HTIMER_FLAG_TIME_TSC)
+		return tsc_to_tick(t);
+	else if (flags & RTE_HTIMER_FLAG_TIME_NS)
+		return ns_to_tick(t);
+	else if (flags & RTE_HTIMER_FLAG_TIME_US)
+		return us_to_tick(t);
+	else
+		return t;

> +}
> +
> +void
> +rte_htimer_mgr_add(struct rte_htimer *timer, uint64_t expiration_time,
> +		   uint64_t period, rte_htimer_cb_t timer_cb,
> +		   void *timer_cb_arg, uint32_t flags)
> +{
> +	unsigned int lcore_id = rte_lcore_id();
> +	struct htimer_mgr *mgr = mgr_get(lcore_id);
> +	uint64_t expiration_time_tick;
> +	uint64_t period_tick;
> +
> +	assure_initialized();
> +
> +	assure_valid_add_flags(flags);
> +
> +	expiration_time_tick = convert_time(expiration_time, flags);
> +
> +	period_tick = convert_time(period, flags);
> +
> +	rte_htw_add(mgr->htw, timer, expiration_time_tick, period_tick,
> +		    timer_cb, timer_cb_arg, flags);
> +
> +	timer->owner_lcore_id = lcore_id;
> +}

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v2 2/2] eal: add high-performance timer facility
  2023-03-22 12:18     ` Morten Brørup
@ 2023-04-03 12:04       ` Mattias Rönnblom
  2023-04-04  7:32         ` Morten Brørup
  0 siblings, 1 reply; 31+ messages in thread
From: Mattias Rönnblom @ 2023-04-03 12:04 UTC (permalink / raw)
  To: Morten Brørup, dev
  Cc: Erik Gabriel Carrillo, David Marchand, Maria Lingemark,
	Stefan Sundkvist, Stephen Hemminger, Tyler Retzlaff

On 2023-03-22 13:18, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
>> Sent: Wednesday, 15 March 2023 18.04
> 
>> +++ b/lib/htimer/rte_htimer.h
>> @@ -0,0 +1,68 @@
>> +/* SPDX-License-Identifier: BSD-3-Clause
>> + * Copyright(c) 2023 Ericsson AB
>> + */
>> +
>> +#ifndef _RTE_HTIMER_H_
>> +#define _RTE_HTIMER_H_
>> +
>> +#include <stdbool.h>
>> +#include <stdint.h>
>> +#include <sys/queue.h>
>> +
>> +#include <rte_bitops.h>
>> +
>> +struct rte_htimer;
>> +
>> +typedef void (*rte_htimer_cb_t)(struct rte_htimer *, void *);
>> +
>> +struct rte_htimer {
>> +	/**
>> +	 * Absolute timer expiration time (in ticks).
>> +	 */
>> +	uint64_t expiration_time;
>> +	/**
>> +	 * Time between expirations (in ticks). Zero for one-shot timers.
>> +	 */
>> +	uint64_t period;
>> +	/**
>> +	 * Owning lcore. May safely be read from any thread.
>> +	 */
>> +	uint32_t owner_lcore_id;
>> +	/**
>> +	 * The current state of the timer.
>> +	 */
>> +	uint32_t state:4;
>> +	/**
>> +	 * Flags set on this timer.
>> +	 */
>> +	uint32_t flags:28;
>> +	/**
>> +	 * User-specified callback function pointer.
>> +	 */
>> +	rte_htimer_cb_t cb;
>> +	/**
>> +	 * Argument for user callback.
>> +	 */
>> +	void *cb_arg;
>> +	/**
>> +	 * Pointers used to add timer to various internal lists.
>> +	 */
>> +	LIST_ENTRY(rte_htimer) entry;
>> +};
> 
> If the rte_htimer structure is supposed to be used in some other data structure, e.g. in a TCP/IP flow structure, it seems unnecessarily bloated.
> 
> Generally, if there is no significant performance benefit to the "period" feature, please remove it.
> 
> Let's say that this library is used for handling the timers of flows in an IP stack, then the vast majority of timers will be timers related to flows. I would prefer if this high-performance timer library is optimized for such high-volume use cases, rather than offering generic features for low-volume use cases.
> 
> And if one HTW instance is used for a single purpose (e.g. the IP stack state machine), both "cb" and "cb_arg" can be removed: The application can derive the pointer to the flow by the using container_of() with the pointer to the rte_htimer, and the cb_arg will effectively be a shadow variable of the flow's state anyway (if not just a pointer to the flow).
> 
> Here's an idea, which will offer both: For the high-volume single-purpose use cases you could provide a struct rte_htimer_core without the generic fields, and for the generic use cases, you could provide a struct rte_htimer containing a struct rte_htimer_core and the additional fields for generic use.
> 
>> 

Good points.

I will look into:
a) making <rte_htw.h> public
b) split rte_htimer into two timer structs (where the now-public 
rte_htw_timer struct may be used from the rte_htimer_timer struct).
c) ...where the htw timer struct won't have any callbacks
d) merge rte_htimer_timer.h into rte_htimer.h.
e) remove the periodic feature, at least from the core timer wheel

+
>> +#define RTE_HTIMER_FLAG_ABSOLUTE_TIME RTE_BIT32(0)
>> +#define RTE_HTIMER_FLAG_PERIODICAL RTE_BIT32(1)
>> +#define RTE_HTIMER_FLAG_TIME_TICK RTE_BIT32(2)
>> +#define RTE_HTIMER_FLAG_TIME_TSC RTE_BIT32(3)
>> +
>> +#define RTE_HTIMER_STATE_PENDING 1
>> +#define RTE_HTIMER_STATE_EXPIRED 2
>> +#define RTE_HTIMER_STATE_CANCELED 3
>> +
>> +LIST_HEAD(rte_htimer_list, rte_htimer);
>> +
>> +#ifdef __cplusplus
>> +}
>> +#endif
>> +
>> +#endif /* _RTE_HTIMER_H_ */


^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [RFC v2 2/2] eal: add high-performance timer facility
  2023-04-03 12:04       ` Mattias Rönnblom
@ 2023-04-04  7:32         ` Morten Brørup
  0 siblings, 0 replies; 31+ messages in thread
From: Morten Brørup @ 2023-04-04  7:32 UTC (permalink / raw)
  To: Mattias Rönnblom, dev
  Cc: Erik Gabriel Carrillo, David Marchand, Maria Lingemark,
	Stefan Sundkvist, Stephen Hemminger, Tyler Retzlaff

> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> Sent: Monday, 3 April 2023 14.04
> 
> On 2023-03-22 13:18, Morten Brørup wrote:
> >> From: Mattias Rönnblom [mailto:mattias.ronnblom@ericsson.com]
> >> Sent: Wednesday, 15 March 2023 18.04
> >
> >> +++ b/lib/htimer/rte_htimer.h
> >> @@ -0,0 +1,68 @@
> >> +/* SPDX-License-Identifier: BSD-3-Clause
> >> + * Copyright(c) 2023 Ericsson AB
> >> + */
> >> +
> >> +#ifndef _RTE_HTIMER_H_
> >> +#define _RTE_HTIMER_H_
> >> +
> >> +#include <stdbool.h>
> >> +#include <stdint.h>
> >> +#include <sys/queue.h>
> >> +
> >> +#include <rte_bitops.h>
> >> +
> >> +struct rte_htimer;
> >> +
> >> +typedef void (*rte_htimer_cb_t)(struct rte_htimer *, void *);
> >> +
> >> +struct rte_htimer {
> >> +	/**
> >> +	 * Absolute timer expiration time (in ticks).
> >> +	 */
> >> +	uint64_t expiration_time;
> >> +	/**
> >> +	 * Time between expirations (in ticks). Zero for one-shot timers.
> >> +	 */
> >> +	uint64_t period;
> >> +	/**
> >> +	 * Owning lcore. May safely be read from any thread.
> >> +	 */
> >> +	uint32_t owner_lcore_id;
> >> +	/**
> >> +	 * The current state of the timer.
> >> +	 */
> >> +	uint32_t state:4;
> >> +	/**
> >> +	 * Flags set on this timer.
> >> +	 */
> >> +	uint32_t flags:28;
> >> +	/**
> >> +	 * User-specified callback function pointer.
> >> +	 */
> >> +	rte_htimer_cb_t cb;
> >> +	/**
> >> +	 * Argument for user callback.
> >> +	 */
> >> +	void *cb_arg;
> >> +	/**
> >> +	 * Pointers used to add timer to various internal lists.
> >> +	 */
> >> +	LIST_ENTRY(rte_htimer) entry;
> >> +};
> >
> > If the rte_htimer structure is supposed to be used in some other data
> structure, e.g. in a TCP/IP flow structure, it seems unnecessarily
> bloated.
> >
> > Generally, if there is no significant performance benefit to the
> "period" feature, please remove it.
> >
> > Let's say that this library is used for handling the timers of flows
> in an IP stack, then the vast majority of timers will be timers related
> to flows. I would prefer if this high-performance timer library is
> optimized for such high-volume use cases, rather than offering generic
> features for low-volume use cases.
> >
> > And if one HTW instance is used for a single purpose (e.g. the IP
> stack state machine), both "cb" and "cb_arg" can be removed: The
> application can derive the pointer to the flow by the using
> container_of() with the pointer to the rte_htimer, and the cb_arg will
> effectively be a shadow variable of the flow's state anyway (if not just
> a pointer to the flow).
> >
> > Here's an idea, which will offer both: For the high-volume single-
> purpose use cases you could provide a struct rte_htimer_core without the
> generic fields, and for the generic use cases, you could provide a
> struct rte_htimer containing a struct rte_htimer_core and the additional
> fields for generic use.
> >
> >>
> 
> Good points.
> 
> I will look into:
> a) making <rte_htw.h> public
> b) split rte_htimer into two timer structs (where the now-public
> rte_htw_timer struct may be used from the rte_htimer_timer struct).
> c) ...where the htw timer struct won't have any callbacks
> d) merge rte_htimer_timer.h into rte_htimer.h.
> e) remove the periodic feature, at least from the core timer wheel

Sounds good, Mattias. Looking forward to reviewing the next version. :-)


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v2 2/2] eal: add high-performance timer facility
  2023-03-15 17:03   ` [RFC v2 2/2] eal: add high-performance timer facility Mattias Rönnblom
                       ` (3 preceding siblings ...)
  2023-03-24 16:00     ` Morten Brørup
@ 2023-07-06 22:41     ` Stephen Hemminger
  2023-07-12  8:58       ` Mattias Rönnblom
  4 siblings, 1 reply; 31+ messages in thread
From: Stephen Hemminger @ 2023-07-06 22:41 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, Erik Gabriel Carrillo, David Marchand, maria.lingemark,
	Stefan Sundkvist, Morten Brørup, Tyler Retzlaff

On Wed, 15 Mar 2023 18:03:42 +0100
Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:

> The htimer library attempts at providing a timer facility with roughly
> the same functionality, but less overhead and better scalability than
> DPDK timer library.

I don't understand. Why not just fix and extend existing timers.
Sure you will need to add some API's and maybe drop some of the existing
experimental ones (ie alt_timer). Even change the ABI.

It would be better to have one high performance, scaleable timer than
spend the next 3 years telling users which one to use and why!

So please make rte_timer work better in 23.11 release rather
than reinventing a new variant.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v2 2/2] eal: add high-performance timer facility
  2023-07-06 22:41     ` Stephen Hemminger
@ 2023-07-12  8:58       ` Mattias Rönnblom
  0 siblings, 0 replies; 31+ messages in thread
From: Mattias Rönnblom @ 2023-07-12  8:58 UTC (permalink / raw)
  To: Stephen Hemminger
  Cc: dev, Erik Gabriel Carrillo, David Marchand, Maria Lingemark,
	Stefan Sundkvist, Morten Brørup, Tyler Retzlaff,
	Mattias Rönnblom

On 2023-07-07 00:41, Stephen Hemminger wrote:
> On Wed, 15 Mar 2023 18:03:42 +0100
> Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
> 
>> The htimer library attempts at providing a timer facility with roughly
>> the same functionality, but less overhead and better scalability than
>> DPDK timer library.
> 
> I don't understand. Why not just fix and extend existing timers.
> Sure you will need to add some API's and maybe drop some of the existing
> experimental ones (ie alt_timer). Even change the ABI.
> 
> It would be better to have one high performance, scaleable timer than
> spend the next 3 years telling users which one to use and why!
> 
> So please make rte_timer work better in 23.11 release rather
> than reinventing a new variant.

I wanted to explore how a data plane timer API should look like. 
Something like a "first principles" type approach. As it happens, it 
seems like I will converge on something that's pretty similar to how 
rte_timer (and most kernel timers) API works, for example in regards to 
timer memory allocation.

Clearly, there should not be two DPDK timer APIs that provide the same 
functionality. That was never the intention. Since so much DPDK code and 
more importantly application code depends on <rte_timer.h> it wasn't 
obvious that the best option was make extensive changes to rte_timer API 
and implementation. One way that seemed like a plausible option (how 
much so depending on the extend of the rte_timer vs rte_htimer API 
differences) was to have a new API, and depreciate <rte_timer.h> in the 
release htimer was introduced.

That said, at this point, it's not clear to me which option is the best 
one of "making extensive changes to rte_timer" or "having rte_htimer on 
the side for a couple of releases".

An imaginary alternative where the <rte_timer.h> API/ABI can be 
maintained, and you get all the performance and scalability and improved 
API semantics of htimer, would obviously be the best option. But I don't 
think that is possible. Especially not if you want to end up with a 
nice, orthogonal API and a clean implementation.

I think changes in both ABI and API are inevitable, and a good thing, 
considering some of the quirks for the current API.

A side note: It seems to me at this point there should be two public 
timer APIs, but providing different functionality, at slightly different 
levels of abstraction. One is the <rte_timer.h> lookalike, and the other 
what in the current patchset is represented by <rte_htw.h>, but minus 
the callbacks, as per Morten Brørup's suggestion. The latter would be a 
low-level HTW only, with no MT safety, no lcore knowledge, no opinions 
on time source, etc.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v2 0/2] Add high-performance timer facility
  2023-03-15 17:03 ` [RFC v2 " Mattias Rönnblom
  2023-03-15 17:03   ` [RFC v2 1/2] eal: add bitset type Mattias Rönnblom
  2023-03-15 17:03   ` [RFC v2 2/2] eal: add high-performance timer facility Mattias Rönnblom
@ 2024-10-03 18:36   ` Stephen Hemminger
  2024-10-03 21:32     ` Morten Brørup
  2 siblings, 1 reply; 31+ messages in thread
From: Stephen Hemminger @ 2024-10-03 18:36 UTC (permalink / raw)
  To: Mattias Rönnblom
  Cc: dev, Erik Gabriel Carrillo, David Marchand, maria.lingemark,
	Stefan Sundkvist, Morten Brørup, Tyler Retzlaff

On Wed, 15 Mar 2023 18:03:40 +0100
Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:

> This patchset is an attempt to introduce a high-performance, highly
> scalable timer facility into DPDK.
> 
> More specifically, the goals for the htimer library are:
> 
> * Efficient handling of a handful up to hundreds of thousands of
>   concurrent timers.
> * Make adding and canceling timers low-overhead, constant-time
>   operations.
> * Provide a service functionally equivalent to that of
>   <rte_timer.h>. API/ABI backward compatibility is secondary.

Worthwhile goals, and the problem needs to be addressed.
But this patch never got accepted.

Please fix/improve/extend existing rte_timer instead.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [RFC v2 0/2] Add high-performance timer facility
  2024-10-03 18:36   ` [RFC v2 0/2] Add " Stephen Hemminger
@ 2024-10-03 21:32     ` Morten Brørup
  2024-10-06 13:02       ` Mattias Rönnblom
  0 siblings, 1 reply; 31+ messages in thread
From: Morten Brørup @ 2024-10-03 21:32 UTC (permalink / raw)
  To: Stephen Hemminger, Mattias Rönnblom
  Cc: dev, Erik Gabriel Carrillo, David Marchand, maria.lingemark,
	Stefan Sundkvist, Tyler Retzlaff

> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> Sent: Thursday, 3 October 2024 20.37
> 
> On Wed, 15 Mar 2023 18:03:40 +0100
> Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
> 
> > This patchset is an attempt to introduce a high-performance, highly
> > scalable timer facility into DPDK.
> >
> > More specifically, the goals for the htimer library are:
> >
> > * Efficient handling of a handful up to hundreds of thousands of
> >   concurrent timers.
> > * Make adding and canceling timers low-overhead, constant-time
> >   operations.
> > * Provide a service functionally equivalent to that of
> >   <rte_timer.h>. API/ABI backward compatibility is secondary.
> 
> Worthwhile goals, and the problem needs to be addressed.
> But this patch never got accepted.

I think work on it was put on hold due to the requested changes requiring a significant development effort.
I too look forward to work on this being resumed. ;-)

> 
> Please fix/improve/extend existing rte_timer instead.

The rte_timer API is too "fat" for use in the fast path with millions of timers, e.g. TCP flow timers.

Shoehorning a fast path feature into a slow path API is not going to cut it. I support having a separate htimer library with its own API for high volume, high-performance fast path timers.

When striving for low latency across the internet, timing is everything. Packet pacing is the "new" hot thing in congestion control algorithms, and a simple software implementation would require a timer firing once per packet.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v2 0/2] Add high-performance timer facility
  2024-10-03 21:32     ` Morten Brørup
@ 2024-10-06 13:02       ` Mattias Rönnblom
  2024-10-06 13:43         ` Morten Brørup
  0 siblings, 1 reply; 31+ messages in thread
From: Mattias Rönnblom @ 2024-10-06 13:02 UTC (permalink / raw)
  To: Morten Brørup, Stephen Hemminger, Mattias Rönnblom
  Cc: dev, Erik Gabriel Carrillo, David Marchand, maria.lingemark,
	Stefan Sundkvist, Tyler Retzlaff

On 2024-10-03 23:32, Morten Brørup wrote:
>> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
>> Sent: Thursday, 3 October 2024 20.37
>>
>> On Wed, 15 Mar 2023 18:03:40 +0100
>> Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
>>
>>> This patchset is an attempt to introduce a high-performance, highly
>>> scalable timer facility into DPDK.
>>>
>>> More specifically, the goals for the htimer library are:
>>>
>>> * Efficient handling of a handful up to hundreds of thousands of
>>>    concurrent timers.
>>> * Make adding and canceling timers low-overhead, constant-time
>>>    operations.
>>> * Provide a service functionally equivalent to that of
>>>    <rte_timer.h>. API/ABI backward compatibility is secondary.
>>
>> Worthwhile goals, and the problem needs to be addressed.
>> But this patch never got accepted.
> 
> I think work on it was put on hold due to the requested changes requiring a significant development effort.
> I too look forward to work on this being resumed. ;-)
> 
>>
>> Please fix/improve/extend existing rte_timer instead.
> 
> The rte_timer API is too "fat" for use in the fast path with millions of timers, e.g. TCP flow timers.
> 
> Shoehorning a fast path feature into a slow path API is not going to cut it. I support having a separate htimer library with its own API for high volume, high-performance fast path timers.
> 
> When striving for low latency across the internet, timing is everything. Packet pacing is the "new" hot thing in congestion control algorithms, and a simple software implementation would require a timer firing once per packet.
> 

I think DPDK should have two public APIs in the timer area. One is a 
just a bare-bones hierarchical timer wheel API, without callbacks, 
auto-created per-lcore instances, MT safety or any other of the 
<rte_timer.h> bells and whistles. It also doesn't make any assumptions 
about the time source (other it being monotonic) or resolution.

The other is a new variant of <rte_timer.h>, using the core HTW library 
for its implementation (and being public, it may also expose this 
library in its header files, which may be required for efficient 
operation). The new <rte_timer.h> would provide the same kind of 
functionality as the old API, but with some quirks and bugs fixed, plus 
potentially some new functionality added. For example, it would be 
useful to allow non-preemption safe threads to add and remove timers 
(something rte_timer and its spinlocks doesn't allow).

I would consider both "fast path APIs".

In addition, there should probably also be a time source API.

Considering the lead time of relatively small contributions like the 
bitops extensions and the new bitset API (which still aren't in), I 
can't imagine how long time it would take to get in a semi-backward 
compatible rte_timer with a new implementation, plus a new timer wheel 
library, into DPDK.

^ permalink raw reply	[flat|nested] 31+ messages in thread

* RE: [RFC v2 0/2] Add high-performance timer facility
  2024-10-06 13:02       ` Mattias Rönnblom
@ 2024-10-06 13:43         ` Morten Brørup
  2024-10-06 14:43           ` Mattias Rönnblom
  0 siblings, 1 reply; 31+ messages in thread
From: Morten Brørup @ 2024-10-06 13:43 UTC (permalink / raw)
  To: Mattias Rönnblom, Stephen Hemminger, Mattias Rönnblom
  Cc: dev, Erik Gabriel Carrillo, David Marchand, maria.lingemark,
	Stefan Sundkvist, Tyler Retzlaff

> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
> Sent: Sunday, 6 October 2024 15.03
> 
> On 2024-10-03 23:32, Morten Brørup wrote:
> >> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
> >> Sent: Thursday, 3 October 2024 20.37
> >>
> >> On Wed, 15 Mar 2023 18:03:40 +0100
> >> Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
> >>
> >>> This patchset is an attempt to introduce a high-performance, highly
> >>> scalable timer facility into DPDK.
> >>>
> >>> More specifically, the goals for the htimer library are:
> >>>
> >>> * Efficient handling of a handful up to hundreds of thousands of
> >>>    concurrent timers.
> >>> * Make adding and canceling timers low-overhead, constant-time
> >>>    operations.
> >>> * Provide a service functionally equivalent to that of
> >>>    <rte_timer.h>. API/ABI backward compatibility is secondary.
> >>
> >> Worthwhile goals, and the problem needs to be addressed.
> >> But this patch never got accepted.
> >
> > I think work on it was put on hold due to the requested changes
> requiring a significant development effort.
> > I too look forward to work on this being resumed. ;-)
> >
> >>
> >> Please fix/improve/extend existing rte_timer instead.
> >
> > The rte_timer API is too "fat" for use in the fast path with millions
> of timers, e.g. TCP flow timers.
> >
> > Shoehorning a fast path feature into a slow path API is not going to
> cut it. I support having a separate htimer library with its own API for
> high volume, high-performance fast path timers.
> >
> > When striving for low latency across the internet, timing is
> everything. Packet pacing is the "new" hot thing in congestion control
> algorithms, and a simple software implementation would require a timer
> firing once per packet.
> >
> 
> I think DPDK should have two public APIs in the timer area.

Agree.

> One is a
> just a bare-bones hierarchical timer wheel API, without callbacks,
> auto-created per-lcore instances, MT safety or any other of the
> <rte_timer.h> bells and whistles. It also doesn't make any assumptions
> about the time source (other it being monotonic) or resolution.

The <rte_timer.h> library does not - and is never going to - provide sufficient performance for timer intensive applications, such as packet pacing and fast path TCP/QUIC/whatever congestion control. It is too "fat" for this.

We need a new library with a new API for that.
I agree with Mattias' description of the requirements for such a library.

> 
> The other is a new variant of <rte_timer.h>, using the core HTW library
> for its implementation (and being public, it may also expose this
> library in its header files, which may be required for efficient
> operation). The new <rte_timer.h> would provide the same kind of
> functionality as the old API, but with some quirks and bugs fixed, plus
> potentially some new functionality added. For example, it would be
> useful to allow non-preemption safe threads to add and remove timers
> (something rte_timer and its spinlocks doesn't allow).

Agree.

Until that becomes part of DPDK, we will have to stick with what <rte_timer.h> currently offers.

> 
> I would consider both "fast path APIs".
> 
> In addition, there should probably also be a time source API.

A third library, orthogonal to the two other timer libraries.
But I see why you mention it: It could be somewhat related to the design and implementation of the <rte_timer.h> library.
But, let's please forget about a time source API for now.

> 
> Considering the lead time of relatively small contributions like the
> bitops extensions and the new bitset API (which still aren't in), I
> can't imagine how long time it would take to get in a semi-backward
> compatible rte_timer with a new implementation, plus a new timer wheel
> library, into DPDK.

Well said!

Instead of aiming for an unreachable target, let's instead take this approach:
- Provide the new high-performance HTW library as a stand-alone library.
- Postpone improving the <rte_timer.h> library; it can be done any time in the future, if someone cares to do it. And it can use the HTW library or not, whichever is appropriate.

Doing both simultaneously would require a substantial effort, and would cause much backpressure from the community (due to the modified <rte_timer.h> API and implementation).

Although it might be beneficial for the design of the HTW library to consider how an improved <rte_timer.h> would use it, it is not the primary use case of the HTW library, so co-design is not a requirement here.


^ permalink raw reply	[flat|nested] 31+ messages in thread

* Re: [RFC v2 0/2] Add high-performance timer facility
  2024-10-06 13:43         ` Morten Brørup
@ 2024-10-06 14:43           ` Mattias Rönnblom
  0 siblings, 0 replies; 31+ messages in thread
From: Mattias Rönnblom @ 2024-10-06 14:43 UTC (permalink / raw)
  To: Morten Brørup, Stephen Hemminger, Mattias Rönnblom
  Cc: dev, Erik Gabriel Carrillo, David Marchand, maria.lingemark,
	Stefan Sundkvist, Tyler Retzlaff

On 2024-10-06 15:43, Morten Brørup wrote:
>> From: Mattias Rönnblom [mailto:hofors@lysator.liu.se]
>> Sent: Sunday, 6 October 2024 15.03
>>
>> On 2024-10-03 23:32, Morten Brørup wrote:
>>>> From: Stephen Hemminger [mailto:stephen@networkplumber.org]
>>>> Sent: Thursday, 3 October 2024 20.37
>>>>
>>>> On Wed, 15 Mar 2023 18:03:40 +0100
>>>> Mattias Rönnblom <mattias.ronnblom@ericsson.com> wrote:
>>>>
>>>>> This patchset is an attempt to introduce a high-performance, highly
>>>>> scalable timer facility into DPDK.
>>>>>
>>>>> More specifically, the goals for the htimer library are:
>>>>>
>>>>> * Efficient handling of a handful up to hundreds of thousands of
>>>>>     concurrent timers.
>>>>> * Make adding and canceling timers low-overhead, constant-time
>>>>>     operations.
>>>>> * Provide a service functionally equivalent to that of
>>>>>     <rte_timer.h>. API/ABI backward compatibility is secondary.
>>>>
>>>> Worthwhile goals, and the problem needs to be addressed.
>>>> But this patch never got accepted.
>>>
>>> I think work on it was put on hold due to the requested changes
>> requiring a significant development effort.
>>> I too look forward to work on this being resumed. ;-)
>>>
>>>>
>>>> Please fix/improve/extend existing rte_timer instead.
>>>
>>> The rte_timer API is too "fat" for use in the fast path with millions
>> of timers, e.g. TCP flow timers.
>>>
>>> Shoehorning a fast path feature into a slow path API is not going to
>> cut it. I support having a separate htimer library with its own API for
>> high volume, high-performance fast path timers.
>>>
>>> When striving for low latency across the internet, timing is
>> everything. Packet pacing is the "new" hot thing in congestion control
>> algorithms, and a simple software implementation would require a timer
>> firing once per packet.
>>>
>>
>> I think DPDK should have two public APIs in the timer area.
> 
> Agree.
> 
>> One is a
>> just a bare-bones hierarchical timer wheel API, without callbacks,
>> auto-created per-lcore instances, MT safety or any other of the
>> <rte_timer.h> bells and whistles. It also doesn't make any assumptions
>> about the time source (other it being monotonic) or resolution.
> 
> The <rte_timer.h> library does not - and is never going to - provide sufficient performance for timer intensive applications, such as packet pacing and fast path TCP/QUIC/whatever congestion control. It is too "fat" for this.
> 
> We need a new library with a new API for that.
> I agree with Mattias' description of the requirements for such a library.
> 
>>
>> The other is a new variant of <rte_timer.h>, using the core HTW library
>> for its implementation (and being public, it may also expose this
>> library in its header files, which may be required for efficient
>> operation). The new <rte_timer.h> would provide the same kind of
>> functionality as the old API, but with some quirks and bugs fixed, plus
>> potentially some new functionality added. For example, it would be
>> useful to allow non-preemption safe threads to add and remove timers
>> (something rte_timer and its spinlocks doesn't allow).
> 
> Agree.
> 
> Until that becomes part of DPDK, we will have to stick with what <rte_timer.h> currently offers.
> 
>>
>> I would consider both "fast path APIs".
>>
>> In addition, there should probably also be a time source API.
> 
> A third library, orthogonal to the two other timer libraries.
> But I see why you mention it: It could be somewhat related to the design and implementation of the <rte_timer.h> library.
> But, let's please forget about a time source API for now.
> 
>>
>> Considering the lead time of relatively small contributions like the
>> bitops extensions and the new bitset API (which still aren't in), I
>> can't imagine how long time it would take to get in a semi-backward
>> compatible rte_timer with a new implementation, plus a new timer wheel
>> library, into DPDK.
> 
> Well said!
> 
> Instead of aiming for an unreachable target, let's instead take this approach:
> - Provide the new high-performance HTW library as a stand-alone library.
> - Postpone improving the <rte_timer.h> library; it can be done any time in the future, if someone cares to do it. And it can use the HTW library or not, whichever is appropriate.
> 
> Doing both simultaneously would require a substantial effort, and would cause much backpressure from the community (due to the modified <rte_timer.h> API and implementation).
> 
> Although it might be beneficial for the design of the HTW library to consider how an improved <rte_timer.h> would use it, it is not the primary use case of the HTW library, so co-design is not a requirement here.
> 

Postponing rte_timer improvements would also mean postponing most of the 
benefits of the new timer wheel, in my opinion.

In most scenarios, I think you want to have all application modules 
sharing timer wheel instances, preferably without having to agree on a 
proprietary timer API. Here rte_timer shines.

Also, you want to get the HTW library *exactly* right for the rte_timer 
use case. Making it a public API would make changes to its API painful, 
to address any shortcomings you accidentally designed in. To be on the 
safe side, you would need to have a new rte_timer implementation ready 
upon submitting a HTW library.

That in turn would require a techboard ACK on the necessity of rte_timer 
API tweaks, otherwise all your work may be wasted.


^ permalink raw reply	[flat|nested] 31+ messages in thread

end of thread, other threads:[~2024-10-06 14:43 UTC | newest]

Thread overview: 31+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-02-28  9:39 [RFC 0/2] Add high-performance timer facility Mattias Rönnblom
2023-02-28  9:39 ` [RFC 1/2] eal: add bitset type Mattias Rönnblom
2023-02-28 18:46   ` Tyler Retzlaff
2023-03-02  6:31     ` Mattias Rönnblom
2023-03-02 20:39       ` Tyler Retzlaff
2023-02-28  9:39 ` [RFC 2/2] eal: add high-performance timer facility Mattias Rönnblom
2023-03-05 17:25   ` Stephen Hemminger
2023-03-09 15:20     ` Mattias Rönnblom
2023-02-28 16:01 ` [RFC 0/2] Add " Morten Brørup
2023-03-01 11:18   ` Mattias Rönnblom
2023-03-01 13:31     ` Morten Brørup
2023-03-01 15:50       ` Mattias Rönnblom
2023-03-01 17:06         ` Morten Brørup
2023-03-15 17:03 ` [RFC v2 " Mattias Rönnblom
2023-03-15 17:03   ` [RFC v2 1/2] eal: add bitset type Mattias Rönnblom
2023-03-15 17:20     ` Stephen Hemminger
2023-03-15 18:27       ` Mattias Rönnblom
2023-03-15 17:03   ` [RFC v2 2/2] eal: add high-performance timer facility Mattias Rönnblom
2023-03-16  3:55     ` Tyler Retzlaff
2023-03-17  1:58     ` Stephen Hemminger
2023-03-22 12:18     ` Morten Brørup
2023-04-03 12:04       ` Mattias Rönnblom
2023-04-04  7:32         ` Morten Brørup
2023-03-24 16:00     ` Morten Brørup
2023-07-06 22:41     ` Stephen Hemminger
2023-07-12  8:58       ` Mattias Rönnblom
2024-10-03 18:36   ` [RFC v2 0/2] Add " Stephen Hemminger
2024-10-03 21:32     ` Morten Brørup
2024-10-06 13:02       ` Mattias Rönnblom
2024-10-06 13:43         ` Morten Brørup
2024-10-06 14:43           ` Mattias Rönnblom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).