From: "Mattias Rönnblom" <mattias.ronnblom@ericsson.com>
To: <dev@dpdk.org>
Cc: "Erik Gabriel Carrillo" <erik.g.carrillo@intel.com>,
"David Marchand" <david.marchand@redhat.com>,
maria.lingemark@ericsson.com,
"Stefan Sundkvist" <stefan.sundkvist@ericsson.com>,
"Stephen Hemminger" <stephen@networkplumber.org>,
"Morten Brørup" <mb@smartsharesystems.com>,
"Tyler Retzlaff" <roretzla@linux.microsoft.com>,
"Mattias Rönnblom" <mattias.ronnblom@ericsson.com>
Subject: [RFC v2 0/2] Add high-performance timer facility
Date: Wed, 15 Mar 2023 18:03:40 +0100 [thread overview]
Message-ID: <20230315170342.214127-1-mattias.ronnblom@ericsson.com> (raw)
In-Reply-To: <20230228093916.87206-1-mattias.ronnblom@ericsson.com>
This patchset is an attempt to introduce a high-performance, highly
scalable timer facility into DPDK.
More specifically, the goals for the htimer library are:
* Efficient handling of a handful up to hundreds of thousands of
concurrent timers.
* Make adding and canceling timers low-overhead, constant-time
operations.
* Provide a service functionally equivalent to that of
<rte_timer.h>. API/ABI backward compatibility is secondary.
In the author's opinion, there are two main shortcomings with the
current DPDK timer library (i.e., rte_timer.[ch]).
One is the synchronization overhead, where heavy-weight full-barrier
type synchronization is used. rte_timer.c uses per-EAL/lcore skip
lists, but any thread may add or cancel (or otherwise access) timers
managed by another lcore (and thus resides in its timer skip list).
The other is an algorithmic shortcoming, with rte_timer.c's reliance
on a skip list, which is less efficient than certain alternatives.
This patchset implements a hierarchical timer wheel (HWT, in
rte_htw.c), as per the Varghese and Lauck paper "Hashed and
Hierarchical Timing Wheels: Data Structures for the Efficient
Implementation of a Timer Facility". A HWT is a data structure
purposely design for this task, and used by many operating system
kernel timer facilities.
To further improve the solution described by Varghese and Lauck, a
bitset is placed in front of each of the timer wheel in the HWT,
reducing overhead of rte_htimer_mgr_manage() (i.e., progressing time
and expiry processing).
Cycle-efficient scanning and manipulation of these bitsets are crucial
for the HWT's performance.
The htimer module keeps a per-lcore (or per-registered EAL thread) HWT
instance, much like rte_timer.c keeps a per-lcore skip list.
To avoid expensive synchronization overhead for thread-local timer
management, the HWTs are accessed only from the "owning" thread. Any
interaction any other thread does with a particular lcore's timer
wheel goes over a set of DPDK rings. A side-effect of this design is
that all operations working toward a "remote" HWT must be
asynchronous.
The <rte_htimer.h> API is available only to EAL threads and registered
non-EAL threads.
The htimer API allows the application to supply the current time,
useful in case it already has retrieved this for other purposes,
saving the cost of a rdtsc instruction (or its equivalent).
Relative htimer does not retrieve a new time, but reuse the current
time (as known via/at-the-time of the manage-call), again to shave off
some cycles of overhead.
A semantic improvement compared to the <rte_timer.h> API is that the
htimer library can give a definite answer on the question if the timer
expiry callback was called, after a timer has been canceled.
The patchset includes a performance test case
'timer_htimer_htw_perf_autotest', which compares rte_timer, rte_htimer
and rte_htw timers in the same scenario.
'timer_htimer_htw_perf_autotest' suggests that rte_htimer is ~3-5x
faster than rte_timer for timer/timeout-heavy applications, in a
scenario where the timer always fires. For a scenario with a mix of
canceled and expired timers, the performance difference is greater.
In scenarios with few timeouts, rte_timer has lower overhead than
htimer, but both variants consume very little CPU time.
In certain scenarios, rte_timer does not suffer from
non-constant-time-add and cancel operations. On such is in case the
timer added is always last in the list, where htimer is only ~2-3x
faster.
The bitset implementation which the HWT implementation depends upon
seemed generic-enough and potentially useful outside the world of
HWTs, to justify being located in the EAL.
This patchset is very much an RFC, and the author is yet to form an
opinion on many important issues.
* If deemed a suitable replacement, should the htimer replace the
current DPDK timer library in some particular (ABI-breaking)
release, or should it live side-by-side with the then-legacy
<rte_timer.h> API? A lot of things in and outside DPDK depend on
<rte_timer.h>, so coexistence may be required to facilitate a smooth
transition.
* Should the htimer and htw-related files be colocated with rte_timer.c
in the timer library?
* Would it be useful for applications using asynchronous cancel to
have the option of having the timer callback run not only in case of
timer expiration, but also cancellation (on the target lcore)? The
timer cb signature would need to include an additional parameter in
that case.
* Should the rte_htimer be a nested struct, so the htw parts be separated
from the htimer parts?
* <rte_htimer.h> is kept separate from <rte_htimer_mgr.h>, so that
<rte_htw.h> may avoid a depedency to <rte_htimer_mgr.h>. Should it
be so?
* rte_htimer struct is only supposed to be used by the application to
give an indication of how much memory it needs to allocate, and is
its member are not supposed to be directly accessed (w/ the possible
exception of the owner_lcore_id field). Should there be a dummy
struct, or a #define RTE_HTIMER_MEMSIZE or a rte_htimer_get_memsize()
function instead, serving the same purpose? Better encapsulation,
but more inconvenient for applications. Run-time dynamic sizing
would force application-level dynamic allocations.
* Asynchronous cancellation is a little tricky to use for the
application (primarily due to timer memory reclamation/race
issues). Should this functionality be removed?
* Should rte_htimer_mgr_init() also retrieve the current time? If so,
there should to be a variant which allows the user to specify the
time (to match rte_htimer_mgr_manage_time()). One pitfall with the
current proposed API is an application calling rte_htimer_mgr_init()
and then immediately adding a timer with a relative timeout, in
which case the current absolute time used is 0, which might be a
surprise.
* Would the event timer adapter be best off using <rte_htw.h>
directly, or <rte_htimer.h>? In the latter case, there needs to be a
way to instantiate more HWTs (similar to the "alt" functions of
<rte_timer.h>)?
* Should the PERIODICAL flag (and the complexity it brings) be
removed? And leave the application with only single-shot timers, and
the option to re-add them in the timer callback.
* Should the async result codes and the sync cancel error codes be merged
into one set of result codes?
* Should the rte_htimer_mgr_async_add() have a flag which allow
buffering add request messages until rte_htimer_mgr_process() is
called? Or any manage function. Would reduce ring signaling overhead
(i.e., burst enqueue operations instead of single-element
enqueue). Could also be a rte_htimer_mgr_async_add_burst() function,
solving the same "problem" a different way. (The signature of such
a function would not be pretty.)
* Does the functionality provided by the rte_htimer_mgr_process()
function match its the use cases? Should there me a more clear
separation between expiry processing and asynchronous operation
processing?
* Should the patchset be split into more commits? If so, how?
Thanks to Erik Carrillo for his assistance.
Mattias Rönnblom (2):
eal: add bitset type
eal: add high-performance timer facility
app/test/meson.build | 12 +-
app/test/test_bitset.c | 645 +++++++++++++++++++
app/test/test_htimer_mgr.c | 674 ++++++++++++++++++++
app/test/test_htimer_mgr_perf.c | 322 ++++++++++
app/test/test_htw.c | 478 ++++++++++++++
app/test/test_htw_perf.c | 181 ++++++
app/test/test_timer_htimer_htw_perf.c | 693 ++++++++++++++++++++
doc/api/doxy-api-index.md | 5 +-
doc/api/doxy-api.conf.in | 1 +
lib/eal/common/meson.build | 1 +
lib/eal/common/rte_bitset.c | 29 +
lib/eal/include/meson.build | 1 +
lib/eal/include/rte_bitset.h | 879 ++++++++++++++++++++++++++
lib/eal/version.map | 3 +
lib/htimer/meson.build | 7 +
lib/htimer/rte_htimer.h | 68 ++
lib/htimer/rte_htimer_mgr.c | 547 ++++++++++++++++
lib/htimer/rte_htimer_mgr.h | 516 +++++++++++++++
lib/htimer/rte_htimer_msg.h | 44 ++
lib/htimer/rte_htimer_msg_ring.c | 18 +
lib/htimer/rte_htimer_msg_ring.h | 55 ++
lib/htimer/rte_htw.c | 445 +++++++++++++
lib/htimer/rte_htw.h | 49 ++
lib/htimer/version.map | 17 +
lib/meson.build | 1 +
25 files changed, 5689 insertions(+), 2 deletions(-)
create mode 100644 app/test/test_bitset.c
create mode 100644 app/test/test_htimer_mgr.c
create mode 100644 app/test/test_htimer_mgr_perf.c
create mode 100644 app/test/test_htw.c
create mode 100644 app/test/test_htw_perf.c
create mode 100644 app/test/test_timer_htimer_htw_perf.c
create mode 100644 lib/eal/common/rte_bitset.c
create mode 100644 lib/eal/include/rte_bitset.h
create mode 100644 lib/htimer/meson.build
create mode 100644 lib/htimer/rte_htimer.h
create mode 100644 lib/htimer/rte_htimer_mgr.c
create mode 100644 lib/htimer/rte_htimer_mgr.h
create mode 100644 lib/htimer/rte_htimer_msg.h
create mode 100644 lib/htimer/rte_htimer_msg_ring.c
create mode 100644 lib/htimer/rte_htimer_msg_ring.h
create mode 100644 lib/htimer/rte_htw.c
create mode 100644 lib/htimer/rte_htw.h
create mode 100644 lib/htimer/version.map
--
2.34.1
next prev parent reply other threads:[~2023-03-15 17:09 UTC|newest]
Thread overview: 31+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-02-28 9:39 [RFC " Mattias Rönnblom
2023-02-28 9:39 ` [RFC 1/2] eal: add bitset type Mattias Rönnblom
2023-02-28 18:46 ` Tyler Retzlaff
2023-03-02 6:31 ` Mattias Rönnblom
2023-03-02 20:39 ` Tyler Retzlaff
2023-02-28 9:39 ` [RFC 2/2] eal: add high-performance timer facility Mattias Rönnblom
2023-03-05 17:25 ` Stephen Hemminger
2023-03-09 15:20 ` Mattias Rönnblom
2023-02-28 16:01 ` [RFC 0/2] Add " Morten Brørup
2023-03-01 11:18 ` Mattias Rönnblom
2023-03-01 13:31 ` Morten Brørup
2023-03-01 15:50 ` Mattias Rönnblom
2023-03-01 17:06 ` Morten Brørup
2023-03-15 17:03 ` Mattias Rönnblom [this message]
2023-03-15 17:03 ` [RFC v2 1/2] eal: add bitset type Mattias Rönnblom
2023-03-15 17:20 ` Stephen Hemminger
2023-03-15 18:27 ` Mattias Rönnblom
2023-03-15 17:03 ` [RFC v2 2/2] eal: add high-performance timer facility Mattias Rönnblom
2023-03-16 3:55 ` Tyler Retzlaff
2023-03-17 1:58 ` Stephen Hemminger
2023-03-22 12:18 ` Morten Brørup
2023-04-03 12:04 ` Mattias Rönnblom
2023-04-04 7:32 ` Morten Brørup
2023-03-24 16:00 ` Morten Brørup
2023-07-06 22:41 ` Stephen Hemminger
2023-07-12 8:58 ` Mattias Rönnblom
2024-10-03 18:36 ` [RFC v2 0/2] Add " Stephen Hemminger
2024-10-03 21:32 ` Morten Brørup
2024-10-06 13:02 ` Mattias Rönnblom
2024-10-06 13:43 ` Morten Brørup
2024-10-06 14:43 ` Mattias Rönnblom
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230315170342.214127-1-mattias.ronnblom@ericsson.com \
--to=mattias.ronnblom@ericsson.com \
--cc=david.marchand@redhat.com \
--cc=dev@dpdk.org \
--cc=erik.g.carrillo@intel.com \
--cc=maria.lingemark@ericsson.com \
--cc=mb@smartsharesystems.com \
--cc=roretzla@linux.microsoft.com \
--cc=stefan.sundkvist@ericsson.com \
--cc=stephen@networkplumber.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).