From: rsanford2@gmail.com
To: dev@dpdk.org
Subject: [dpdk-dev] [PATCH 0/3] timer: fix rte_timer_manage and improve unit tests
Date: Thu, 23 Jul 2015 18:42:24 -0400 [thread overview]
Message-ID: <1437691347-58708-1-git-send-email-rsanford2@gmail.com> (raw)
From: Robert Sanford <rsanford@akamai.com>
This patchset fixes a bug in timer stress test 2, adds a new stress test
to expose a race condition bug in API rte_timer_manage(), and then fixes
the rte_timer_manage() bug.
--
Patch 1, app/test timer stress test 2: Sometimes this test fails and
seg-faults because the slave lcores get out of phase with the master.
The master uses a single int, 'ready', to synchronize multiple slave
lcores through multiple phases of the test.
To resolve, we construct simple synchronization primitives that use one
atomic-int state variable per slave. The master tells the slaves when to
start, and then waits for all of them to finish. Each slave waits for
the master to tell it to start, and then tells the master when it has
finished.
--
Description of rte_timer_manage() race condition bug: Through code
inspection, we noticed a potential problem in rte_timer_manage() that
leads to corruption of per-lcore pending-lists (implemented as
skip-lists). The race condition occurs when rte_timer_manage() expires
multiple timers on lcore A, while lcore B simultaneously invokes
rte_timer_reset() for one of the expiring timers (other than the first
one).
Lcore A splits its pending-list, creating a local list of expired timers
linked through their sl_next[0] pointers, and sets the first expired
timer to the RUNNING state, all during one list-lock round trip. Lcore A
then unlocks the list-lock to run the first callback, and that is when
lcore B can misinterpret the subsequent expired timers' true states.
Lcore B sees an expired timer still in the PENDING state, locks A's
list-lock, and reinserts the timer into A's pending-list. We are trying
to use the same pointer to belong to both lists!
One solution is to remove timers from the pending-list and set them to
the RUNNING state in one atomic step, i.e., we should perform these two
actions within one ownership of the list-lock.
--
Patch 2, new timer-manage race-condition test: We wrote a test to
confirm our suspicion that we could crash rte_timer_manage() (RTM) under
the right circumstances. We repeatedly set several timers to expire at
roughly the same time on the master core. The master lcore just delays
and runs RTM about ten times per second. The slave lcores all watch the
first timer (timer-0) to see when RTM is running on the master, i.e.,
timer-0's state is not PENDING. At this point, each slave attempts to
reset a subset of the timers to a later expiration time. The goal here
is to have the slaves moving most of the timers to a different place in
the master's pending-list, while the master is working with the same
next-pointers and running callback functions. This eventually results in
the master traversing a corrupted linked-list. In our observations, it
results in an infinite loop.
--
Patch 3, eliminate race condition in rte_timer_manage(): After splitting
the pending-list at the current time point, we try to set all expired
timers to the RUNNING state, and put back into the pending-list any
timers that we fail to set to the RUNNING state, all during one
list-lock round trip. It is then safe to run the callback functions
for all expired timers that remain on our local list, without the lock.
Robert Sanford (3):
timer: fix stress test 2 synchronization bug
timer: add timer-manage race condition test
timer: fix race condition in rte_timer_manage()
app/test/Makefile | 1 +
app/test/test_timer.c | 149 ++++++++++++++++++++++-------
app/test/test_timer_racecond.c | 209 ++++++++++++++++++++++++++++++++++++++++
lib/librte_timer/rte_timer.c | 45 ++++++---
4 files changed, 355 insertions(+), 49 deletions(-)
create mode 100644 app/test/test_timer_racecond.c
next reply other threads:[~2015-07-23 22:42 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-07-23 22:42 rsanford2 [this message]
2015-07-23 22:42 ` [dpdk-dev] [PATCH 1/3] timer: fix stress test 2 synchronization bug rsanford2
2015-07-23 22:42 ` [dpdk-dev] [PATCH 2/3] timer: add timer-manage race condition test rsanford2
2015-07-23 22:42 ` [dpdk-dev] [PATCH 3/3] timer: fix race condition in rte_timer_manage() rsanford2
2015-07-26 14:11 ` [dpdk-dev] [PATCH 0/3] timer: fix rte_timer_manage and improve unit tests Thomas Monjalon
2015-07-27 15:46 ` Sanford, Robert
2015-07-27 15:53 ` Thomas Monjalon
2015-07-27 22:46 ` [dpdk-dev] [PATCH v2 " rsanford2
2015-08-02 22:06 ` Thomas Monjalon
2015-07-27 22:46 ` [dpdk-dev] [PATCH v2 1/3] timer: fix stress test 2 synchronization bug rsanford2
2015-07-27 22:46 ` [dpdk-dev] [PATCH v2 2/3] timer: add timer-manage race condition test rsanford2
2015-07-27 22:46 ` [dpdk-dev] [PATCH v2 3/3] timer: fix race condition in rte_timer_manage() rsanford2
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1437691347-58708-1-git-send-email-rsanford2@gmail.com \
--to=rsanford2@gmail.com \
--cc=dev@dpdk.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).