DPDK patches and discussions
 help / color / mirror / Atom feed
From: Thomas Monjalon <thomas.monjalon@6wind.com>
To: rsanford2@gmail.com
Cc: dev@dpdk.org
Subject: Re: [dpdk-dev] [PATCH 0/3] timer: fix rte_timer_manage and improve unit tests
Date: Sun, 26 Jul 2015 16:11:40 +0200	[thread overview]
Message-ID: <4708068.jtpnxJe3Ir@xps13> (raw)
In-Reply-To: <1437691347-58708-1-git-send-email-rsanford2@gmail.com>

2015-07-23 18:42, rsanford2@gmail.com:
> From: Robert Sanford <rsanford@akamai.com>
> 
> This patchset fixes a bug in timer stress test 2, adds a new stress test
> to expose a race condition bug in API rte_timer_manage(), and then fixes
> the rte_timer_manage() bug.
> --
> 
> Patch 1, app/test timer stress test 2: Sometimes this test fails and
> seg-faults because the slave lcores get out of phase with the master.
> The master uses a single int, 'ready', to synchronize multiple slave
> lcores through multiple phases of the test.
> 
> To resolve, we construct simple synchronization primitives that use one
> atomic-int state variable per slave. The master tells the slaves when to
> start, and then waits for all of them to finish. Each slave waits for
> the master to tell it to start, and then tells the master when it has
> finished.
> --
> 
> Description of rte_timer_manage() race condition bug: Through code
> inspection, we noticed a potential problem in rte_timer_manage() that
> leads to corruption of per-lcore pending-lists (implemented as
> skip-lists). The race condition occurs when rte_timer_manage() expires
> multiple timers on lcore A, while lcore B simultaneously invokes
> rte_timer_reset() for one of the expiring timers (other than the first
> one).
> 
> Lcore A splits its pending-list, creating a local list of expired timers
> linked through their sl_next[0] pointers, and sets the first expired
> timer to the RUNNING state, all during one list-lock round trip. Lcore A
> then unlocks the list-lock to run the first callback, and that is when
> lcore B can misinterpret the subsequent expired timers' true states.
> Lcore B sees an expired timer still in the PENDING state, locks A's
> list-lock, and reinserts the timer into A's pending-list. We are trying
> to use the same pointer to belong to both lists!
> 
> One solution is to remove timers from the pending-list and set them to
> the RUNNING state in one atomic step, i.e., we should perform these two
> actions within one ownership of the list-lock.
> --
> 
> Patch 2, new timer-manage race-condition test: We wrote a test to
> confirm our suspicion that we could crash rte_timer_manage() (RTM) under
> the right circumstances. We repeatedly set several timers to expire at
> roughly the same time on the master core. The master lcore just delays
> and runs RTM about ten times per second. The slave lcores all watch the
> first timer (timer-0) to see when RTM is running on the master, i.e.,
> timer-0's state is not PENDING. At this point, each slave attempts to
> reset a subset of the timers to a later expiration time. The goal here
> is to have the slaves moving most of the timers to a different place in
> the master's pending-list, while the master is working with the same
> next-pointers and running callback functions. This eventually results in
> the master traversing a corrupted linked-list. In our observations, it
> results in an infinite loop.
> --
> 
> Patch 3, eliminate race condition in rte_timer_manage(): After splitting
> the pending-list at the current time point, we try to set all expired
> timers to the RUNNING state, and put back into the pending-list any
> timers that we fail to set to the RUNNING state, all during one
> list-lock round trip. It is then safe to run the callback functions
> for all expired timers that remain on our local list, without the lock.

Please, could you re-send this serie after having added the description of
each patch in the commit messages?
It seems you fix 2 bugs in the first patch. It may be clearer to split it
in 2 patches with separate explanations.

Thanks

  parent reply	other threads:[~2015-07-26 14:12 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-07-23 22:42 rsanford2
2015-07-23 22:42 ` [dpdk-dev] [PATCH 1/3] timer: fix stress test 2 synchronization bug rsanford2
2015-07-23 22:42 ` [dpdk-dev] [PATCH 2/3] timer: add timer-manage race condition test rsanford2
2015-07-23 22:42 ` [dpdk-dev] [PATCH 3/3] timer: fix race condition in rte_timer_manage() rsanford2
2015-07-26 14:11 ` Thomas Monjalon [this message]
2015-07-27 15:46   ` [dpdk-dev] [PATCH 0/3] timer: fix rte_timer_manage and improve unit tests Sanford, Robert
2015-07-27 15:53     ` Thomas Monjalon
2015-07-27 22:46 ` [dpdk-dev] [PATCH v2 " rsanford2
2015-08-02 22:06   ` Thomas Monjalon
2015-07-27 22:46 ` [dpdk-dev] [PATCH v2 1/3] timer: fix stress test 2 synchronization bug rsanford2
2015-07-27 22:46 ` [dpdk-dev] [PATCH v2 2/3] timer: add timer-manage race condition test rsanford2
2015-07-27 22:46 ` [dpdk-dev] [PATCH v2 3/3] timer: fix race condition in rte_timer_manage() rsanford2

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4708068.jtpnxJe3Ir@xps13 \
    --to=thomas.monjalon@6wind.com \
    --cc=dev@dpdk.org \
    --cc=rsanford2@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).