From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <rsanford2@gmail.com>
Received: from mail-ig0-f173.google.com (mail-ig0-f173.google.com
 [209.85.213.173]) by dpdk.org (Postfix) with ESMTP id 8D405C5DC
 for <dev@dpdk.org>; Tue, 28 Jul 2015 00:46:26 +0200 (CEST)
Received: by igk11 with SMTP id 11so82810912igk.1
 for <dev@dpdk.org>; Mon, 27 Jul 2015 15:46:26 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113;
 h=from:to:cc:subject:date:message-id:in-reply-to:references;
 bh=1rVxFFPMqASOwh/VumSD3yA+M6EF25roe9bt+TdFKoc=;
 b=OnHbVo8nWYPUbB9YZKkDB/tmebHyB2R4mx93BElIwYzjZ+vo5XUI8OW+OAgUXV5+80
 H04WNgSdZ1oYKVY5VFBhChm9f6bohld+Eest5qhNE3e11MjwxrXM3G0+AgL4A63NFlPp
 kM56T7eLwXCGO6aqimhAO1skVJKkHt+FkMshr6QbFFTYVInnkpGzeZcQfYI34rxDBTkB
 dg9XDdhwa/gmOCdFSJgB1lu30PrH6nYPXuBlVL6ZG7otGlwnaViNtVt9IFh2Srq7XMYp
 xLxfOqw0Me2Jg+pGxxyF9pM0ksn3AMNh3MRLvWnA7j6cG0LTWOD5OBih1p0qW6iZOvuS
 xt1g==
X-Received: by 10.107.135.200 with SMTP id r69mr47994292ioi.54.1438037186118; 
 Mon, 27 Jul 2015 15:46:26 -0700 (PDT)
Received: from localhost.localdomain ([23.79.237.14])
 by smtp.gmail.com with ESMTPSA id j18sm7063061igf.2.2015.07.27.15.46.25
 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
 Mon, 27 Jul 2015 15:46:25 -0700 (PDT)
From: rsanford2@gmail.com
To: dev@dpdk.org
Date: Mon, 27 Jul 2015 18:46:06 -0400
Message-Id: <1438037168-639-4-git-send-email-rsanford2@gmail.com>
X-Mailer: git-send-email 1.7.1
In-Reply-To: <1437691347-58708-1-git-send-email-rsanford2@gmail.com>
References: <1437691347-58708-1-git-send-email-rsanford2@gmail.com>
Subject: [dpdk-dev] [PATCH v2 3/3] timer: fix race condition in
	rte_timer_manage()
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.15
Precedence: list
List-Id: patches and discussions about DPDK <dev.dpdk.org>
List-Unsubscribe: <http://dpdk.org/ml/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://dpdk.org/ml/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <http://dpdk.org/ml/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
X-List-Received-Date: Mon, 27 Jul 2015 22:46:27 -0000

From: Robert Sanford <rsanford@akamai.com>

Eliminate problematic race condition in rte_timer_manage() that can
lead to corruption of per-lcore pending-lists (implemented as
skip-lists). The race condition occurs when rte_timer_manage() expires
multiple timers on lcore A, while lcore B simultaneously invokes
rte_timer_reset() for one of the expiring timers (other than the first
one).

Lcore A splits its pending-list, creating a local list of expired timers
linked through their sl_next[0] pointers, and sets the first expired
timer to the RUNNING state, all during one list-lock round trip.
Lcore A then unlocks the list-lock to run the first callback, and that
is when A and B can have different interpretations of the subsequent
expired timers' true state. Lcore B sees an expired timer still in the
PENDING state, atomically changes the timer to the CONFIG state, locks
lcore A's list-lock, and reinserts the timer into A's pending-list.
The two lcores try to use the same next-pointers to maintain both lists!

Our solution is to remove expired timers from the pending-list and try
to set them all to the RUNNING state in one atomic step, i.e.,
rte_timer_manage() should perform these two actions within one
ownership of the list-lock.

After splitting the pending-list at the current point in time and trying
to set all expired timers to the RUNNING state, we must put back into
the pending-list any timers that we failed to set to the RUNNING state,
all while still holding the list-lock. It is then safe to release the
lock and run the callback functions for all expired timers that remain
on our local run-list.

Signed-off-by: Robert Sanford <rsanford@akamai.com>
---
 lib/librte_timer/rte_timer.c |   56 ++++++++++++++++++++++++++---------------
 1 files changed, 35 insertions(+), 21 deletions(-)

diff --git a/lib/librte_timer/rte_timer.c b/lib/librte_timer/rte_timer.c
index 8e9243a..3dcdab5 100644
--- a/lib/librte_timer/rte_timer.c
+++ b/lib/librte_timer/rte_timer.c
@@ -504,6 +504,7 @@ void rte_timer_manage(void)
 {
 	union rte_timer_status status;
 	struct rte_timer *tim, *next_tim;
+	struct rte_timer *run_first_tim, **pprev;
 	unsigned lcore_id = rte_lcore_id();
 	struct rte_timer *prev[MAX_SKIPLIST_DEPTH + 1];
 	uint64_t cur_time;
@@ -519,9 +520,9 @@ void rte_timer_manage(void)
 	cur_time = rte_get_timer_cycles();
 
 #ifdef RTE_ARCH_X86_64
-	/* on 64-bit the value cached in the pending_head.expired will be updated
-	 * atomically, so we can consult that for a quick check here outside the
-	 * lock */
+	/* on 64-bit the value cached in the pending_head.expired will be
+	 * updated atomically, so we can consult that for a quick check here
+	 * outside the lock */
 	if (likely(priv_timer[lcore_id].pending_head.expire > cur_time))
 		return;
 #endif
@@ -531,8 +532,10 @@ void rte_timer_manage(void)
 
 	/* if nothing to do just unlock and return */
 	if (priv_timer[lcore_id].pending_head.sl_next[0] == NULL ||
-			priv_timer[lcore_id].pending_head.sl_next[0]->expire > cur_time)
-		goto done;
+	    priv_timer[lcore_id].pending_head.sl_next[0]->expire > cur_time) {
+		rte_spinlock_unlock(&priv_timer[lcore_id].list_lock);
+		return;
+	}
 
 	/* save start of list of expired timers */
 	tim = priv_timer[lcore_id].pending_head.sl_next[0];
@@ -540,30 +543,47 @@ void rte_timer_manage(void)
 	/* break the existing list at current time point */
 	timer_get_prev_entries(cur_time, lcore_id, prev);
 	for (i = priv_timer[lcore_id].curr_skiplist_depth -1; i >= 0; i--) {
-		priv_timer[lcore_id].pending_head.sl_next[i] = prev[i]->sl_next[i];
+		priv_timer[lcore_id].pending_head.sl_next[i] =
+		    prev[i]->sl_next[i];
 		if (prev[i]->sl_next[i] == NULL)
 			priv_timer[lcore_id].curr_skiplist_depth--;
 		prev[i] ->sl_next[i] = NULL;
 	}
 
-	/* now scan expired list and call callbacks */
+	/* transition run-list from PENDING to RUNNING */
+	run_first_tim = tim;
+	pprev = &run_first_tim;
+
 	for ( ; tim != NULL; tim = next_tim) {
 		next_tim = tim->sl_next[0];
 
 		ret = timer_set_running_state(tim);
+		if (likely(ret == 0)) {
+			pprev = &tim->sl_next[0];
+		} else {
+			/* another core is trying to re-config this one,
+			 * remove it from local expired list and put it
+			 * back on the priv_timer[] skip list */
+			*pprev = next_tim;
+			timer_add(tim, lcore_id, 1);
+		}
+	}
 
-		/* this timer was not pending, continue */
-		if (ret < 0)
-			continue;
+	/* update the next to expire timer value */
+	priv_timer[lcore_id].pending_head.expire =
+	    (priv_timer[lcore_id].pending_head.sl_next[0] == NULL) ? 0 :
+		priv_timer[lcore_id].pending_head.sl_next[0]->expire;
 
-		rte_spinlock_unlock(&priv_timer[lcore_id].list_lock);
+	rte_spinlock_unlock(&priv_timer[lcore_id].list_lock);
 
+	/* now scan expired list and call callbacks */
+	for (tim = run_first_tim; tim != NULL; tim = next_tim) {
+		next_tim = tim->sl_next[0];
 		priv_timer[lcore_id].updated = 0;
 
 		/* execute callback function with list unlocked */
 		tim->f(tim, tim->arg);
 
-		rte_spinlock_lock(&priv_timer[lcore_id].list_lock);
 		__TIMER_STAT_ADD(pending, -1);
 		/* the timer was stopped or reloaded by the callback
 		 * function, we have nothing to do here */
@@ -579,23 +599,17 @@ void rte_timer_manage(void)
 		}
 		else {
 			/* keep it in list and mark timer as pending */
+			rte_spinlock_lock(&priv_timer[lcore_id].list_lock);
 			status.state = RTE_TIMER_PENDING;
 			__TIMER_STAT_ADD(pending, 1);
 			status.owner = (int16_t)lcore_id;
 			rte_wmb();
 			tim->status.u32 = status.u32;
 			__rte_timer_reset(tim, cur_time + tim->period,
-					tim->period, lcore_id, tim->f, tim->arg, 1);
+				tim->period, lcore_id, tim->f, tim->arg, 1);
+			rte_spinlock_unlock(&priv_timer[lcore_id].list_lock);
 		}
 	}
-
-	/* update the next to expire timer value */
-	priv_timer[lcore_id].pending_head.expire =
-			(priv_timer[lcore_id].pending_head.sl_next[0] == NULL) ? 0 :
-					priv_timer[lcore_id].pending_head.sl_next[0]->expire;
-done:
-	/* job finished, unlock the list lock */
-	rte_spinlock_unlock(&priv_timer[lcore_id].list_lock);
 }
 
 /* dump statistics about timers */
-- 
1.7.1