From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <dev-bounces@dpdk.org>
Received: from mails.dpdk.org (mails.dpdk.org [217.70.189.124])
	by inbox.dpdk.org (Postfix) with ESMTP id 25C67A0A02;
	Thu, 14 Jan 2021 15:47:08 +0100 (CET)
Received: from [217.70.189.124] (localhost [127.0.0.1])
	by mails.dpdk.org (Postfix) with ESMTP id B8654141319;
	Thu, 14 Jan 2021 15:46:32 +0100 (CET)
Received: from mga11.intel.com (mga11.intel.com [192.55.52.93])
 by mails.dpdk.org (Postfix) with ESMTP id 58FE6141312
 for <dev@dpdk.org>; Thu, 14 Jan 2021 15:46:31 +0100 (CET)
IronPort-SDR: 4Z43UBw9YxD9Cq53u/KqLII3XEPNQvB/ZQVqP2CaZFVvIoHf33QYKkMgmX2U0qfKxkDKW7vE7G
 TyvqxTma9qSQ==
X-IronPort-AV: E=McAfee;i="6000,8403,9863"; a="174870276"
X-IronPort-AV: E=Sophos;i="5.79,347,1602572400"; d="scan'208";a="174870276"
Received: from fmsmga001.fm.intel.com ([10.253.24.23])
 by fmsmga102.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384;
 14 Jan 2021 06:46:31 -0800
IronPort-SDR: OnumARjXZiuXzXr+gSeJt+OYrSWqnL3uiH+g8KHeWgHq1nm/+HfeEFcFmWJz1ZdNVHM7/IH9j3
 XiqFO3+sIPJQ==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.79,347,1602572400"; d="scan'208";a="465271332"
Received: from silpixa00399498.ir.intel.com (HELO
 silpixa00399498.ger.corp.intel.com) ([10.237.222.179])
 by fmsmga001.fm.intel.com with ESMTP; 14 Jan 2021 06:46:28 -0800
From: Anatoly Burakov <anatoly.burakov@intel.com>
To: dev@dpdk.org
Cc: Jan Viktorin <viktorin@rehivetech.com>,
 Ruifeng Wang <ruifeng.wang@arm.com>, Jerin Jacob <jerinj@marvell.com>,
 David Christensen <drc@linux.vnet.ibm.com>, Ray Kinsella <mdr@ashroe.eu>,
 Neil Horman <nhorman@tuxdriver.com>,
 Bruce Richardson <bruce.richardson@intel.com>,
 Konstantin Ananyev <konstantin.ananyev@intel.com>, thomas@monjalon.net,
 timothy.mcdaniel@intel.com, david.hunt@intel.com, chris.macnamara@intel.com
Date: Thu, 14 Jan 2021 14:46:07 +0000
Message-Id: <dc78c3c2ca703e090416be3662ff9a0f1b7ed827.1610635488.git.anatoly.burakov@intel.com>
X-Mailer: git-send-email 2.25.1
In-Reply-To: <cover.1610635488.git.anatoly.burakov@intel.com>
References: <cover.1610473000.git.anatoly.burakov@intel.com>
 <cover.1610635488.git.anatoly.burakov@intel.com>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit
Subject: [dpdk-dev] [PATCH v17 05/11] eal: add monitor wakeup function
X-BeenThere: dev@dpdk.org
X-Mailman-Version: 2.1.29
Precedence: list
List-Id: DPDK patches and discussions <dev.dpdk.org>
List-Unsubscribe: <https://mails.dpdk.org/options/dev>,
 <mailto:dev-request@dpdk.org?subject=unsubscribe>
List-Archive: <http://mails.dpdk.org/archives/dev/>
List-Post: <mailto:dev@dpdk.org>
List-Help: <mailto:dev-request@dpdk.org?subject=help>
List-Subscribe: <https://mails.dpdk.org/listinfo/dev>,
 <mailto:dev-request@dpdk.org?subject=subscribe>
Errors-To: dev-bounces@dpdk.org
Sender: "dev" <dev-bounces@dpdk.org>

Now that we have everything in a C file, we can store the information
about our sleep, and have a native mechanism to wake up the sleeping
core. This mechanism would however only wake up a core that's sleeping
while monitoring - waking up from `rte_power_pause` won't work.

Signed-off-by: Anatoly Burakov <anatoly.burakov@intel.com>
Acked-by: Konstantin Ananyev <konstantin.ananyev@intel.com>
---

Notes:
    v17:
    - Improve code readability with a goto (the horror!)
    - Fix the compile issues for non-x86 archs
    
    v16:
    - Improve error handling
    - Take a lock before UMONITOR
    
    v13:
    - Add comments around wakeup code to explain what it does
    - Add lcore_id parameter checking to prevent buffer overrun
    
    v15:
    - Fix check in UMWAIT callback
    
    v13:
    - Rework the synchronization mechanism to not require locking
    - Add more parameter checking
    - Rework n_rx_queues access to not go through internal PMD structures and use
      public API instead
    
    v13:
    - Rework the synchronization mechanism to not require locking
    - Add more parameter checking
    - Rework n_rx_queues access to not go through internal PMD structures and use
      public API instead

 lib/librte_eal/arm/rte_power_intrinsics.c     | 11 +++
 .../include/generic/rte_power_intrinsics.h    | 16 ++++
 lib/librte_eal/ppc/rte_power_intrinsics.c     | 11 +++
 lib/librte_eal/version.map                    |  1 +
 lib/librte_eal/x86/rte_power_intrinsics.c     | 93 ++++++++++++++++++-
 5 files changed, 131 insertions(+), 1 deletion(-)

diff --git a/lib/librte_eal/arm/rte_power_intrinsics.c b/lib/librte_eal/arm/rte_power_intrinsics.c
index 8d271dc0c1..e83f04072a 100644
--- a/lib/librte_eal/arm/rte_power_intrinsics.c
+++ b/lib/librte_eal/arm/rte_power_intrinsics.c
@@ -27,3 +27,14 @@ rte_power_pause(const uint64_t tsc_timestamp)
 
 	return -ENOTSUP;
 }
+
+/**
+ * This function is not supported on ARM.
+ */
+int
+rte_power_monitor_wakeup(const unsigned int lcore_id)
+{
+	RTE_SET_USED(lcore_id);
+
+	return -ENOTSUP;
+}
diff --git a/lib/librte_eal/include/generic/rte_power_intrinsics.h b/lib/librte_eal/include/generic/rte_power_intrinsics.h
index 85343bc9eb..6109d28faa 100644
--- a/lib/librte_eal/include/generic/rte_power_intrinsics.h
+++ b/lib/librte_eal/include/generic/rte_power_intrinsics.h
@@ -62,6 +62,22 @@ __rte_experimental
 int rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 		const uint64_t tsc_timestamp);
 
+/**
+ * @warning
+ * @b EXPERIMENTAL: this API may change without prior notice
+ *
+ * Wake up a specific lcore that is in a power optimized state and is monitoring
+ * an address.
+ *
+ * @note This function will *not* wake up a core that is in a power optimized
+ *   state due to calling `rte_power_pause`.
+ *
+ * @param lcore_id
+ *   Lcore ID of a sleeping thread.
+ */
+__rte_experimental
+int rte_power_monitor_wakeup(const unsigned int lcore_id);
+
 /**
  * @warning
  * @b EXPERIMENTAL: this API may change without prior notice
diff --git a/lib/librte_eal/ppc/rte_power_intrinsics.c b/lib/librte_eal/ppc/rte_power_intrinsics.c
index f7862ea324..7fc9586da7 100644
--- a/lib/librte_eal/ppc/rte_power_intrinsics.c
+++ b/lib/librte_eal/ppc/rte_power_intrinsics.c
@@ -27,3 +27,14 @@ rte_power_pause(const uint64_t tsc_timestamp)
 
 	return -ENOTSUP;
 }
+
+/**
+ * This function is not supported on PPC64.
+ */
+int
+rte_power_monitor_wakeup(const unsigned int lcore_id)
+{
+	RTE_SET_USED(lcore_id);
+
+	return -ENOTSUP;
+}
diff --git a/lib/librte_eal/version.map b/lib/librte_eal/version.map
index 1fcd1d3bed..fce90a112f 100644
--- a/lib/librte_eal/version.map
+++ b/lib/librte_eal/version.map
@@ -406,6 +406,7 @@ EXPERIMENTAL {
 
 	# added in 21.02
 	rte_power_monitor;
+	rte_power_monitor_wakeup;
 	rte_power_pause;
 	rte_thread_tls_key_create;
 	rte_thread_tls_key_delete;
diff --git a/lib/librte_eal/x86/rte_power_intrinsics.c b/lib/librte_eal/x86/rte_power_intrinsics.c
index 29247d8638..af3ae3237c 100644
--- a/lib/librte_eal/x86/rte_power_intrinsics.c
+++ b/lib/librte_eal/x86/rte_power_intrinsics.c
@@ -2,8 +2,31 @@
  * Copyright(c) 2020 Intel Corporation
  */
 
+#include <rte_common.h>
+#include <rte_lcore.h>
+#include <rte_spinlock.h>
+
 #include "rte_power_intrinsics.h"
 
+/*
+ * Per-lcore structure holding current status of C0.2 sleeps.
+ */
+static struct power_wait_status {
+	rte_spinlock_t lock;
+	volatile void *monitor_addr; /**< NULL if not currently sleeping */
+} __rte_cache_aligned wait_status[RTE_MAX_LCORE];
+
+static inline void
+__umwait_wakeup(volatile void *addr)
+{
+	uint64_t val;
+
+	/* trigger a write but don't change the value */
+	val = __atomic_load_n((volatile uint64_t *)addr, __ATOMIC_RELAXED);
+	__atomic_compare_exchange_n((volatile uint64_t *)addr, &val, val, 0,
+			__ATOMIC_RELAXED, __ATOMIC_RELAXED);
+}
+
 static bool wait_supported;
 
 static inline uint64_t
@@ -51,17 +74,29 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 {
 	const uint32_t tsc_l = (uint32_t)tsc_timestamp;
 	const uint32_t tsc_h = (uint32_t)(tsc_timestamp >> 32);
+	const unsigned int lcore_id = rte_lcore_id();
+	struct power_wait_status *s;
 
 	/* prevent user from running this instruction if it's not supported */
 	if (!wait_supported)
 		return -ENOTSUP;
 
+	/* prevent non-EAL thread from using this API */
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+
 	if (pmc == NULL)
 		return -EINVAL;
 
 	if (__check_val_size(pmc->data_sz) < 0)
 		return -EINVAL;
 
+	s = &wait_status[lcore_id];
+
+	/* update sleep address */
+	rte_spinlock_lock(&s->lock);
+	s->monitor_addr = pmc->addr;
+
 	/*
 	 * we're using raw byte codes for now as only the newest compiler
 	 * versions support this instruction natively.
@@ -72,6 +107,10 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 			:
 			: "D"(pmc->addr));
 
+	/* now that we've put this address into monitor, we can unlock */
+	rte_spinlock_unlock(&s->lock);
+
+	/* if we have a comparison mask, we might not need to sleep at all */
 	if (pmc->mask) {
 		const uint64_t cur_value = __get_umwait_val(
 				pmc->addr, pmc->data_sz);
@@ -79,14 +118,21 @@ rte_power_monitor(const struct rte_power_monitor_cond *pmc,
 
 		/* if the masked value is already matching, abort */
 		if (masked == pmc->val)
-			return 0;
+			goto end;
 	}
+
 	/* execute UMWAIT */
 	asm volatile(".byte 0xf2, 0x0f, 0xae, 0xf7;"
 			: /* ignore rflags */
 			: "D"(0), /* enter C0.2 */
 			  "a"(tsc_l), "d"(tsc_h));
 
+end:
+	/* erase sleep address */
+	rte_spinlock_lock(&s->lock);
+	s->monitor_addr = NULL;
+	rte_spinlock_unlock(&s->lock);
+
 	return 0;
 }
 
@@ -122,3 +168,48 @@ RTE_INIT(rte_power_intrinsics_init) {
 	if (i.power_monitor && i.power_pause)
 		wait_supported = 1;
 }
+
+int
+rte_power_monitor_wakeup(const unsigned int lcore_id)
+{
+	struct power_wait_status *s;
+
+	/* prevent user from running this instruction if it's not supported */
+	if (!wait_supported)
+		return -ENOTSUP;
+
+	/* prevent buffer overrun */
+	if (lcore_id >= RTE_MAX_LCORE)
+		return -EINVAL;
+
+	s = &wait_status[lcore_id];
+
+	/*
+	 * There is a race condition between sleep, wakeup and locking, but we
+	 * don't need to handle it.
+	 *
+	 * Possible situations:
+	 *
+	 * 1. T1 locks, sets address, unlocks
+	 * 2. T2 locks, triggers wakeup, unlocks
+	 * 3. T1 sleeps
+	 *
+	 * In this case, because T1 has already set the address for monitoring,
+	 * we will wake up immediately even if T2 triggers wakeup before T1
+	 * goes to sleep.
+	 *
+	 * 1. T1 locks, sets address, unlocks, goes to sleep, and wakes up
+	 * 2. T2 locks, triggers wakeup, and unlocks
+	 * 3. T1 locks, erases address, and unlocks
+	 *
+	 * In this case, since we've already woken up, the "wakeup" was
+	 * unneeded, and since T1 is still waiting on T2 releasing the lock, the
+	 * wakeup address is still valid so it's perfectly safe to write it.
+	 */
+	rte_spinlock_lock(&s->lock);
+	if (s->monitor_addr != NULL)
+		__umwait_wakeup(s->monitor_addr);
+	rte_spinlock_unlock(&s->lock);
+
+	return 0;
+}
-- 
2.25.1